Running Policies¶

URLab runs and evaluates policies, and you can train in it too. It steps one simulation at a time, though, so for large-scale reinforcement learning a dedicated massively parallel MuJoCo framework such as mjlab or mujoco_warp will train far faster. Because they and URLab all use MuJoCo, a policy trained there evaluates directly here, with Unreal's rendering, cameras, and recording.

This page covers running a bundled policy, registering your own, and wrapping a client as a gymnasium.Env for evaluation pipelines.

If you only want to see a robot move, follow Run a bundled policy. To plug URLab into an evaluation pipeline, jump to Gym environment.

Run a bundled policy¶

1. Install the policy extra¶

cd urlab_bridge
uv sync --extra ui          # dashboard
uv sync --extra robojudo    # bundled pretrained policies

PHC-flagged policies (BeyondMimic, AMO, H2H, twist_tracker) also need the RoboJuDo submodule fully checked out:

cd RoboJuDo && git submodule update --init --recursive

2. Import the matching MJCF¶

Each policy expects a specific model. Drag the .xml into the Unreal Content Browser; the importer runs the mesh pipeline and produces an Articulation Blueprint.

Key	Robot	DOF	MJCF
`unitree_12dof`	G1	12	`assets/robots/g1/g1_29dof_rev_1_0.xml`
`unitree_wo_gait`	G1	29	`assets/robots/g1/g1_29dof_rev_1_0.xml`
`smooth`	G1	29	`assets/robots/g1/g1_29dof_rev_1_0.xml`
`beyondmimic_dance`	G1	29	`assets/robots/g1/g1_29dof_rev_1_0.xml`
`amo`	G1	29	`assets/robots/g1/g1_29dof_rev_1_0.xml`
`h2h`	G1	21	`assets/robots/g1/g1_29dof_rev_1_0.xml`
`twist_tracker`	G1	12	`assets/robots/g1/g1_29dof_rev_1_0.xml`
`go2_wtw`	Go2	12	`mujoco_menagerie/unitree_go2/go2.xml` (download)

The G1 XMLs ship in urlab_bridge/assets/robots/g1/ with their mesh dirs. Go2 comes from mujoco_menagerie.

3. Add a PD controller¶

Every articulation driven by a bundled policy needs a UMjPDController component on its Blueprint. The policies emit position targets at policy rate; the PD controller converts those into per-step joint torques against the live qpos / qvel. Without it the robot either does not move or moves with the wrong gains.

Open the imported Articulation Blueprint.
Add Component, search MjPDController, add it. It auto-binds to every actuator.
Compile and save.

Tune gains in the Blueprint defaults, or at runtime from Python through art.controller.set_gains(...) (see Controllers).

4. Run¶

Start UE, drop the Blueprint into a level, click Play. The dispatcher comes up on tcp://localhost:5559.

From the dashboard:

uv run urlab-ui

Connect, open the Policy tab, pick a policy, set the articulation prefix (for example g1), and click Run.

Headless from the CLI:

uv run urlab-policy --policy unitree_12dof --prefix g1

The launcher checks the policy's required_step_mode against the current bridge mode and refuses to start in an incompatible one, naming the mode it needs. Pass --help for the full flag set.

The policy registry¶

The registry binds a short key to a config class, a URLab env config, an MJCF asset, and a DOF count. The dashboard, the urlab-policy launcher, and your own code all look policies up by name.

from urlab_policy.registry import POLICIES, get_policy_labels

print(get_policy_labels())
print(POLICIES["unitree_12dof"]["dofs"])

Adapters¶

The merged urlab_policy.registry.POLICIES is the union of every adapter's registry.py. Two adapters ship:

adapters/robojudo/ - the RoboJuDo ecosystem. All the bundled entries above live here.
adapters/mjlab/ - mjlab-trained policies, run via the framework-agnostic PolicyRunner + TaskSpec machinery.

Add a policy¶

Drop an entry into the relevant adapter's registry.py, copying an existing one as a template. The schema is whatever the adapter reads; open the adapter source to see the required fields. After it lands:

uv run urlab-policy --policy my_policy --prefix robot
uv run urlab-ui                            # picks it up in the Policy tab

The dashboard imports the merged registry on launch, so new entries show up next session.

`required_step_mode`¶

Mode-sensitive policies declare what they need; the launcher refuses to start in an incompatible mode rather than producing silent garbage.

"my_policy": {
    # ...
    "required_step_mode": "direct",              # single mode
    # or
    "required_step_mode": ("direct", "puppet"),  # any-of
}

Policy trait	Declare
Calls `mj_step` on `client.data` itself (MJX, custom integrator)	`puppet`
Reads `art.controller` gains, expects UE to step	`direct`
Wraps a teleop device, needs continuous publishing	`live`
Pure black-box `act(obs) -> action`	omit (works everywhere)

Skip the registry for one-offs¶

The registry is for pretrained policies that should be discoverable. For a one-off control loop, skip it and use URLabClient directly:

from urlab_client import URLabClient

with URLabClient("tcp://localhost", step_mode="direct") as client:
    client.discover()
    client.sim.start()
    robot = client.articulations["g1"]
    for _ in range(1000):
        robot.set_ctrl({"left_hip_pitch": 0.5})
        client.step(n_steps=10)

Gym environment¶

URLabEnv is a thin gymnasium.Env wrapper around a single URLabClient. Use it to step a policy trained elsewhere through URLab's contacts and rendering, or when a script already speaks the gym interface.

Not for high-throughput training

One process, one editor, one env. Vectorised GPU sims (mjlab, MJX, mujoco_warp) train orders of magnitude faster. Use URLabEnv for the rollout that matters: eval, sim-to-sim verification, or a sanity rollout against URLab's higher-fidelity contacts.

Minimal usage¶

from urlab_client import URLabClient
from urlab_policy.adapters.robojudo.env import URLabEnv

client = URLabClient("tcp://localhost", step_mode="direct")
client.discover()

env = URLabEnv(client)
obs, info = env.reset(seed=42)

for _ in range(1000):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, info = env.reset()

env.close()

URLabEnv wraps an already-discovered client (its first positional argument). The compiled model is at env.client.model, and the client itself is reachable as env.client.

Observation and action spaces¶

URLabEnv(client, space_mode="flat")   # default
URLabEnv(client, space_mode="dict")

Mode	Type	Use when
`flat`	`Box` (qpos + qvel + sensors concatenated in discovery order)	Single-articulation rollouts, flat actor-critic inputs
`dict`	`Dict` keyed by articulation prefix	Multi-articulation scenes, structure-aware nets

Rewards and termination¶

Both are opt-in callables; URLab does not invent rewards. Without them, reward is 0.0 and terminated is False every step. Both take the same info dict the env returns from step(), which carries the live client, the per-articulation collections, step_count, and sim_time.

def my_reward(info) -> float:
    art = info["client"].articulations["vx300s"]
    waist = art.qpos_array[art.joints["waist"].qpos_local_offset]
    return -abs(waist - 0.5)

env = URLabEnv(client, reward_fn=my_reward, max_episode_steps=1000)

Seeding¶

obs, info = env.reset(seed=42)

reset(seed=...) records the seed URLab-side (the manager's Seed field) so client code and the recording layer can mirror it for reproducibility.

The seed is not written into MuJoCo options

Modern mjOption has no seed field, and mj_step does not depend on a stored integrator seed; randomness comes from user-set noise inputs. In direct mode, the same seed plus the same scene and starting keyframe still reproduces a rollout because the integrator is deterministic, not because a seed was pushed into MuJoCo. In puppet mode the client owns the integrator, so reproducibility follows your local MuJoCo install.

Observation level¶

URLabEnv(client, observations="standard")   # default
URLabEnv(client, observations="minimal")
URLabEnv(client, observations="full")

Trades wire bandwidth for richness. See Protocol Reference for the per-level contents.

Closing¶

env.close()

Tears down the underlying client socket. Always call it; UE keeps the session alive until the client disconnects.