BOSS: A Benchmark for Observation Space Shift in Long-Horizon Task

What is Observation Space Shift?

Skill chaining — sequentially executing pre-learned visuomotor skills — is a simple yet powerful way to tackle long-horizon robot tasks. But failures often arise at skill transitions: the terminal state of one skill can fall outside the initial-state distribution the next skill was trained on. Crucially, the culprit is frequently a change that is logically irrelevant to the next skill, yet still appears in its visual observation.

Consider chaining PlaceObject(potato, bowl) with MoveContainer(bowl, cabinet). The second policy was only ever trained on an empty bowl, so the leftover potato creates a visual mismatch that breaks it — even though moving the bowl is still perfectly feasible. We call this Observation Space Shift (OSS).

Formal definition

We model the environment as a POMDP $\langle \mathcal{S}, \mathcal{O}, \mathcal{A}, \mathcal{T}, \mathcal{Z}, r, \gamma \rangle$, where the observation space $\mathcal{O}$ includes third-person camera views and proprioception. Following Task and Motion Planning, a predicate $\psi_i : \mathcal{S} \to \{0, 1\}$ is a binary property of the state (e.g. $\texttt{In(potato, bowl)}$), and an operator $op = \langle \textit{Pre}, \textit{Eff}, c \rangle$ specifies a skill's preconditions, effects, and cost.

A predicate is irrelevant to an operator when $\psi_i \notin \textit{Pre} \cup \textit{Eff}$ — it does not affect feasibility. OSS is the problem in which changes to such irrelevant predicates in the visual observation space $\mathcal{O}$, caused by the effects of preceding skills, degrade the performance of the current visuomotor policy.

Prior skill-transition methods don't address this well: offline transition policies assume each skill's initial states are always reachable (here, that would mean undoing the potato placement, which is logically invalid), while online fine-tuning assumes policies can be continually retrained — impractical when OSS can occur at every transition of a long-horizon task.

The BOSS Benchmark

BOSS is built on the LIBERO simulator (a Franka Emika Panda arm across 12 manipulation scenes). We start from 44 single-skill tasks drawn from LIBERO-100, then generate modified counterparts with our Rule-based Automatic Modification Generator (RAMG), which edits PDDL predicates — repositioning or adding objects, toggling fixtures (e.g. opening a drawer), and changing containment relations — while preserving each skill's feasibility and logical consistency. RAMG can produce up to 1,727 single-modification variants, giving a large, controllable testbed for OSS.

Challenge 1

Single Predicate Shift

Tests robustness to a single irrelevant modification introduced by the immediately preceding skill (e.g. a potato placed in the bowl the robot must now move).

Challenge 2

Accumulated Predicate Shift

Tests the cumulative effect of two or three modifications from multiple preceding skills (e.g. a potato in the bowl and an open drawer), a harder, more realistic scenario.

Challenge 3

Skill Chaining

Evaluates 10 real long-horizon tasks, each a chain of three skills, end-to-end — directly measuring how OSS degrades full-task success.

Results

We evaluate four widely used imitation-learning baselines: three Behavioral Cloning policies from LIBERO (BC-RESNET-RNN, BC-RESNET-T, BC-VIT-T) and the vision-language-action model OpenVLA. For C1/C2 we report the Ratio Performance Delta (RPD) — the relative drop in success rate caused by OSS; for C3 we report the Delta to Upper Bound Ratio (DUBR) — the normalized gap between the chain's actual success and its OSS-free upper bound. All results average over three seeds.

Challenge 1 — Single Predicate Shift

C1 scatter plots: success rate on modified vs. unaffected tasks for each baseline. — Each point is a task pair (success unaffected vs. modified). Points below the diagonal are hurt by OSS; darker = larger RPD.

Takeaway: Even a single irrelevant change is damaging. 50–68% of tasks degrade, with average performance drops of 67% / 35% / 34% / 54% for BC-RESNET-RNN, BC-RESNET-T, BC-VIT-T, and OpenVLA. BC-VIT-T fares best (attention helps focus on task-relevant features); the RNN fares worst.

Challenge 2 — Accumulated Predicate Shift

C2 bar charts: average positive RPD and OSS occurrence rate for 1, 2, and 3 modifications. — (top) Average positive RPD and (bottom) OSS occurrence rate, for 1, 2, and 3 accumulated modifications.

Takeaway: OSS compounds. As modifications accumulate (1→2→3), both severity (average RPD 0.48 → 0.59 → 0.62) and frequency (61.9% → 70.5% → 76.1%) rise — exactly the regime a long-horizon task lives in.

Challenge 3 — Skill Chaining

C3 bar chart: Delta to Upper Bound Ratio across 10 long-horizon tasks for each baseline. — Delta to Upper Bound Ratio (DUBR) across 10 three-skill long-horizon tasks. (Some BC-RESNET-RNN bars read 0% because its OSS-free upper bound is already 0% — it fails even without OSS.)

Takeaway: On real long-horizon tasks, the gap to the OSS-free upper bound is large and positive across nearly every task — OSS sharply reduces end-to-end success.

Can data augmentation fix it? No.

A natural idea is to expose policies to more visual variety during training. We used RAMG to generate 1,727 modified environments and replayed demonstrations to build a new dataset of ~57,000 demonstrations — nearly 30× the original LIBERO data for these 44 tasks. We then compare Setup A (trained on original data) with Setup B (trained on the augmented data), both evaluated on the same modified C1 tasks.

Table: success rate on Setup A vs Setup B and their difference for each baseline. — Success rate under Setup A vs. Setup B (and their difference A−B).

Takeaway: Augmentation barely helped one baseline (BC-RESNET-RNN, +0.06) and hurt the others (BC-RESNET-T −0.13, BC-VIT-T −0.10, OpenVLA ±0). The combinatorial explosion of skill-induced scene variations is simply too large to cover by augmentation — even in simulation.

Other mitigations also fall short

We tested two more intuitive strategies on C1, and neither resolves OSS:

Frozen robotics-specific vision encoders (R3M, LIV). Replacing the trainable encoder with frozen pretrained ones still leaves 55% (BC-R3M-T) and 61% (BC-LIV-T) of tasks affected, with average RPD of 29% and 26%.
3D point-cloud policies (3D Diffuser Actor). Despite better viewpoint generalization, 73% of tasks are still affected, with average RPD 37% — even image-free policies remain vulnerable to OSS.

Together with the data-augmentation result, this shows that current baselines lack any mechanism to handle OSS, motivating the need for new, OSS-targeted algorithms.

BibTeX

@article{yang2025boss,
    title={BOSS: Benchmark for observation space shift in long-horizon task},
    author={Yang, Yue and Zhao, Linfeng and Ding, Mingyu and Bertasius, Gedas and Szafir, Daniel},
    journal={IEEE Robotics and Automation Letters},
    year={2025},
    publisher={IEEE}
}

BOSS:A Benchmark for Observation Space Shift in Long-Horizon Task