<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>UT Austin Robot Perception and Learning Lab</title>
    <description>UTCS Homepage</description>
    <link>rpl.cs.utexas.edu</link>
    <atom:link href="/feed.xml" rel="self" type="application/rss+xml" />
    
      <item>
        <title>SCIZOR: Self-Supervised Data Curation for Imitation Learning</title>
        <description>&lt;p&gt;Imitation learning advances robot capabilities by enabling the acquisition of diverse behaviors from human demonstrations. However, large-scale datasets used for policy training often introduce substantial variability in quality, which can negatively impact performance. As a result, automatically curating datasets by filtering low-quality samples to improve quality becomes essential. Existing robotic curation approaches rely on costly manual annotations and perform curation at a coarse granularity, such as the dataset or trajectory level, failing to account for the quality of individual state-action pairs. To address this, we introduce SCIZOR, a self-supervised data curation framework that filters out low-quality state-action pairs to improve the performance of imitation learning policies. SCIZOR targets two complementary sources of low-quality data: suboptimal data, which hinders learning with undesirable actions, and redundant data, which dilutes training with repetitive patterns. SCIZOR leverages a self-supervised task progress predictor for suboptimal data to remove samples lacking task progression, and a deduplication module operating on joint state-action representation for samples with redundant patterns. Empirically, we show that SCIZOR enables imitation learning policies to achieve higher performance with less data, yielding an average improvement of 15.4% across multiple benchmarks. More information is available at: https://ut-austin-rpl.github.io/SCIZOR/&lt;/p&gt;
</description>
        <pubDate>Mon, 01 Jun 2026 00:00:00 +0000</pubDate>
        <link>/publications/2026/06/01/zhang-icra26-scizor/</link>
        <guid isPermaLink="true">/publications/2026/06/01/zhang-icra26-scizor/</guid>
      </item>
    
      <item>
        <title>MimicDroid: In-Context Learning for Humanoid Robot Manipulation from Human Play Videos</title>
        <description>&lt;p&gt;We aim to enable humanoid robots to efficiently solve new manipulation tasks from a few video examples. In-context learning (ICL) is a promising framework for achieving this goal due to its test-time data efficiency and rapid adaptability. However, current ICL methods rely on labor-intensive teleoperated data for training, which restricts scalability. We propose using human play videos—continuous, unlabeled videos of people interacting freely with their environment—as a scalable and diverse training data source. We introduce MimicDroid, which enables humanoids to perform ICL using human play videos as the only training data MimicDroid extracts trajectory pairs with similar manipulation behaviors and trains the policy to predict the actions of one trajectory conditioned on the other. Through this process, the model acquired ICL capabilities for adapting to novel objects and environments at test time. To bridge the embodiment gap, MimicDroid first retargets human wrist poses estimated from RGB videos to the humanoid, leveraging kinematic similarity. It also applies random patch masking during training to reduce overfitting to human-specific cues and improve robustness to visual differences. To evaluate few-shot learning for humanoids, we introduce an open-source simulation benchmark with increasing levels of generalization difficulty. MimicDroid outperformed state-of-the-art methods and achieved nearly twofold higher success rates in the real world.&lt;/p&gt;
</description>
        <pubDate>Mon, 01 Jun 2026 00:00:00 +0000</pubDate>
        <link>/publications/2026/06/01/shah-icra26-mimicdroid/</link>
        <guid isPermaLink="true">/publications/2026/06/01/shah-icra26-mimicdroid/</guid>
      </item>
    
      <item>
        <title>Self-Improving Vision-Language-Action Models with Data Generation via Residual RL</title>
        <description>&lt;p&gt;Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a three-stage plug-and-play framework that improves VLAs through residual reinforcement learning (RL) and distribution-aware data collection. In Stage 1, we train lightweight residual actors to probe failure regions of the VLA generalist. In Stage 2, we use a hybrid rollout scheme that aligns collected trajectories with the generalist’s deployment distribution while capturing recovery behaviors. In Stage 3, we distill the curated trajectories back into the generalist with standard SFT. PLD achieves near-saturated 99% task success on LIBERO, over 50% gains in SimplerEnv, and 100% success on real-world Franka and YAM arm manipulation tasks. Ablations show that residual probing and distribution-aware replay are key to collecting deployment-aligned data that improves both seen and unseen tasks, offering a scalable path toward self-improving VLA models.&lt;/p&gt;
</description>
        <pubDate>Wed, 01 Apr 2026 00:00:00 +0000</pubDate>
        <link>/publications/2026/04/01/xiao-iclr26-pld/</link>
        <guid isPermaLink="true">/publications/2026/04/01/xiao-iclr26-pld/</guid>
      </item>
    
      <item>
        <title>DEAS: DEtached value learning with Action Sequence for Scalable Offline RL</title>
        <description>&lt;p&gt;Offline reinforcement learning (RL) presents an attractive paradigm for training intelligent agents without expensive online interactions. However, current approaches still struggle with complex, long-horizon sequential decision making. In this work, we introduce DEtached value learning with Action Sequence (DEAS), a simple yet effective offline RL framework that leverages action sequences for value learning. These temporally extended actions provide richer information than single-step actions, enabling reduction of the effective planning horizon by considering longer sequences at once. However, directly adopting such sequences in actor-critic algorithms introduces excessive value overestimation, which we address through detached value learning that steers value estimates toward in-distribution actions that achieve high returns in the offline dataset. We demonstrate that DEAS consistently outperforms baselines on complex, long-horizon tasks from OGBench and can be applied to enhance the performance of large-scale Vision-Language-Action models that predict action sequences, significantly boosting performance in both RoboCasa Kitchen simulation tasks and real-world manipulation tasks.&lt;/p&gt;
</description>
        <pubDate>Wed, 01 Apr 2026 00:00:00 +0000</pubDate>
        <link>/publications/2026/04/01/kim-iclr26-deas/</link>
        <guid isPermaLink="true">/publications/2026/04/01/kim-iclr26-deas/</guid>
      </item>
    
      <item>
        <title>FORTE: Tactile Force and Slip Sensing on Compliant Fingers for Delicate Manipulation</title>
        <description>&lt;p&gt;Handling delicate and fragile objects remains a major challenge for robotic manipulation, especially for rigid parallel grippers. While the simplicity and versatility of parallel grippers have led to widespread adoption, these grippers are limited by their heavy reliance on visual feedback. Tactile sensing and soft robotics can add responsiveness and compliance. However, existing methods typically involve high integration complexity or suffer from slow response times. In this work, we introduce FORTE, a tactile sensing system embedded in compliant gripper fingers. FORTE uses 3D-printed fin-ray grippers with internal air channels to provide low-latency force and slip feedback. FORTE applies just enough force to grasp objects without damaging them, while remaining easy to fabricate and integrate. We find that FORTE can accurately estimate grasping forces from 0-8 N with an average error of 0.2 N, and detect slip events within 100 ms of occurring. We demonstrate FORTE’s ability to grasp a wide range of slippery, fragile, and deformable objects. In particular, FORTE grasps fragile objects like raspberries and potato chips with a 98.6% success rate, and achieves 93% accuracy in detecting slip events. These results highlight FORTE’s potential as a robust and practical solution for enabling delicate robotic manipulation.&lt;/p&gt;

</description>
        <pubDate>Sun, 01 Mar 2026 00:00:00 +0000</pubDate>
        <link>/publications/2026/03/01/shang-arxiv25-forte/</link>
        <guid isPermaLink="true">/publications/2026/03/01/shang-arxiv25-forte/</guid>
      </item>
    
      <item>
        <title>CHIP: Adaptive Compliance for Humanoid Control through Hindsight Perturbation</title>
        <description>&lt;p&gt;Recent progress in humanoid robots has unlocked agile locomotion skills, including backflipping, running, and crawling. Yet it remains challenging for a humanoid robot to perform forceful manipulation tasks such as moving objects, wiping, and pushing a cart. We propose adaptive Compliance Humanoid control through hIsight Perturbation (CHIP), a plug-and-play module that enables controllable end-effector stiffness while preserving agile tracking of dynamic reference motions. CHIP is easy to implement and requires neither data augmentation nor additional reward tuning. We show that a generalist motion-tracking controller trained with CHIP can perform a diverse set of forceful manipulation tasks that require different end-effector compliance, such as multi-robot collaboration, wiping, box delivery, and door opening.&lt;/p&gt;
</description>
        <pubDate>Fri, 06 Feb 2026 00:00:00 +0000</pubDate>
        <link>/publications/2026/02/06/chen-arxiv25-chip/</link>
        <guid isPermaLink="true">/publications/2026/02/06/chen-arxiv25-chip/</guid>
      </item>
    
      <item>
        <title>NitroGen: An Open Foundation Model for Generalist Gaming Agents</title>
        <description>&lt;p&gt;We introduce NitroGen, a vision-action foundation model for generalist gaming agents trained on 40,000 hours of gameplay video across more than 1,000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in task success rates over models trained from scratch. We release the dataset, evaluation suite, and model weights to advance research on generalist embodied agents.&lt;/p&gt;
</description>
        <pubDate>Thu, 01 Jan 2026 00:00:00 +0000</pubDate>
        <link>/publications/2026/01/01/magne-arxiv26-nitrogen/</link>
        <guid isPermaLink="true">/publications/2026/01/01/magne-arxiv26-nitrogen/</guid>
      </item>
    
      <item>
        <title>Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer</title>
        <description>&lt;p&gt;Recent progress in GPU-accelerated, photorealistic simulation has opened a scalable data-generation path for robot learning, where massive physics and visual randomization allow policies to generalize beyond curated environments. Building on these advances, we develop a teacher-student-bootstrap learning framework for vision-based humanoid loco-manipulation, using articulated-object interaction as a representative high-difficulty benchmark. Our approach introduces a staged-reset exploration strategy that stabilizes long-horizon privileged-policy training, and a GRPO-based fine-tuning procedure that mitigates partial observability and improves closed-loop consistency in sim-to-real RL. Trained entirely on simulation data, the resulting policy achieves robust zero-shot performance across diverse door types and outperforms human teleoperators by up to 31.7% in task completion time under the same whole-body control stack. This represents the first humanoid sim-to-real policy capable of diverse articulated loco-manipulation using pure RGB perception.&lt;/p&gt;
</description>
        <pubDate>Sun, 30 Nov 2025 00:00:00 +0000</pubDate>
        <link>/publications/2025/11/30/xue-arxiv25-doorman/</link>
        <guid isPermaLink="true">/publications/2025/11/30/xue-arxiv25-doorman/</guid>
      </item>
    
      <item>
        <title>VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation</title>
        <description>&lt;p&gt;A key barrier to the real-world deployment of humanoid robots is the lack of autonomous loco-manipulation skills. We introduce VIRAL, a visual sim-to-real framework that learns humanoid loco-manipulation entirely in simulation and deploys it zero-shot to real hardware. VIRAL follows a teacher-student design: a privileged RL teacher, operating on full state, learns long-horizon loco-manipulation using a delta action space and reference state initialization. A vision-based student policy is then distilled from the teacher via large-scale simulation with tiled rendering, trained with a mixture of online DAgger and behavior cloning. We find that compute scale is critical: scaling simulation to tens of GPUs (up to 64) makes both teacher and student training reliable, while low-compute regimes often fail. To bridge the sim-to-real gap, VIRAL combines large-scale visual domain randomization over lighting, materials, camera parameters, image quality, and sensor delays–with real-to-sim alignment of the dexterous hands and cameras. Deployed on a Unitree G1 humanoid, the resulting RGB-based policy performs continuous loco-manipulation for up to 54 cycles, generalizing to diverse spatial and appearance variations without any real-world fine-tuning, and approaching expert-level teleoperation performance. Extensive ablations dissect the key design choices required to make RGB-based humanoid loco-manipulation work in practice.&lt;/p&gt;
</description>
        <pubDate>Wed, 19 Nov 2025 00:00:00 +0000</pubDate>
        <link>/publications/2025/11/19/he-arxiv25-viral/</link>
        <guid isPermaLink="true">/publications/2025/11/19/he-arxiv25-viral/</guid>
      </item>
    
      <item>
        <title>SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control</title>
        <description>&lt;p&gt;Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited behavior set, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements. Specifically, we posit motion tracking as a natural and scalable task for humanoid control, leverageing dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating the benefits of scale, we show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces, such as VR teleoperation devices, human videos, and vision-language-action (VLA) models, all using the same policy. Scaling motion tracking exhibits favorable properties: performance improves steadily with increased compute and data diversity, and learned representations generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.&lt;/p&gt;
</description>
        <pubDate>Tue, 11 Nov 2025 00:00:00 +0000</pubDate>
        <link>/publications/2025/11/11/luo-arxiv25-sonic/</link>
        <guid isPermaLink="true">/publications/2025/11/11/luo-arxiv25-sonic/</guid>
      </item>
    
  </channel>
</rss>
