HomeArtificial IntelligenceGoogle AI Weblog: Robotic See, Robotic Do

Google AI Weblog: Robotic See, Robotic Do

Folks be taught to do issues by watching others — from mimicking new dance strikes, to watching YouTube cooking movies. We’d like robots to do the identical, i.e., to be taught new expertise by watching folks do issues throughout coaching. Right now, nevertheless, the predominant paradigm for educating robots is to distant management them utilizing specialised {hardware} for teleoperation after which practice them to imitate pre-recorded demonstrations. This limits each who can present the demonstrations (programmers & roboticists) and the place they are often supplied (lab settings). If robots may as an alternative self-learn new duties by watching people, this functionality may permit them to be deployed in additional unstructured settings like the house, and make it dramatically simpler for anybody to show or talk with them, knowledgeable or in any other case. Maybe sooner or later, they may even have the ability to use Youtube movies to develop their assortment of expertise over time.

Our motivation is to have robots watch folks do duties, naturally with their palms, after which use that information as demonstrations for studying. Video by Teh Aik Hui and Nathaniel Lim. License: CC-BY

Nevertheless, an apparent however usually missed downside is {that a} robotic is bodily totally different from a human, which implies it usually completes duties otherwise than we do. For instance, within the pen manipulation job under, the hand can seize all of the pens collectively and rapidly switch them between containers, whereas the two-fingered gripper should transport separately. Prior analysis assumes that people and robots can do the identical job equally, which makes manually specifying one-to-one correspondences between human and robotic actions simple. However with stark variations in physique, defining such correspondences for seemingly simple duties may be surprisingly tough and generally not possible.

Bodily totally different end-effectors (i.e., “grippers”) (i.e., the half that interacts with the surroundings) induce totally different management methods when fixing the identical job. Left: The hand grabs all pens and rapidly transfers them between containers. Proper: The 2-fingered gripper transports one pen at a time.

In “XIRL: Cross-Embodiment Inverse RL”, offered as an oral paper at CoRL 2021, we discover these challenges additional and introduce a self-supervised technique for Cross-embodiment Inverse Reinforcement Studying (XIRL). Reasonably than specializing in how particular person human actions ought to correspond to robotic actions, XIRL learns the high-level job goal from movies, and summarizes that information within the type of a reward perform that’s invariant to embodiment variations, equivalent to form, actions and end-effector dynamics. The realized rewards can then be used along with reinforcement studying to show the duty to brokers with new bodily embodiments by means of trial and error. Our strategy is common and scales autonomously with information — the extra embodiment range offered within the movies, the extra invariant and strong the reward features change into. Experiments present that our realized reward features result in considerably extra pattern environment friendly (roughly 2 to 4 occasions) reinforcement studying on new embodiments in comparison with different strategies. To increase and construct on our work, we’re releasing an accompanying open-source implementation of our technique together with X-MAGICAL, our new simulated benchmark for cross-embodiment imitation.

Cross-Embodiment Inverse Reinforcement Studying (XIRL)

The underlying commentary on this work is that despite the various variations induced by totally different embodiments, there nonetheless exist visible cues that replicate development in direction of a typical job goal. For instance, within the pen manipulation job above, the presence of pens within the cup however not the mug, or the absence of pens on the desk, are key frames which are widespread to totally different embodiments and not directly present cues for a way near being full a job is. The important thing thought behind XIRL is to mechanically uncover these key moments in movies of various size and cluster them meaningfully to encode job development. This motivation shares many similarities with unsupervised video alignment analysis, from which we are able to leverage a way referred to as Temporal Cycle Consistency (TCC), which aligns movies precisely whereas studying helpful visible representations for fine-grained video understanding with out requiring any ground-truth correspondences.

We leverage TCC to coach an encoder to temporally align video demonstrations of various specialists performing the identical job. The TCC loss tries to maximise the variety of cycle-consistent frames (or mutual nearest-neighbors) between pairs of sequences utilizing a differentiable formulation of gentle nearest-neighbors. As soon as the encoder is educated, we outline our reward perform as merely the unfavourable Euclidean distance between the present commentary and the aim commentary within the realized embedding area. We are able to subsequently insert the reward into a normal MDP and use an RL algorithm to be taught the demonstrated habits. Surprisingly, we discover that this easy reward formulation is efficient for cross-embodiment imitation.

XIRL self-supervises reward features from knowledgeable demonstrations utilizing temporal cycle consistency (TCC), then makes use of them for downstream reinforcement studying to be taught new expertise from third-person demonstrations.

X-MAGICAL Benchmark

To guage the efficiency of XIRL and baseline alternate options (e.g., TCN, LIFS, Aim Classifier) in a constant surroundings, we created X-MAGICAL, which is a simulated benchmark for cross-embodiment imitation. X-MAGICAL encompasses a various set of agent embodiments, with variations of their shapes and end-effectors, designed to resolve duties in numerous methods. This results in variations in execution speeds and state-action trajectories, which poses challenges for present imitation studying strategies, e.g., ones that use time as a heuristic for weak correspondences between two trajectories. The power to generalize throughout embodiments is exactly what X-MAGICAL evaluates.

The SweepToTop job we thought of for our experiments is a simplified 2D equal of a typical family robotic sweeping job, the place an agent has to push three objects right into a aim zone within the surroundings. We selected this job particularly as a result of its long-horizon nature highlights how totally different agent embodiments can generate fully totally different trajectories (proven under). X-MAGICAL encompasses a Fitness center API and is designed to be simply extendable to new duties and embodiments. You possibly can strive it out at this time with pip set up x-magical.

Completely different agent shapes within the SweepToTop job within the X-MAGICAL benchmark want to make use of totally different methods to reposition objects into the goal space (pink), i.e., to “clear the particles”. For instance, the long-stick can clear them multi function fell swoop, whereas the short-stick must do a number of consecutive back-and-forths.
Left: Heatmap of state visitation for every embodiment throughout all knowledgeable demonstrations. Proper: Examples of knowledgeable trajectories for every embodiment.


In our first set of experiments, we checked whether or not our realized embodiment-invariant reward perform can allow profitable reinforcement studying, when the knowledgeable demonstrations are supplied by means of the agent itself. We discover that XIRL considerably outperforms different strategies particularly on the harder brokers (e.g., short-stick and gripper).

Identical-embodiment setting: Comparability of XIRL with baseline reward features, utilizing SAC for RL coverage studying. XIRL is roughly 2 to 4 occasions extra pattern environment friendly than a number of the baselines on the tougher brokers (short-stick and gripper).

We additionally discover that our strategy exhibits nice potential for studying reward features that generalize to novel embodiments. As an illustration, when reward studying is carried out on embodiments which are totally different from those on which the coverage is educated, we discover that it leads to considerably extra pattern environment friendly brokers in comparison with the identical baselines. Beneath, within the gripper subplot (backside proper) for instance, the reward is first realized on demonstration movies from long-stick, medium-stick and short-stick, after which the reward perform is used to coach the gripper agent.

Cross-embodiment setting: XIRL performs favorably in comparison with different baseline reward features, educated on observation-only demonstrations from totally different embodiments. Every agent (long-stick, medium-stick, short-stick, gripper) had its reward educated utilizing demonstrations from the opposite three embodiments.

We additionally discover that we are able to practice on real-world human demonstrations, and use the realized reward to coach a Sawyer arm in simulation to push a puck to a delegated goal zone. In these experiments as nicely, our technique outperforms baseline alternate options. For instance, our XIRL variant educated solely on the real-world demonstrations (purple within the plots under) reaches 80% of the entire efficiency roughly 85% sooner than the RLV baseline (orange).

What Do The Discovered Reward Features Look Like?

To additional discover the qualitative nature of our realized rewards in tougher real-world eventualities, we acquire a dataset of the pen switch job utilizing varied family instruments.

Beneath, we present rewards extracted from a profitable (high) and unsuccessful (backside) demonstration. Each demonstrations observe an identical trajectory at first of the duty execution. The profitable one nets a excessive reward for putting the pens consecutively into the mug then into the glass cup, whereas the unsuccessful one obtains a low reward as a result of it drops the pens exterior the glass cup in direction of the top of the execution (orange circle). These outcomes are promising as a result of they present that our realized encoder can characterize fine-grained visible variations related to a job.


We highlighted XIRL, our strategy to tackling the cross-embodiment imitation downside. XIRL learns an embodiment-invariant reward perform that encodes job progress utilizing a temporal cycle-consistency goal. Insurance policies realized utilizing our reward features are considerably extra sample-efficient than baseline alternate options. Moreover, the reward features don’t require manually paired video frames between the demonstrator and the learner, giving them the flexibility to scale to an arbitrary variety of embodiments or specialists with various talent ranges. Total, we’re enthusiastic about this path of labor, and hope that our benchmark promotes additional analysis on this space. For extra particulars, please try our paper and obtain the code from our GitHub repository.


Kevin and Andy summarized analysis carried out along with Pete Florence, Jonathan Tompson, Jeannette Bohg (school at Stanford College) and Debidatta Dwibedi. All authors would moreover prefer to thank Alex Nichol, Nick Hynes, Sean Kirmani, Brent Yi, Jimmy Wu, Karl Schmeckpeper and Minttu Alakuijala for fruitful technical discussions, and Sam Toyer for invaluable assist with establishing the simulated benchmark.



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments