NEW DELHI: Google has demonstrated its first vision-language-action (VLA) model for robot control that showed improved generalisation capabilities and semantic and visual understanding beyond the robotic data it was exposed to.
This includes interpreting new commands and responding to user commands by performing rudimentary reasoning, such as reasoning about object categories or high-level descriptions.
The Robotic Transformer 2 (RT-2) is a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control, according to Google DeepMind. A traditional robot can pick up a ball and stumble when picking up a cube.
RT-2’s flexible approach enables a robot to train on picking up a ball and can figure out how to adjust its ex-tremities to pick up a cube or another toy it’s never seen before.
“We also show that incorporating chain-of-thought reasoning allows RT-2 to perform multi-stage semantic rea-soning, like deciding which object could be used as an improvised hammer (a rock), or which type of drink is best for a tired person (an energy drink),” said the DeepMind team.
The latest model builds upon Robotic Transformer 1 (RT-1) that was trained on multi-task demonstrations.
The team performed a series of qualitative and quantitative experiments on RT-2 models, on over 6,000 robotic trials. (IANS)
Also Watch: