RobotX Research Program

ETH Zurich has launched the RobotX Research Program to further expand research and education in the area of mobile robotics and manipulation. Within this framework, a long-term research program can be established aiming at bringing together competences from academic and industrial research with the goal to make the robots conscious about their environment and capable to move around, make decisions and perform tasks in an autonomous manner.
The program is structured around annual calls that will support research projects from ETH RobotX Initiative research groups.
2024 Funded Projects

There is an unprecedented hype around humanoid robots. Companies around the globe make tremendous investments to build advanced robotic systems and aim to incorporate the substantial research progress that has been seen in quadrupedal locomotion over the last decade. While building our own humanoid systems at ETH is out of reach due to the limited budgets available, our standing in the community gives us a unique possibility to position ourselves as the provider of the locomotion synthesis and control pipeline and to demonstrate this through early access to state of the art robots.
Within the scope of the Advanced Humanoid Locomotion (AHL) project, we aim to empower bipedal robots with robust walking skills to overcome terrains that humans and animals encounter in daily life. Such environments may be structured s.a. flat ground, stairs, and slopes, or can be highly irregular, e.g., dense vegetation, rocky ground, and detritus. Traversing such terrains requires sensory feedback from both vision and touch, which, when jointly processed, allows robots to understand topological properties and environmental uncertainties.
We plan to exploit the knowledge we have gathered with the controls of quadrupedal robots and target to establish a new baseline within the humanoid locomotion community. To this end, Robotic Systems Lab (RSL) will build upon our reinforcement learning (RL)-based sim-to-real pipeline to generate robust and versatile locomotion control policies. Specifically, we investigate the use of transformer-based terrain interpretation for optimal foothold selection during perceptive locomotion. In parallel, Computational Robotics Lab (CRL) investigates using model-predictive control (MPC)-guided RL and imitation learning as alternative approaches to control such a complex robotic system. Both teams will compare and merge the two approaches to generate versatile yet robust whole-body locomotion skills. The emerging controller will lay the foundation for future research of loco-manipulation (i.e., the combination of locomotion and manipulation) and more advanced bipedal skill synthesis, such as learning parkour.
Principal Investigator
Prof. Marco Hutter
Start (duration)
01.07.2024 (18 months)

Using engineered living muscle tissues to generate motion, bio-actuators have the potential for sustainable production, self-healing, self-assembly, and pathophysiological study of motion in living beings. However, as their architecture does not fully mimic native muscle, bio-actuators provide limited performance. We lack a principled approach to physically model muscle constructs starting from the myofiber level, that could guide bottom-up tissue engineering and increase biofabrication’s efficiency.
We aim to: (1) systematically bioprint bundles of myofibers; (2) characterize their contraction responses to electrical stimulation; (3) formulate a physical model predicting the geometricvariability and contractile behavior; and (4) biofabricate muscles with a biomimetic microarchitecture.
We will deposit high-resolution structures of myoblast-laden hydrogel via extrusion-based bioprinting and generate a myofiber performance library to formulate a physics model explaining the engineered muscle’s behavior. Equation discovery coupled with differentiable simulation will explain the contractile behavior of differently-sized myobundles. A physics-informed machine learning approach will improve the model’s performance and scale it up to solve the behavior of more complex tissue architectures. To validate the model’s generalizability, we will compare the performance of various myobundles’ architectures against our model, and finally realize a proof-of-concept bio-actuator with a ierarchical tissue organization and predicted performance.
Our biophysical machine learning-driven approach to predictable contractile tissue fabrication will enable the understanding of myofibers’ biophysics and scaling-up performant bio-actuators. This interdisciplinary project offers the first opportunity to apply equation discovery to living materials and engineer bio-actuators from the myofiber level with predicted contractions for highperformance operations.
Principal Investigator
Prof. Pobert Katzschmann
Start (duration)
01.09.2024 (18 months)

This project aims to explore localization within 3D scene graphs, with the objective of enabling autonomous agents, such as the Boston Dynamics Spot and ANYmal, to self-localize within a preconstructed map of a changing environment. This graph characterizes object instances at the leaf nodes using various modalities (point cloud, image, and attributes), while the connecting edges denote interobject relationships (e.g., "nearby"). Interior nodes categorize groups into hierarchy levels, such as rooms or buildings. This representation supports the integration of multiple modalities in localization, ensuring both lightweight operation and efficiency while providing a natural way to handle changing environments. The lightweight nature stems from the ability to distill complex objects into comprehensive descriptors, capturing all modalities cohesively. This method stands out for its efficiency by eliminating the reliance on large point clouds or image datasets. Moreover, it allows for representing dynamic objects similarly to static ones. The project is structured into three work packages (WPs).
The first WP creates localization algorithms to process input from various modalities providing coarse and fine localization given a pre-existing map. This WP involves enhancing the framework to leverage sequences, aiming to improve localization robustness. The second WP is dedicated to change detection by comparing queries against the established map. The third WP develops algorithms for 3D scene graph construction capturing dynamic object properties. This approach ensures that generated graphs are optimized for cross-modal localization. This phase will also include data collection and proceed simultaneously with the previous WPs.
Principal Investigator
Prof. Marc Pollefeys
Start (duration)
01.09.2024 (18 months)

Every one of us interacts with numerous objects daily without giving the interaction much thought. Yet robot manipulators still struggle with actions beyond simple pick-and-place operations. Why is that? While there are many reasons, including limitations in hardware, sensing, and planning, an often overlooked one is going beyond shape to the ability to capture and represent physical properties.
Humans have a remarkably good intuition about an object's physical properties and how it will interact with the environment. And despite not knowing the actual physical quantities, we can make educated guesses about the physical interactions of objects. Current robotic systems are still far from exhibiting a comparable level of physical intuition. Despite recent advances in manipulation and perception, most approaches depend purely on shape information that can be passively captured through vision alone, neglecting other intrinsic properties such as friction and mass distribution.
In this project, we will develop a learning-based method that captures the essence of both visual and physical properties by embedding them in a single shared latent space. The robot will collect vision and force-torque observations through interactions with previously unknown objects. Leveraging state-of-the-art unsupervised learning approaches, we side-step the need for explicit labels for the observed properties, and focus on learning a latent representation that incorporates multiple relevant physical and shape properties for interaction. We will demonstrate the power of the learned latent space by demonstrating it on a stacking task with complex, non-uniform objects for which successful completion requires an understanding of both shape and physical properties.
Principal Investigator
Prof. Roland Siegwart
Start (duration)
01.07.2024 (12 months)
2023 Funded Projects
Teaching mobile robots long horizon tasks that involve locomotion and manipulation, such as moving to a table to pick up an object and delivering it to a target location, is challenging. We formulate this task as a reinforcement learning problem. We propose a framework that learns from a large set of uncurated demonstrations of the robot interacting with different environments and a few task-specific expert demonstrations. Our method consists of three modules. First we learn to extract meaningful behaviors from the uncurated demonstrations. Second, we introduce a teacher-student framework that learns to interact with the environment to solve the desired manipulation task. To guide the learning, the teacher rewards the student agent to act similarly to the expert demonstrations. However, when encountered with novel states, the student receives suggestions from the previously extracted behavioral prior. Our end goal is to demonstrate the desired task on a real robotic system, where we do not have access to privileged information as we do in simulation. Crucially, our proposed framework allows an easy integration of real world sensing capabilities that still leverages the benefits of easily accessible, privileged information in simulation. Due to the modularity of our framework, only the student agent’s sensors have to be adjusted, e.g., to an RGB-D camera, whereas the rest of the framework can still benefit from additional information during training. Therefore, our framework will be able to learn complex control tasks and transfer them to the real world, bringing learning-based mobile manipulation robots closer to deployment.
Principal Investigator
Prof. Otmar Hilliges
Prof. Stelian Coros
Start (duration)
01.10.2023 (18 months)

Many of the items we interact with in our daily lives are soft and squishy. While research on the manipulation of rigid objects by robots is advancing at a rapid pace, the situation is not quite so for deformable objects. The problem involves the development of advanced optimal control and learning methods that would allow robots to handle soft objects. The goal is to equip robots with human-like
abilities, such as folding clothes and passing soft parcels with precision, flexibility, and adaptability. Advancements in this field are expected to have far-reaching implications for various industries, including manufacturing, healthcare, and logistics, as well as contribute to the overall progress of the field of robotics.
With this proposal, we seek to focus on and investigate the decision-making process involved in activities such as folding clothes. For example, the robot must identify the type of cloth and its properties, such as its material, texture, and size. Based on this information, the robot must determine the best folding strategy, taking into account the cloth's tendency to stretch or wrinkle. Next, the robot must use its sensors to detect the position and orientation of the cloth, as well as any obstacles that may interfere with the folding process. The robot must then choose the appropriate movement, such as picking up, manipulating, and folding the cloth, using its actuators. The entire process requires the integration of perception, decision-making, and motor control systems, as well as the ability to adapt to changing conditions, such as unexpected obstacles or changes in the cloth's shape.
Principal Investigator
Prof. Stelian Coros
Start (duration)
01.07.2023 (18 months)
The ALLIES project aims to develop an intelligent legged robot that can understand the environment, perform complex sequences of mobile manipulation tasks, and assist people autonomously with only linguistic instruction as the interface. Such a technology could free people from tedious household chores and improve people’s quality of life, especially for those with motor impairments since it can greatly reduce their dependence on a caregiver.
The legged manipulator developed at RSL can already perform specific predefined tasks but falls short of reasoning about the environment and completing complex sequences of tasks. In this project, we aim to build on top of our initial investigations on semantic understanding and robotic planning and exploration to achieve a high autonomy of the robot allowing it to better comprehend and interact with its surroundings. Large language models (LLMs) will be used to understand verbal commands from a human operator and deconstruct them into a plan that the robot is able to execute. In addition, visual language models (VLMs) and open-vocabulary object detection methods will be used to develop a representation of the environment around the robot and provide it with the perception and localization capabilities it needs to execute the plan successfully.
Principal Investigator
Prof. Marco Hutter
Start (duration)
01.07.2023 (18 months)

Mobile robots with mounted manipulators allow for automating daily tasks of increasing complexity in environments designed for humans. However, versatile manipulation with human-like dexterity and autonomy is still beyond current capabilities.
To develop human-like manipulation skills, we propose a system design for a versatile, cost-efficient, agile, and compliant robotic hand with accurate position and tactile sensing. This hand will be integrated with an existing robotic arm on an ANYmal quadrupedal robot to solve real-world challenges in assisting humans such as opening doors, pressing buttons, picking up objects from the floor, and performing robot-to-human handover tasks.
To solve complex manipulation tasks with this robotic hand we propose a data-driven, deep-learning-based dexterous grasp planning framework. By using available crowd-sourced point-of-view datasets (3670h video) of humans performing daily tasks for pre-training our robot transformer, we will reduce by an order of magnitude the need for time-intensive robot demonstrations for training.
We will be the first to equip quadrupedal robots with the abilities for versatile and shape invariant grasping using different grasp types, dexterous in-hand manipulation and re-orientation of objects, and grasp compliance for human interaction, which have previously not been possible with the twoand three-fingered robotic grippers used today.
This robotic hand and grasp planning framework will serve as a versatile platform for future research in machine learning for robotic manipulation and could significantly increase the capabilities of mobile robots deployed in unstructured environments. The technology will enable applications such as care and service robots, inspection robots, and rescue robots.
Principal Investigator
Prof. Robert Katzschmann
Start (duration)
01.07.2023 (18 months)
Novel aerial robotic systems, such as tilt-rotor platforms, have made meaningful in-flight physical interaction possible, enabling a wealth of novel applications in industrial inspection and maintenance. In contrast to regular drones, they use individually servo-actuated arms with propellers to perform thrust vectoring and generate forces in arbitrary directions.
The precise and robust control in flight is highly challenging and prone to modeling mismatches and unknown environmental factors. Classic control approaches applied to tilt-rotor platforms have reached limits in both precision and robustness, mainly due to complex non-linear dynamics and the lack of accurate models. While accurate propeller thrust models exist, the unknown dynamics and inaccuracies (e.g. backlash) of servomotors are a significant impediment to better flight performance. Additionally, in current approaches, the actuator allocation, which distributes desired forces and torques to the different actuators, is blind to the context of flight - e.g. the closeness to obstacles and airflow disturbances or precision/smoothness requirements.
We are convinced that control and modeling mismatches are the main impediments to robust and precise aerial manipulation. We propose to tackle these problems by i) applying data-driven modeling techniques for actuator identification, ii) integrating the resulting model in a comprehensive simulation environment, iii) and subsequently train novel context-dependent control and allocation policies.
This will enable the system to leverage the full actuation capabilities and redundancies of tilt-rotor platforms, with the ultimate aim to create a new generation of aerial workers, capable of accurate force generation for precise flight and physical interaction with the environment.
Principal Investigator
Prof. Roland Siegwart
Start (duration)
01.07.2023 (15 months)
2022 Funded Projects
Project Description
The main objective of this project is to investigate the benefits and challenges of targeted 3D and semantic reconstruction and to develop quality-adaptive semantically guided Simultaneous Localization and Mapping (SLAM) algorithms. The goal is to make an agent (e.g., Boston Dynamics Spot robot) able to navigate and find a target object (or other semantics) in an unknown or partially known environment while reconstructing the scene in a quality-adaptive manner. We interpret being quality-adaptive as making the reconstruction accuracy and detailedness dependent on finding the target class – i.e., reconstruct only until we are certain that the observed object does not belong to the target class.
Principal Investigator
Prof. Marc Pollefeys
Dr. Iro Armeni
Dr. Daniel Barath
Duration
01.09.2022 - 01.03.2024 (18 months)
Most important achieved milestones
1. Quality-Adaptive 3D Semantic Reconstruction. An algorithm for quality-adaptive semantic reconstruction was designed. This algorithm employs a multi-layer voxel structure to represent the environment. Each voxel encapsulates a truncated signed distance function (TSDF) value indicating the distance to the nearest 3D surface, alongside color, texture information, surface normal, and potential semantic classifications. Adaptive voxel subdivision into eight smaller voxels is governed by multiple criteria, including predefined target semantic categories. This approach allows users to delineate objects requiring high-resolution reconstruction from those less critical for the task at hand. The algorithm categorizes resolution into three levels: coarse (8 cm voxel size), middle (4 cm), and fine (1 cm), adjustable based on task requirements. Furthermore, a criterion based on geometric complexity has been established, facilitating the high-quality automatic reconstruction of complex structures irrespective of their semantic classification.
Our current extension on this method entails separating the SLAM reconstruction's geometric complexity from texture details, aiming for high-quality renderings without storing excessively detailed geometry. This is particularly relevant for simple geometries with complex textures, where current methods result in unwarranted reconstruction complexity and substantial storage demands. The proposed solution involves utilizing a coarse, adaptable voxel structure for geometry, with color data in 3D texture boxes, leveraging a triplanar mapping algorithm for enhanced rendering quality with minimal geometric detail.
2. An algorithm was introduced to enhance Voxblox++ significantly, enabling high-quality, real-time, incremental 3D panoptic segmentation of the environment. This method combines 2D-to-3D semantic and instance mapping to surpass the accuracy of recent 2D-to-3D semantic instance segmentation techniques on large-scale public datasets. Improvements over Voxblox++ include
- a novel application of 2D semantic prediction confidence in the mapping process,
- a new method for segmenting semantic-instance consistent surface regions (super-points) and
- a new graph optimization-based approach for semantic labeling and instance refinement.
3. Another significant contribution of the project is a novel matching algorithm that incorporates semantics for enhanced feature identification within a SLAM pipeline. This method generates a semantic descriptor from each feature's vicinity, which is integrated with the conventional visual descriptor for feature matching. Demonstrated improvements in accuracy, verified using publicly available datasets, underscore the method's effectiveness while maintaining real-time performance capabilities.
Most important publications
- Oguzhan Ilter, Iro Armeni, Marc Pollefeys, Daniel Barath (ICRA 2024): external page Semantically Guided Feature Matching for Visual SLAM
- Yang Miao, Iro Armeni, Marc Pollefeys, Daniel Barath (IROS 2024): external page Volumetric Semantically Consistent 3D Panoptic Mapping
- Jianhao Zheng, Daniel Barath, Marc Pollefeys, Iro Armeni (ECCV 2024): external page MAP-ADAPT: Real-Time Quality-Adaptive Semantic 3D Maps
Links to images and videos
Project Description
Within the project, we investigate representations of soft and/or articulated robots and objects to allow general manipulation pipelines. To apply these representations for real manipulation tasks, we develop a dexterous robotic platform.
Principal Investigator
Prof. Robert Katzschmann
Prof. Fisher Yu
Duration
01.07.2022 - 01.01.2024 (18 months)
Most important achieved milestones
1. We present ICGNet, which uses pointcloud data to create an embedding that contains both surface and volumetric information, and can be used to predict occupancy, object classes and physics or application specific details like grasp poses.
2. We developed a real time tracking framework for soft and articulated robots that allows for real time mesh construction from pointcloud data with point-wise errors that are almost an order of magnitude lower than the state of the art.
3. As an application platform, we constructed a dexterous robotic hand that is capable of precise and fast manipulation of objects.
Most important publications
Yasunori Toshimitsu, Benedek Forrai, Barnabas Gavin Cangan, Ulrich Steger, Manuel Knecht, Stefan Weirich and Robert K. Katzschmann (Humanoids 2023): Getting the Ball Rolling: Learning a Dexterous Policy for a Biomimetic Tendon-Driven Hand with Rolling Contact Joints
René Zurbrügg, Yifan Liu, Francis Engelmann, Suryansh Kumar, Marco Hutter, Vaishakh Patil and Fisher Yu (ICRA 2024): external page ICGNet: A Unified Approach for Instance-Centric Grasping
Elham Amin Mansour, Hehui Zheng and Robert K. Katzschmann (ROBOVIS 2024): external page Fast Point Cloud to Mesh Reconstruction for Deformable Object Tracking
Links to images and videos
external page ICGNet Architecture (CC BY-NC-ND 4.0)
external page Grasp prediction pipeline (CC BY-NC-ND 4.0)
external page Predicted grasps with ICGNet (CC BY-NC-ND 4.0)
external page Dexterous robotic hand demo video (Apache-2.0, BSD-3-Clause)
external page Robotic hands learning in simulation (Apache-2.0, BSD-3-Clause)
external page Robotic hands + rendering pointclouds for ICGNet in simulation (Apache-
2.0, BSD-3-Clause)