GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping

Abstract

We study cross-embodiment 6-DOF robot grasping. Unlike prior works, we require the model not only to generalize to novel objects / scenes but also to novel gripper morphologies and physical grasping processes. Our method extends diffusion model based generative 6-DOF grasping models to condition on the additional gripper's representation. We propose a swept-volume heuristic for encoding the gripper. We train our cross-embodiment model with procedural grippers and a large-scale dataset of 395 Million grasps. In simulation experiments, our model has the best zero-shot generalization to novel real-world grippers and objects over baseline methods. Our model also serves as a good initialization for fine-tuning to adapt to novel grippers. In ablations, we demonstrate the efficiency of our sweep-volume gripper representation and our procedural gripper training dataset. Last, we show zero-shot generalization to real-world novel grippers for 6-DOF grasping, surpassing baselines in cross-embodiment generalization.

Method

Swept-Volume Gripper Representation

We encode gripper geometry using a swept-volume heuristic — the union of volumes swept by the gripper during its closing motion. This compact representation captures the key aspect of the physical grasping process, which enables efficient zero-shot generalization to novel gripper morphologies.

Procedural Gripper Training Dataset

Different from training with existing grippers, we train with 25 procedurally generated grippers spanning a wide range of morphologies, including parallel 2-finger grippers, revolute 2-finger grippers, and high-dof 3-finger grippers.

Large-Scale Grasp Dataset

GraspGen-X is trained on 350M sampled grasps. The table compares our dataset with other existing grasping datasets.

Experiments

Zero-Shot Cross-Embodiment Generalization

We evaluate GraspGen-X on zero-shot generalization to novel real-world grippers and objects. Our model performs well in cross-embodiment grasp without any fine-tuning. GraspGen-DTR: direct transfer for a graspgen trained with only Franka gripper. GraspGen-RTG: transfer the franka graspgen with grasp pose retargetting.

Supervised Fine-Tuning

GraspGen-X serves as a strong initialization for fine-tuning to novel grippers. Fine-tuning from our cross-embodiment model converges faster and reaches higher performance than training from scratch. GraspGen-Scratch: training graspgen from scratch. GraspGen-Franka-SFT: finetuning graspgen franka. GraspGen-X-SFT: ours.

Ablation: Gripper Encoder Comparison

We compare our swept-volume gripper encoder against alternative encoding strategies. The swept-volume representation consistently outperforms other methods across all gripper types.

Grasp Pose Visualization of Zero-Shot Generalization

Qualitative visualization of 6-DOF grasp poses predicted by GraspGen-X in a zero-shot setting, demonstrating diverse and physically plausible grasps across novel grippers and objects.

Grasp pose visualization for zero-shot generalization

BibTeX

@inproceedings{han2026graspgenx,
  title     = {GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping},
  author    = {Han, Beining and Chao, Yu-Wei and Coumans, Erwin and Eppner, Clemens and Deng, Jia and Birchfield, Stan and Murali, Adithyavairavan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026},
}