Shallow-π: Knowledge Distillation for Flow-based VLAs

  • Problem 1: Original large VLA models (e.g., π) are too slow for real-time deployment on edge devices.
  • Problem 2: Small-backbone VLA models lack sufficient capacity for precise manipulation and reliable denoising.
  • Our Solution: We distill a high-capacity teacher into a compact student that is both fast and reliable.

Abstract

The growing demand for real-time robotic deployment necessitates fast and on-device inference for vision–language–action (VLA) models. Within the VLA literature, efficiency has been extensively studied at the token level, such as visual token pruning. In contrast, systematic transformer layer reduction has received limited attention, and to the best of our knowledge, has not been explored for flow-based VLA models in the context of knowledge distillation. Existing approaches to transformer layer reduction largely rely on layer-skipping techniques that require manual threshold tuning and yield only limited compression ratios. Moreover, these methods are particularly ineffective for flow-based action models such as π, in which the action head computes cross-attention with the VLM backbone at every layer. In this work, we propose Shallow-π, a principled knowledge distillation framework that aggressively reduces the transformer depth of both the VLM backbone and the flow-based action head, compressing the model from 18 → 6 layers. Shallow-π achieves over 2× faster inference with less than 1% absolute drop in success rate on standard manipulation benchmarks, establishing state-of-the-art performance among reduced VLA models. Crucially, we validate our approach through industrial-scale real-world experiments on Jetson Orin and Jetson Thor, deploying the model across multiple robot platforms, including humanoid-type systems, in complex and dynamic manipulation scenarios that require both calculation speed and precision.

DESCRIPTION

1. ALOHA Qualitative Results

Comparison across tasks and models on edge devices. Original π vs SmolVLA vs Shallow-π

DESCRIPTION

Peg in hole (ALOHA)

Shallow-π (6 layers) on Jetson Orin (~10 Hz)

Original π model (~3 Hz)

SmolVLA (~5 Hz)

Unseen cylinder position (ALOHA)

Original π model

Shallow-π

Insert foam (ALOHA)

Shallow-π (6 layers) on Jetson Orin (~10 Hz)

Original π model (~3 Hz)

SmolVLA (~5 Hz)

Pour beans into box (ALOHA)

Shallow-π (6 layers) on Jetson Orin (~10 Hz)

Original π model (~3 Hz)

SmolVLA (~5 Hz)

Scoop and put apple (ALOHA)

Shallow-π (6 layers) on Jetson Orin (~10 Hz)

Original π model (~3 Hz)

SmolVLA (~5 Hz)

2. RB-Y1 Qualitative Results (Jeton Thor)

Comparison across tasks and models on edge devices. Original π vs SmolVLA

DESCRIPTION

Open lid & peg: type A and B (RB-Y1)

Shallow-π (6 layers) on Jetson Thor (80ms for 50 chunk), 2x video play.

Recycle: Shallow-π (~80ms for 50 chunk on Thor) vs Original π (~130ms)

Shallow-π (SR=17/20)

Original π (SR=12/20)

Recycle with perturbed trash bins (unseen)

Shallow-π (SR=17/20)

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}