PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Shallow-π: Knowledge Distillation for Flow-based VLAs

Boseong Jeon, Yunho Choi, Taehan Kim

Robot Intelligence Team
Samsung Research

Paper Code

Problem 1: Original large VLA models (e.g., π) are too slow for real-time deployment on edge devices.
Problem 2: Small-backbone VLA models lack sufficient capacity for precise manipulation and reliable denoising.
Our Solution: We distill a high-capacity teacher into a compact student that is both fast and reliable.

Abstract

The growing demand for real-time robotic deployment necessitates fast and on-device inference for vision–language–action (VLA) models. Within the VLA literature, efficiency has been extensively studied at the token level, such as visual token pruning. In contrast, systematic transformer layer reduction has received limited attention, and to the best of our knowledge, has not been explored for flow-based VLA models in the context of knowledge distillation. Existing approaches to transformer layer reduction largely rely on layer-skipping techniques that require manual threshold tuning and yield only limited compression ratios. Moreover, these methods are particularly ineffective for flow-based action models such as π, in which the action head computes cross-attention with the VLM backbone at every layer. In this work, we propose Shallow-π, a principled knowledge distillation framework that aggressively reduces the transformer depth of both the VLM backbone and the flow-based action head, compressing the model from 18 → 6 layers. Shallow-π achieves over 2× faster inference with less than 1% absolute drop in success rate on standard manipulation benchmarks, establishing state-of-the-art performance among reduced VLA models. Crucially, we validate our approach through industrial-scale real-world experiments on Jetson Orin and Jetson Thor, deploying the model across multiple robot platforms, including humanoid-type systems, in complex and dynamic manipulation scenarios that require both computational speed and precision.

1. ALOHA Qualitative Results

Comparison across tasks and models on edge devices. Original π vs SmolVLA vs Shallow-π

Peg in hole (ALOHA)

Shallow-π (6 layers) on Jetson Orin (~10 Hz)

Original π model (~3 Hz)

SmolVLA (~5 Hz)

Unseen cylinder position (ALOHA)

Original π model

Shallow-π

Insert foam (ALOHA)

Shallow-π (6 layers) on Jetson Orin (~10 Hz)

Original π model (~3 Hz)

SmolVLA (~5 Hz)

Pour beans into box (ALOHA)

Shallow-π (6 layers) on Jetson Orin (~10 Hz)

Original π model (~3 Hz)

SmolVLA (~5 Hz)

Scoop and place an apple (ALOHA)

Shallow-π (6 layers) on Jetson Orin (~10 Hz)

Original π model (~3 Hz)

SmolVLA (~5 Hz)

2. RB-Y1 Qualitative Results (Jetson Thor)

Comparison across tasks and models on edge devices. Original π vs SmolVLA

Open lid & peg: type A and B (RB-Y1)

Shallow-π (6 layers) on Jetson Thor (80ms for 50 chunk), 2x playback speed.

Recycle: Shallow-π (~80ms for 50 chunk on Thor) vs Original π (~130ms)

Shallow-π (SR=17/20)

Original π (SR=12/20)

Recycle with perturbed trash bins (unseen)

Shallow-π (SR=17/20)

BibTeX

@misc{jeon2026shallowpiknowledgedistillationflowbased,
      title={Shallow-{\pi}: Knowledge Distillation for Flow-based VLAs}, 
      author={Boseong Jeon and Yunho Choi and Taehan Kim},
      year={2026},
      eprint={2601.20262},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2601.20262}, 
}