Fine-Tuning Vision-Language-Action Models:
Optimizing Speed and Success

Moo Jin Kim¹, Chelsea Finn¹, Percy Liang¹

¹Stanford University

Paper Video Tweet Code

Models

OpenVLA-OFT Summary Video

(🔊 Turn on sound to follow along with the narration! 🔊)

TLDR:

• Our new Optimized Fine-Tuning (OFT) recipe for VLAs — which combines parallel decoding, action chunking, a continuous action representation, and L1 regression objective — significantly enhances inference speed (25-50x) and task performance (20%+ boost in success rate).
• OpenVLA-OFT, a policy created with our fine-tuning recipe, achieves SOTA results in LIBERO: 97.1% average success rate across 4 task suites, outperforming π₀, MDT, Seer, DiT Policy, Octo, and Diffusion Policy.
• Our recipe, when augmented with FiLM for better language grounding ("OFT+"), enables high-frequency language-driven control on the bimanual ALOHA robot with a 7B-parameter VLA policy and outperforms other fine-tuned VLAs (π₀ and RDT-1B) and popular imitation learning policies trained from scratch (ACT and Diffusion Policy).

Experimental Results
Deep Dive: ALOHA Robot Rollout Videos & Qualitative Analysis
Frequently Asked Questions
BibTeX

Experimental Results

LIBERO Simulation Benchmark: Task Performance

We evaluate OpenVLA-OFT in four LIBERO simulation benchmark task suites, measuring task success rates with and without additional inputs (wrist camera image and proprioceptive state) and comparing it to prior methods. OpenVLA-OFT achieves state-of-the-art results in both categories.

LIBERO Simulation Benchmark: Inference Efficiency

Through parallel decoding and action chunking, OpenVLA-OFT obtains 26x faster action generation speed and 3x lower latency than the base OpenVLA model.

ALOHA Robot: Task Performance

We evaluate both fine-tuned VLAs (RDT-1B, π₀, OpenVLA-OFT+) and popular imitation learning policies trained from scratch (ACT, Diffusion Policy) on four representative dexterous manipulation tasks on the ALOHA robot. Fine-tuned VLAs consistently outperform from-scratch policies, with OpenVLA-OFT+ achieving the highest average performance.

ALOHA Robot: Steerability via Language

We also assess the language following capabilities of each method, measuring the success rates in approaching the correct language-specified target object for two language-dependent tasks. As expected, the fine-tuned VLA policies (RDT-1B, π₀, OpenVLA-OFT+) show better language following capabilities than the policies trained from scratch. Overall, OpenVLA-OFT+ exhibits strongest language grounding.

We recommend watching the summary video above and reading our paper first before perusing the deep dive below!

Click the arXiv icon below to go to the paper.

Deep Dive: ALOHA Robot Rollout Videos & Qualitative Analysis

In our paper, we evaluate popular imitation learning policies trained from scratch (ACT and Diffusion Policy) and fine-tuned VLAs (RDT-1B, π₀, OpenVLA-OFT) on the bimanual ALOHA robot. Here we show real-world rollout videos and focus on qualitative differences between the methods.

All videos below were captured by an external camera and are sped up by 5x unless specified otherwise.

Non-VLA Methods Trained from Scratch: ACT and Diffusion Policy

ACT and Diffusion Policy can reliably execute tasks that involve folding clothes and are not dependent on language inputs, as there is just one object to manipulate per task.

ACT:
fold shorts

✅

Diffusion Policy:
fold shorts

✅

ACT:
fold shirt

✅

Diffusion Policy:
fold shirt

✅

In the "scoop X into bowl" task, where the user specifies the target trail mix ingredient via language, ACT and Diffusion Policy approach the correct ingredient most of the time. However, they often make errors during task execution, e.g., hitting the front of the container with the spoon or failing to scoop the ingredients.

ACT:
scoop raisins into bowl

❌

Diffusion Policy:
scoop pretzels into bowl

❌

Additionally, in the "put X into pot" task, ACT and Diffusion Policy struggle to follow the user's language inputs, often approaching the wrong target object while also making general task execution errors in the process.

ACT:
put red pepper into pot

❌

Diffusion Policy:
put green pepper into pot

❌

Fine-Tuned VLA Policies: RDT-1B, π₀, and OpenVLA-OFT+

The fine-tuned VLAs (RDT-1B, π₀, and OpenVLA-OFT+) can also reliably perform the clothes folding tasks, like the previous methods.

RDT-1B:
fold shorts

✅

π₀:
fold shorts

✅

OpenVLA-OFT+ (ours):
fold shorts

✅

RDT-1B:
fold shirt

✅

π₀:
fold shirt

✅

OpenVLA-OFT+ (ours):
fold shirt

✅

Compared to the non-VLA policies, the fine-tuned VLAs show improved language following and task execution in the "scoop X into bowl" and "put X into pot" tasks.

RDT-1B:
scoop raisins into bowl

✅

π₀:
scoop raisins into bowl

✅

OpenVLA-OFT+ (ours):
scoop raisins into bowl

✅

RDT-1B:
put yellow corn into pot

✅

π₀:
put yellow corn into pot

✅

OpenVLA-OFT+ (ours):
put yellow corn into pot

✅

However, the fine-tuned VLAs are not always successful. In some trials of the "put X into pot" task, π₀ approaches the wrong object while RDT-1B approaches the correct one but fails to finish the task. OpenVLA-OFT+, on the other hand, more frequently targets the correct object and completes the task.

RDT-1B:
put green pepper into pot

⚠️

π₀:
put green pepper into pot

❌

OpenVLA-OFT+ (ours):
put green pepper into pot

✅

Other Observations

RDT-1B: Tradeoffs Between Better Language Grounding and Reactivity to Visual Feedback

One error that fine-tuned RDT-1B makes multiple times in the "scoop X into bowl" task is shown below. Even though the robot misses the bowl at the beginning of the episode, it carries on as if the bowl were correctly placed at the center of the table, proceeding to pour the trail mix ingredients all over the table. We attribute this failure mode to RDT-1B's "Alternating Condition Injection" scheme, which alternates between injecting visual inputs and language inputs in successive transformer layers — a design that encourages the policy to pay more attention to language inputs rather than over-relying on visual inputs. Despite improving the model's ability to follow language, this specially designed architecture may lead to an impaired ability to incorporate visual feedback.

RDT-1B:
scoop raisins into bowl

❌

RDT-1B:
scoop almonds and green M&Ms into bowl

❌

RDT-1B:
scoop pretzels into bowl

❌

Retrying Behaviors: π₀ and OpenVLA-OFT+

On the other hand, while π₀ slightly trails RDT-1B in terms of language following ability, it exhibits better closed-loop visuomotor control. For instance, it occasionally retries after making an initial mistake. OpenVLA-OFT+ demonstrates similar retrying behaviors as well. In the videos below, neither policy finishes the task fully since we reach the time limit (π₀ does not drop the pepper into the pot, OpenVLA-OFT+ does not finish closing the pot). However, both methods would have succeeded given more time.

π₀:
put green pepper into pot

🔄

OpenVLA-OFT+ (ours):
put green pepper into pot

🔄

L1 Regression vs. Diffusion

Contrary to the common belief that diffusion-based policies are superior to L1 regression-based policies in imitation learning due to their expressivity and multimodal action modeling capabilities, our findings reveal some important nuances. The characteristics that make diffusion models powerful — e.g., their ability to capture complex action distributions — can lead to issues when training on imperfect demonstration data. Specifically, these models can accurately reproduce even suboptimal behaviors present in the training demonstrations, potentially compromising the policy's performance during deployment. In contrast, L1 policies can benefit from an inherent regularization effect by naturally filtering out noise in training demonstrations through their limited expressivity and committing to the median mode in the task demonstrations. This previously overlooked advantage suggests that simpler algorithms may be more robust than their more sophisticated counterparts in some cases.

We can see this difference clearly in the practical example shown below: When using a diffusion-based fine-tuned VLA (π₀) to scoop pretzels, the robot fails because it inserts the spoon too deeply into the container. This problematic behavior arises from reproducing some demonstration sequences where the expert demonstrator had inserted the spoon too deeply into the pretzels container (making it difficult for the policy to retract the spoon afterwards). π₀ generates this same behavior two times in twelve trials.

In comparison, our OpenVLA-OFT+ approach, which uses L1 regression, learns to insert the spoon at an ideal depth—neither too deep nor too shallow—and executes the scooping more reliably (achieving 100% success rate on this task).

π₀:
scoop pretzels into bowl

❌

OpenVLA-OFT+ (ours):
scoop pretzels into bowl

✅

Note: We do not intend to suggest that L1 policies are universally better than diffusion policies. In fact, if the action distribution in the training set is multimodal, L1 regression-based optimization may lead to learning just one "median" mode in the action distribution (which, importantly, is different from the "mean" mode that MSE regression-based approaches would collapse to). This may not be ideal in certain cases where generating alternative action sequences can be beneficial for task completion. However, in the real world, high-dimensional data such as camera readings are noisy, and with the noise even deterministic policies can produce seemingly "multimodal" behaviors.

Overall, in practice, we find that a simple L1 regression-based approach with a high-capacity policy backbone like OpenVLA proves to be quite effective for adapting to new robots and new tasks.

⭐ Bonus OpenVLA-OFT+ Video: Fully Autonomous Forward and Backward Task Rollouts ⭐

We show a bonus clip in which OpenVLA-OFT+ autonomously performs the forward task ("scoop X into bowl") and backward/reset task ("pour X into container") in six consecutive rollouts, alternating between the two tasks while cycling through the three trail mix ingredients based on preset language commands. This particular policy was trained on extra data with demonstrations of the backward task. With our strong imitation learning framework, a policy can autonomously execute a task and reset the scene effectively while also exhibiting steerability via language.

OpenVLA-OFT+ (ours):
scoop X into bowl & pour X into container
(X = "raisins" → "almonds and green M&Ms" → "pretzels")

✅

Frequently Asked Questions

(Last updated on 2025-02-22)

How much compute do I need to fine-tune OpenVLA using the OFT recipe? What if I just want to run inference?

Training
For this project, we ran each OpenVLA-OFT training job with 8 A100 or H100 GPUs with 80 GB memory and trained for 50K to 150K gradient steps, depending on the fine-tuning dataset size, which took 1-2 days. We recommend using 4 or 8 GPUs if possible, though you can use fewer GPUs and enable gradient accumulation (which is supported by our fine-tuning script); the training runs will just take longer. Since we fine-tune OpenVLA with LoRA instead of full fine-tuning, there is no need to do model sharding. We simply use Distributed Data Parallel (DDP) and split training samples in a batch across multiple GPUs.

Here is the minimum GPU memory that is needed for each OpenVLA-OFT(+) training configuration with the default bfloat16 data type:

• LIBERO - 1 input image (third-person camera), 7 action dimensions, action chunk size 8, batch size 1 → 25.6 GB
• LIBERO - 2 input images (third-person + wrist camera) + proprio state, 7 action dimensions, action chunk size 8, batch size 1 → 25.7 GB
• ALOHA - 3 input images (third-person + 2 wrist cameras) + proprio state, 7 action dimensions, action chunk size 25, FiLM, batch size 1 → 38.6 GB

Here is the recommended GPU memory for each OpenVLA-OFT(+) training configuration with the default bfloat16 data type:

• LIBERO - 1 input image (third-person camera), 7 action dimensions, action chunk size 8, batch size 8 per device → 44.1 GB
• LIBERO - 2 input images (third-person + wrist camera) + proprio state, 7 action dimensions, action chunk size 8, batch size 8 per device → 62.5 GB
• ALOHA - 3 input images (third-person + 2 wrist cameras) + proprio state, 14 action dimensions, action chunk size 25, FiLM, batch size 4 per device → 73.5 GB

Inference
To run OpenVLA-OFT(+) with the default bfloat16 data type, you need less GPU memory:

• LIBERO - 1 input image (third-person camera), 7 action dimensions, action chunk size 8 → 15.9 GB
• LIBERO - 2 input images (third-person + wrist camera) + proprio state, 7 action dimensions, action chunk size 8 → 16.2 GB
• ALOHA - 3 input images (third-person + 2 wrist cameras) + proprio state, FiLM, 14 action dimensions, action chunk size 25 → 18.0 GB

How does OpenVLA-OFT, which is an L1 policy, outperform fine-tuned diffusion VLAs like RDT-1B and π₀, which use more sophisticated algorithms and larger pretraining datasets?

Please see the section on L1 Regression vs. Diffusion.

Does OpenVLA's pretraining help at all if your new fine-tuning recipe uses a different learning algorithm and architecture?

Yes. We ran an ablation study in LIBERO and observed a 5% drop in average success rate when ablating the OpenVLA pretrained representation. See Appendix H and Table XIV in our paper for more details.

Why did you need so many demonstrations for the "put X into pot" task? Even with 300 demonstrations, why is the performance in this task much lower than in other tasks, which seem more difficult?

Number of demonstrations: We do not actually need all 300 demonstrations for satisfactory performance on this task. We simply collected a large number of demonstrations because it was during a point in our project when learned policies were showing poor language following, and we experimented with increasing the training dataset size significantly to test whether this would enable better language grounding. It turned out that simply doubling/quadrupling the dataset size did not solve the problem, as it only slightly improved language following ability. To achieve much better language grounding, we had to take additional measures to encourage the model to pay more attention to language — such as FiLM for fine-tuned OpenVLA policies, which infuses language embedding information into all visual features.

Task performance: This is the first task that we started with and collected demonstrations for on the ALOHA robot setup, and therefore, it is the oldest. We observed significant performance degradation in all methods over time due to hardware-related distribution shifts that arose as time passed (e.g., shifts in the wrist camera viewpoints and slight wear-and-tear in a few robot joints, which affected the dynamics). Earlier on in the project, after we figured out how to imbue policies with enhanced language grounding, we would observe over 90% success rate on this task with fine-tuned OpenVLA policies. However, due to distribution shifts, performance dropped quite significantly when we ran the tests again weeks later. To ensure fair comparisons between methods, we evaluated all methods at the same time so that they all encounter the same train-test distribution shifts — hence the relatively low average rates on this task across all methods.

BibTeX


        @article{kim2025fine,
          title={Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success},
          author={Kim, Moo Jin and Finn, Chelsea and Liang, Percy},
          journal={arXiv preprint arXiv:2502.19645},
          year={2025}
        }

Fine-Tuning Vision-Language-Action Models:Optimizing Speed and Success

OpenVLA-OFT Summary Video

TLDR:

Table of Contents

Experimental Results

LIBERO Simulation Benchmark: Task Performance

LIBERO Simulation Benchmark: Inference Efficiency

ALOHA Robot: Task Performance

ALOHA Robot: Steerability via Language

Deep Dive: ALOHA Robot Rollout Videos & Qualitative Analysis

Non-VLA Methods Trained from Scratch: ACT and Diffusion Policy

ACT:fold shorts

✅

Diffusion Policy:fold shorts

✅

ACT:fold shirt

✅

Diffusion Policy:fold shirt

✅

ACT:scoop raisins into bowl

❌

Diffusion Policy:scoop pretzels into bowl

❌

ACT:put red pepper into pot

❌

Diffusion Policy:put green pepper into pot

❌

Fine-Tuned VLA Policies: RDT-1B, π0, and OpenVLA-OFT+

RDT-1B:fold shorts

✅

π0:fold shorts

✅

OpenVLA-OFT+ (ours):fold shorts

✅

RDT-1B:fold shirt

✅

π0:fold shirt

✅

OpenVLA-OFT+ (ours):fold shirt

✅

RDT-1B:scoop raisins into bowl

✅

π0:scoop raisins into bowl

✅

OpenVLA-OFT+ (ours):scoop raisins into bowl

✅

RDT-1B:put yellow corn into pot

✅

π0:put yellow corn into pot

✅

OpenVLA-OFT+ (ours):put yellow corn into pot

✅

RDT-1B:put green pepper into pot

⚠️

π0:put green pepper into pot

❌

OpenVLA-OFT+ (ours):put green pepper into pot

✅

Other Observations

RDT-1B: Tradeoffs Between Better Language Grounding and Reactivity to Visual Feedback

RDT-1B:scoop raisins into bowl

❌

RDT-1B:scoop almonds and green M&Ms into bowl

❌

RDT-1B:scoop pretzels into bowl

❌

Retrying Behaviors: π0 and OpenVLA-OFT+

π0:put green pepper into pot

🔄

OpenVLA-OFT+ (ours):put green pepper into pot

🔄

L1 Regression vs. Diffusion

π0:scoop pretzels into bowl

❌

OpenVLA-OFT+ (ours):scoop pretzels into bowl

✅

⭐ Bonus OpenVLA-OFT+ Video: Fully Autonomous Forward and Backward Task Rollouts ⭐

OpenVLA-OFT+ (ours):scoop X into bowl & pour X into container(X = "raisins" → "almonds and green M&Ms" → "pretzels")

✅

Frequently Asked Questions

Fine-Tuning Vision-Language-Action Models:
Optimizing Speed and Success

ACT:
fold shorts

Diffusion Policy:
fold shorts

ACT:
fold shirt

Diffusion Policy:
fold shirt

ACT:
scoop raisins into bowl

Diffusion Policy:
scoop pretzels into bowl

ACT:
put red pepper into pot

Diffusion Policy:
put green pepper into pot

Fine-Tuned VLA Policies: RDT-1B, π₀, and OpenVLA-OFT+

RDT-1B:
fold shorts

π₀:
fold shorts

OpenVLA-OFT+ (ours):
fold shorts

RDT-1B:
fold shirt

π₀:
fold shirt

OpenVLA-OFT+ (ours):
fold shirt

RDT-1B:
scoop raisins into bowl

π₀:
scoop raisins into bowl

OpenVLA-OFT+ (ours):
scoop raisins into bowl

RDT-1B:
put yellow corn into pot

π₀:
put yellow corn into pot

OpenVLA-OFT+ (ours):
put yellow corn into pot

RDT-1B:
put green pepper into pot

π₀:
put green pepper into pot

OpenVLA-OFT+ (ours):
put green pepper into pot

RDT-1B:
scoop raisins into bowl

RDT-1B:
scoop almonds and green M&Ms into bowl

RDT-1B:
scoop pretzels into bowl

Retrying Behaviors: π₀ and OpenVLA-OFT+

π₀:
put green pepper into pot

OpenVLA-OFT+ (ours):
put green pepper into pot

π₀:
scoop pretzels into bowl

OpenVLA-OFT+ (ours):
scoop pretzels into bowl

OpenVLA-OFT+ (ours):
scoop X into bowl & pour X into container
(X = "raisins" → "almonds and green M&Ms" → "pretzels")