Fine-tuning Experiments with Whisper and Llama

Client: ML Institute

Year: 2025

Role: Machine Learning Engineer

Duration: 2 week

Relevant links: Whisper repo, Llama repo

Summary: A set of experiments exploring fine-tuning techniques for Whisper (speech) and Llama (language) models to improve domain-specific performance while retaining general capabilities.

Project Summary

Datasets

🔊 Whisper: Fine-Tuning vs. Generalization

🦙 LLaMA: Use-Case Specific Experiments

1. GRPO (Generative RL from Preference Optimization)

2. Knowledge Injection

3. Knowledge Distillation

Project Summary

I ran a number of fine-tuning experiments on Whisper and LLaMA. Specifically, I experimented with knowledge injection, knowledge distillation, GRPO reward fine-tuning, and preserving generalizability while improving performance on specific tasks.

Datasets

Whisper:

- Primary: LibriSpeech ASR corpus
- Secondary: Small synthetic dataset with simple English phrases (for generalization testing)

LLaMA:

- SmolTLDR: Concise summarization dataset with token-length constraints
- AI-MO: Math problem-solving dataset with format and accuracy rewards
- Text8: For decoder training and distillation
- Custom knowledge prompts: Injecting recent knowledge (e.g., "Trump became president in 2025")

🔊 Whisper: Fine-Tuning vs. Generalization

Fine-tuning Whisper (base) on LibriSpeech improved WER significantly:

- From 0.19 → ~0.09 on test-clean
- From 0.27 → ~0.18 on test-other

However, as shown in logs/old-logs/out-of-sample-eval.log, the fine-tuned models completely lost their ability to transcribe simple English phrases outside the training distribution:

Base model (no fine-tuning):

1. Successfully transcribed "Hello, my name is Izaak" and "Hello, my name is Tolga"
2. Overall WER: 0.4

Fine-tuned models (standard approach):

- Failed to transcribe simple phrases, producing outputs like "HELLO MY MAIMS ISICK"
- Overall WER increased dramatically to 1.0–1.1

To address catastrophic forgetting, I used LoRA (Low-Rank Adaptation), KL Divergence and EWC regulators which fixed the issues illustrated above.

🦙 LLaMA: Use-Case Specific Experiments

1. GRPO (Generative RL from Preference Optimization)

I used the following reward functions to shape behavior of the baseline Llama model:

reward_len, reward_token_length: For controlling TLDR output length
reward_format, reward_accuracy: For structured math answers

The prior worked really well and the model did start outputting much shorter messages after a few epochs. However, the latter didn't improve performance on maths benchmarks almost at all.

2. Knowledge Injection

I injected post-training knowledge (e.g., political events from 2025) to Llama 3.2 1B using:

- LoRA on all layers vs. MLP-only
- Adapter-based methods (less successful)
- Evaluation via sense checks and control prompts

In my first attempt I naivly didn't adhere to the chatbot style input and the model totally forget how to respond. Interestingly, it took up to 50 epochs for this fact to learn, which I expected to happen much faster. After this it did correctly learn the fact that Donald Trump is the new President but it also forgot adjacent fact such as who the president of Turkey is.

3. Knowledge Distillation

I compared two student models:

- Regular: Small FFN trained directly on Text8
- Distilled: Mimicked LLaMA 3.2 1B using soft + hard loss (KL + X-entropy)

The distilled model retained more semantic coherence and required fewer resources.

Page updated

Google Sites

Report abuse