Year: 2025
Role: Machine Learning Engineer
Duration: 2 week
Relevant links: Whisper repo, Llama repo Β
Summary: A set of experiments exploring fine-tuning techniques for Whisper (speech) and Llama (language) models to improve domain-specific performance while retaining general capabilities.
I ran a number of fine-tuning experiments on Whisper and LLaMA. Specifically, I experimented with knowledge injection, knowledge distillation, GRPO reward fine-tuning, and preserving generalizability while improving performance on specific tasks.
Whisper:
Primary: LibriSpeech ASR corpus
Secondary: Small synthetic dataset with simple English phrases (for generalization testing)
LLaMA:
SmolTLDR: Concise summarization dataset with token-length constraints
AI-MO: Math problem-solving dataset with format and accuracy rewards
Text8: For decoder training and distillation
Custom knowledge prompts: Injecting recent knowledge (e.g., "Trump became president in 2025")
Fine-tuning Whisper (base) on LibriSpeech improved WER significantly:
From 0.19 β ~0.09 on test-clean
From 0.27 β ~0.18 on test-other
However, as shown in logs/old-logs/out-of-sample-eval.log, the fine-tuned models completely lost their ability to transcribe simple English phrases outside the training distribution:
Base model (no fine-tuning):
Successfully transcribed "Hello, my name is Izaak" and "Hello, my name is Tolga"
Overall WER: 0.4
Fine-tuned models (standard approach):
Failed to transcribe simple phrases, producing outputs like "HELLO MY MAIMS ISICK"
Overall WER increased dramatically to 1.0β1.1
To address catastrophic forgetting, I used LoRA (Low-Rank Adaptation), KL Divergence and EWC regulators which fixed the issues illustrated above.
I used the following reward functions to shape behavior of the baseline Llama model:
reward_len, reward_token_length: For controlling TLDR output length
reward_format, reward_accuracy: For structured math answers
The prior worked really well and the model did start outputting much shorter messages after a few epochs. However, the latter didn't improve performance on maths benchmarks almost at all.Β
I injected post-training knowledge (e.g., political events from 2025) to Llama 3.2 1B using:
LoRA on all layers vs. MLP-only
Adapter-based methods (less successful)
Evaluation via sense checks and control prompts
In my first attempt I naivly didn't adhere to the chatbot style input and the model totally forget how to respond. Interestingly, it took up to 50 epochs for this fact to learn, which I expected to happen much faster. After this it did correctly learn the fact that Donald Trump is the new President but it also forgot adjacent fact such as who the president of Turkey is.Β
I compared two student models:
Regular: Small FFN trained directly on Text8
Distilled: Mimicked LLaMA 3.2 1B using soft + hard loss (KL + X-entropy)
The distilled model retained more semantic coherence and required fewer resources.