Despite advances in automatic speech recognition (ASR), large models like Whisper exhibit higher Word Error Rates (WER) for non-native English speakers, creating fairness and accessibility concerns. We propose a multi-agent post-processing pipeline that improves ASR accuracy for non-native speakers without retraining the underlying model. Our system consists of three specialized agents: (1) a BERT-based Error Analysis Agent that detects and classifies token-level errors, (2) a Correction Agent that applies rule-based fixes and T5-based grammar correction using detected error labels, and (3) an Evaluation Agent that measures improvements and provides feedback. On our L2-ARCTIC pilot dataset (100 samples from Mandarin Chinese and Arabic speakers), we demonstrate the feasibility of learned error detection and establish baseline metrics for accent-aware correction.
Our multi-agent pipeline operates entirely on text after ASR transcription, making it model-agnostic and computationally efficient.
Unlike monolithic correction systems, our modular design allows each agent to specialize:
What problem are we solving?
Current ASR systems perform significantly worse on non-native English speech, with WER gaps of 20-50% compared to native speakers. Fine-tuning models is computationally expensive and requires large amounts of accent-specific data. Moreover, fine-tuned models lack transparency about what was improved.
Why does it matter?
Fair and accurate ASR is critical for education (language learners), accessibility (voice interfaces), and professional settings (international workplaces). Our approach provides an interpretable, efficient, and portable solution that works with any ASR system.
Current limitations of existing approaches:
We sampled 100 utterances from the L2-ARCTIC dataset (scripted non-native English speech) from two speakers: NCC (native Mandarin Chinese speaker). We established baseline metrics using OpenAI Whisper (small.en):
| Metric | Value | Notes |
|---|---|---|
| Overall WER | 9.6% |
Better than expected for non-native speech |
| Overall CER | 4.8% |
Character-level accuracy |
| Average WER | 10.3% |
Per-utterance average |
| Average CER | 5.4% |
Per-utterance average |
| Max Sentence Length | 13 words |
Important for model context window |
| Total Tokens Annotated | 902 |
Word-level error labels |
We manually annotated all 100 samples with word-level error labels, creating 902 total token annotations. Our comprehensive taxonomy includes 8 error types covering both standard ASR errors and accent-specific patterns:
Our dataset exhibits a 817:1 imbalance ratio between the majority class (correct tokens) and rarest error types, presenting significant challenges for model training.
| Error Type | Count | Percentage | Description |
|---|---|---|---|
equal |
817 | 90.1% | β Correct transcription (majority class) |
substitution |
45 | 5.0% | Word replaced with different word |
accent_pronunciation |
17 | 1.9% | Accent-driven phonetic error (e.g., "color" β "cooler") |
homophone |
10 | 1.1% | Sound-alike substitution (e.g., "20th" vs "twentieth") |
deletion+equal |
7 | 0.8% | Missing word (label concatenated with next token) |
insertion |
3 | 0.3% | β Extra word inserted (insufficient for learning) |
substitution+deletion |
1 | 0.1% | Compound error |
repetition+equal |
1 | 0.1% | Disfluency: word repetition |
equal+repetition |
1 | 0.1% | Disfluency: word repetition (variant) |
Key Innovation: Our deletion handling concatenates the deletion label with the next word's label (e.g., deletion+equal), keeping all annotations aligned with the hypothesis tokens rather than requiring complex realignment with the reference text.
| Metric | Value | Interpretation |
|---|---|---|
| Final Training Loss | 0.77 |
Low training error |
| Final Eval Loss | 3.01 |
β Severe overfitting |
| Training Progress | 2.08 β 0.18 |
Loss decreased across 280 steps |
| Overfitting Gap | ~3.9x |
Eval loss / Train loss ratio |
β οΈ Expected Limitation: With only 70 training samples and 817:1 class imbalance, severe overfitting is expected. The model likely predicts "equal" for 95%+ of tokens. This validates our architecture and error taxonomy but requires significantly more data (2000-5000+ samples based on LearnerVoice results) for production-quality performance.
We extracted accent-specific phoneme confusion patterns for speaker NCC (native Mandarin Chinese) by aligning ASR errors with phonetic transcriptions. These patterns guide the Correction Agent's rule-based fixes:
| Reference Phoneme | Confused With | Probability | Linguistic Pattern |
|---|---|---|---|
R |
L, AA1 |
~0.059 each | Classic r/l confusion + vowelization |
N |
D, L |
~0.059 each | Nasal-stop confusion |
T |
D, final_stop_deletion |
~0.059 each | Final consonant deletion (common in Mandarin) |
D |
V, Z |
~0.059 each | Stop-fricative confusion |
Vowels |
Multiple patterns | ~0.059 | AA1βAH1, AY1βIH1, EH1βDH |
π‘ Impact: These patterns enable the Correction Agent to apply targeted, accent-aware fixes rather than generic corrections. The confusion matrix provides statistical evidence for which phoneme substitutions are systematic vs. random.
Our Correction Agent implements a novel two-stage correction pipeline that uses BERT error labels to guide both rule-based and neural corrections:
Stage 1: Rule-Based Corrections (Label-Guided)
deletion: reintroduce missing words from referencefiller (um, uh, etc.) from hypothesisrepetition (stutters) from hypothesisaccent_pronunciation labelsequal, substitution, and homophone unchanged for Stage 2Stage 2: T5 Grammar Correction
Based on findings from the LearnerVoice paper (Kim et al., 2024), aggressive grammar correction is not always optimal. Non-native reference transcripts may themselves contain grammatical variations, so excessive correction can actually increase WER by diverging from the reference text. Our T5 stage is calibrated to improve fluency while preserving semantic content and avoiding over-normalization.
Current Implementation Status: The Correction Agent is fully implemented and operational. It currently uses manually annotated error labels from the CSV. Once the BERT classifier generates predicted labels, they will be plugged in seamlessly using the same interface.
The Evaluation Agent compares WER before and after correction, tracking:
Based on these metrics, the agent decides whether to:
Current Status: WER calculation and per-speaker metrics are complete. Feedback loop integration is in progress.
| Component | Status | Details |
|---|---|---|
| Dataset Preparation | β Complete | 100 L2-ARCTIC samples (NCC speakers) |
| Baseline ASR | β Complete | Whisper small.en, WER: 9.6% |
| Manual Annotation | β Complete | 902 tokens with 8-class error labels |
| Error Analysis Agent (BERT) | β Trained | Proof-of-concept (overfitting expected with n=70) |
| Phoneme Confusion Matrices | β Built | Speaker-specific patterns (NCC) |
| Correction Agent (Hybrid) | β Complete | Rule-based + T5 grammar correction operational |
| Evaluation Agent | π In Progress | Metrics complete, feedback loop pending |
| Full Pipeline Integration | π In Progress | Connecting BERT predictions β Corrections β Evaluation |
| Large-Scale Evaluation | π Planned | Waiting for LearnerVoice & Common Voice access |
Based on related work (e.g., LearnerVoice fine-tuning reduced WER by 44% with 50 hours of data), we anticipate the following improvements over baseline Whisper:
| Metric | Current (n=100) | Expected (n=2000-5000) |
|---|---|---|
| WER Reduction | TBD (integration pending) | 15-30% relative improvement |
| ΞWER (Fairness) | TBD | 20-40% reduction in speaker gap |
| Correction Quality | TBD | >70% helpful, <10% harmful |
| Inference Latency | TBD | <200ms per utterance (BERT + T5) |
| BERT Error Classifier | Eval loss: 3.01 (overfitting) | Eval loss: <1.0 (generalizing) |
Note on Dataset Size: The LearnerVoice paper demonstrated that 50 hours of spontaneous L2 speech (approximately 5000+ utterances) achieved strong fine-tuning results. Given the extreme class imbalance in our task, we estimate 2000-5000 annotated utterances will be necessary for the BERT classifier to learn rare error types effectively.
Issue: Only 70 training samples leads to BERT classifier memorization rather than learning generalizable patterns.
Evidence: Train loss (0.77) vs. Eval loss (3.01) indicates ~3.9x overfitting gap.
Mitigation:
Issue: 90% of tokens are "equal" (correct), making it hard to learn rare error types like insertion (n=3).
Evidence: Model likely predicts "equal" for 95%+ of tokens to minimize loss.
Mitigation:
Issue: Current pilot data only covers 1 speaker: NCC (Mandarin Chinese L1).
Mitigation:
Issue: L2-ARCTIC uses read speech, which has fewer disfluencies than spontaneous speech.
Evidence: Only 2 disfluency examples in 902 tokens (0.2%).
Mitigation:
Issue: Aggressive T5 grammar correction may "fix" variations in non-native reference transcripts, increasing WER.
Evidence: LearnerVoice paper found that over-normalization can harm WER when references contain natural L2 variations.
Mitigation:
deletion+equal) keeps all annotations aligned with hypothesis, avoiding complex realignment
All team members contributed equally to experimental design, code review, results analysis, and report writing.
We thank Professor Dongyang Kang and TAs Shuyu Gan and Drew Gjerstad for their invaluable guidance, feedback, and support throughout this project. We also acknowledge the creators of L2-ARCTIC (Zhao et al., 2018), LearnerVoice (Kim et al., 2024), and Mozilla Common Voice (Mozilla Foundation, 2024) for making their datasets publicly available for research. Computational resources for this project were provided by Google Colab (T4/A100 GPUs).
Midterm Checkpoint Completed
Last Updated: November 14, 2024
Final Submission: December 2024
CSCI 5541 - Natural Language Processing | Fall 2024
University of Minnesota Twin Cities