Multi-Agent ASR Post-Processing | CSCI 5541

Abstract

Despite advances in automatic speech recognition (ASR), large models like Whisper exhibit higher Word Error Rates (WER) for non-native English speakers, creating fairness and accessibility concerns. We propose a multi-agent post-processing pipeline that improves ASR accuracy for non-native speakers without retraining the underlying model. Our system consists of three specialized agents: (1) a BERT-based Error Analysis Agent that detects and classifies token-level errors, (2) a Correction Agent that applies rule-based fixes and T5-based grammar correction using detected error labels, and (3) an Evaluation Agent that measures improvements and provides feedback. On our L2-ARCTIC pilot dataset (100 samples from Mandarin Chinese and Arabic speakers), we demonstrate the feasibility of learned error detection and establish baseline metrics for accent-aware correction.

System Overview

Our multi-agent pipeline operates entirely on text after ASR transcription, making it model-agnostic and computationally efficient.

Key Innovation: Agent-Based Architecture

Unlike monolithic correction systems, our modular design allows each agent to specialize:

Error Analysis Agent (BERT-based): Token classifier trained on 8 error types (equal, substitution, accent_pronunciation, homophone, deletion, insertion, repetition, filler). Takes ASR hypothesis and predicts error labels for each word.
Correction Agent (Hybrid): Two-stage correction process:
1. Rule-based corrections guided by BERT error labels (removes deletions, fillers, repetitions, applies accent-specific fixes)
2. T5 grammar correction model for fluency improvement on the cleaned text
Evaluation Agent: WER-based metrics with fairness analysis (ΔWER) and feedback loop triggering

Motivation

What problem are we solving?

Current ASR systems perform significantly worse on non-native English speech, with WER gaps of 20-50% compared to native speakers. Fine-tuning models is computationally expensive and requires large amounts of accent-specific data. Moreover, fine-tuned models lack transparency about what was improved.

Why does it matter?

Fair and accurate ASR is critical for education (language learners), accessibility (voice interfaces), and professional settings (international workplaces). Our approach provides an interpretable, efficient, and portable solution that works with any ASR system.

Current limitations of existing approaches:

Model retraining: Requires 50+ hours of data, expensive GPU compute, and produces opaque improvements
Rule-based correction: Limited to predefined patterns, cannot adapt to new error types
Monolithic neural correction: End-to-end systems that lack interpretability and require parallel training data

Our Approach

1. Data Preparation & Baseline Establishment

We sampled 100 utterances from the L2-ARCTIC dataset (scripted non-native English speech) from two speakers: NCC (native Mandarin Chinese speaker). We established baseline metrics using OpenAI Whisper (small.en):

Baseline ASR Performance (n=100)

Metric	Value	Notes
Overall WER	`9.6%`	Better than expected for non-native speech
Overall CER	`4.8%`	Character-level accuracy
Average WER	`10.3%`	Per-utterance average
Average CER	`5.4%`	Per-utterance average
Max Sentence Length	`13 words`	Important for model context window
Total Tokens Annotated	`902`	Word-level error labels

2. Error Analysis Agent (BERT Token Classifier)

We manually annotated all 100 samples with word-level error labels, creating 902 total token annotations. Our comprehensive taxonomy includes 8 error types covering both standard ASR errors and accent-specific patterns:

⚠️ Critical Challenge: Extreme Class Imbalance

Our dataset exhibits a 817:1 imbalance ratio between the majority class (correct tokens) and rarest error types, presenting significant challenges for model training.

Error Type	Count	Percentage	Description
`equal`	817	90.1%	✓ Correct transcription (majority class)
`substitution`	45	5.0%	Word replaced with different word
`accent_pronunciation`	17	1.9%	Accent-driven phonetic error (e.g., "color" → "cooler")
`homophone`	10	1.1%	Sound-alike substitution (e.g., "20th" vs "twentieth")
`deletion+equal`	7	0.8%	Missing word (label concatenated with next token)
`insertion`	3	0.3%	❌ Extra word inserted (insufficient for learning)
`substitution+deletion`	1	0.1%	Compound error
`repetition+equal`	1	0.1%	Disfluency: word repetition
`equal+repetition`	1	0.1%	Disfluency: word repetition (variant)

Key Innovation: Our deletion handling concatenates the deletion label with the next word's label (e.g., deletion+equal), keeping all annotations aligned with the hypothesis tokens rather than requiring complex realignment with the reference text.

BERT Model Training Configuration

Model: BERT-base-uncased (110M parameters)
Training Samples: 70 utterances
Validation Samples: 10 utterances
Test Samples: 20 utterances
Epochs: 8
Learning Rate: 3e-5
Batch Size: 2
Loss Function: Weighted Cross-Entropy (inverse class frequency)
Class Weights Range: 1.0 (equal) to 817.0 (rarest classes)

Training Results (Proof-of-Concept)

Metric	Value	Interpretation
Final Training Loss	`0.77`	Low training error
Final Eval Loss	`3.01`	❌ Severe overfitting
Training Progress	`2.08 → 0.18`	Loss decreased across 280 steps
Overfitting Gap	`~3.9x`	Eval loss / Train loss ratio

⚠️ Expected Limitation: With only 70 training samples and 817:1 class imbalance, severe overfitting is expected. The model likely predicts "equal" for 95%+ of tokens. This validates our architecture and error taxonomy but requires significantly more data (2000-5000+ samples based on LearnerVoice results) for production-quality performance.

3. Phoneme Confusion Matrix Analysis

We extracted accent-specific phoneme confusion patterns for speaker NCC (native Mandarin Chinese) by aligning ASR errors with phonetic transcriptions. These patterns guide the Correction Agent's rule-based fixes:

📊 Key Phoneme Confusions (NCC Speaker - Mandarin Chinese L1)

Reference Phoneme	Confused With	Probability	Linguistic Pattern
`R`	`L, AA1`	~0.059 each	Classic r/l confusion + vowelization
`N`	`D, L`	~0.059 each	Nasal-stop confusion
`T`	`D, final_stop_deletion`	~0.059 each	Final consonant deletion (common in Mandarin)
`D`	`V, Z`	~0.059 each	Stop-fricative confusion
`Vowels`	Multiple patterns	~0.059	AA1↔AH1, AY1↔IH1, EH1↔DH

💡 Impact: These patterns enable the Correction Agent to apply targeted, accent-aware fixes rather than generic corrections. The confusion matrix provides statistical evidence for which phoneme substitutions are systematic vs. random.

4. Correction Agent (Hybrid Rule-Based + T5)

Our Correction Agent implements a novel two-stage correction pipeline that uses BERT error labels to guide both rule-based and neural corrections:

Two-Stage Correction Pipeline

Stage 1: Rule-Based Corrections (Label-Guided)

Input: ASR hypothesis text + BERT error labels (one per word)
Process:
- Handle words labeled as deletion: reintroduce missing words from reference
- Remove words labeled as filler (um, uh, etc.) from hypothesis
- Remove words labeled as repetition (stutters) from hypothesis
- Apply accent-specific phoneme fixes for accent_pronunciation labels
- Leave equal, substitution, and homophone unchanged for Stage 2
Output: Cleaned text with disfluencies and obvious errors removed

Stage 2: T5 Grammar Correction

Input: Cleaned text from Stage 1
Model: T5-base fine-tuned for grammar correction
Process: Fixes remaining grammatical errors, improves fluency
Output: Final corrected transcription

⚠️ Important Consideration: Grammar Over-Correction

Based on findings from the LearnerVoice paper (Kim et al., 2024), aggressive grammar correction is not always optimal. Non-native reference transcripts may themselves contain grammatical variations, so excessive correction can actually increase WER by diverging from the reference text. Our T5 stage is calibrated to improve fluency while preserving semantic content and avoiding over-normalization.

Current Implementation Status: The Correction Agent is fully implemented and operational. It currently uses manually annotated error labels from the CSV. Once the BERT classifier generates predicted labels, they will be plugged in seamlessly using the same interface.

5. Evaluation Agent & Feedback Loop

The Evaluation Agent compares WER before and after correction, tracking:

Overall WER reduction: Primary accuracy metric
ΔWER (fairness metric): max(WER) - min(WER) across speakers/accents
Correction quality: % helpful vs. harmful vs. neutral corrections
Per-speaker performance breakdown: Which speakers benefit most?

Based on these metrics, the agent decides whether to:

Accept: Corrections improved WER → done
Adjust & Re-correct: Increase Correction Agent confidence threshold and reapply
Re-analyze: If systematic errors detected, may trigger Error Agent to re-classify with adjusted parameters

Current Status: WER calculation and per-speaker metrics are complete. Feedback loop integration is in progress.

Current Status (Midterm Checkpoint)

Component Progress

Component	Status	Details
Dataset Preparation	✅ Complete	100 L2-ARCTIC samples (NCC speakers)
Baseline ASR	✅ Complete	Whisper small.en, WER: 9.6%
Manual Annotation	✅ Complete	902 tokens with 8-class error labels
Error Analysis Agent (BERT)	✅ Trained	Proof-of-concept (overfitting expected with n=70)
Phoneme Confusion Matrices	✅ Built	Speaker-specific patterns (NCC)
Correction Agent (Hybrid)	✅ Complete	Rule-based + T5 grammar correction operational
Evaluation Agent	🔄 In Progress	Metrics complete, feedback loop pending
Full Pipeline Integration	🔄 In Progress	Connecting BERT predictions → Corrections → Evaluation
Large-Scale Evaluation	📋 Planned	Waiting for LearnerVoice & Common Voice access

Next Steps & Timeline

Immediate Goals (By Final Submission - December 2024)

Week 1-2: Pipeline Integration & Initial Testing

🔄 Connect BERT predicted labels → Correction Agent (currently using manual labels)
🔄 Complete Evaluation Agent feedback loop
🔄 Run end-to-end pipeline on L2-ARCTIC pilot data (n=100)
📋 Document proof-of-concept results with current BERT model
📋 Measure: WER improvement, correction quality, per-speaker analysis

Week 2-3: Dataset Expansion

📋 Gain access to LearnerVoice dataset (50 hours of spontaneous L2 speech)
📋 Access Mozilla Common Voice 15.0 (diverse accent samples)
📋 Sample and annotate 2000-5000 additional utterances
📋 Expand to 5+ L1 backgrounds beyond Mandarin and Arabic

Week 3-4: Model Improvement & Comprehensive Evaluation

📋 Retrain BERT classifier on expanded dataset (target: eval loss < 1.0)
📋 Experiment with:
- DistilBERT vs BERT-base (efficiency vs. accuracy trade-off)
- Focal loss for extreme class imbalance
- Synthetic data augmentation (error injection)
- Binary classification (correct vs. error) as simpler alternative
📋 Compare 4 approaches on full dataset:
- Baseline: No post-processing (Whisper only)
- Rule-based: Phoneme confusion rules without BERT
- BERT-guided: Full pipeline with BERT error detection
- Hybrid variants: With/without T5 grammar stage
📋 Ablation studies: Contribution of each agent and correction stage

Week 5: Final Report & Presentation

📋 Document comprehensive methodology and results
📋 Create visualizations (WER improvement, per-speaker analysis, confusion matrices)
📋 Discuss fairness implications and ethical considerations
📋 Outline future work and limitations
📋 Prepare presentation slides and live demo

Expected Final Results

Based on related work (e.g., LearnerVoice fine-tuning reduced WER by 44% with 50 hours of data), we anticipate the following improvements over baseline Whisper:

Metric	Current (n=100)	Expected (n=2000-5000)
WER Reduction	TBD (integration pending)	15-30% relative improvement
ΔWER (Fairness)	TBD	20-40% reduction in speaker gap
Correction Quality	TBD	>70% helpful, <10% harmful
Inference Latency	TBD	<200ms per utterance (BERT + T5)
BERT Error Classifier	Eval loss: 3.01 (overfitting)	Eval loss: <1.0 (generalizing)

Note on Dataset Size: The LearnerVoice paper demonstrated that 50 hours of spontaneous L2 speech (approximately 5000+ utterances) achieved strong fine-tuning results. Given the extreme class imbalance in our task, we estimate 2000-5000 annotated utterances will be necessary for the BERT classifier to learn rare error types effectively.

Current Limitations & Mitigation Strategies

1. Small Training Set → Severe Overfitting

Issue: Only 70 training samples leads to BERT classifier memorization rather than learning generalizable patterns.

Evidence: Train loss (0.77) vs. Eval loss (3.01) indicates ~3.9x overfitting gap.

Mitigation:

✅ Current model validates architecture and error taxonomy (proof-of-concept achieved)
📋 Retrain with 2000-5000 samples from LearnerVoice + Common Voice (based on LearnerVoice paper results)
📋 Use smaller model (DistilBERT: 66M params) for better generalization with limited data
📋 Implement early stopping and stronger regularization

2. Extreme Class Imbalance (817:1)

Issue: 90% of tokens are "equal" (correct), making it hard to learn rare error types like insertion (n=3).

Evidence: Model likely predicts "equal" for 95%+ of tokens to minimize loss.

Mitigation:

✅ Inverse class weighting (implemented but insufficient with small data)
📋 Focal loss: Downweight easy examples, focus on hard ones
📋 Synthetic data generation: Inject artificial errors into correct transcripts
📋 Binary classification fallback: "correct" vs. "error" (reduces to 11:1 imbalance)
📋 Oversample error classes or undersample "equal" class during training

3. Limited Speaker/Accent Coverage (1 speaker)

Issue: Current pilot data only covers 1 speaker: NCC (Mandarin Chinese L1).

Mitigation:

📋 Common Voice 15.0 includes 100+ language backgrounds
📋 L2-ARCTIC includes 24 speakers from diverse L1s
📋 Expand to at least 10+ speakers across 5+ L1 backgrounds
📋 Build speaker-specific confusion matrices for targeted corrections

4. Scripted Speech Only (Limited Disfluencies)

Issue: L2-ARCTIC uses read speech, which has fewer disfluencies than spontaneous speech.

Evidence: Only 2 disfluency examples in 902 tokens (0.2%).

Mitigation:

📋 LearnerVoice provides spontaneous L2 English with rich disfluency annotations
📋 Test pipeline on both scripted (L2-ARCTIC) and spontaneous (LearnerVoice) speech
📋 Compare error patterns: How do disfluencies affect error distribution and correction strategies?

5. Grammar Over-Correction Risk

Issue: Aggressive T5 grammar correction may "fix" variations in non-native reference transcripts, increasing WER.

Evidence: LearnerVoice paper found that over-normalization can harm WER when references contain natural L2 variations.

Mitigation:

✅ T5 stage only applied to rule-cleaned text, not raw ASR output
✅ Calibrated for fluency improvement while preserving semantic content
📋 Ablation study: Compare WER with and without T5 stage
📋 Confidence-based T5 application: Only apply if BERT detects grammatical errors

Novel Contributions

Multi-Agent Architecture: First post-ASR correction system with specialized, communicating agents and feedback loops (to our knowledge)
Hybrid Correction Strategy: Novel combination of BERT-guided rule-based corrections + T5 grammar improvement, avoiding over-correction pitfalls
Comprehensive Error Taxonomy: 8-class error classification covering accent-specific errors, disfluencies, and standard ASR errors in a unified framework
Innovative Deletion Handling: Concatenating deletion labels with subsequent tokens (e.g., deletion+equal) keeps all annotations aligned with hypothesis, avoiding complex realignment
Fairness-Focused Evaluation: Explicit measurement of ΔWER to quantify equity improvements across speaker groups, not just overall accuracy
Model-Agnostic Design: Works with any ASR system (Whisper, Wav2Vec, commercial APIs) without retraining the underlying model

Team Contributions

Rishabh Agarwal: Dataset acquisition and preparation, ASR baseline transcription using Whisper, multi-agent pipeline setup and integration, website development
Ella Boytim: Manual error annotation and taxonomy design, BERT-based Error Analysis Agent implementation and training, evaluation metrics design
Sharon Soedarto: Correction Agent development (hybrid rule-based + T5 grammar correction), phoneme confusion matrix analysis, technical documentation

All team members contributed equally to experimental design, code review, results analysis, and report writing.

Key References

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). "Robust Speech Recognition via Large-Scale Weak Supervision." OpenAI Whisper Technical Report. arXiv:2212.04356.
Kim, J., Myung, J., Kang, D., Lee, H., & Kim, J. (2024). "LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech." Proceedings of EMNLP 2024. arXiv:2407.04280.
Zhao, G., Sonsaat, S., Silpachai, A., Lucic, I., Chukharev-Hudilainen, E., Levis, J., & Gutierrez-Osuna, R. (2018). "L2-ARCTIC: A Non-Native English Speech Corpus." Proceedings of Interspeech 2018, pp. 2787-2791.
Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J. R., Jurafsky, D., & Goel, S. (2020). "Racial Disparities in Automated Speech Recognition." Proceedings of the National Academy of Sciences (PNAS), 117(14), 7684-7689.
Feng, S., Kudina, O., Halpern, Y., & Scharenborg, O. (2021). "Quantifying Bias in Automatic Speech Recognition." Proceedings of Interspeech 2021, pp. 3810-3814.
Mozilla Foundation (2024). "Mozilla Common Voice 15.0." HuggingFace Datasets. Available: https://huggingface.co/datasets/mozilla-foundation/common_voice_15_0
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of NAACL-HLT 2019, pp. 4171-4186.

Acknowledgments

We thank Professor Dongyang Kang and TAs Shuyu Gan and Drew Gjerstad for their invaluable guidance, feedback, and support throughout this project. We also acknowledge the creators of L2-ARCTIC (Zhao et al., 2018), LearnerVoice (Kim et al., 2024), and Mozilla Common Voice (Mozilla Foundation, 2024) for making their datasets publicly available for research. Computational resources for this project were provided by Google Colab (T4/A100 GPUs).

📊 Project Status

Midterm Checkpoint Completed

Last Updated: November 14, 2024

Final Submission: December 2024

CSCI 5541 - Natural Language Processing | Fall 2024

University of Minnesota Twin Cities

Multi-Agent Post-Processing Pipeline for Non-Native English ASR

Fall 2024 CSCI 5541 NLP: Class Project - University of Minnesota

MultiAgentTeam

Abstract

System Overview

Key Innovation: Agent-Based Architecture

Motivation

Our Approach

1. Data Preparation & Baseline Establishment

Baseline ASR Performance (n=100)

2. Error Analysis Agent (BERT Token Classifier)

⚠️ Critical Challenge: Extreme Class Imbalance

BERT Model Training Configuration

Training Results (Proof-of-Concept)

3. Phoneme Confusion Matrix Analysis

📊 Key Phoneme Confusions (NCC Speaker - Mandarin Chinese L1)

4. Correction Agent (Hybrid Rule-Based + T5)

Two-Stage Correction Pipeline

⚠️ Important Consideration: Grammar Over-Correction

5. Evaluation Agent & Feedback Loop

Current Status (Midterm Checkpoint)

Component Progress

Next Steps & Timeline

Immediate Goals (By Final Submission - December 2024)

Week 1-2: Pipeline Integration & Initial Testing

Week 2-3: Dataset Expansion

Week 3-4: Model Improvement & Comprehensive Evaluation

Week 5: Final Report & Presentation

Expected Final Results

Current Limitations & Mitigation Strategies

1. Small Training Set → Severe Overfitting

2. Extreme Class Imbalance (817:1)

3. Limited Speaker/Accent Coverage (1 speaker)

4. Scripted Speech Only (Limited Disfluencies)

5. Grammar Over-Correction Risk

Novel Contributions

Team Contributions

Key References

Acknowledgments

📊 Project Status