Multi-Agent Post-Processing Pipeline for Non-Native English ASR

Fall 2024 CSCI 5541 NLP: Class Project - University of Minnesota

MultiAgentTeam

Rishabh Agarwal

Rishabh Agarwal

Ella Boytim

Ella Boytim

Sharon Soedarto

Sharon Soedarto



Abstract

Despite advances in automatic speech recognition (ASR), large models like Whisper exhibit higher Word Error Rates (WER) for non-native English speakers, creating fairness and accessibility concerns. We propose a multi-agent post-processing pipeline that improves ASR accuracy for non-native speakers without retraining the underlying model. Our system consists of three specialized agents: (1) a BERT-based Error Analysis Agent that detects and classifies token-level errors, (2) a Correction Agent that applies rule-based fixes and T5-based grammar correction using detected error labels, and (3) an Evaluation Agent that measures improvements and provides feedback. On our L2-ARCTIC pilot dataset (100 samples from Mandarin Chinese and Arabic speakers), we demonstrate the feasibility of learned error detection and establish baseline metrics for accent-aware correction.


System Overview

Our multi-agent pipeline operates entirely on text after ASR transcription, making it model-agnostic and computationally efficient.

Multi-Agent Pipeline

Key Innovation: Agent-Based Architecture

Unlike monolithic correction systems, our modular design allows each agent to specialize:


Motivation

What problem are we solving?

Current ASR systems perform significantly worse on non-native English speech, with WER gaps of 20-50% compared to native speakers. Fine-tuning models is computationally expensive and requires large amounts of accent-specific data. Moreover, fine-tuned models lack transparency about what was improved.

Why does it matter?

Fair and accurate ASR is critical for education (language learners), accessibility (voice interfaces), and professional settings (international workplaces). Our approach provides an interpretable, efficient, and portable solution that works with any ASR system.

Current limitations of existing approaches:


Our Approach

1. Data Preparation & Baseline Establishment

We sampled 100 utterances from the L2-ARCTIC dataset (scripted non-native English speech) from two speakers: NCC (native Mandarin Chinese speaker). We established baseline metrics using OpenAI Whisper (small.en):

Baseline ASR Performance (n=100)

Metric Value Notes
Overall WER 9.6% Better than expected for non-native speech
Overall CER 4.8% Character-level accuracy
Average WER 10.3% Per-utterance average
Average CER 5.4% Per-utterance average
Max Sentence Length 13 words Important for model context window
Total Tokens Annotated 902 Word-level error labels

2. Error Analysis Agent (BERT Token Classifier)

We manually annotated all 100 samples with word-level error labels, creating 902 total token annotations. Our comprehensive taxonomy includes 8 error types covering both standard ASR errors and accent-specific patterns:

⚠️ Critical Challenge: Extreme Class Imbalance

Our dataset exhibits a 817:1 imbalance ratio between the majority class (correct tokens) and rarest error types, presenting significant challenges for model training.

Error Type Count Percentage Description
equal 817 90.1% βœ“ Correct transcription (majority class)
substitution 45 5.0% Word replaced with different word
accent_pronunciation 17 1.9% Accent-driven phonetic error (e.g., "color" β†’ "cooler")
homophone 10 1.1% Sound-alike substitution (e.g., "20th" vs "twentieth")
deletion+equal 7 0.8% Missing word (label concatenated with next token)
insertion 3 0.3% ❌ Extra word inserted (insufficient for learning)
substitution+deletion 1 0.1% Compound error
repetition+equal 1 0.1% Disfluency: word repetition
equal+repetition 1 0.1% Disfluency: word repetition (variant)

Key Innovation: Our deletion handling concatenates the deletion label with the next word's label (e.g., deletion+equal), keeping all annotations aligned with the hypothesis tokens rather than requiring complex realignment with the reference text.

BERT Model Training Configuration

Training Results (Proof-of-Concept)

Metric Value Interpretation
Final Training Loss 0.77 Low training error
Final Eval Loss 3.01 ❌ Severe overfitting
Training Progress 2.08 β†’ 0.18 Loss decreased across 280 steps
Overfitting Gap ~3.9x Eval loss / Train loss ratio

⚠️ Expected Limitation: With only 70 training samples and 817:1 class imbalance, severe overfitting is expected. The model likely predicts "equal" for 95%+ of tokens. This validates our architecture and error taxonomy but requires significantly more data (2000-5000+ samples based on LearnerVoice results) for production-quality performance.

3. Phoneme Confusion Matrix Analysis

We extracted accent-specific phoneme confusion patterns for speaker NCC (native Mandarin Chinese) by aligning ASR errors with phonetic transcriptions. These patterns guide the Correction Agent's rule-based fixes:

πŸ“Š Key Phoneme Confusions (NCC Speaker - Mandarin Chinese L1)

Reference Phoneme Confused With Probability Linguistic Pattern
R L, AA1 ~0.059 each Classic r/l confusion + vowelization
N D, L ~0.059 each Nasal-stop confusion
T D, final_stop_deletion ~0.059 each Final consonant deletion (common in Mandarin)
D V, Z ~0.059 each Stop-fricative confusion
Vowels Multiple patterns ~0.059 AA1↔AH1, AY1↔IH1, EH1↔DH

πŸ’‘ Impact: These patterns enable the Correction Agent to apply targeted, accent-aware fixes rather than generic corrections. The confusion matrix provides statistical evidence for which phoneme substitutions are systematic vs. random.

4. Correction Agent (Hybrid Rule-Based + T5)

Our Correction Agent implements a novel two-stage correction pipeline that uses BERT error labels to guide both rule-based and neural corrections:

Two-Stage Correction Pipeline

Stage 1: Rule-Based Corrections (Label-Guided)

Stage 2: T5 Grammar Correction

⚠️ Important Consideration: Grammar Over-Correction

Based on findings from the LearnerVoice paper (Kim et al., 2024), aggressive grammar correction is not always optimal. Non-native reference transcripts may themselves contain grammatical variations, so excessive correction can actually increase WER by diverging from the reference text. Our T5 stage is calibrated to improve fluency while preserving semantic content and avoiding over-normalization.

Current Implementation Status: The Correction Agent is fully implemented and operational. It currently uses manually annotated error labels from the CSV. Once the BERT classifier generates predicted labels, they will be plugged in seamlessly using the same interface.

5. Evaluation Agent & Feedback Loop

The Evaluation Agent compares WER before and after correction, tracking:

Based on these metrics, the agent decides whether to:

Current Status: WER calculation and per-speaker metrics are complete. Feedback loop integration is in progress.


Current Status (Midterm Checkpoint)

Component Progress

Component Status Details
Dataset Preparation βœ… Complete 100 L2-ARCTIC samples (NCC speakers)
Baseline ASR βœ… Complete Whisper small.en, WER: 9.6%
Manual Annotation βœ… Complete 902 tokens with 8-class error labels
Error Analysis Agent (BERT) βœ… Trained Proof-of-concept (overfitting expected with n=70)
Phoneme Confusion Matrices βœ… Built Speaker-specific patterns (NCC)
Correction Agent (Hybrid) βœ… Complete Rule-based + T5 grammar correction operational
Evaluation Agent πŸ”„ In Progress Metrics complete, feedback loop pending
Full Pipeline Integration πŸ”„ In Progress Connecting BERT predictions β†’ Corrections β†’ Evaluation
Large-Scale Evaluation πŸ“‹ Planned Waiting for LearnerVoice & Common Voice access

Next Steps & Timeline

Immediate Goals (By Final Submission - December 2024)

Week 1-2: Pipeline Integration & Initial Testing

Week 2-3: Dataset Expansion

Week 3-4: Model Improvement & Comprehensive Evaluation

Week 5: Final Report & Presentation

Expected Final Results

Based on related work (e.g., LearnerVoice fine-tuning reduced WER by 44% with 50 hours of data), we anticipate the following improvements over baseline Whisper:

Metric Current (n=100) Expected (n=2000-5000)
WER Reduction TBD (integration pending) 15-30% relative improvement
Ξ”WER (Fairness) TBD 20-40% reduction in speaker gap
Correction Quality TBD >70% helpful, <10% harmful
Inference Latency TBD <200ms per utterance (BERT + T5)
BERT Error Classifier Eval loss: 3.01 (overfitting) Eval loss: <1.0 (generalizing)

Note on Dataset Size: The LearnerVoice paper demonstrated that 50 hours of spontaneous L2 speech (approximately 5000+ utterances) achieved strong fine-tuning results. Given the extreme class imbalance in our task, we estimate 2000-5000 annotated utterances will be necessary for the BERT classifier to learn rare error types effectively.


Current Limitations & Mitigation Strategies

1. Small Training Set β†’ Severe Overfitting

Issue: Only 70 training samples leads to BERT classifier memorization rather than learning generalizable patterns.

Evidence: Train loss (0.77) vs. Eval loss (3.01) indicates ~3.9x overfitting gap.

Mitigation:

2. Extreme Class Imbalance (817:1)

Issue: 90% of tokens are "equal" (correct), making it hard to learn rare error types like insertion (n=3).

Evidence: Model likely predicts "equal" for 95%+ of tokens to minimize loss.

Mitigation:

3. Limited Speaker/Accent Coverage (1 speaker)

Issue: Current pilot data only covers 1 speaker: NCC (Mandarin Chinese L1).

Mitigation:

4. Scripted Speech Only (Limited Disfluencies)

Issue: L2-ARCTIC uses read speech, which has fewer disfluencies than spontaneous speech.

Evidence: Only 2 disfluency examples in 902 tokens (0.2%).

Mitigation:

5. Grammar Over-Correction Risk

Issue: Aggressive T5 grammar correction may "fix" variations in non-native reference transcripts, increasing WER.

Evidence: LearnerVoice paper found that over-normalization can harm WER when references contain natural L2 variations.

Mitigation:


Novel Contributions

  1. Multi-Agent Architecture: First post-ASR correction system with specialized, communicating agents and feedback loops (to our knowledge)
  2. Hybrid Correction Strategy: Novel combination of BERT-guided rule-based corrections + T5 grammar improvement, avoiding over-correction pitfalls
  3. Comprehensive Error Taxonomy: 8-class error classification covering accent-specific errors, disfluencies, and standard ASR errors in a unified framework
  4. Innovative Deletion Handling: Concatenating deletion labels with subsequent tokens (e.g., deletion+equal) keeps all annotations aligned with hypothesis, avoiding complex realignment
  5. Fairness-Focused Evaluation: Explicit measurement of Ξ”WER to quantify equity improvements across speaker groups, not just overall accuracy
  6. Model-Agnostic Design: Works with any ASR system (Whisper, Wav2Vec, commercial APIs) without retraining the underlying model

Team Contributions

All team members contributed equally to experimental design, code review, results analysis, and report writing.


Key References

  1. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). "Robust Speech Recognition via Large-Scale Weak Supervision." OpenAI Whisper Technical Report. arXiv:2212.04356.
  2. Kim, J., Myung, J., Kang, D., Lee, H., & Kim, J. (2024). "LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech." Proceedings of EMNLP 2024. arXiv:2407.04280.
  3. Zhao, G., Sonsaat, S., Silpachai, A., Lucic, I., Chukharev-Hudilainen, E., Levis, J., & Gutierrez-Osuna, R. (2018). "L2-ARCTIC: A Non-Native English Speech Corpus." Proceedings of Interspeech 2018, pp. 2787-2791.
  4. Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J. R., Jurafsky, D., & Goel, S. (2020). "Racial Disparities in Automated Speech Recognition." Proceedings of the National Academy of Sciences (PNAS), 117(14), 7684-7689.
  5. Feng, S., Kudina, O., Halpern, Y., & Scharenborg, O. (2021). "Quantifying Bias in Automatic Speech Recognition." Proceedings of Interspeech 2021, pp. 3810-3814.
  6. Mozilla Foundation (2024). "Mozilla Common Voice 15.0." HuggingFace Datasets. Available: https://huggingface.co/datasets/mozilla-foundation/common_voice_15_0
  7. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of NAACL-HLT 2019, pp. 4171-4186.

Acknowledgments

We thank Professor Dongyang Kang and TAs Shuyu Gan and Drew Gjerstad for their invaluable guidance, feedback, and support throughout this project. We also acknowledge the creators of L2-ARCTIC (Zhao et al., 2018), LearnerVoice (Kim et al., 2024), and Mozilla Common Voice (Mozilla Foundation, 2024) for making their datasets publicly available for research. Computational resources for this project were provided by Google Colab (T4/A100 GPUs).


πŸ“Š Project Status

Midterm Checkpoint Completed

Last Updated: November 14, 2024

Final Submission: December 2024

CSCI 5541 - Natural Language Processing | Fall 2024

University of Minnesota Twin Cities