..
Whisper
Contributions
There are two major contributions of the paper
- Train on a weakly supervised dataset
- Training a multi-task model
Weak supervised
- This just means that the data is not perfect, it has noise and can have mistakes
- Data is not a gold standard, not labeled by humans but by other pretrained ASR models
- “Transcript-ese”: Data generated by other ASR models
- They used heuristics to remove sub-par quality data or data generated by ASR models
Multi-task
- The model is trained on multiple tasks
- Transcribing English to English
- Transcribing X language to X language
- Transcribing X language to English
New concepts to me
Log mel spectrogram
- A color photo of a speech
- X axis is time and the Y axis is in mels
- Mels is just a unit conversion from hertz (frequency) which represents how humans perceive sounds
- The color of the photo depicts the amplitude
Tokenization?
- Why separate tokenization methods?
- Tokenization is just breaking the sentence into words and words into sub-words, creating an index for each sub-word, and converting that index into an embedding during the forward pass
- The naive way of doing tokenization involves a lot of rules and ends up creating a huge vocabulary. It is also tough to keep it updated since each new word will require a new token
- The specialized tokenization method helps tokenize any word across different languages since they are a mix of tokenizing characters and words
- How are tokenization and embedding related? Some answer here
- Whisper uses: Byte level BTE tokenizer (not the contribution of this research paper)
- Reference of popular tokenization methods
Other ideas
- How to measure robustness?
- Train on data D
- Evaluate on a similar distribution
dev
dataset from D -> X - Evaluate on completely out-of-distribution dataset Y
- If the model is completely robust, the error rate on X and Y should be the same
- Models to try
- CLD2 model for language detection
- Maestro for x ->
en
translation - mSLAM-CTC for language identification
- Text normalizer
- Rules/methods to convert text into a standard format. For example,
you're
andyou are
are converted to the same token - This also helps in calculating WER
- This is done manually and a fixed vocabulary is stored
- Rules/methods to convert text into a standard format. For example,
- Greedy search vs Beam search
- Instead of single token probabilities being selected at each timestamp, beam search enables the selection of the best sequence of tokens
- In beam search, if the width is 2, then at each timestamp, the algorithm will select the 2 best tokens while conditioning on 2 previous best tokens. This eventually results in the search for the best overall sequence
- CTC loss
- In our case, the input is an audio signal and the output is words at discrete timestamps. To use a loss function for such scenarios, CTC is used
- CTC is used to align the input and output sequences when the input is continuous, the output is discrete, and there are no clear element boundaries that can be used to map the input to the elements of the output sequence
- Slightly complex, need to understand fully