2023-12-15

Whisper

Contributions

There are two major contributions of the paper

This just means that the data is not perfect, it has noise and can have mistakes
Data is not a gold standard, not labeled by humans but by other pretrained ASR models
“Transcript-ese”: Data generated by other ASR models
- They used heuristics to remove sub-par quality data or data generated by ASR models

A color photo of a speech
X axis is time and the Y axis is in mels
Mels is just a unit conversion from hertz (frequency) which represents how humans perceive sounds
The color of the photo depicts the amplitude

Why separate tokenization methods?
Tokenization is just breaking the sentence into words and words into sub-words, creating an index for each sub-word, and converting that index into an embedding during the forward pass
The naive way of doing tokenization involves a lot of rules and ends up creating a huge vocabulary. It is also tough to keep it updated since each new word will require a new token
The specialized tokenization method helps tokenize any word across different languages since they are a mix of tokenizing characters and words
How are tokenization and embedding related? Some answer here
Whisper uses: Byte level BTE tokenizer (not the contribution of this research paper)
Reference of popular tokenization methods

How to measure robustness?
- Train on data D
- Evaluate on a similar distribution dev dataset from D -> X
- Evaluate on completely out-of-distribution dataset Y
- If the model is completely robust, the error rate on X and Y should be the same
Models to try
- CLD2 model for language detection
- Maestro for x -> en translation
- mSLAM-CTC for language identification
Text normalizer
- Rules/methods to convert text into a standard format. For example, you're and you are are converted to the same token
- This also helps in calculating WER
- This is done manually and a fixed vocabulary is stored
Greedy search vs Beam search
- Instead of single token probabilities being selected at each timestamp, beam search enables the selection of the best sequence of tokens
- In beam search, if the width is 2, then at each timestamp, the algorithm will select the 2 best tokens while conditioning on 2 previous best tokens. This eventually results in the search for the best overall sequence
CTC loss
- In our case, the input is an audio signal and the output is words at discrete timestamps. To use a loss function for such scenarios, CTC is used
- CTC is used to align the input and output sequences when the input is continuous, the output is discrete, and there are no clear element boundaries that can be used to map the input to the elements of the output sequence
- Slightly complex, need to understand fully