..

Whisper

  1. Contributions
    1. Weak supervised
    2. Multi-task
  2. New concepts to me
    1. Log mel spectrogram
    2. Tokenization?
    3. Other ideas
    4. References

Contributions

There are two major contributions of the paper

  1. Train on a weakly supervised dataset
  2. Training a multi-task model

Weak supervised

  • This just means that the data is not perfect, it has noise and can have mistakes
  • Data is not a gold standard, not labeled by humans but by other pretrained ASR models
  • “Transcript-ese”: Data generated by other ASR models
    • They used heuristics to remove sub-par quality data or data generated by ASR models

Multi-task

  • The model is trained on multiple tasks
  • Transcribing English to English
  • Transcribing X language to X language
  • Transcribing X language to English

New concepts to me

Log mel spectrogram

  • A color photo of a speech
  • X axis is time and the Y axis is in mels
  • Mels is just a unit conversion from hertz (frequency) which represents how humans perceive sounds
  • The color of the photo depicts the amplitude

Tokenization?

  • Why separate tokenization methods?
  • Tokenization is just breaking the sentence into words and words into sub-words, creating an index for each sub-word, and converting that index into an embedding during the forward pass
  • The naive way of doing tokenization involves a lot of rules and ends up creating a huge vocabulary. It is also tough to keep it updated since each new word will require a new token
  • The specialized tokenization method helps tokenize any word across different languages since they are a mix of tokenizing characters and words
  • How are tokenization and embedding related? Some answer here
  • Whisper uses: Byte level BTE tokenizer (not the contribution of this research paper)
  • Reference of popular tokenization methods

Other ideas

  • How to measure robustness?
    • Train on data D
    • Evaluate on a similar distribution dev dataset from D -> X
    • Evaluate on completely out-of-distribution dataset Y
    • If the model is completely robust, the error rate on X and Y should be the same
  • Models to try
    • CLD2 model for language detection
    • Maestro for x -> en translation
    • mSLAM-CTC for language identification
  • Text normalizer
    • Rules/methods to convert text into a standard format. For example, you're and you are are converted to the same token
    • This also helps in calculating WER
    • This is done manually and a fixed vocabulary is stored
  • Greedy search vs Beam search
    • Instead of single token probabilities being selected at each timestamp, beam search enables the selection of the best sequence of tokens
    • In beam search, if the width is 2, then at each timestamp, the algorithm will select the 2 best tokens while conditioning on 2 previous best tokens. This eventually results in the search for the best overall sequence
  • CTC loss
    • In our case, the input is an audio signal and the output is words at discrete timestamps. To use a loss function for such scenarios, CTC is used
    • CTC is used to align the input and output sequences when the input is continuous, the output is discrete, and there are no clear element boundaries that can be used to map the input to the elements of the output sequence
    • Slightly complex, need to understand fully

References

  1. YouTube video
  2. Visual explanation of beam search