The model uses a convolutional neural-network (CNN) feature encoder to convert audio into latent speech representations which are quantized then fed into a Transformer the Transformer converts sequences of speech representations into text. This strategy has been applied by Facebook and others for neural-machine translation, using popular sequence-to-sequence natural language Transformer models such as BERT.įAIR published the original wav2vec deep-learning model for automated speech recognition (ASR) in 2019 and the updated wav2vec 2.0 model in 2020. In this situation, researchers turn to transfer learning: fine-tuning models that have been pre-trained on a large publicly-available dataset. Acquiring such a dataset can be challenging for non-European languages-often termed low-resource languages because of the lack of readily available data. Training a deep-learning model requires a large dataset of labeled examples for speech-recognition, this would mean audio data with corresponding text transcripts. Our goal.is to enable few-shot learning for languages that are actually low-resource, leveraging unsupervised data from higher-resource languages. The system can also learn languages not seen during pre-training, outperforming monolingual models specifically trained on those languages. When evaluated on the CommonVoice and BABEL benchmarks, the model outperforms existing baselines. The system is pre-trained on three public datasets containing 53 languages. XSLR is built on the wav2vec architecture and uses transfer learning to improve performance on "low-resource" languages. The model architecture and related experiments were described in a paper published on arXiv. XSLR is trained on 53 languages and outperforms existing systems when evaluated on common benchmarks. Facebook AI Research (FAIR) open-sourced Cross-Lingual Speech Recognition (XSLR), a multilingual speech recognition AI model.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |