综合编程

Getting Started with End-to-End Speech Translation

微信扫一扫,分享到朋友圈

Getting Started with End-to-End Speech Translation

Getting Started with End-to-End Speech Translation

With Pytorch you can translate English speech in only a few steps

Introduction

Speech-to-text translation is the task of translating a speech given in a source language into text written in a different, target language. It is a task with a history that dates back to a demo given in 1983
. The classic approach to tackle this task consists in training a cascade of systems including automatic speech recognition (ASR) and machine translation (MT). You can see it in your Google Translate app, where your speech is first transcribed and then translated (although the translation appear to be real-time)

Both tasks of ASR and MT have been studied for long time and the systems’ quality has experienced significant leaps with the adoption of deep learning techniques. Indeed, the availability of big data (at least for some languages), large computing power and clear evaluation, made these two tasks perfect targets for big companies like Google that invested a lot in research. See as a reference the papers about Transformer
[1] and SpecAugment [
2]. As this blog post is not about cascaded systems, I refer the interested reader to the system that won the last IWSLT
competition [3].

IWSLT is the main yearly workshop devoted to spoken language translation. Every edition hosts a “shared task”, a kind of competition, with the goal of recording the progress in spoken language technologies. Since 2018, the shared task started a separate evaluation for “end-to-end” systems, that are those systems consisting of a single model that learns to translate directly from audio to text in the target language, without intermediate steps. Our group has been participating to this new evaluation since its first edition, and I reported our first participation in a previous story.

The quality of end-to-end models is still discussed, when compared to the cascaded approach, but it is a growing research topic and quality improvements are reported quite frequently. The goal of this tutorial is to lower the entry barriers to this field by providing the reader with a step-to-step guide to train an end-to-end system. In particular, we will focus on a system that can translate English speech into Italian, but it can be easily extended to additional seven languages: Dutch, French, German, Spanish, Portuguese, Romanian or Russian.

What you need

The minimum requirement is the access to at least one GPU, which you can get for free with Colab and Pytorch 0.4 installed.

However, the K80 GPUs are quite slow and will require several days of training. Accessing to better or more GPUs will be of great help.

Getting data

We will use MuST-C, the largest multilingual corpus available for the direct speech translation task. You can find a detailed description in the paper that introduced it [4] or in the following Medium story:

To get the corpus, go to https://mustc.fbk.eu/,
click on the button “Click here to download the corpus”, then fill the form and you will soon be able to download it.

MuST-C is divided in 8 portions, one for each target language, feel free to download one or all of them, but for this tutorial we will use the Italian target (it) as an example. Each portion contains TED talks given in English and translated in the target language (the translations are provided by the Ted website
). The size of the training set depends on the availability of translations for the given language, while the validation and test sets are extracted from a common pool of talks.

Each portion of MuST-C is divided into train, dev, tst-COMMON and tst-HE. Train, dev and tst-COMMON represent our split into training, validation and test set, while you can safely ignore tst-HE. In each of the three directories you will find three sub-directories: wav/, txt/, and h5/. wav/
contains the audio side of the set in the form of .wav files, one for each talk. txt
contains the transcripts and translations, for our example with Italian you will find, under the train/txt
directory, the files train.it, train.en, train.yaml
. The first two are, respectively, the textual translation and trascript. train.yaml
is a file containing the audio segmentation in a way that it is aligned with the textual files. As a bonus, the .en and .it files are parallel and, as such, they can be used to train MT systems. If you don’t know what to do with the segmentation provided by the yaml file, don’t be afraid! In the h5/
directory there is a single .h5 file that contains the audio already segmented and transformed to extract 40 Mel Filterbanks
features.

NOTE:The dataset will be downloaded from Google Drive, if you want to download it from a machine with no GUI, you can try to use the tool gdown
. However, it does not work always correctly. If you are unable to download with gdown, please try again after a few hours.

Getting the software

We will use FBK-Fairseq-ST
, that is the fairseq
tool by Facebook for MT adapted for the direct speech translation task. Clone the repository from github:

git clone https://github.com/mattiadg/FBK-Fairseq-ST.git

Then, clone also mosesdecoder,
which contains useful scripts for text preprocessing.

git clone https://github.com/moses-smt/mosesdecoder.git

Data preprocessing

The audio side of the data is already preprocessed in the .h5 file, so we only have to care about the textual side.

Let us first create a directory where to put the tokenized data.

> mkdir mustc-tokenized
> cd mustc-tokenized

Then, we can proceed to tokenize our Italian texts (an analogous process is needed for the other target languages):

> for file in $MUSTC/en-it/data/{train,dev,tst-COMMON}/txt/*.it; do
$mosesdecoder/scripts/tokenizer/tokenizer.perl -l it < $file |
$mosesdecoder/scripts/tokenizer/deescape-special-chars.perl > $file
done> mkdir tokenized> for file in *.it; do
cp $file tokenized/$file.char
sh word_level2char_level.sh tokenized/$file
done

The second for-loop splits the words in characters, as done in our paper that sets baselines for all the MuST-C languages [5].

Now, we have to binarize the data to make audio and text in a single format for fairseq. First, link the h5 files in the data directory.

> cd tokenized
> for file in $MUSTC/en-it/data/{train,dev,tst-COMMON}/h5/*.h5; do
ln -s $file
done

Then, we can move to the actual binarization

> python $FBK-Fairseq-ST/preprocess.py --trainpref train --validpref dev --testpref tst-COMMON -s h5 -t it --inputtype audio --format h5 --destdir bin

This will require some minutes, and in the end you should get something like this:

> ls bin/
dict.it.txt train.h5-it.it.bin valid.h5-it.it.idx
test.h5-it.it.bin train.h5-it.it.idx valid.h5-it.h5.bin
test.h5-it.it.idx train.h5-it.h5.bin valid.h5-it.h5.idx
test.h5-it.h5.bin train.h5-it.h5.idx
test.h5-it.h5.idx valid.h5-it.it.bin

We have a dictionary for the target language (dict.it.txt), and for each split of the data, an index and a content file for the source side (*.h5.idx and *.h5.bin) and the same for the target side (*.it.idx and *.it.bin).

With this, we have finished with the data preprocessing and can move on the training!

Training your model

For training, we are going to replicate the one reported in [5]. You just need to run the following command:

> mkdir models
> CUDA_VISIBLE_DEVICES=$GPUS python $FBK-Fairseq-ST/train.py bin/ 
--clip-norm 20 
--max-sentences 8 
--max-tokens 12000 
--save-dir models/ 
--max-epoch 50 
--lr 5e-3 
--dropout 0.1 
--lr-schedule inverse_sqrt 
--warmup-updates 4000 --warmup-init-lr 3e-4 
--optimizer adam 
--arch speechconvtransformer_big 
--distance-penalty log 
--task translation 
--audio-input 
--max-source-positions 1400 --max-target-positions 300 
--update-freq 16 
--skip-invalid-size-inputs-valid-test 
--sentence-avg 
--criterion label_smoothed_cross_entropy 
--label-smoothing 0.1

Let me explain it step by step. bin/
is the directory containing the binarized data, as above, while models/
is the directory where the checkpoints will be saved (one at the end of each epoch). --clip-norm
refers to gradient clipping,
and --dropout
should be clear if you are familiar with deep learning. --max-tokens
is the maximum number of audio frames that can be loaded in a single
GPU for every iteration, and --max-sentences
is the maximum batch size, which is limited also by max-tokens. --update-freq
also affects the batch size, as here we are saying that the weights have to be updated after 16 iterations. It basically emulates the training with 16x GPUs. Now, the optimization policy: --optimizer
adam is for using the Adam optimizer,
--lr-schedule inverse_sqrt
uses the schedule introduced by the Transformer paper [1]: the learning rate grows linearly in --warmup-updates
step (4000) from --warmup-init-lr
(0.0003) to --lr
(0.005) and then decreases following the square root of the number of steps. The loss to optimize is cross entropy with label smoothing
( --criterion
) using a --label-smoothing
of 0.1 . The loss is averaged among the sentences and not the tokens with --sentence-avg
. --arch
defines the architecture to use and the hyperparameters, these can be changed when running the training, but speechconvtransformer_big
uses the same hyperparameters as in our paper, except for the distance penalty that is specified in our command.

The deep learning architecture is an adaptation of the Transformer to the speech translation task, which modifies the encoder to work with spectrograms in input. I will describe it in a future blog post.

During training, one checkpoint will be saved at the end of each epoch and called accordingly checkpoint1.pt, checkpoint2.pt, etc. Additionally, two more checkpoints will be updated at the end of every epoch: checkpoint_best.pt and checkpoint_last.pt. The former is a copy of the checkpoint with the best validation loss, the latter a copy of the last saved checkpoint.

Generation and evaluation

When you are ready to run a translation from audio (actually, preprocessed spectrograms), you can run the following command:

python $FBK-Fairseq-ST/generate.py tokenized/bin/ --path models/checkpoint_best.pt --audio-input 
[--gen-subset valid] [--beam 5] [--batch 32] 
[--skip-invalid-size-inputs-valid-test] [--max-source-positions N] [--max-target-positions N] > test.raw.txt

What is absolutely needed here is the directory with binarized data bin/
, the path to the checkpoint --path models/checkpoint_best.pt
, but it can be any of the saved checkpoints, --audio-input
and to inform the software that it has to expect audio (and not text) in input.

By design, this command will look for the “test” portion of the dataset within the given directory. If you want to translate another, valid or train, you can do it with --gen-subset {valid,train}
. The beam size and the batch size can modified, respectively, with --beam
and --batch
. --skip-invalid-size-inputs-valid-test
let the software skip the segments that are longer than the limits set by --max-source-positions
and --max-target-positions
.

The output will be something like this:

SEC暂停Zoom Technologies交易 因投资者将其与Zoom Video混淆

上一篇

Python PIL Imaging Library FileNotFoundError

下一篇

你也可能喜欢

评论已经被关闭。

插入图片

热门栏目

Getting Started with End-to-End Speech Translation

长按储存图像,分享给朋友