As a special machine translation task, dialect translation has two main characteristics: 1) lack of parallel training corpus; and 2) possessing similar grammar between two sides of the translation. In this paper, we investigate how to exploit the commonality and diversity between dialects thus to build unsupervised translation models merely accessing to monolingual data. Specifically, we leverage pivot-private embedding, layer coordination, as well as parameter sharing to sufficiently model commonality and diversity among source and target, ranging from lexical, through syntactic, to semantic levels. In order to examine the effectiveness of the proposed models, we collect 20 million monolingual corpus for each of Mandarin and Cantonese, which are official language and the most widely used dialect in China. Experimental results reveal that our methods outperform rule-based simplified and traditional Chinese conversion and conventional unsupervised translation models over 12 BLEU scores.
Future human action forecasting from partial observations of activities is an important problem in many practical applications such as assistive robotics, video surveillance and security. We present a method to forecast actions for the unseen future of the video using a neural machine translation technique that uses encoder-decoder architecture. The input to this model is the observed RGB video, and the target is to generate the future symbolic action sequence. Unlike most methods that predict frame or clip level predictions for some unseen percentage of video, we predict the complete action sequence that is required to accomplish the activity. To cater for two types of uncertainty in the future predictions, we propose a novel loss function. We show a combination of optimal transport and future uncertainty losses help to boost results. We evaluate our model in three challenging video datasets (Charades, MPII cooking and Breakfast). We outperform other state-of-the art techniques for frame based action forecasting task by 5.06\% on average across several action forecasting setups.
Over 800 languages are spoken across West Africa. Despite the obvious diversity among people who speak these languages, one language significantly unifies them all - West African Pidgin English. There are at least 80 million speakers of West African Pidgin English. However, there is no known natural language processing (NLP) work on this language. In this work, we perform the first NLP work on the most popular variant of the language, providing three major contributions. First, the provision of a Pidgin corpus of over 56000 sentences, which is the largest we know of. Secondly, the training of the first ever cross-lingual embedding between Pidgin and English. This aligned embedding will be helpful in the performance of various downstream tasks between English and Pidgin. Thirdly, the training of an Unsupervised Neural Machine Translation model between Pidgin and English which achieves BLEU scores of 7.93 from Pidgin to English, and 5.18 from English to Pidgin. In all, this work greatly reduces the barrier of entry for future NLP works on West African Pidgin English.
Sequence-level knowledge distillation (SLKD) is a model compression technique that leverages large, accurate teacher models to train smaller, under-parameterized student models. Why does pre-processing MT data with SLKD help us train smaller models? We test the common hypothesis that SLKD addresses a capacity deficiency in students by "simplifying" noisy data points and find it unlikely in our case. Models trained on concatenations of original and "simplified" datasets generalize just as well as baseline SLKD. We then propose an alternative hypothesis under the lens of data augmentation and regularization. We try various augmentation strategies and observe that dropout regularization can become unnecessary. Our methods achieve BLEU gains of 0.7-1.2 on TED Talks.
Neural Machine Translation has lately gained a lot of "attention" with the advent of more and more sophisticated but drastically improved models. Attention mechanism has proved to be a boon in this direction by providing weights to the input words, making it easy for the decoder to identify words representing the present context. But by and by, as newer attention models with more complexity came into development, they involved large computation, making inference slow. In this paper, we have modelled the attention network using techniques resonating with social choice theory. Along with that, the attention mechanism, being a Markov Decision Process, has been represented by reinforcement learning techniques. Thus, we propose to use an election method ($k$-Borda), fine-tuned using Q-learning, as a replacement for attention networks. The inference time for this network is less than a standard Bahdanau translator, and the results of the translation are comparable. This not only experimentally verifies the claims stated above but also helped provide a faster inference.
Common intermediate language representation in neural machine translation can be used to extend bilingual to multilingual systems by incremental training. In this paper, we propose a new architecture based on introducing an interlingual loss as an additional training objective. By adding and forcing this interlingual loss, we are able to train multiple encoders and decoders for each language, sharing a common intermediate representation. Translation results on the low-resourced tasks (Turkish-English and Kazakh-English tasks, from the popular Workshop on Machine Translation benchmark) show the following BLEU improvements up to 2.8. However, results on a larger dataset (Russian-English and Kazakh-English, from the same baselines) show BLEU loses if the same amount. While our system is only providing improvements for the low-resourced tasks in terms of translation quality, our system is capable of quickly deploying new language pairs without retraining the rest of the system, which may be a game-changer in some situations (i.e. in a disaster crisis where international help is required towards a small region or to develop some translation system for a client). Precisely, what is most relevant from our architecture is that it is capable of: (1) reducing the number of production systems, with respect to the number of languages, from quadratic to linear (2) incrementally adding a new language in the system without retraining languages previously there and (3) allowing for translations from the new language to all the others present in the system
We investigate the problem of simultaneous machine translation of long-form speech content. We target a continuous speech-to-text scenario, generating translated captions for a live audio feed, such as a lecture or play-by-play commentary. As this scenario allows for revisions to our incremental translations, we adopt a re-translation approach to simultaneous translation, where the source is repeatedly translated from scratch as it grows. This approach naturally exhibits very low latency and high final quality, but at the cost of incremental instability as the output is continuously refined. We experiment with a pipeline of industry-grade speech recognition and translation tools, augmented with simple inference heuristics to improve stability. We use TED Talks as a source of multilingual test data, developing our techniques on English-to-German spoken language translation. Our minimalist approach to simultaneous translation allows us to easily scale our final evaluation to six more target languages, dramatically improving incremental stability for all of them.