NMT Inference
Last updated
Last updated
This module provides the NMT based translation service for various Indic language pairs. Currently the NMT models are trained using OpenNMT-py framework version 1 and the model binaries are generated using ctranslate2 module provided for OpenNMT-py and the same is used to generate model predictions.
NMT requires parallel corpus between languages. Typically the size of language corpus is in the millions. The language corpus must have enough examples to cover various situations. This is one of the most important portions of the system and a very time consuming work where quality of data has to be checked to ensure the accuracy of translation. At Anuvaad, we have collected data for 11 languages as parallel corpora. The corpus is available under MIT license.
Training and retraining is a continuous process and training is dependent on the quality of input dataset. We have to constantly monitor the quality of translation. The translation mistakes should be used to generate training examples and retraining exercises have to periodically be taken up. The training cycle is a costly affair as they need GPU infrastructure and long training hours.
The model output is evaluated on the per-selected sentences and BLEU score is calculated. BLEU score provides a score that helps as the guidance to provide feedback on the model quality. Translation output has to be evaluated by human translators as well before it can be used in a production environment.
Anuvaad uses the current state-of-the-art Transformer model to achieve target sentence prediction or translation Supporting code and paper is in open source domain, LINK
We are leveraging an open-source project called “openNMT” and also exploring “FairSeq”(IndicTrans) from the perspective of enhancement and usage. The deeplearning platform used is pytorch
Vocabulary or dictionary generation
Tokenizer (Detokenizer) or breaking of given sentence in word or sub-word. (Language specific) Moses or IndicNLP(for indian languages)
Sentence Piece or subword-nmt Supporting code and paper is in open source domain, LINK
BPE (Byte Pair Encoding)
Unigram
Tune model parameters and hyper parameters to improve accuracy.
Opennmt-py based
https://github.com/project-anuvaad/nmt-training
Fairseq based
https://github.com/AI4Bharat/indicTrans
python 3.6
ubuntu 16.04
Install various python libraries as mentioned in requirements.txt file
Run app.py to start the service with all the packages installed
For more information about api documentation, please check @ https://github.com/project-anuvaad/aaib4-inference/tree/main/docs/contracts