Model Retraining
Briefly explains about retraining a translation model to accommodate a domain-specific usecase.
The production environment of Anuvaad runs on top of translation models trained in the general domain, which will cover a good amount of scenarios. However, in case we need a separate instance of Anuvaad to translate domain-specific data, like financial, biomedical, etc: the existing model must be finetuned with more relevant data in a particular domain to improve the accuracy of translation. This page briefly summarises how it could be done
Data Collection
Bi-lingual, or parallel dataset is required for training the model. It is nothing but the same sentence pair in both source and target language. Example:
Source(en): India is my country
Target(hi): भारत मेरा देश है
The more the amount of available data, the more accurately the model could be trained. In short, data collection could be done in one of the following three approaches
Manual Annotation by linguistic experts
This is exhaustive but the best approach. At least a small sample size of the dataset must be manually curated and used for validation purposes
Creation of Corpus by web and document crawling
Certain websites will have the same data in multiple languages. The idea is to somehow find matching pairs of sentences from them. Scraping could be done using frameworks such as selenium and sentence matching could be done by using techniques such as LaBSE. This method if used properly can produce huge amounts of data without much manual effort. however random manual verification is recommended to ensure data accuracy
A lot of sample crawlers for reference are available in this repo
To do sentence matching of scraped sentences, Anuvaad aligner also could be used, which is implemented using LaBSE. The specs for Aligner is available here
Purchasing or using an open-sourced dataset
Very often, datasets will be made available by certain research institutes or private vendors. This data also could be included to increase the quantity of training data
Data cleaning & formatting
The raw data that is purchased or web-scraped might have too much noise which could affect the training accuracy and thereby translation. Noise includes unwanted characters, blank spaces, bullets and numbering,html tags etc.
The basic script for sentence alignment, cleaning and formatting is available here
However, more rules for cleaning could be applied based on context and manual verification of raw data as per the scenario.
Model retraining
The present default model of Anuvaad is Indictrans. The instructions to retrain and benchmark an Indicrans model is explained here
The training repo of legacy openNMT-py models is available here
Once a model is retrained, if there are plans to open-source it, hosting it in Dhruva will facilitate seamless integration with Anuvaad.
Last updated