Project Anuvaad
  • Sunbird Anuvaad Overview
    • Features
  • USE
    • Playbook
    • Video Tutorials
  • LEARN
    • Architecture
    • Technology Stack
    • Repository structure and developers guide
    • Setting up Anuvaad on your own
    • Git branching strategies
    • Anuvaad Module Config Guidelines
  • MODULES
    • Modulewise Appendix
    • Anuvaad Workflow Manager
    • User management
    • Document converter
    • Auditor
    • OCR Content handler
    • Block merger
    • Content Handler
    • Document Digitization
    • File uploader
    • Aligner
    • ETL Translator
    • File translator
    • Anuvaad Zuul Gateway System
    • Anuvaad Translator
    • Tokenizer
    • Analytics
    • NMT
  • Legacy
    • Model Retraining
    • NMT Inference
    • Integration
      • Registration
      • Login and auth token
      • Supported Language pairs and translation models
      • Translate texts
    • Service Contracts
    • API Host Endpoints
  • ENGAGE
    • FAQ
    • KT Videos
    • Source Code Repository
    • Discuss
    • Tools
      • anuvaad-corpus-tools
      • layout-mt-corpus
      • ocr-toolkit
      • anuvaad-ocr-corpus
      • parallel-corpus
      • anuvaad-em
Powered by GitBook
On this page
  • Data Collection
  • Data cleaning & formatting
  • Model retraining
Edit on GitHub
Export as PDF
  1. Legacy

Model Retraining

Briefly explains about retraining a translation model to accommodate a domain-specific usecase.

PreviousLegacyNextNMT Inference

Last updated 1 year ago

The production environment of Anuvaad runs on top of translation models trained in the general domain, which will cover a good amount of scenarios. However, in case we need a separate instance of Anuvaad to translate domain-specific data, like financial, biomedical, etc: the existing model must be finetuned with more relevant data in a particular domain to improve the accuracy of translation. This page briefly summarises how it could be done

Data Collection

Bi-lingual, or parallel dataset is required for training the model. It is nothing but the same sentence pair in both source and target language. Example:

Source(en): India is my country

Target(hi): भारत मेरा देश है

The more the amount of available data, the more accurately the model could be trained. In short, data collection could be done in one of the following three approaches

Manual Annotation by linguistic experts

This is exhaustive but the best approach. At least a small sample size of the dataset must be manually curated and used for validation purposes

Creation of Corpus by web and document crawling

Certain websites will have the same data in multiple languages. The idea is to somehow find matching pairs of sentences from them. Scraping could be done using frameworks such as and sentence matching could be done by using techniques such as . This method if used properly can produce huge amounts of data without much manual effort. however random manual verification is recommended to ensure data accuracy

A lot of sample crawlers for reference are available in this

To do sentence matching of scraped sentences, Anuvaad aligner also could be used, which is implemented using LaBSE. The specs for is available

Purchasing or using an open-sourced dataset

Very often, datasets will be made available by certain research institutes or private vendors. This data also could be included to increase the quantity of training data

Data cleaning & formatting

The raw data that is purchased or web-scraped might have too much noise which could affect the training accuracy and thereby translation. Noise includes unwanted characters, blank spaces, bullets and numbering,html tags etc.

However, more rules for cleaning could be applied based on context and manual verification of raw data as per the scenario.

Model retraining

The basic script for sentence alignment, cleaning and formatting is available

The present default model of Anuvaad is Indictrans. The instructions to retrain and benchmark an Indicrans model is explained

The training repo of legacy openNMT-py models is available

Once a model is retrained, if there are plans to open-source it, hosting it in will facilitate seamless integration with Anuvaad.

selenium
LaBSE
repo
Aligner
here
here
here
here
Dhruva