Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Anuvaad is loaded with lots of features to provide the optimal experience for the end user to smoothen the process of document translation. The notable features are highlighted below:
Document digitization is the process of converting physical documents into digital formats, making them easily accessible and editable.
Anuvaad is coupled with custom trained Layout detection models for Identifying and comprehending a document's structure, which involves the recognition of key elements, including headings, paragraphs, tables, and images. This process is essential not only for enhancing OCR accuracy but also for preserving the document's layout and structure in the translated version.
Document translation involves converting text from one language to another, facilitating cross-lingual communication and information access. Anuvaad supports using NMT models straight from Bhashini Dhruva or in-built plug and play type of models for domain specific use cases.
This feature ensures that the original formatting, layout, and structure of documents are maintained during the translation process, preserving the document's visual integrity.
Speech to text technology converts spoken language into written text, enabling audio content to be transcribed for translation or other purposes.
Translation memory stores and retrieves previously translated segments to ensure consistency across documents and reduce translation time.
Glossary support provides access to defined terminology and specialised vocabulary, ensuring consistency and precision in translations, particularly in specialised fields.
Usage analytics and metrics offer insights into how the platform is utilised, helping users track and optimise translation processes and workflows.
File format conversion simplifies the process of converting documents from one file format to another while preserving their content and structure, enhancing compatibility.
Transliteration support enables the conversion of text from one script or alphabet to another, aiding users in dealing with different writing systems and ensuring the correct pronunciation of words, especially in multilingual contexts.
Overview
It was bootstrapped by EkStep Foundation in late 2019 as a solution to enable easier translation of legal documents from English to Indic languages & vice-versa. Creating Anuvaad platform allowed legal entities to digitize & translate the Orders/Judgements using an easy to use user interface.
Anuvaad leverages state of the art AI/ML models including NMT, OCR, Layout detection to provide high level of accuracy. Project Anuvaad was envisioned to be end to end open sourced solution for document translation across multiple domains.
Project Anuvaad
is REST APIs driven and hence any third party system can use various features like sentence translation, layout detection etc.
NOTE: The documentation is still WIP. Feel free to contribute to it or raise issues if the desired info is not uptodate. Explore the KT videos if you would like to dive deep into each module.
Project Anuvaad
is an open-sourced project funded by EkStep foundation.
Anuvaad is an AI based open source Document Translation Platform to digitize and translate documents in Indic languages at scale. Anuvaad provides easy-to-edit capabilities on top the plug & play NMT models. Separate instances of Anuvaad are deployed to (NCERT), (SUVAS) and (Amar Vasha).
Follow these steps to set up the Anuvaad Web Application on your local machine:
Clone the Repository:
Navigate to the Project Directory:
Install Dependencies:
or
Environment Variables: Create a .env
file in the root directory of the project and configure the necessary environment variables. You can use the .env.example
file as a reference.
Start the Development Server:
or
Build the Application:
or
Run Tests:
or
General Guidelines:
Clone the repo and go to the module specific directory.
Run pip3 install -r requirements.txt
.
Make necessary changes to config files with respect to MongoDB and Kafka.
Run python3 src/app.py
.
Alternatively, modules could be run by building and running Docker images. Make sure configs and ports are configured as per your local setup.
Build Docker Image:
Run Docker Container:
Various video tutorials demonstrating features and step-by-step instructions to utilize the best out of Anuvaad!
Access the Application: Once the development server is started, you can access the application by navigating to in your web browser.
Note: Apart from this, the Docker images running in the user's environment could be found .
Workflow Manager(WM)
Centralized Orchestrator based on user request.
Auditor
Python package/library used for formatting , exception handling.
File Uploader
Microservice to upload and maintain user documents.
File Converter
Microservice to convert files from one format to other. E.g: .doc to .pdf files.
Aligner
Microservice accepts source and target sentances and align them to form parallel corpus.
Tokenizer
Microservice tokenises pragraphs into independently translatable sentences.
Layout Detector
Microservice interface for Layout detection model.
Block Segmenter
Handles layout detection miss-classifications , region unifying.
Word Detector
Word detection.
Block Merger
An OCR system that extracts texts, images, tables, blocks etc from the input file and makes it avaible in the format which can be utilised by downstream services to perform Translation. This can also be used as an independent product that can perform OCR on files, images, ppts, etc.
Translator
Translator pushes sentences to NMT module, which internally invokes IndicTrans model hosted in Dhruva to translate and push back sentences during the document translation flow.
Content Handler
Repository Microservice which maintains and manages all the translated documents
Translation Memory X(TMX)
System translation memory to facilitate overriding NMT translation with user preferred translation. TMX provides three levels of caching - Global , User , Organisation.
User Translation Memory(UTM)
System tracks and remembers individual user translations or corrected translations and applies automatically when same sentences are encountered again.
Internal modules are integrated through Kafka messaging.
Primary data storage.
Secondary in memory storage.
Cloud Storage
Samba storage is used to store user input files.
Serve as a redirection server and also takes care of system level configs. Ngnix acts as the gateway.
API Gateway to apply filters on client requests,authenticate,authorize,throttle client requests.
Layout detection model.
Used for Line detection.
Custom trained Tesseract used for OCR.
Custom trained model used for translation.
open-source platform for serving language AI models at scale.
The project Anuvaad repository serves as the primary codebase for the Anuvaad project, aimed at facilitating document processing and translation tasks efficiently.
anuvaad-api: Houses standalone APIs utilized within the project, such as login and analytics functionalities.
anuvaad-fe: Contains frontend-related code, responsible for the user interface and interaction aspects of the application.
chrome-extension: Hosts code relevant to the Anuvaad Chrome extension, offering additional features and integrations within the Chrome browser environment.
anuvaad-nmt-inference [legacy]: Previously held legacy OpenNMT Python-based inference code. Deprecated and not actively utilized within the current project framework.
anuvaad-etl: Comprises sub-modules dedicated to document processing tasks, enhancing the extraction, transformation, and loading capabilities within the Anuvaad ecosystem.
As an application, the Workflow Manager, in conjunction with independent APIs, forms the foundational architecture of Anuvaad. The Workflow Manager facilitates communication among various modules and orchestrates their interactions. However, Anuvaad's design accommodates diverse use cases, allowing each module to operate autonomously when necessary. For instance, the Tokenizer service can function independently to tokenize an Indic sentence without reliance on other modules.
Each microservice within Anuvaad adheres to a consistent structure, comprising the following common elements:
Dockerfile: Provides instructions to build the individual microservice within a Docker container, ensuring portability and consistency across different environments.
docs Folder: Contains documentation outlining the API contracts necessary for running and testing the module independently. This documentation serves as a reference for developers and users alike.
config Folder: Stores module-specific configurations and secrets required for the proper functioning of the microservice. Centralizing configuration management simplifies deployment and maintenance tasks.
kafkawrapper: Defines Kafka/WFM (Workflow Manager) related communication protocols, facilitating seamless integration and communication between modules. In the production environment, the Workflow Manager plays a crucial role in establishing communication channels, rendering standalone APIs redundant.
Anuvaad follows the standard feature-master type of branching strategy for code maintenance. The releases happen through the master branch via release tags.
Feature branches are a set of branches owned by individual developers in order to work on specific tasks. These branches are forked out of the master branch and they eventually feed into the same master branch once the code for that particular use case is developed and tested. These branches can either be deleted right after merging to master or can be retained to be reused for other use cases.
Feature branches can ONLY be deployed in the ‘Dev’ environment. The ‘Dev’ environment is a dedicated VM for the developers to test their code. Once the code is dev-tested, it must be merged to the ‘develop’ branch which further feeds into the ‘master’ branch.
The ‘Develop’ branch is a mirror branch to the master branch. This branch is dedicated for QA/UAT testing. All feature branches must feed into this branch before the use-case is sent for QA testing and at times UAT if needed. This branch will also act as a backup in case there’s something wrong with the master branch.
The develop branch can ONLY be deployed to the ‘QA’ environment. This is a dedicated environment for the QAs to perform unit, regression, and smoke testing of the features and the app as a whole. This environment can also be used for UAT purposes. Once there’s a QA signoff on the features, this will directly feed into master.
The master branch is the main branch from which all releases happen. All features, once dev-tested and QA-tested, will feed into master via the develop branch. The master branch is from where the code is deployed to production. Every release to production from the master branch will be tagged with the specific version of that release.
In case of production issues, we can fallback to any of the previous stable releases.
Hotfix branches are temporary branches which are forked directly from the ‘master’ branch and will feed back into the master only. These are for special cases when there’s a production bug to be resolved, and the develop branch is at the (n+m)th commit and master at (n)th commit.
These branches will act as temporary mirror branches for the master branch and can be tested on the QA env. Once tested and merged back to master, these branches have to be deleted. After the merge, the develop branch will have to be rebased with master, and the features will have to be rebased with the develop branch. The commits will flow upstream only after a rebase is successfully completed on all the forks.
Apart from the feature branches, individual devs will also own these branches.
Feature Branches: Code check-in to feature branches can be done by anyone; there’s no need for a review as such. These branches are mainly for the devs to test their code. The use case developed in this branch will have to be dev-tested on the ‘Dev’ environment before a merge request to the ‘develop’ branch is raised.
Develop Branch: Code check-in to the develop branch should only happen after a Peer Review. Merge to develop will only happen once the code is dev-tested on the Dev environment. It should be noted that a merge to develop should ensure that the code quality is up to the mark, all standards are followed, and it doesn’t break anything that is already merged to the develop branch by other devs. QA testing must happen on this branch deployed in a dedicated environment for QA/UAT. Any bugs reported will be fixed in the feature branch, reviewed, and then merged back to the develop branch. QA signoff happens on this branch.
Master Branch: Code check-in to the master branch will only happen from the develop branch and NO feature branches. Any merge to the master branch apart from the hotfix branch MUST come from the develop branch only. Merge from develop to master should happen only after an extensive code review from the leads. Only a select few members of the team will have access to merge the code to the master branch. The onus of the master branch is on the Technology leads of the team. Once the code is merged to master, a final round of regression testing must take place before the code is tagged for release.
Hotfix Branch: Code check-in to the hotfix branch can be done by individual devs once it is reviewed by a peer and the leads. This branch feeds into the master only after a second round of review. QA must happen on the hotfix branch before it is merged to master. The merge to master must also be released only after regression testing is done on the fix.
Parameters of a module that can be injected in and out of the system with zero to minimal code change in order to enable/disable/modify certain features of the system. The configs pertaining to modules of anuvaad data-flow pipeline can be broadly classified into 2 categories as:
Configs outside the build (docker image)
Configs within the build. (docker image)
These are those configs which are injected to the system on the fly, the changes injected thereby can be incorporated into the system on runtime or on just a restart of the system without having to re-build or push a logical piece of code. For instance, WFM reads configs for identifying different workflows configured. In order to add/edit/delete a workflow, one needs to make changes to the config file as required and push the file, the changes will be incorporated into the system on restart or through a reload API on runtime. (https://raw.githubusercontent.com/project-anuvaad/anuvaad/master/anuvaad-etl/anuvaad-workflow-mgr/config/etl-wf-manager-config-dev.yml) These files will be saved in the ‘configs’ folder outside the source code of the system.
These configs travel with the build. Meaning, these configs will be a part of docker image. These configs can be controlled via an environment file during deployment or internally within the code. This also means that any changes in these parameters will need a rebuild and re-deployment of the system. However, no change in the logic or the code should be needed to incorporate the changes. Most of the hooks that are exposed for a given system fall under this category. These are mentioned in the ‘configs’ folder inside the source code. It is recommended to use just one ***config.py file inside the folder for all these configs for better maintainability. In case someone prefers to separate out config files based on concern, they can do so but bear the overhead of maintaining them. For convenience and readability, these configs are further divided as:
Cross module common configs
Module specific configs.
These configs are used across all modules. Configs like Kafka host, Mongo host, File Upload URL etc fall under this category as these will not change from module to module. However, if there’s a case where modules chose to use different values for these parameters they can go ahead by using a different variable. The point of having this is ensuring we don’t create redundant variables in the environment file and use them from the same variables that are already defined.
For convenience the second type of variables are categorised as:
Kafka Configs: Configs required for kafka like topics, consumer groups, partition keys etc. that are very specific to the module. Some other parameters required to customise your consumer and producer as per your requirement can be mentioned under this category.
Datastore Configs: Configs required for the datastore that is being used which is mostly Mongo in our use case. In case you’re using MySQL, Redis, Elasticsearch etc, mention the required parameters in this category.
Module Configs: All other configs required for your module can be mentioned here.
Note: It is recommended to have most of these parameters deriving values from the environment file only. In some cases, they can also be hard-coded within the code. It is mandatory for every file/class within the project to use these parameters from these variables of the ***config.py file only.
Please ensure the folder structure is perfectly maintained.
Never ever check-in sensitive data like AWS keys, passwords, PIIs etc in the config file, always erase/mask/encrypt them before pushing it to github.
Eg:
These configs are specific to the module, and will change for each module. This category includes both the common variables (Eg: https://github.com/project-anuvaad/anuvaad/blob/69b494224626d51a7baf0405603106a4a66a25c7/anuvaad-etl/anuvaad-extractor/aligner/etl-aligner/configs/alignerconfig.py#L10 ) inside the code which are used at multiple places in your project and the variables deriving value from the environment file (Eg: )
You can check this file for reference:
Summary of the purpose of each module and necessary links
1
user management
manage the User and Admin side functionalities in Anuvaad.
2
file handler
The User Uploads the file and in return the file will be stored in the samba share for further api's to access them.
3
file converter
consumes the input files and converts them into PDF. Best results are obtained only for the file formats supported by Libreoffice.
4
file translator
transform the data in the file to form JSON file and download the translated files of type docx, pptx, and html.
5
content handler
handle and retrieve back the contents (final result) of files translated in the Anuvaad system.
6
document converter
This microservice is intended to generate the final document after translation and digitization. This currently supports pdf, txt, xlsx document generation.
7
tokenizer
tokenise the input paragraphs received into independently translatable sentences which can be consumed by downstream services to translate the entire input
8
ocr tokenizer
This service is used to tokenise the input paragraphs received into independently translatable sentences which can be consumed by downstream services to translate the entire input.
9
ocr content handler
handle and manipulate the digitized data from anuvaad-gv-document-digitize which is part of the Anuvaad system.
10
Aligner
This Module is for “aligning” or simply, finding similar sentence pairs from two lists of sentences,
11
workflow manager
centralized orchestrator which directs the user input through the dataflow pipeline to achieve the desired output.
12
Block merger
extract text from a digital document in a structured format(paragraph,image,table) which is then used for translation purposes.
13
translator
Translator is a wrapper over the NMT and is used to send sentence by sentence to NMT for translation of the document
14
word detector
Input as pdf or image If input is pdf , then convert pdf into images Use custom prima line model to line detection in the image
15
layout detector
Output of word detector as an input. Use a prima layout model for layout detection in the image.
16
block segmenter
Output of layout detector as an input. Collation of line and word at layout level
17
google vision ocr
Output of block segmenter as an input. Use google vision as OCR engine. Text collation at word,line and paragraph level.
18
tesseract ocr
Output of block segmenter as an input. Use Anuvaad ocr model as OCR engine. Text collation at word,line and paragraph level.
19
NMT
This service gets the translated content either by invoking the model directly or fetches translated content from Dhruva platform.
20
metrics
Display Analytics
Key API contract:
Workflow Manager is the orchestrator for the entire dataflow pipeline.
This document provides details about the Workflow Manager. WFM is the orchestrating module for the Anuvaad pipeline.
WFM is the backbone service of the Anuvaad system, it is a centralized orchestrator which directs the user input through the dataflow pipeline to achieve the desired output. It maintains a record of all the jobs and all the tasks involved in each job. WFM is the SPOC for the clients to retrieve details, status, error reports etc about the jobs executed (sync/async) in the system. Using WFM, we’ve been able to use Anuvaad not just as a Translation platform but also as an OCR platform, Tokenization platform, Sentence Alignment platform for dataset curation. Every use-case in Anuvaad is defined as a ‘Workflow’ in the WFM, These workflow definitions are in the form of a YAML file, which is read by WFM as an external configuration file.
WFM Config: This is a YAML file which has a well defined structure to create workflows in the Anuvaad system. Every use-case in Anuvaad is called ‘Workflow’.
Workflow - Set of steps to be executed on a given input to obtain the desired output. Anuvaad has 2 types of workflows: Async WF and Sync WF.
Async WF - These are asynchronous workflows, wherein the modules involved in this flow communicate with each other and the WFM via the kafka queue asynchronously.
Sync WF - These are synchronous workflows wherein the modules involved communicate with each other and the WFM via REST APIs. The client receives responses in real time.
Structure of the config is as follows:
workflowCode: An alphanumeric code that UNIQUELY identifies a workflow. Format: WF_<A/S>_<codes_of_modules_in_sequence>
type: Type of the workflow - ASYNC or SYNC
description: Description of the workflow to explain what the workflow does
useCase: An alphanumeric prefix to the job ID signifying a reference to the workflowCode.
sequence: The set of steps to be defined under the workflow. This is a list of ‘steps’ where each ‘step’ contains keys order, tool & endState.
The ‘tool’ key is the definition of the tool used in the corresponding ‘step’ in the ‘sequence’. Each tool contains keys name, description, kafka-input, topic, partitions, kafka-output. In case of Sync WFs, the tool contains keys name, description, api-details, uri.
Order: Number that defines the order of this step in the sequence. 0 is the value for the first step, 1 being next and so on.
name: Name of the tool
description: Description of the tool
kafka-input: Details of the kafka input for that particular tool. The tool must accept input on this topic from the WFM.
kafka-output: Details of the kafka output for that particular tool. The tool must produce output on this topic from the WFM.
api-details: Details of the API exposed by the tool for WFM to access.
WFM has 2 types of IDs involved in the jobs that hep uniquely identify a job and its intermediate tasks: jobID & taskID. jobID: This is a alphanumeric ID that uniquely identifies a job in the system. jobIDs are generated for both Sync and Async Jobs. Format: <use_case>-<random_string>-<13-digit epoch time> taskID: A job contains multiple intermediate tasks, taskID is a unique ID used to idenity each of those tasks. A combination of these taskIDs mapped to a given jobID can help trace an entire job through the system. Format: <module_code>-<13-digit epoch time>
API Details WFM exposes multiple APIs for the client to execute and fetch jobs in the Anuvaad system. The APIs are as follows: /async/initiate: API to execute Async workflows. /sync/initiate: API to execute Sync workflows. /configs/search: API to search WFM configs /jobs/search: API to search initiated jobs.
python 3.7
ubuntu 16.04
Dependencies:
Run:
An example workflowCode: WF_A_FCBMTKTR WF = Workflow A = Async FC = File Converter BM = Block Merger TK = Tokeniser TR = Translator Configs can be found here:
Details of the APIs can be found here:
Details of the requests flowing in and out through kafka can be found here:
Wokflows have to be configured in a .yaml file as shown in the following document:
A Python package that provides standardized logging and error handling for the Anuvaad dataflow pipeline. This package serves features like session tracing, job tracing, error debugging, and troubleshooting.
Prerequisites:
Python 3.7
Command:
This part of the library provides features for logging by exposing the following functions:
Import file:
Logs INFO level information.
Logs DEBUG level information.
Logs ERROR level information. Should be used for logical errors like “File is not valid”, “File format not accepted” etc.
Logs EXCEPTION level information. Should be used in case of exceptions like “TypeError”, “KeyError” etc.
In all the functions, message and input-object are mandatory.
These functions build an object using these parameters and index them to Elasticsearch for easy tracing.
Ensure all major functions have a log_info
call, all exceptions have log_exception
calls, and all logical errors have log_error
calls.
This part of the library provides features for standardizing and indexing the error objects of the pipeline.
Import file:
Returns a standard error object for replying back to the client during a SYNC call and indexes the error to an error index.
Constructs a standard error object which will be indexed to a different error index and PUSHES THE ERROR TO WFM internally.
Usage Notes:
Use post_error_wf
for flows triggered via Kafka or REST through WFM.
Ensure both log functions and error functions are used in case of exceptions or errors.
Errors are indexed to two different indexes: Error index and Audit Index.
Use post_error_wf
carefully, as this method will take the entire job to FAILED state.
anuvaad-auditor==0.1.1
- Please use this version.
Source code:
This microservice is served with multiple APIs to handle and manipulate the digitized data from anuvaad-gv-document-digitize
, which is part of the Anuvaad system. This service is functionally similar to the Content Handler service but differs since the output document (digitized doc) structure varies.
API to save translated documents. The JSON request object is generated from anuvaad-gv-document-digitizer
and later updated by tokenizer. This API is being used internally.
Mandatory parameters: files
, record_id
Actions:
Validating input params as per the policies
The document to be saved is converted into blocks of pages
Each block contains regions such as line, word, table, etc.
Every block is created with UUID
Saving blocks in the database
API to update the text in the digitized doc. RBAC enabled.
Mandatory parameters: words
, record_id
, region_id
, word_id
, updated_word
Actions:
Validating input params as per the policies
Looping over the regions to locate the word to be updated
Updating the word and setting a flag save=True
API to fetch back the document. RBAC enabled.
Mandatory parameters: record_id
, start_page
, end_page
Actions:
Validating input params as per the policies
Returning back the document as an array of pages
This microservice is used to extract text from a digital document in a structured format (paragraph, image, table), which is then used for translation purposes.
It takes an image or pdf as an input.
If the input is a pdf, it converts the pdf into images.
The pdftohtml
tool is used to extract page-level information like text, word coordinates, page width, page height, tables, images, and others.
If the document language is vernacular, the pdftohtml
tool does not work well, so we use Tesseract (or GV if required Alternatively) for OCR.
Horizontal merging is used to get lines using word coordinates.
Vertical merging is used to get blocks using line coordinates.
The final JSON contains page-level information like page width, page height, paragraphs, lines, words, and layout class.
Input:
Here it takes a PDF or image path as an input and the language of that document.
Input:
Upload a PDF or image file using the upload API:
Get the upload ID and copy that to the path of wf-initiate input of the block merger.
Do bulk search using jobIDs to get JSON ID of the BM service response:
Bulk search input format:
Download JSON using download API:
API Contract:
Code location:
URL:
URL:
Upload URL:
Bulk search URL:
Download URL:
This microservice is served with multiple APIs to handle and retrieve the contents (final result) of files translated in the Anuvaad system.
Some common info that is applicable to save, update operations on translations.
WF_S_TR
and WF_S_TKTR
: Changes the sentence structure, hence s0 pair needs to be updated.
DP_WFLOW_S_C
: Doesn't change the sentence structure, hence no need to update the s0 pair.
s0_src
: Source sentence extracted from the file.
s0_tgt
: Sentence translation from NMT.
tgt
: Translation updated by the user (User translation). (Source may vary if the user edits the input document, or else it keeps the same as s0_src
).
API to save translated documents. The JSON request object is generated from block-merger and later updated by tokenizer and translator. This API is used internally.
Mandatory parameters: userid
, pages
, record_id
, src_lang
, tgt_lang
Actions:
Validating input parameters as per the policies.
The document to be saved is converted into blocks.
Block can be of type images, lines, text_blocks, etc.
Every block is created with UUID.
Saving blocks in the database.
API to fetch back the documents. The response object would be an array of pages, with pagination enabled. RBAC enabled.
Mandatory parameters: start_page
, end_page
, record_id
Actions:
Validating input parameters as per the policies.
Fetching back the blocks as per the page number requested.
API to update the block content; triggered on split, merge, re-translate operations. Used internally.
Mandatory parameters: record_id
, user_id
, blocks
, workflowCode
Actions:
Validating input parameters as per the policies.
Updating the list of blocks.
Internal API to store S3 link references to translated documents (on docx flow).
Mandatory parameters: job_id
, file_link
Actions:
Validating input parameters as per the policies.
Storing records in the database.
API to fetch back the S3 link for docx files. RBAC enabled.
Mandatory parameters: job_ids
Actions:
Validating input parameters as per the policies.
Fetching back the data from the database.
API to store user translations. RBAC enabled.
Mandatory parameters: sentences
, workflowCode
, user_id
Actions:
Validating input parameters as per the policies.
Updating the sentence blocks.
Saved sentences are always updated with "save": true
flag.
Saved sentences are also saved in the Redis store for Sentence Memory.
Bulk API to fetch back sentences. RBAC enabled.
Mandatory parameters: sentences
, record_id
, block_identifier
, s_id
Actions:
Validating input parameters as per the policies.
Returning back an array of sentences searched for.
This pipeline is used to extract text from a digital/scanned document. Lines and layouts (header, footer, paragraph, table, cell, image) are detected by a custom-trained Prima layout model and OCR is done using the Anuvaad OCR model.
Upload a PDF or image file using the upload API:
Get the upload ID and copy it to the DD2.0 input path.
Initiate the Workflow:
DD2.0 Input:
Input: PDF or image
Output: List of pages with detected lines and page information.
Input: Output of word detector
Output: List of pages with detected layouts and lines.
Input: Output of layout detector
Output: Collation of line and word at layout level.
Input: Output of block segmenter
Output: Text collation at word, line, and paragraph level using Google Vision as the OCR engine.
Input: Output of block segmenter
Output: Text collation at word, line, and paragraph level using Anuvaad OCR model.
Github repo:
API contract:
Upload URL:
WF URL:
Github repo:
API contract:
Upload URL:
WF URL:
Github repo:
API contract:
WF URL:
Github repo:
API contract:
WF URL:
Github repo:
API contract:
WF URL:
WF URL:
Github repo:
API contract:
This microservice is served with multiple APIs to transform the data in the file to form JSON file and download the translated files of type DOCX, PPTX, and HTML.
Steps:
Transformation Flow
Use the data in DOCX, PPTX, or HTML file to create a JSON file.
Tokenizer Flow
Read the JSON file created in the Transformation Flow and tokenize each paragraph.
Tokenization is a process where we extract all the sentences in a paragraph.
Translation Flow
Translate each sentence.
Steps:
WF_A_FTTKTR Flow
This flow must be completed before calling the download flow.
Download Flow
Fetch content for the file, replace original sentences with translated ones, and download the file.
Mandatory Params for File Translator:
Path
Type
Locale
Actions:
Validate input parameters.
Generate a JSON file from the data of the given file.
Convert the given file to HTML, PDF, and push it to S3 (For showing it on UI).
Get the S3 link of the converted file and call content handler API to store the link.
Mandatory Params for File Translator:
Path
Type
Locale
Actions:
Validate input parameters.
Call fetch-content to get the translation of the file passed in the param.
Replace the original text in the file with the translated text.
Return the path of the translated file.
This microservice is intended to generate the final document after translation and digitization. This currently supports pdf
, txt
, xlsx
document generation.
API to create digitized txt
& xlsx
files for Translation Flow. RBAC enabled.
Mandatory parameters: record_id
, user_id
, file_type
Actions:
Validating input params as per the policies
Page data is converted into dataframes
Writing the data into file and storing them on Samba store
API to create digitized txt
& pdf
files on Document Digitization flow. RBAC enabled.
Mandatory parameters: record_id
, user_id
, file_type
Actions:
Validating input params as per the policies
Generating the docs using ReportLab
Writing the data into file and storing them on Samba store
UMS is the initial Anuvaad module that facilitates user login and other account-related functionalities. It features admin level login and user level login. Only super Admin has the authority to create new organizations or add new users to the system (if not for sign-up). Admin can assign roles to the new users as well.
Whitelisted bulk API to create/register users in the system.
Mandatory params: userName
, email
, password
, roles
Actions:
Validating input params as per the policies
Storing user entry in the database and assigning a unique id (userID
)
Triggering verification email
Whitelisted API to verify and complete the registration process on Anuvaad.
Mandatory params: userName
, userID
Actions:
Validating input params as per the policies
Activating the user
Triggering registration successful email
Whitelisted API for login.
Mandatory params: userName
, password
Actions:
Validating input params as per the policies
Issuing auth token (JWT token)
Activating user session
Whitelisted API for logging out.
Mandatory params: userName
Actions:
Validating input params as per the policies
Turning off user session
API to validate auth tokens and fetch back user details.
Mandatory params: token
Actions:
Validating the token
Returning user records matching the token only when the token is active
Same API is used for verifying a token generated on forgot-password as well.
Bulk API to update user details, RBAC enabled.
Mandatory params: userID
Updatable fields: orgID
, roles
, models
, email
Actions:
Validating input params as per the policies
Updating DB records
API for forgot password.
Mandatory params: userName
Actions:
Validating input params as per the policies
Generating reset password link and sending it via email
API to update password, RBAC enabled.
Mandatory params: userName
, password
Actions:
Validating input params as per the policies
Generating reset password link and sending it via email
(Only Admin has access)
Bulk API to onboard users to the Anuvaad system.
Mandatory params: userName
, email
, password
, roles
Actions:
Validating input params as per the policies
Storing user entry in the database and assigning a unique userID
User account is verified and activated by default
API for bulk search with pagination property.
Actions:
Validating input params as per the policies
All user records are returned if skip_pagination
is set to True
When no offset and limit are provided, default values are set as per configs
Only the records matching the search values are returned if skip_pagination
is False
API to update the activation status of a user.
Mandatory params: userName
, is_active
Actions:
Validating input params as per the policies
Updating the user activation status
API to fetch active roles in Anuvaad.
Actions:
Returning active role codes
CreateOrganization: Bulk API to upsert organizations.
Mandatory params: code
, active
Actions:
Validating input params as per the policies
Creating or deactivating orgs as per active
status on request
SearchOrganization: API to get organization details.
Actions:
If org_code
is given, searches for that organization alone; otherwise, all organizations are returned.
GenerateIdToken: Generating token for web extension user.
Mandatory params: id_token
Actions:
Decrypting and validating the token
If the token is valid, register the user and return auth token
Add APIs with Zuul if they need external access.
Rebuild and deploy UMS whenever a new role is added with Zuul.
Email ID used for system notifications: anuvaad.support@tarento.com
Run the docker container.
Initialize the DB by creating a Super-Admin account directly in the DB.
Additional users can be added from the UI by logging into the super admin account.
Create an account (Admin is preferred) using the API anuvaad/user-mgmt/v1/users/create
.
Get the verification token from the email (2nd last ID on the ‘verify now’ link) or the userID
from the user table.
Complete the registration process by calling the anuvaad/user-mgmt/v1/users/verify-user
API.
This document provides details about the translator service used in Anuvaad. Translator is a wrapper over the NMT and is used to send sentence by sentence to NMT for translation of the document.
Translator receives input from the tokeniser module, the input is a JSON file that contains tokenised sentences. These tokenised sentences are extracted from the JSON file and then sent to NMT over kafka for translation. NMT expects a batch of ‘n’ sentences in one request, Translator created ‘m’ no of batches of ‘n’ sentences each and pushes to the NMT input topic. In parallel it also listens to the NMT’s output topic to recieve the translation of the batches sent. Once, all the ‘m’ no of batches are received back from the NMT, the translation of the document is marked complete.
Next, Translator appends these translations back to the JSON file received from Tokeniser, The whole of this JSON which is now enriched with translations against every sentence is pushed to Content Handler via API, Content Handler then stores these translations.
TMX is the system translation memory, A user can decide to override Anuvaad’s translation of a text/sentence by inserting ‘preferred translations’ into the system. TMX is backed by a redis store which hashes and stores user-specific translations for a text. This can be called as user’s personal cache of translations.
TMX provides three levels of cache-ing: Global, Org, User.
This is a global bucket of preferred translations where an ADMIN or a Global Level User can feed translations which will applied across users and orgs.
This a Org level bucket where Anuvaad translations are overridden by preferred translations to only those users who belong to a particular organisation. Any ADMIN or a Org Level user can feed these translations in to be overridden across users of his/her org.
This a User level bucket where a User can feed in his/her preferred translations and the system will override Anuvaad translations only for this particular User.
TMX can be uploaded sentence by sentence or in bulk, both APIs are supported.
UTM is User Translation Memory, This is slightly different from TMX, here there are no levels, This is totally and plainly a translation cache. Here, the system remembers the user's translation and applies it automatically when it encounters the same sentence for the same user.
Let’s say, We have a sentence ‘S1’ in a document ‘D1’, Anuvaad’s translation of this sentence is ‘T1’. Let’s say that the user on encountering this, changes the translation ‘T1’ of sentence ‘S1’ to ‘T2’. Now, Anuvaad remembers this such that, in any document, say ‘D2’ in this case, whenever there’s ‘S1’ and NMT translates it to ‘T1’, Anuvaad automatically overrides the translation to ‘T2’. However, let’s say NMT got better with time and now translates ‘S1’ to ‘T3’, in this case, Anuvaad dosen’t override it because the user context was ‘S1’ —> ‘T1’ and not ‘S1’ —> ‘T3’.
The Tokenizer submodule in Anuvaad is designed to break down paragraphs into sentences or words, facilitating efficient preprocessing and accurate translations. This submodule is integral for preparing text data for subsequent processing and translation tasks.
Paragraph to Sentence Tokenization: Splits paragraphs into individual sentences, making the text easier to process.
Sentence to Word Tokenization: Breaks down sentences into individual words for detailed analysis and translation.
Document-Specific Handling: Manages document-specific symbols and special characters to ensure consistency and accuracy in tokenization.
Flexible Integration: Can be invoked independently as a standalone service or as part of a larger workflow through the Workflow Manager.
The Tokenizer can be utilized in two main ways:
Independent Invocation:
As an independent service, the Tokenizer can be directly called to process text data. This is useful for isolated tasks where only tokenization is required.
Workflow Manager Integration:
Within the Workflow Manager, the Tokenizer works as a part of the broader document processing and translation pipeline. This integration allows for seamless interaction with other Anuvaad submodules, ensuring smooth and efficient data flow.
By employing the Tokenizer submodule, Anuvaad ensures that text data is meticulously prepared, contributing to the overall accuracy and efficiency of the document processing and translation workflow.
API Contract:
Code:
Feature Branch name: user-mangement_feature
API Contract:
Email templates are available .
Example:
You can find the code for the Tokenizer submodule in the Anuvaad repository at the following link:
For detailed information about the API endpoints and their usage, refer to the API contract available at:
When Anuvaad is implemented at an organizational level, analytics is crucial for tracking usage and metrics. A dedicated module exists to serve this purpose.
For every X hours, a cron job creates a CSV file, and analytics are drawn based on it for time-consuming computations. For other metrics, data is fetched directly from the database. The following analytics are currently available, with room for more metrics to be visualized:
Total Documents Translated, Language-wise
Organization-wise Sentences Translated
Organization-wise Dashboard
Reviewer Metrics
Code Repository:
API Contract:
The Aligner module is designed for “aligning” or finding similar sentence pairs from two lists of sentences, preferably in different languages. The Aligner is a standalone service that cannot be accessed from the UI as of now. The service is dependent on the file uploader and workflow manager (WFM) services.
The Aligner service is based on Google’s LaBSE model and FB’s FAISS algorithm. It accepts two files as inputs, from which two lists of sentences are collected. LaBSE Embeddings are calculated for each of the sentences in the list. Cosine similarity between embeddings is calculated to find meaningfully similar sentence pairs. The FAISS algorithm is used to dramatically speed up the whole process.
The service accepts two text files, and the aligner module can ideally be invoked using WFM. It is time-consuming and hence an async service. Once the run is fully done, a WFM-based search can be conducted using the job ID to obtain the result.
The response is typically a JSON file path, which can be downloaded using the download API. The JSON file is self-explanatory and it contains source_text
, target_text
, and the corresponding cosine similarity between them.
Clone the Repo
Install dependencies
Run the application
Access from local:
It returns a JOB ID, which can be searched using the WFM Bulk search API to see job progress and pull out results once done.
WF_A_JAL
is the Workflow code for JSON-based aligner, which returns the filepath of a JSON file that could be downloaded using the download API.
WF_A_AL
is the old workflow code, that returns multiple txt files.
Upload two files.
Call API endpoint with file paths as parameters.
Verify if sentences are matching properly in the JSON.
Can be used as an independent service by deploying file-uploader and aligner modules alone on a server, preferably GPU-based (tested working well on g4dn2xlarge
).
Simplified implementations of the aligner could be found .
An explanatory article could be found and .
This module provides the NMT based translation service for various Indic language pairs. Currently the NMT models are trained using OpenNMT-py framework version 1 and the model binaries are generated using ctranslate2 module provided for OpenNMT-py and the same is used to generate model predictions.
NMT requires parallel corpus between languages. Typically the size of language corpus is in the millions. The language corpus must have enough examples to cover various situations. This is one of the most important portions of the system and a very time consuming work where quality of data has to be checked to ensure the accuracy of translation. At Anuvaad, we have collected data for 11 languages as parallel corpora. The corpus is available under MIT license.
Training and retraining is a continuous process and training is dependent on the quality of input dataset. We have to constantly monitor the quality of translation. The translation mistakes should be used to generate training examples and retraining exercises have to periodically be taken up. The training cycle is a costly affair as they need GPU infrastructure and long training hours.
The model output is evaluated on the per-selected sentences and BLEU score is calculated. BLEU score provides a score that helps as the guidance to provide feedback on the model quality. Translation output has to be evaluated by human translators as well before it can be used in a production environment.
We are leveraging an open-source project called “openNMT” and also exploring “FairSeq”(IndicTrans) from the perspective of enhancement and usage. The deeplearning platform used is pytorch
Vocabulary or dictionary generation
Tokenizer (Detokenizer) or breaking of given sentence in word or sub-word. (Language specific) Moses or IndicNLP(for indian languages)
BPE (Byte Pair Encoding)
Unigram
Tune model parameters and hyper parameters to improve accuracy.
Opennmt-py based
Fairseq based
python 3.6
ubuntu 16.04
Install various python libraries as mentioned in requirements.txt file
Run app.py to start the service with all the packages installed
This page will help the user to get themselves onboarded on Anuvaad and perform an operation.
Once a user reaches the Sign Up page, they have to fill in the required details as shown below:
Upon successful submission, an E-Mail will be sent to the registered ID with a verification link
Clicking the verification link will redirect the user to the login page as shown below
For security purposes, Anuvaad follows an OTP-based login mechanism. You will be asked for the E-Mail to which a one-time password will be sent.
Upon confirming the ID, you will be asked to opt for one of the authentication methods.
Everything discussed above is a one-time process and must be applicable only on initial login to the application. Going forward, you will be redirected to the OTP verification page upon successful login.
Upon providing the correct OTP, the user will reach the below landing page of Anuvaad and we are good to go!
The translate sentence feature enables the user to input a text and instantaneously get its translation in another language. To use it, simply click on the Translate Sentence option on the landing page of Anuvaad.
The user will have to select the source language to which he provides the input and the target language to which the text must be translated.
If an Indic language is selected as the source, by default Transliteration feature will be enabled, assisting the user to type in that particular language from the normal keyboard with ease.
Upon clicking submit, the translated sentence will be displayed along with the model used to perform the translation. Using this feature, a stakeholder can quickly check the accuracy of the translation performed by Anuvaad.
Digitize Document feature helps to convert scanned documents into editable digital format by preserving the structure. This process recognizes text in scanned (non hand-written for now) documents and converts it into searchable text.
To perform a document digitization, click on the Digitize Document option on the landing page of Anuvaad. You will be greeted with a screen as below
User may select a document/image and choose the appropriate source language and then trigger the digitization process. A pop-up window appears which shows the progress of the ongoing process. This is an async job and will happen in the background. The time taken will be dependent on the nature of the uploaded file
If the status of a job is completed, you can view the result and make changes by clicking on the view document icon, which is second in the last column under the label Action.
Users can make changes, if any, by double-clicking on the word. once done, the digitized document can be downloaded by clicking on the download button in the top right corner in the desired format.
The translate document feature enables the user to upload a document and get its translated version. The key highlight of the feature is that Anuvaad tries to maintain the original structure of the document in the best possible manner.
To perform a document translation, click on the Translate Document option on the landing page of Anuvaad. You will be greeted with a screen as below
Here, the user will have the provision to upload a document (pdf,docx,pptx formats are supported as of now) and select a source and target language to perform translation. On successful upload of a supported file, the process begins and status will be shown
There are various stages happening behind the scenes of document translation and the status will be displayed on screen. The total time taken to complete the process depends on the number of pages and the structure of the input document. The translation is an async
process and once initiated it will be performed in the background. Users can keep on adding more and more tasks to Anuvaad meanwhile.
Users can make necessary changes to the document using Anuvaad's easy editor which is developed by seeing document translators forefront, and later on download the translated document back to their system in the desired format.
The blue icon in the bottom right corner is to use the merge feature. In case two or more text blocks need to be combined into a single unit, it could be used.
Being an AI-assisted translation software, some occurrences can happen where a machine translation can give meaningful, yet non-contextual output to certain words/phrases. In order to work around this, Anuvaad offers user-level Glossary support.
A translator can store certain phrases and their predefined translation so that if there is an occurrence of a similar phrase in the document for translation, the system acts wisely based on the predefined criterions.
Anuvaad also offers a built-in Analytics feature to keep track of usage metrics. These Analytics are instance-specific. Bar graph-based representations offer quick insights into how well Anuvaad is utilized. This also helps stakeholders to keep track of the number of Documents processed, Languages used, sentences translated, and organization-level information. Furthermore, these data could be exported in the desired format for future reference.
Anuvaad uses the current state-of-the-art Transformer model to achieve target sentence prediction or translation Supporting code and paper is in open source domain,
Sentence Piece or subword-nmt Supporting code and paper is in open source domain,
For more information about api documentation, please check @
The automated onboarding process is disabled for now to restrict resource usage. For time being, please fill out the details and we will reach back to you soon with login details (Remember to check spam as well). please send a mail in case you didn't receive any response within a day.
Users shall onboard themselves on Anuvaad via the link below: Registration:
The status of all tasks given by user to Anuvaad can be viewed on the as below
After a certain time, by going to the , the user can access the translated document.
The list of default Glossaries and added ones can be viewed on page.
To delete a glossary, simply click on the bin icon under the Action column. To add an extra item, click on the button in the top right corner. You will be redirected to the following page where new words/phrases and their corresponding translation can be added. Once submitted, the new items will be visible on page.
Integration
This is the very first step, users should register at . Anuvaad will send a verification email, please verify the email before start making any APIs calls. Clearly registration is a one time activity.
It is recommended to translate batches of sentences in order to get high throughput.
/aai4b-nmt-inference/v1/translate
Translate batches of sentences.
This document explains how Netflix Zuul is used as an API Gateway in Anuvaad to perform Authentication, Authorization and API redirection to all inbound API calls. Zuul is an Open Source Project.
Zuul is an API Gateway developed as an Open Source Project by Netflix. Zuul provides various features to abstract out some of the common operations of the system and provide a strong layer for Authentication, Authorization, API Pre & Post Hooks, API Throttling, Session monitoring and much more. Zuul: Zuul is an edge service that proxies requests to multiple backing services. It provides a unified “front door” to your system, which allows a browser, mobile app, or other user interface to consume services from multiple hosts without managing cross-origin resource sharing (CORS) and authentication for each one.
Zuul is Anuvaad is a config driven implementation where APIs, Roles, Role Actions are read by Zuul through a file stored in a remote repository.
The set of Roles defined in the system, these will be attached to the Users and also mapped with the APIs to provide role based access control (RBAC).
Set of APIs exposed in the system by various microservices. Each action is an API which will be mapped against the roles. APIs are of 2 types: Open APIs and Closed APIs. Open APIs can be accessed without authentication, in other words: these APIs are whitelisted. Closed APIs can only be accessed after auth checks.
Mapping between the roles and actions, Zuul uses this to decide if the User should be allowed to access a particular API. These configs can be found here:
Anuvaad uses JWT auth tokens for authentication and authorization purposes. The same token is also used as the session ID. These tokens are generated and stored securely by a UMS system. Example:
Anuvaad Zuul uses 3 Pre Filters namely: Correlation, Auth, Rbac. Correlation: Filter to add a correlation ID to the inbound request. Auth: Filter to perform Authentication check on the inbound request. Rbac: Filter to perform Authorization check on the inbound request. API redirection configuration is provided in the
In Anuvaad All the service are talked through Workflow-manager by respective workflow configs defined for each service and communicating between these service are done by means of Kafka. Each services has its own functionality and are not dependent to predecessor services outputs. Two Main Flows:
1. Translation - Single/Block Translation or Document (.pdf , .docx , .pptx) Translation.
2. Digitization - OCR on (.pdf or Images).
WFM is the backbone service of the Anuvaad system, it is a centralized orchestrator which directs the user input through the dataflow pipeline to achieve the desired output. It maintains a record of all the jobs and all the tasks involved in each job. WFM is the SPOC for the clients to retrieve details, status, error reports etc about the jobs executed (sync/async) in the system. Using WFM, we’ve been able to use Anuvaad not just as a Translation platform but also as an OCR platform, Tokenization platform, Sentence Alignment platform for dataset curation. Every use-case in Anuvaad is defined as a ‘Workflow’ in the WFM, These workflow definitions are in the form of a YAML file, which is read by WFM as an external configuration file.
This microservice is served with multiple APIs to manage the User and Admin side functionalities in Anuvaad.
if pdf document:
FILE-UPLOADER -> FILE-CONVERTER -> BLOCK-MERGER -> TOKENISER -> TRANSLATOR -> CONTENT-HANDLER
if docx and pptx:
FILE-UPLOADER -> FILE-TRANSLATOR -> TOKENISER -> TRANSLATOR -> CONTENT-HANDLER
V1.0
FILE-UPLOADER -> FILE-CONVERTER -> GOOGLE-VISION-OCR -> OCR-TOKENISER -> OCR-CONTENT-HANDLER
V1.5
FILE-UPLOADER -> FILE-CONVERTER -> WORD-DETECTOR -> LAYOUT-DETECTOR -> BLOCK-SEGMENTER -> GOOGLE-VISION-OCR -> OCR-TOKENISER -> OCR-CONTENT-HANDLER
V2.0
FILE-UPLOADER -> FILE-CONVERTER -> WORD-DETECTOR -> LAYOUT-DETECTOR -> BLOCK-SEGMENTER -> TESSERACT-OCR -> OCR-TOKENISER -> OCR-CONTENT-HANDLER
The User Uploads the file and in return the file will be stored in the samba share for further api's to access them.
This micro service, which in turn is a Kafka consumer service, consumes the input files and converts them into PDF. Best results are obtained only for the file formats supported by Libreoffice.
If Document format is .pdf then Block-merger will be used for OCR on the document.
It is used to extract text from a digital document in a structured format(paragraph,image,table) which is then used for translation purposes.
If Document format is .docx or .pptx then File-Translator service will be used.
This microservice is served with multiple APIs to transform the data in the file to form JSON file and download the translated files of type docx, pptx, and html.
This service is used to tokenise the input paragraphs received into independently translatable sentences which can be consumed by downstream services to translate the entire input.Regular expressions and specific libraries such as Nltk are being used to build this tokeniser.
Translator receives input from the tokeniser module, the input is a JSON file that contains tokenised sentences. These tokenised sentences are extracted from the JSON file and then sent to NMT over kafka for translation. NMT expects a batch of ‘n’ sentences in one request, Translator created ‘m’ no of batches of ‘n’ sentences each and pushes to the NMT input topic. In parallel it also listens to the NMT’s output topic to recieve the translation of the batches sent. Once, all the ‘m’ no of batches are received back from the NMT, the translation of the document is marked complete. Next, Translator appends these translations back to the JSON file received from Tokeniser, The whole of this JSON which is now enriched with translations against every sentence is pushed to Content Handler via API, Content Handler then stores these translations.
This microservice is served with multiple APIs to handle and retrieve back the contents (final result) of files translated in the Anuvaad system.
Input as pdf or image If input is pdf , then convert pdf into images Use custom prima line model to line detection in the image Return list of pages and each page includes a list of lines, it also includes page information(page path, page resolution).
Output of word detector as an input. Use a prima layout model for layout detection in the image. Layout classes: Paragraph, Image, Table, Footer, Header, Maths formula Return list of pages and each page includes a list of layouts and list of lines.
Output of layout detector as an input. Collation of line and word at layout level
Output of block segmenter as an input. Use google vision as OCR engine. Text collation at word,line and paragraph level.
Output of block segmenter as an input. Use Anuvaad ocr model as OCR engine. Text collation at word,line and paragraph level.
This service is used to tokenise the input paragraphs received into independently translatable sentences which can be consumed by downstream services to translate the entire input.Regular expressions and specific libraries such as Nltk are being used to build this tokeniser.
This microservice is served with multiple APIs to handle and manipulate the digitized data from anuvaad-gv-document-digitize which is part of the Anuvaad system. This service is functionally similar to the Content Handler service but differs since the output document (digitized doc) structure varies.
This Module is for “aligning” or simply, finding similar sentence pairs from two lists of sentences, preferably in different languages. The service is dependent on the file uploader and workflow manager (WFM) services. Aligner service is based on Google’s LaBse model and FB’s FAISS algorithm. It accepts two files as inputs, from which two lists of sentences are collected. LaBse Embeddings are calculated for each of the sentences in the list. Cosine similarity between embeddings is calculated to find meaningfully similar sentence pairs. Faiss algorithm is used to fasten up the whole process dramatically.
This microservice is intended to generate the final document after translation and digitization. This currently supports pdf, txt, xlsx document generation.
1
aai4b-nmt-inference
2
block-segmenter
3
content-handler
4
etl-document-converter
5
etl-file-translator
6
etl-tokeniser
7
etl-tokeniser-ocr
8
etl-translator
9
Etl-wf-manager [critical]
10
file-converter
11
layout-detector-prima
12
gv-document-digitization (Optional)
13
metrics
14
ocr-content-handler
15
ocr-tesseract-server
16
user-fileuploader
17
word-detector-craft
18
docx-download-service[nodejs]
19
two Factor Authentication (optional)
20
user Management System
21
sentence Aligner
22
Zuul
23
Architecture, Config Management, & Git Strategies [critical]
24
Frontend
Login and auth token
Before making an API call, the application should call this API first to receive authorization tokens. Once an application has valid token the same can be used to make all subsequent calls, please note token can expire and hence it is good practice to validate the token.
/v1/users/login
To login the user. User email and password.
/v1/users/auth-token-search
To check validity of the token
/
Supported Language pairs and translation models
Integrating applications first have to fetch supported language pairs (source and target language) along with respective translation model identifiers. These two parameters are mandatory before calling the translation API.
/v2/fetch-models
To get list models and support languages.
Frequently asked questions about Anuvaad
API HOST ENDPOINT
In order to make an API call this is the host URL and all mentioned API should be called against the mentioned point.
The NMT module is responsible for the translation of sentences. It can be invoked directly or via the Workflow Manager. The NMT module works in correlation with the ETL Translator to enhance translation efficiency based on previous translations or pre-provided glossary and TMX support (refer to other sections). The module supports batch inferencing and provides APIs that return model details for language and other dropdown menus.
In the early days of Anuvaad, OpenNMT-py based models trained on Anuvaad's proprietary data were used. These models were primarily focused on judicial content. The inference code for this initial version is available here: .
With the collaboration between Anuvaad and , data from Anuvaad and other sources were used to publish the Samanantar paper (https://arxiv.org/abs/2104.05596). Using the Samanantar dataset, IndicTrans, a more general domain model, was trained. This model performed well for legal use cases, leading to the replacement of OpenNMT with . The IndicTrans-based inferencing code is available here: .
As the Sunbird ecosystem developed, the need for hosting multiple ML models independently became resource-intensive. This led to the development of , a centralized platform for hosting models. Applications can now utilize models from Dhruva using APIs. In Dhruva, models are wrapped with NVIDIA Triton, facilitating a scalable architecture. The IndicTrans model was moved to Dhruva, and currently, models are invoked from Dhruva via wrapper APIs from the NMT module rather than using dedicated inference. The Dhruva-ported code is available here: .
Briefly explains about retraining a translation model to accommodate a domain-specific usecase.
The production environment of Anuvaad runs on top of translation models trained in the general domain, which will cover a good amount of scenarios. However, in case we need a separate instance of Anuvaad to translate domain-specific data, like financial, biomedical, etc: the existing model must be finetuned with more relevant data in a particular domain to improve the accuracy of translation. This page briefly summarises how it could be done
Bi-lingual, or parallel dataset is required for training the model. It is nothing but the same sentence pair in both source and target language. Example:
Source(en): India is my country
Target(hi): भारत मेरा देश है
The more the amount of available data, the more accurately the model could be trained. In short, data collection could be done in one of the following three approaches
This is exhaustive but the best approach. At least a small sample size of the dataset must be manually curated and used for validation purposes
Very often, datasets will be made available by certain research institutes or private vendors. This data also could be included to increase the quantity of training data
The raw data that is purchased or web-scraped might have too much noise which could affect the training accuracy and thereby translation. Noise includes unwanted characters, blank spaces, bullets and numbering,html tags etc.
However, more rules for cleaning could be applied based on context and manual verification of raw data as per the scenario.
Certain websites will have the same data in multiple languages. The idea is to somehow find matching pairs of sentences from them. Scraping could be done using frameworks such as and sentence matching could be done by using techniques such as . This method if used properly can produce huge amounts of data without much manual effort. however random manual verification is recommended to ensure data accuracy
A lot of sample crawlers for reference are available in this
To do sentence matching of scraped sentences, Anuvaad aligner also could be used, which is implemented using LaBSE. The specs for is available
The basic script for sentence alignment, cleaning and formatting is available
The present default model of Anuvaad is Indictrans. The instructions to retrain and benchmark an Indicrans model is explained
The training repo of legacy openNMT-py models is available
Once a model is retrained, if there are plans to open-source it, hosting it in will facilitate seamless integration with Anuvaad.