Only this pageAll pages
Powered by GitBook
1 of 45

Sunbird Anuvaad

Loading...

Loading...

USE

Loading...

Loading...

LEARN

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

MODULES

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

ENGAGE

Loading...

Loading...

Loading...

Features

Anuvaad is loaded with lots of features to provide the optimal experience for the end user to smoothen the process of document translation. The notable features are highlighted below:

Document Digitization

Document digitization is the process of converting physical documents into digital formats, making them easily accessible and editable.

Layout Detection

Anuvaad is coupled with custom trained Layout detection models for Identifying and comprehending a document's structure, which involves the recognition of key elements, including headings, paragraphs, tables, and images. This process is essential not only for enhancing OCR accuracy but also for preserving the document's layout and structure in the translated version.

Document Translation

Document translation involves converting text from one language to another, facilitating cross-lingual communication and information access. Anuvaad supports using NMT models straight from Bhashini Dhruva or in-built plug and play type of models for domain specific use cases.

Document Structure Preservation

This feature ensures that the original formatting, layout, and structure of documents are maintained during the translation process, preserving the document's visual integrity.

Improve Translation from Speech

Speech to text technology converts spoken language into written text, enabling audio content to be transcribed for translation or other purposes.

Translation Memory

Translation memory stores and retrieves previously translated segments to ensure consistency across documents and reduce translation time.

Glossary Support

Glossary support provides access to defined terminology and specialised vocabulary, ensuring consistency and precision in translations, particularly in specialised fields.

Usage Analytics and Metrics

Usage analytics and metrics offer insights into how the platform is utilised, helping users track and optimise translation processes and workflows.

File Format Conversion

File format conversion simplifies the process of converting documents from one file format to another while preserving their content and structure, enhancing compatibility.

Transliteration Support

Transliteration support enables the conversion of text from one script or alphabet to another, aiding users in dealing with different writing systems and ensuring the correct pronunciation of words, especially in multilingual contexts.

Sunbird Anuvaad Overview

Overview

Project Anuvaad is an open-sourced project funded by EkStep foundation.

It was bootstrapped by EkStep Foundation in late 2019 as a solution to enable easier translation of legal documents from English to Indic languages & vice-versa. Creating Anuvaad platform allowed legal entities to digitize & translate the Orders/Judgements using an easy to use user interface.

Anuvaad is an AI based open source Document Translation Platform to digitize and translate documents in Indic languages at scale. Anuvaad provides easy-to-edit capabilities on top the plug & play NMT models. Separate instances of Anuvaad are deployed to Diksha (NCERT), Supreme Court of India(SUVAS) and Supreme Court of Bangladesh (Amar Vasha).

Anuvaad leverages state of the art AI/ML models including NMT, OCR, Layout detection to provide high level of accuracy. Project Anuvaad was envisioned to be end to end open sourced solution for document translation across multiple domains.

Project Anuvaad is REST APIs driven and hence any third party system can use various features like sentence translation, layout detection etc.

NOTE: The documentation is still WIP. Feel free to contribute to it or raise issues if the desired info is not uptodate. Explore the KT videos if you would like to dive deep into each module.

Video Tutorials

Various video tutorials demonstrating features and step-by-step instructions to utilize the best out of Anuvaad!

Integration

Integration

ETL Translator

File uploader

Anuvaad Module Config Guidelines

Configs:

Parameters of a module that can be injected in and out of the system with zero to minimal code change in order to enable/disable/modify certain features of the system. The configs pertaining to modules of anuvaad data-flow pipeline can be broadly classified into 2 categories as:

  1. Configs outside the build (docker image)

  2. Configs within the build. (docker image)

Configs outside the build:

These are those configs which are injected to the system on the fly, the changes injected thereby can be incorporated into the system on runtime or on just a restart of the system without having to re-build or push a logical piece of code. For instance, WFM reads configs for identifying different workflows configured. In order to add/edit/delete a workflow, one needs to make changes to the config file as required and push the file, the changes will be incorporated into the system on restart or through a reload API on runtime. (https://raw.githubusercontent.com/project-anuvaad/anuvaad/master/anuvaad-etl/anuvaad-workflow-mgr/config/etl-wf-manager-config-dev.yml) These files will be saved in the ‘configs’ folder outside the source code of the system.

Configs within the build:

These configs travel with the build. Meaning, these configs will be a part of docker image. These configs can be controlled via an environment file during deployment or internally within the code. This also means that any changes in these parameters will need a rebuild and re-deployment of the system. However, no change in the logic or the code should be needed to incorporate the changes. Most of the hooks that are exposed for a given system fall under this category. These are mentioned in the ‘configs’ folder inside the source code. It is recommended to use just one ***config.py file inside the folder for all these configs for better maintainability. In case someone prefers to separate out config files based on concern, they can do so but bear the overhead of maintaining them. For convenience and readability, these configs are further divided as:

  1. Cross module common configs

  2. Module specific configs.

Cross module common configs:

These configs are used across all modules. Configs like Kafka host, Mongo host, File Upload URL etc fall under this category as these will not change from module to module. However, if there’s a case where modules chose to use different values for these parameters they can go ahead by using a different variable. The point of having this is ensuring we don’t create redundant variables in the environment file and use them from the same variables that are already defined.

Eg:

Module specific configs:

These configs are specific to the module, and will change for each module. This category includes both the common variables (Eg: https://github.com/project-anuvaad/anuvaad/blob/69b494224626d51a7baf0405603106a4a66a25c7/anuvaad-etl/anuvaad-extractor/aligner/etl-aligner/configs/alignerconfig.py#L10 ) inside the code which are used at multiple places in your project and the variables deriving value from the environment file (Eg: )

For convenience the second type of variables are categorised as:

  1. Kafka Configs: Configs required for kafka like topics, consumer groups, partition keys etc. that are very specific to the module. Some other parameters required to customise your consumer and producer as per your requirement can be mentioned under this category.

  2. Datastore Configs: Configs required for the datastore that is being used which is mostly Mongo in our use case. In case you’re using MySQL, Redis, Elasticsearch etc, mention the required parameters in this category.

  3. Module Configs: All other configs required for your module can be mentioned here.

Note: It is recommended to have most of these parameters deriving values from the environment file only. In some cases, they can also be hard-coded within the code. It is mandatory for every file/class within the project to use these parameters from these variables of the ***config.py file only.

Having said that:

  1. Please ensure the folder structure is perfectly maintained.

  2. Never ever check-in sensitive data like AWS keys, passwords, PIIs etc in the config file, always erase/mask/encrypt them before pushing it to github.

  3. You can check this file for reference:

Tokenizer

The Tokenizer submodule in Anuvaad is designed to break down paragraphs into sentences or words, facilitating efficient preprocessing and accurate translations. This submodule is integral for preparing text data for subsequent processing and translation tasks.

Key Features

  • Paragraph to Sentence Tokenization: Splits paragraphs into individual sentences, making the text easier to process.

  • Sentence to Word Tokenization: Breaks down sentences into individual words for detailed analysis and translation.

  • Document-Specific Handling: Manages document-specific symbols and special characters to ensure consistency and accuracy in tokenization.

  • Flexible Integration: Can be invoked independently as a standalone service or as part of a larger workflow through the Workflow Manager.

Usage

The Tokenizer can be utilized in two main ways:

  1. Independent Invocation:

    • As an independent service, the Tokenizer can be directly called to process text data. This is useful for isolated tasks where only tokenization is required.

  2. Workflow Manager Integration:

Code Repository

You can find the code for the Tokenizer submodule in the Anuvaad repository at the following link:

API Contract

For detailed information about the API endpoints and their usage, refer to the API contract available at:

By employing the Tokenizer submodule, Anuvaad ensures that text data is meticulously prepared, contributing to the overall accuracy and efficiency of the document processing and translation workflow.


Document converter

This microservice is intended to generate the final document after translation and digitization. This currently supports pdf, txt, xlsx document generation.

  • API Contract:

  • Code:

Setting up Anuvaad on your own

Follow these steps to set up the Anuvaad Web Application on your local machine:

Frontend

  1. Clone the Repository:

  2. Navigate to the Project Directory:

API Host Endpoints

API HOST ENDPOINT

In order to make an API call this is the host URL and all mentioned API should be called against the mentioned point.

API endpoint
Description

https://users-auth.anuvaad.org

https://github.com/project-anuvaad/anuvaad/blob/69b494224626d51a7baf0405603106a4a66a25c7/anuvaad-etl/anuvaad-extractor/aligner/etl-aligner/configs/alignerconfig.py#L3
https://github.com/project-anuvaad/anuvaad/blob/69b494224626d51a7baf0405603106a4a66a25c7/anuvaad-etl/anuvaad-extractor/aligner/etl-aligner/configs/alignerconfig.py#L22
https://raw.githubusercontent.com/project-anuvaad/anuvaad/wfmanager_feature/anuvaad-etl/anuvaad-extractor/aligner/etl-aligner/configs/alignerconfig.py
Within the Workflow Manager, the Tokenizer works as a part of the broader document processing and translation pipeline. This integration allows for seamless interaction with other Anuvaad submodules, ensuring smooth and efficient data flow.
GitHub Repository
API Contract

Registration

This is the very first step, users should register at https://users.anuvaad.org. Anuvaad will send a verification email, please verify the email before start making any APIs calls. Clearly registration is a one time activity.

Tools

Modules

Converter Module

DocumentConverter

API to create digitized txt & xlsx files for Translation Flow. RBAC enabled.

Mandatory parameters: record_id, user_id, file_type

Actions:

  • Validating input params as per the policies

  • Page data is converted into dataframes

  • Writing the data into file and storing them on Samba store

DocumentConverter CURL Request
curl --location --request POST 'http://localhost:5001//anuvaad-etl/document-converter/v0/document-converter' \
--header 'Content-Type: application/json' \
--data-raw '{ 
  "record_id":"A_OD10GV-IVRCU-1617009019569%7C0-16170090212740283.json", 
  "user_id":"d4e0b570-b72a-44e5-9110-5fdd54370a9d", 
  "file_type":"txt" 
}'

DocumentExporter

API to create digitized txt & pdf files on Document Digitization flow. RBAC enabled.

Mandatory parameters: record_id, user_id, file_type

Actions:

  • Validating input params as per the policies

  • Generating the docs using ReportLab

  • Writing the data into file and storing them on Samba store

DocumentExporter CURL Request
here
here
  • Install Dependencies:

    or

  • Environment Variables: Create a .env file in the root directory of the project and configure the necessary environment variables. You can use the .env.example file as a reference.

  • Start the Development Server:

    or

  • Access the Application: Once the development server is started, you can access the application by navigating to http://localhost:3000 in your web browser.

  • Additional Commands

    • Build the Application:

      or

    • Run Tests:

      or

    Backend

    General Guidelines:

    1. Clone the repo and go to the module specific directory.

    2. Run pip3 install -r requirements.txt.

    3. Make necessary changes to config files with respect to MongoDB and Kafka.

    4. Run python3 src/app.py.

    Alternatively, modules could be run by building and running Docker images. Make sure configs and ports are configured as per your local setup.

    • Build Docker Image:

    • Run Docker Container:

    Note: Apart from this, the Docker images running in the user's environment could be found here.

    curl --location --request POST 'http://localhost:5001//anuvaad-etl/document-converter/v0/document-exporter' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
      "record_id":"A_OD10GV-IVRCU-1617009019569%7C0-16170090212740283.json", 
      "user_id":"d4e0b570-b72a-44e5-9110-5fdd54370a9d", 
      "file_type":"txt" 
    }'
    npm install
    yarn install
    npm start
    yarn start
    npm run build
    yarn build
    npm test
    yarn test
    docker build -t <service-name> .
    docker run -r <service-name>
    git clone https://github.com/project-anuvaad/anuvaad.git
    
    cd anuvaad/anuvaad-fe/anuvaad-webapp-webapp

    Playbook

    This page will help the user to get themselves onboarded on Anuvaad and perform an operation.

    Account Creation

    The automated onboarding process is disabled for now to restrict resource usage. For time being, please fill out the details here and we will reach back to you soon with login details (Remember to check spam as well). please send a mail in case you didn't receive any response within a day.

    Users shall onboard themselves on Anuvaad via the link below: Registration:

    Once a user reaches the Sign Up page, they have to fill in the required details as shown below:

    Upon successful submission, an E-Mail will be sent to the registered ID with a verification link

    Please check the spam folder as well for authentication E-Mail

    Clicking the verification link will redirect the user to the login page as shown below

    For security purposes, Anuvaad follows an OTP-based login mechanism. You will be asked for the E-Mail to which a one-time password will be sent.

    Upon confirming the ID, you will be asked to opt for one of the authentication methods.

    Note that the selected authentication method could be changed later on as well.

    Everything discussed above is a one-time process and must be applicable only on initial login to the application. Going forward, you will be redirected to the OTP verification page upon successful login.

    Upon providing the correct OTP, the user will reach the below landing page of Anuvaad and we are good to go!

    Translate Sentence

    The translate sentence feature enables the user to input a text and instantaneously get its translation in another language. To use it, simply click on the Translate Sentence option on the landing page of Anuvaad.

    The user will have to select the source language to which he provides the input and the target language to which the text must be translated.

    If an Indic language is selected as the source, by default Transliteration feature will be enabled, assisting the user to type in that particular language from the normal keyboard with ease.

    Upon clicking submit, the translated sentence will be displayed along with the model used to perform the translation. Using this feature, a stakeholder can quickly check the accuracy of the translation performed by Anuvaad.

    Digitize Document

    Digitize Document feature helps to convert scanned documents into editable digital format by preserving the structure. This process recognizes text in scanned (non hand-written for now) documents and converts it into searchable text.

    To perform a document digitization, click on the Digitize Document option on the landing page of Anuvaad. You will be greeted with a screen as below

    User may select a document/image and choose the appropriate source language and then trigger the digitization process. A pop-up window appears which shows the progress of the ongoing process. This is an async job and will happen in the background. The time taken will be dependent on the nature of the uploaded file

    The status of all tasks given by user to Anuvaad can be viewed on the as below

    If the status of a job is completed, you can view the result and make changes by clicking on the view document icon, which is second in the last column under the label Action.

    Users can make changes, if any, by double-clicking on the word. once done, the digitized document can be downloaded by clicking on the download button in the top right corner in the desired format.

    Anuvaad digitization works well on documents that have non-selectable yet printed content. It is mostly tested in scanned files

    Translate Document

    The translate document feature enables the user to upload a document and get its translated version. The key highlight of the feature is that Anuvaad tries to maintain the original structure of the document in the best possible manner.

    To perform a document translation, click on the Translate Document option on the landing page of Anuvaad. You will be greeted with a screen as below

    Pro tip: If the document to be translated does not contain unicode fonts, please perform document digitization and then translate the digitized document.

    Here, the user will have the provision to upload a document (pdf,docx,pptx formats are supported as of now) and select a source and target language to perform translation. On successful upload of a supported file, the process begins and status will be shown

    There are various stages happening behind the scenes of document translation and the status will be displayed on screen. The total time taken to complete the process depends on the number of pages and the structure of the input document. The translation is an async process and once initiated it will be performed in the background. Users can keep on adding more and more tasks to Anuvaad meanwhile.

    After a certain time, by going to the , the user can access the translated document.

    Users can make necessary changes to the document using Anuvaad's easy editor which is developed by seeing document translators forefront, and later on download the translated document back to their system in the desired format.

    The blue icon in the bottom right corner is to use the merge feature. In case two or more text blocks need to be combined into a single unit, it could be used.

    Glossary Support

    Being an AI-assisted translation software, some occurrences can happen where a machine translation can give meaningful, yet non-contextual output to certain words/phrases. In order to work around this, Anuvaad offers user-level Glossary support.

    A translator can store certain phrases and their predefined translation so that if there is an occurrence of a similar phrase in the document for translation, the system acts wisely based on the predefined criterions.

    The list of default Glossaries and added ones can be viewed on page.

    To delete a glossary, simply click on the bin icon under the Action column. To add an extra item, click on the button in the top right corner. You will be redirected to the following page where new words/phrases and their corresponding translation can be added. Once submitted, the new items will be visible on page.

    Analytics

    Anuvaad also offers a built-in Analytics feature to keep track of usage metrics. These Analytics are instance-specific. Bar graph-based representations offer quick insights into how well Anuvaad is utilized. This also helps stakeholders to keep track of the number of Documents processed, Languages used, sentences translated, and organization-level information. Furthermore, these data could be exported in the desired format for future reference.

    All macro-level features to enhance translation speed are explained in detail in the Tutorial videos section.

    Repository structure and developers guide

    The project Anuvaad repository serves as the primary codebase for the Anuvaad project, aimed at facilitating document processing and translation tasks efficiently.

    Purpose of Folders

    • anuvaad-api: Houses standalone APIs utilized within the project, such as login and analytics functionalities.

    • anuvaad-fe: Contains frontend-related code, responsible for the user interface and interaction aspects of the application.

    • chrome-extension: Hosts code relevant to the Anuvaad Chrome extension, offering additional features and integrations within the Chrome browser environment.

    • anuvaad-nmt-inference [legacy]: Previously held legacy OpenNMT Python-based inference code. Deprecated and not actively utilized within the current project framework.

    • anuvaad-etl: Comprises sub-modules dedicated to document processing tasks, enhancing the extraction, transformation, and loading capabilities within the Anuvaad ecosystem.

    Microservice Structure

    As an application, the Workflow Manager, in conjunction with independent APIs, forms the foundational architecture of Anuvaad. The Workflow Manager facilitates communication among various modules and orchestrates their interactions. However, Anuvaad's design accommodates diverse use cases, allowing each module to operate autonomously when necessary. For instance, the Tokenizer service can function independently to tokenize an Indic sentence without reliance on other modules.

    Components of Each Microservice

    Each microservice within Anuvaad adheres to a consistent structure, comprising the following common elements:

    • Dockerfile: Provides instructions to build the individual microservice within a Docker container, ensuring portability and consistency across different environments.

    • docs Folder: Contains documentation outlining the API contracts necessary for running and testing the module independently. This documentation serves as a reference for developers and users alike.

    • config Folder: Stores module-specific configurations and secrets required for the proper functioning of the microservice. Centralizing configuration management simplifies deployment and maintenance tasks.

    Anuvaad Translator

    Overview

    This document provides details about the translator service used in Anuvaad. Translator is a wrapper over the NMT and is used to send sentence by sentence to NMT for translation of the document.

    Getting Started

    Translator receives input from the tokeniser module, the input is a JSON file that contains tokenised sentences. These tokenised sentences are extracted from the JSON file and then sent to NMT over kafka for translation. NMT expects a batch of ‘n’ sentences in one request, Translator created ‘m’ no of batches of ‘n’ sentences each and pushes to the NMT input topic. In parallel it also listens to the NMT’s output topic to recieve the translation of the batches sent. Once, all the ‘m’ no of batches are received back from the NMT, the translation of the document is marked complete.

    Next, Translator appends these translations back to the JSON file received from Tokeniser, The whole of this JSON which is now enriched with translations against every sentence is pushed to Content Handler via API, Content Handler then stores these translations.

    TMX:

    TMX is the system translation memory, A user can decide to override Anuvaad’s translation of a text/sentence by inserting ‘preferred translations’ into the system. TMX is backed by a redis store which hashes and stores user-specific translations for a text. This can be called as user’s personal cache of translations.

    TMX provides three levels of cache-ing: Global, Org, User.

    Global Level:

    This is a global bucket of preferred translations where an ADMIN or a Global Level User can feed translations which will applied across users and orgs.

    Org Level:

    This a Org level bucket where Anuvaad translations are overridden by preferred translations to only those users who belong to a particular organisation. Any ADMIN or a Org Level user can feed these translations in to be overridden across users of his/her org.

    User Level:

    This a User level bucket where a User can feed in his/her preferred translations and the system will override Anuvaad translations only for this particular User.

    TMX can be uploaded sentence by sentence or in bulk, both APIs are supported.

    Example:

    UTM:

    UTM is User Translation Memory, This is slightly different from TMX, here there are no levels, This is totally and plainly a translation cache. Here, the system remembers the user's translation and applies it automatically when it encounters the same sentence for the same user.

    Let’s say, We have a sentence ‘S1’ in a document ‘D1’, Anuvaad’s translation of this sentence is ‘T1’. Let’s say that the user on encountering this, changes the translation ‘T1’ of sentence ‘S1’ to ‘T2’. Now, Anuvaad remembers this such that, in any document, say ‘D2’ in this case, whenever there’s ‘S1’ and NMT translates it to ‘T1’, Anuvaad automatically overrides the translation to ‘T2’. However, let’s say NMT got better with time and now translates ‘S1’ to ‘T3’, in this case, Anuvaad dosen’t override it because the user context was ‘S1’ —> ‘T1’ and not ‘S1’ —> ‘T3’.

    Modules

    Source Code:

    Login and auth token

    Login and auth token

    Before making an API call, the application should call this API first to receive authorization tokens. Once an application has valid token the same can be used to make all subsequent calls, please note token can expire and hence it is good practice to validate the token.

    API endpoint
    Description
    API contracts

    /v1/users/login

    To login the user. User email and password.

    /v1/users/auth-token-search

    To check validity of the token

    Analytics

    Analytics

    When Anuvaad is implemented at an organizational level, analytics is crucial for tracking usage and metrics. A dedicated module exists to serve this purpose.

    Code Repository: Anuvaad Metrics

    API Contract: Metrics API Contract

    For every X hours, a cron job creates a CSV file, and analytics are drawn based on it for time-consuming computations. For other metrics, data is fetched directly from the database. The following analytics are currently available, with room for more metrics to be visualized:

    1. Total Documents Translated, Language-wise

    2. Organization-wise Sentences Translated

    3. Organization-wise Dashboard

    1. Reviewer Metrics

    FAQ

    Frequently asked questions about Anuvaad

    1. How to host Anuvaad locally?

    Setting up the whole application locally is recommended as there are 10+ modules, some of which are resource-demanding. However, individual modules can be run locally for development purposes.

    General Guidelines:

    • Clone the repository and navigate to the module-specific directory.

    • Run pip3 install -r requirements.txt.

    • Make necessary changes to config files with respect to MongoDB and Kafka.

    • Run python3 src/app.py.

    Alternatively, modules can be run by building and running Docker images. Ensure configs and ports are configured as per your local setup:

    • docker build -t <service-name> .

    • docker run -r <service-name>

    Apart from this, the Docker images running in the user's environment can be found here: .

    2. How to make a contribution?

    Fork the repo, make the necessary feature, and create a PR. We will review and merge it. Post queries in the discussions/issues section.

    3. How to contact maintainers for credentials?

    Check discussions or reach out to [email protected]

    4. Can I use individual modules of Anuvaad rather than the whole application?

    Yes, refer to the documentation and KT of the specific module.

    5. How to use Anuvaad features from my code?

    Refer to the API specifications.

    6. I need assistance in setting up Anuvaad in my organization's infrastructure. Who can help?

    Reach out to [email protected] or feel free to raise a request in the discussions section.

    7. Are there any videos of Anuvaad usage and codebase available?

    Yes, they are available here: https://anuvaad.sunbird.org/engage/kt-videos

    NMT

    The NMT module is responsible for the translation of sentences. It can be invoked directly or via the Workflow Manager. The NMT module works in correlation with the ETL Translator to enhance translation efficiency based on previous translations or pre-provided glossary and TMX support (refer to other sections). The module supports batch inferencing and provides APIs that return model details for language and other dropdown menus.

    Initial Version

    In the early days of Anuvaad, OpenNMT-py based models trained on Anuvaad's proprietary data were used. These models were primarily focused on judicial content. The inference code for this initial version is available here: OpenNMT-py Inference Code.

    Intermediary Version

    With the collaboration between Anuvaad and , data from Anuvaad and other sources were used to publish the Samanantar paper (https://arxiv.org/abs/2104.05596). Using the Samanantar dataset, IndicTrans, a more general domain model, was trained. This model performed well for legal use cases, leading to the replacement of OpenNMT with . The IndicTrans-based inferencing code is available here: .

    Current Version

    As the Sunbird ecosystem developed, the need for hosting multiple ML models independently became resource-intensive. This led to the development of , a centralized platform for hosting models. Applications can now utilize models from Dhruva using APIs. In Dhruva, models are wrapped with NVIDIA Triton, facilitating a scalable architecture. The IndicTrans model was moved to Dhruva, and currently, models are invoked from Dhruva via wrapper APIs from the NMT module rather than using dedicated inference. The Dhruva-ported code is available here: .


    Git branching strategies

    Anuvaad follows the standard feature-master type of branching strategy for code maintenance. The releases happen through the master branch via release tags.

    Branches

    Translate texts

    It is recommended to translate batches of sentences in order to get high throughput.

    API endpoint
    Description
    API contracts
    kafkawrapper: Defines Kafka/WFM (Workflow Manager) related communication protocols, facilitating seamless integration and communication between modules. In the production environment, the Workflow Manager plays a crucial role in establishing communication channels, rendering standalone APIs redundant.
    Anuvaad Docker Hub
    Ai4Bharat
    IndicTrans
    IndicTrans Inference Code
    Dhruva
    Dhruva Ported Code

    /v1/users/auth-token-search contract

    /v1/users/login contract
    https://users.anuvaad.org/user/signup#
    Digitization Dashboard
    Translation Dashboard
    my glossary
    Create Glossary
    my glossary
    https://users.anuvaad.org/user/signup#
    login screen
    E-Mail screen
    choose Authentication method
    OTP based login
    Landing page
    Quick translate feature
    upload file
    progress screen
    Digitization dashboard
    upload screen
    progress screen
    List of available Glossaries
    Add a new Glossary
    https://www.getpostman.com/collections/677974f4cfe1c3e119fb
    API Details Postman
    https://github.com/project-anuvaad/anuvaad/tree/master/anuvaad-etl/anuvaad-translator
    Total Documents Translated, Language-wise in SUVAS
    Organization-wise Sentences Translated in SUVAS
    SUVAS Reviewer Metrics
    SUVAS Organization-wise Dashboard
    Feature Branches

    Feature branches are a set of branches owned by individual developers in order to work on specific tasks. These branches are forked out of the master branch and they eventually feed into the same master branch once the code for that particular use case is developed and tested. These branches can either be deleted right after merging to master or can be retained to be reused for other use cases.

    Feature branches can ONLY be deployed in the ‘Dev’ environment. The ‘Dev’ environment is a dedicated VM for the developers to test their code. Once the code is dev-tested, it must be merged to the ‘develop’ branch which further feeds into the ‘master’ branch.

    Develop Branch

    The ‘Develop’ branch is a mirror branch to the master branch. This branch is dedicated for QA/UAT testing. All feature branches must feed into this branch before the use-case is sent for QA testing and at times UAT if needed. This branch will also act as a backup in case there’s something wrong with the master branch.

    The develop branch can ONLY be deployed to the ‘QA’ environment. This is a dedicated environment for the QAs to perform unit, regression, and smoke testing of the features and the app as a whole. This environment can also be used for UAT purposes. Once there’s a QA signoff on the features, this will directly feed into master.

    Master Branch

    The master branch is the main branch from which all releases happen. All features, once dev-tested and QA-tested, will feed into master via the develop branch. The master branch is from where the code is deployed to production. Every release to production from the master branch will be tagged with the specific version of that release.

    In case of production issues, we can fallback to any of the previous stable releases.

    Hotfix Branches

    Hotfix branches are temporary branches which are forked directly from the ‘master’ branch and will feed back into the master only. These are for special cases when there’s a production bug to be resolved, and the develop branch is at the (n+m)th commit and master at (n)th commit.

    These branches will act as temporary mirror branches for the master branch and can be tested on the QA env. Once tested and merged back to master, these branches have to be deleted. After the merge, the develop branch will have to be rebased with master, and the features will have to be rebased with the develop branch. The commits will flow upstream only after a rebase is successfully completed on all the forks.

    Apart from the feature branches, individual devs will also own these branches.

    Code Check-in

    • Feature Branches: Code check-in to feature branches can be done by anyone; there’s no need for a review as such. These branches are mainly for the devs to test their code. The use case developed in this branch will have to be dev-tested on the ‘Dev’ environment before a merge request to the ‘develop’ branch is raised.

    • Develop Branch: Code check-in to the develop branch should only happen after a Peer Review. Merge to develop will only happen once the code is dev-tested on the Dev environment. It should be noted that a merge to develop should ensure that the code quality is up to the mark, all standards are followed, and it doesn’t break anything that is already merged to the develop branch by other devs. QA testing must happen on this branch deployed in a dedicated environment for QA/UAT. Any bugs reported will be fixed in the feature branch, reviewed, and then merged back to the develop branch. QA signoff happens on this branch.

    • Master Branch: Code check-in to the master branch will only happen from the develop branch and NO feature branches. Any merge to the master branch apart from the hotfix branch MUST come from the develop branch only. Merge from develop to master should happen only after an extensive code review from the leads. Only a select few members of the team will have access to merge the code to the master branch. The onus of the master branch is on the Technology leads of the team. Once the code is merged to master, a final round of regression testing must take place before the code is tagged for release.

    • Hotfix Branch: Code check-in to the hotfix branch can be done by individual devs once it is reviewed by a peer and the leads. This branch feeds into the master only after a second round of review. QA must happen on the hotfix branch before it is merged to master. The merge to master must also be released only after regression testing is done on the fix.

    /aai4b-nmt-inference/v1/translate

    Translate batches of sentences.

    /aai4b-nmt-inference/v1/translate contract

    Anuvaad Zuul Gateway System

    Overview

    This document explains how Netflix Zuul is used as an API Gateway in Anuvaad to perform Authentication, Authorization and API redirection to all inbound API calls. Zuul is an Open Source Project.

    Getting Started

    Zuul is an API Gateway developed as an Open Source Project by Netflix. Zuul provides various features to abstract out some of the common operations of the system and provide a strong layer for Authentication, Authorization, API Pre & Post Hooks, API Throttling, Session monitoring and much more. Zuul: Zuul is an edge service that proxies requests to multiple backing services. It provides a unified “front door” to your system, which allows a browser, mobile app, or other user interface to consume services from multiple hosts without managing cross-origin resource sharing (CORS) and authentication for each one.

    Zuul in Anuvaad

    Zuul is Anuvaad is a config driven implementation where APIs, Roles, Role Actions are read by Zuul through a file stored in a remote repository.

    Roles

    The set of Roles defined in the system, these will be attached to the Users and also mapped with the APIs to provide role based access control (RBAC).

    Actions

    Set of APIs exposed in the system by various microservices. Each action is an API which will be mapped against the roles. APIs are of 2 types: Open APIs and Closed APIs. Open APIs can be accessed without authentication, in other words: these APIs are whitelisted. Closed APIs can only be accessed after auth checks.

    role-actions:

    Mapping between the roles and actions, Zuul uses this to decide if the User should be allowed to access a particular API. These configs can be found here:

    Authentication in Anuvaad:

    Anuvaad uses JWT auth tokens for authentication and authorization purposes. The same token is also used as the session ID. These tokens are generated and stored securely by a UMS system. Example:

    Source Code

    Anuvaad Zuul uses 3 Pre Filters namely: Correlation, Auth, Rbac. Correlation: Filter to add a correlation ID to the inbound request. Auth: Filter to perform Authentication check on the inbound request. Rbac: Filter to perform Authorization check on the inbound request. API redirection configuration is provided in the

    File translator

    This microservice is served with multiple APIs to transform the data in the file to form JSON file and download the translated files of type DOCX, PPTX, and HTML.

    Modules

    Workflow Code

    WF_A_FTTKTR

    Steps:

    1. Transformation Flow

      • Use the data in DOCX, PPTX, or HTML file to create a JSON file.

    2. Tokenizer Flow

      • Read the JSON file created in the Transformation Flow and tokenize each paragraph.

    WF_S_FT

    Steps:

    1. WF_A_FTTKTR Flow

      • This flow must be completed before calling the download flow.

    2. Download Flow

      • Fetch content for the file, replace original sentences with translated ones, and download the file.

    Through WF Manager

    Transform Flow

    Mandatory Params for File Translator:

    • Path

    • Type

    • Locale

    Actions:

    1. Validate input parameters.

    2. Generate a JSON file from the data of the given file.

    3. Convert the given file to HTML, PDF, and push it to S3 (For showing it on UI).

    4. Get the S3 link of the converted file and call content handler API to store the link.

    Transform Flow CURL Request

    Download Flow

    Mandatory Params for File Translator:

    • Path

    • Type

    • Locale

    Actions:

    1. Validate input parameters.

    2. Call fetch-content to get the translation of the file passed in the param.

    3. Replace the original text in the file with the translated text.

    4. Return the path of the translated file.

    Download Flow CURL Request

    Model Retraining

    Briefly explains about retraining a translation model to accommodate a domain-specific usecase.

    The production environment of Anuvaad runs on top of translation models trained in the general domain, which will cover a good amount of scenarios. However, in case we need a separate instance of Anuvaad to translate domain-specific data, like financial, biomedical, etc: the existing model must be finetuned with more relevant data in a particular domain to improve the accuracy of translation. This page briefly summarises how it could be done

    Data Collection

    Bi-lingual, or parallel dataset is required for training the model. It is nothing but the same sentence pair in both source and target language. Example:

    Source(en): India is my country

    Target(hi): भारत मेरा देश है

    The more the amount of available data, the more accurately the model could be trained. In short, data collection could be done in one of the following three approaches

    Manual Annotation by linguistic experts

    This is exhaustive but the best approach. At least a small sample size of the dataset must be manually curated and used for validation purposes

    Creation of Corpus by web and document crawling

    Certain websites will have the same data in multiple languages. The idea is to somehow find matching pairs of sentences from them. Scraping could be done using frameworks such as and sentence matching could be done by using techniques such as . This method if used properly can produce huge amounts of data without much manual effort. however random manual verification is recommended to ensure data accuracy

    A lot of sample crawlers for reference are available in this

    To do sentence matching of scraped sentences, Anuvaad aligner also could be used, which is implemented using LaBSE. The specs for is available

    Purchasing or using an open-sourced dataset

    Very often, datasets will be made available by certain research institutes or private vendors. This data also could be included to increase the quantity of training data

    Data cleaning & formatting

    The raw data that is purchased or web-scraped might have too much noise which could affect the training accuracy and thereby translation. Noise includes unwanted characters, blank spaces, bullets and numbering,html tags etc.

    The basic script for sentence alignment, cleaning and formatting is available

    However, more rules for cleaning could be applied based on context and manual verification of raw data as per the scenario.

    Model retraining

    The present default model of Anuvaad is Indictrans. The instructions to retrain and benchmark an Indicrans model is explained

    The training repo of legacy openNMT-py models is available

    Once a model is retrained, if there are plans to open-source it, hosting it in will facilitate seamless integration with Anuvaad.

    Architecture

    Architecture of Anuvaad

    The architecture is around 2 major blocks :

    • Document Digitization

    • Document Translation

    Aligner

    The Aligner module is designed for “aligning” or finding similar sentence pairs from two lists of sentences, preferably in different languages. The Aligner is a standalone service that cannot be accessed from the UI as of now. The service is dependent on the file uploader and workflow manager (WFM) services.

    The Aligner service is based on Google’s LaBSE model and FB’s FAISS algorithm. It accepts two files as inputs, from which two lists of sentences are collected. LaBSE Embeddings are calculated for each of the sentences in the list. Cosine similarity between embeddings is calculated to find meaningfully similar sentence pairs. The FAISS algorithm is used to dramatically speed up the whole process.

    Simplified implementation:

    The service accepts two text files, and the aligner module can ideally be invoked using WFM. It is time-consuming and hence an async service. Once the run is fully done, a WFM-based search can be conducted using the job ID to obtain the result.

    Technology Stack

    Technology Stack

    Component
    Details

    Block merger

    This microservice is used to extract text from a digital document in a structured format (paragraph, image, table), which is then used for translation purposes.

    Architecture

    • It takes an image or pdf as an input.

    selenium
    LaBSE
    repo
    Aligner
    here
    here
    here
    here
    Dhruva
    https://github.com/project-anuvaad/anuvaad/tree/master/anuvaad-api/anuvaad-zuul-api-gw/dev-configs
    https://www.getpostman.com/collections/d91b48529bc5f0474617
    Source Code for Anuvaad Zuul
    application.properties
    Zuul repo

    Legacy

    Tokenization is a process where we extract all the sentences in a paragraph.

  • Translation Flow

    • Translate each sentence.

  • curl --location --request POST 'https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate' \
    --header 'auth-token: AUTHTOKEN' \
    --header 'content-type: application/json' \
    --data-raw '{
       "workflowCode": "WF_A_FTTKTR",
       "jobName": "HTML FILE.html",
       "jobDescription": "",
       "files": [
           {
               "path": "f3cf11bd-c6b8-4ea2-9bd2-9828b9847c8a.html",
               "type": "html",
               "locale": "en",
               "model": {...},
               "context": "JUDICIARY",
               "modifiedSentences": "A"
           }
       ]
    }'
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/sync/initiate' \
    --header 'auth-token: AUTHTOKEN' \
    --header 'Content-Type: application/json' \
    --data-raw '{
       "workflowCode": "WF_S_FT",
       "jobName": "ch 2 communication skills.docx",
       "jobDescription": "",
       "files": [
           {
               "path": "A_FTTTR-RJjbi-1623847596274|DOCX-8f8c43a9-ac35-407f-874e-51d91be7f433.json",
               "type": "json",
               "locale": "en",
               "model": {...},
               "context": "JUDICIARY",
               "modifiedSentences": "A"
           }
       ]
    }'
    Components
    Component
    Details

    Workflow Manager(WM)

    Centralized Orchestrator based on user request.

    Auditor

    Python package/library used for formatting , exception handling.

    File Uploader

    Microservice to upload and maintain user documents.

    File Converter

    Microservice to convert files from one format to other. E.g: .doc to .pdf files.

    Aligner

    Microservice accepts source and target sentances and align them to form parallel corpus.

    Document Digitization Flow
    Block Diagram

    The response is typically a JSON file path, which can be downloaded using the download API. The JSON file is self-explanatory and it contains source_text, target_text, and the corresponding cosine similarity between them.

    Local Setup (Without WFM & Uploader)

    1. Clone the Repo

    2. Install dependencies

    3. Run the application

    4. Access from local:

    Aligner CURL Request
    curl --location --request POST 'http://127.0.0.1:5001/anuvaad-etl/extractor/aligner/v1/sentences/align' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "source": {
            "filepath": "/home/test.en",
            "locale": "en",
            "type": "json"
        },
        "target": {
    

    Remote (Invoked via WFM)

    Initiate Workflow

    It returns a JOB ID, which can be searched using the WFM Bulk search API to see job progress and pull out results once done.

    Search Bulk Workflow Jobs
    • WF_A_JAL is the Workflow code for JSON-based aligner, which returns the filepath of a JSON file that could be downloaded using the download API.

    • WF_A_AL is the old workflow code, that returns multiple txt files.

    Testing

    1. Upload two files.

    2. Call API endpoint with file paths as parameters.

    3. Verify if sentences are matching properly in the JSON.

    Notes

    • Can be used as an independent service by deploying file-uploader and aligner modules alone on a server, preferably GPU-based (tested working well on g4dn2xlarge).

    • Simplified implementations of the aligner could be found here.

    • An explanatory article could be found here and here.

    Here

    Serve as a redirection server and also takes care of system level configs. Ngnix acts as the gateway.

    API Gateway to apply filters on client requests,authenticate,authorize,throttle client requests.

    AI ML Assets

    Component
    Details

    Layout detection model.

    Used for Line detection.

    Custom trained Tesseract used for OCR.

    Custom trained model used for translation.

    open-source platform for serving language AI models at scale.

    Apache Kafka

    Internal modules are integrated through Kafka messaging.

    MongoDB

    Primary data storage.

    Redis

    Secondary in memory storage.

    Cloud Storage

    Samba storage is used to store user input files.

    If the input is a pdf, it converts the pdf into images.
  • The pdftohtml tool is used to extract page-level information like text, word coordinates, page width, page height, tables, images, and others.

  • If the document language is vernacular, the pdftohtml tool does not work well, so we use Tesseract (or GV if required Alternatively) for OCR.

  • Horizontal merging is used to get lines using word coordinates.

  • Vertical merging is used to get blocks using line coordinates.

  • The final JSON contains page-level information like page width, page height, paragraphs, lines, words, and layout class.

  • API Contract: here

  • Code location: here

  • Modules

    API Details

    Local Testing

    URL: http://0.0.0.0:5001/anuvaad-etl/block-merger/v0/merge-blocks

    Input:

    Here it takes a PDF or image path as an input and the language of that document.

    Workflow Initiate

    URL: https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate

    Input:

    Steps:

    1. Upload a PDF or image file using the upload API:

      Upload URL: https://auth.anuvaad.org/anuvaad-api/file-uploader/v0/upload-file

    2. Get the upload ID and copy that to the path of wf-initiate input of the block merger.

    3. Do bulk search using jobIDs to get JSON ID of the BM service response:

      Bulk search URL: https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/jobs/search/bulk

      Bulk search input format:

    4. Download JSON using download API:

      Download URL:

    OCR Content handler

    This microservice is served with multiple APIs to handle and manipulate the digitized data from anuvaad-gv-document-digitize, which is part of the Anuvaad system. This service is functionally similar to the Content Handler service but differs since the output document (digitized doc) structure varies.

    Modules

    OCR Document Modules

    DigitalDocumentSave

    API to save translated documents. The JSON request object is generated from anuvaad-gv-document-digitizer and later updated by tokenizer. This API is being used internally.

    Mandatory parameters: files, record_id

    Actions:

    • Validating input params as per the policies

    • The document to be saved is converted into blocks of pages

    • Each block contains regions such as line, word, table, etc.

    • Every block is created with UUID

    DigitalDocumentSave CURL Request

    DigitalDocumentUpdateWord

    API to update the text in the digitized doc. RBAC enabled.

    Mandatory parameters: words, record_id, region_id, word_id, updated_word

    Actions:

    • Validating input params as per the policies

    • Looping over the regions to locate the word to be updated

    • Updating the word and setting a flag save=True

    DigitalDocumentUpdateWord CURL Request

    DigitalDocumentGet

    API to fetch back the document. RBAC enabled.

    Mandatory parameters: record_id, start_page, end_page

    Actions:

    • Validating input params as per the policies

    • Returning back the document as an array of pages

    DigitalDocumentGet CURL Request

    NMT Inference

    NMT Inference

    This module provides the NMT based translation service for various Indic language pairs. Currently the NMT models are trained using OpenNMT-py framework version 1 and the model binaries are generated using ctranslate2 module provided for OpenNMT-py and the same is used to generate model predictions.

    Data preparation

    NMT requires parallel corpus between languages. Typically the size of language corpus is in the millions. The language corpus must have enough examples to cover various situations. This is one of the most important portions of the system and a very time consuming work where quality of data has to be checked to ensure the accuracy of translation. At Anuvaad, we have collected data for 11 languages as parallel corpora. The corpus is available under MIT license.

    Training and Retraining

    Training and retraining is a continuous process and training is dependent on the quality of input dataset. We have to constantly monitor the quality of translation. The translation mistakes should be used to generate training examples and retraining exercises have to periodically be taken up. The training cycle is a costly affair as they need GPU infrastructure and long training hours.

    Model Evaluation

    The model output is evaluated on the per-selected sentences and BLEU score is calculated. BLEU score provides a score that helps as the guidance to provide feedback on the model quality. Translation output has to be evaluated by human translators as well before it can be used in a production environment.

    Architecture

    Anuvaad uses the current state-of-the-art Transformer model to achieve target sentence prediction or translation Supporting code and paper is in open source domain,

    We are leveraging an open-source project called “openNMT” and also exploring “FairSeq”(IndicTrans) from the perspective of enhancement and usage. The deeplearning platform used is pytorch

    MODEL TRAINING

    • Vocabulary or dictionary generation

      1. Tokenizer (Detokenizer) or breaking of given sentence in word or sub-word. (Language specific) Moses or IndicNLP(for indian languages)

      2. Sentence Piece or subword-nmt Supporting code and paper is in open source domain,

    Approaches used:

    • BPE (Byte Pair Encoding)

    • Unigram

    • Tune model parameters and hyper parameters to improve accuracy.

    Modules

    • Opennmt-py based

    Training Repo

    API Details

    aai4b-nmt-inference (indicTrans)

    • Fairseq based

    Training Repo

    API Details

    Prerequisites

    • python 3.6

    • ubuntu 16.04

    Install various python libraries as mentioned in requirements.txt file

    APIs and Documentation

    Run app.py to start the service with all the packages installed

    For more information about api documentation, please check @

    License

    Auditor

    A Python package that provides standardized logging and error handling for the Anuvaad dataflow pipeline. This package serves features like session tracing, job tracing, error debugging, and troubleshooting.

    Installation

    Prerequisites:

    • Python 3.7

    Source code: GitHub Repository

    Command:

    Logging/Auditing

    This part of the library provides features for logging by exposing the following functions:

    Import file:

    Functions

    log_info

    Logs INFO level information.

    log_debug

    Logs DEBUG level information.

    log_error

    Logs ERROR level information. Should be used for logical errors like “File is not valid”, “File format not accepted” etc.

    log_exception

    Logs EXCEPTION level information. Should be used in case of exceptions like “TypeError”, “KeyError” etc.

    Notes

    • In all the functions, message and input-object are mandatory.

    • These functions build an object using these parameters and index them to Elasticsearch for easy tracing.

    • Ensure all major functions have a log_info call, all exceptions have log_exception calls, and all logical errors have log_error calls.

    Error Handling

    This part of the library provides features for standardizing and indexing the error objects of the pipeline.

    Import file:

    Functions

    post_error

    Returns a standard error object for replying back to the client during a SYNC call and indexes the error to an error index.

    post_error_wf

    Constructs a standard error object which will be indexed to a different error index and PUSHES THE ERROR TO WFM internally.

    Usage Notes:

    • Use post_error_wf for flows triggered via Kafka or REST through WFM.

    • Ensure both log functions and error functions are used in case of exceptions or errors.

    • Errors are indexed to two different indexes: Error index and Audit Index.

    • Use post_error_wf

    Example Usage

    Current Version

    anuvaad-auditor==0.1.1 - Please use this version.

    Document Translation
    Overview
    Document Digitization

    Anuvaad Workflow Manager

    ANUVAAD DATAFLOW PIPELINE WORKFLOW MANAGER

    Workflow Manager is the orchestrator for the entire dataflow pipeline.

    Overview

    Supported Language pairs and translation models

    Supported Language pairs and translation models

    Integrating applications first have to fetch supported language pairs (source and target language) along with respective translation model identifiers. These two parameters are mandatory before calling the translation API.

    API endpoint
    Description
    API contracts
    pip install -r requirements.txt
    python app.py
    curl --location --request POST 'https://stage-auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate' \
    --header 'Content-Type: application/json' \
    --header 'auth-token: {{auth-token}}' \
    --header 'context:' \
    --data-raw '{
        "workflowCode": "WF_A_JAL",
        "files": [
            {
                "locale": "ml",
                "path": "983da7e1-7cde-4091-8db4-cf845b5ea3c3.txt",
                "type": "txt"
            },
            {
                "locale": "en",
                "path": "aab70b95-ec0d-4c1c-9bfe-0c4864aecda0.txt",
                "type": "txt"
            }
        ]
    }'
    curl --location --request POST 'https://stage-auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/jobs/search/bulk' \
    --header 'auth-token: {{auth-token}}' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "jobIDs": [
            "{{jobIDs}}"
        ],
        "taskDetails": false
    }'
    {
        "input": {
            "files": [
                {
                    "locale": "en",
                    "path": "1.pdf",
                    "type": "pdf"
                }
            ]
        },
        "jobID": "BM-15913540488115873",
        "state": "INITIATED",
        "status": "STARTED",
        "stepOrder": 0,
        "workflowCode": "abc",
        "tool": "BM",
        "metadata": { 
            "module": "WORKFLOW-MANAGER",
            "receivedAt": 15993163946431696,
            "sessionID": "4M1qOZj53tIZsCoLNzP0oP",
            "userID": "d4e0b570-b72a-44e5-9110-5fdd54370a9d"
        }
    }
    {
        "workflowCode": "WF_A_BM",
        "files": [
            {
                "path": "763b0d80-4e82-423f-a432-23ddffe5ad92.pdf",
                "type": "pdf",
                "locale": "en"
            }
        ]
    }
    "filepath": "/home/test.ml",
    "locale": "ml",
    "type": "json"
    }
    }'
    </details>
    <details>
    <summary>Search Jobs in Local</summary>
    ```bash
    curl --location --request GET 'http://127.0.0.1:5001/anuvaad-etl/extractor/aligner/v1/alignment/jobs/get/ALIGN-1614743930159'
    NGINX
    Zuul
    PRIMA
    CRAFT
    Tesseract
    IndicTrans2
    Dhruva

    /v2/fetch-models

    To get list models and support languages.

    /v2/fetch-models contract

    Introduction
    User creation and login
    https://auth.anuvaad.org/download/0-1640069280533983.json

    Saving blocks in the database

    curl --location --request POST 'http://localhost:5001//anuvaad/ocr-content-handler/v0/ocr/save-document' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
        "jobID": "BM-15913540488115873", 
        "state": "INITIATED", 
        "status": "STARTED", 
        "stepOrder": 0, 
        "workflowCode": "abc", 
        "taskID": "vision_ocr1615969391110792", 
        "tool": "GVOCR", 
        "message": "OCR", 
        "metadata": { 
            "module": "WORKFLOW-MANAGER", 
            "receivedAt": 15993163946431696, 
            "sessionID": "4M1qOZj53tIZsCoLNzP0oP", 
            "userID": "d4e0b570-b72a-44e5-9110-5fdd54370a9d" 
        }, 
        "files": [{ 
            "file": { 
                "identifier": "string", 
                "name": "20695.pdf", 
                "type": "json" 
            }, 
            "config": { 
                "language": "en" 
            }, 
            "pages": [ 
                { 
                    "identifier": "958b00e5-7864-4a73-a3ed-7640b1c3c1cf", 
                    "resolution": 300, 
                    "path": "/home/naresh/anuvaad/anuvaad-etl/anuvaad-extractor/document-processor/gv-document-digitization/upload/20695_41c92afd-53fd-4446-aaee-bedd194c59cf/images/206950001-1.jpg", 
                    "boundingBox": { 
                        "vertices": [{ 
                            "x": 0, 
                            "y": 0 
                        }, { 
                            "x": 2481, 
                            "y": 0 
                        }, { 
                            "x": 2481, 
                            "y": 3508 
                        }, { 
                            "x": 0, 
                            "y": 3508 
                        }] 
                    }, 
                    "page_no": 0, 
                    "regions": [ 
                        { 
                            "identifier": "1e0f1313-4c2f-47f3-9971-a797452439f8", 
                            "boundingBox": { 
                                "vertices": [{ 
                                    "x": 0, 
                                    "y": 0 
                                }, { 
                                    "x": 2481, 
                                    "y": 0 
                                }, { 
                                    "x": 2481, 
                                    "y": 3508 
                                }, { 
                                    "x": 0, 
                                    "y": 3508 
                                }] 
                            }, 
                            "class": "BGIMAGE", 
                            "data": "/home/naresh/anuvaad/anuvaad-etl/anuvaad-extractor/document-processor/gv-document-digitization/upload/20695_41c92afd-53fd-4446-aaee-bedd194c59cf/images/206950001-1_bgimages_.jpg" 
                        }
                    ]
                }
            ]
        }]
    }'
    LINK
    LINK
    anuvaad-nmt-inference
    https://github.com/project-anuvaad/nmt-training
    Api contract
    https://github.com/project-anuvaad/aaib4-inference
    https://github.com/AI4Bharat/indicTrans
    Api contract
    https://github.com/project-anuvaad/aaib4-inference/tree/main/docs/contracts
    MIT
    carefully, as this method will take the entire job to FAILED state.
    {
        "jobIDs": ["A_B-MtrjS-1640069221694"],
        "taskDetails": true
    }
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/ocr-content-handler/v0/ocr/update-word' \
    --header 'userID: d4e0b570-b72a-44e5-9110-5fdd54370a9d' \
    --header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6ImphaW55LmpveUB0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkcXFjYUM2WW5yU2RFM2hDT2h4aXpnT0ZILjBxeFR4UWJBTHloZDFjTjBFOWluSnRqaTguOWknIiwiZXhwIjoxNjE2NTcxMjM3fQ.vCOncRM7BNK0qsv0OWnioIDfy-lOusTcMERsusm_ics' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
        "words":[ 
            { 
                "record_id":"A_OD10GV-msJYb-1616508492867|0-1616508495552232.json", 
                "region_id":"7df5afdc-6aac-498d-af2c-e73cdf438b90", 
                "word_id":"78f682ba-5571-4099-9055-f51d4d82368a", 
                "updated_word":"Constituency" 
            }
        ] 
    }'
    curl --location --request GET 'https://auth.anuvaad.org/anuvaad/ocr-content-handler/v0/ocr/fetch-document?recordID=A_FWLBOD15GOT-eAIRP-1632812802745%7C0-16328129323475454.json&start_page=0&end_page=0' \
    --header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6ImphaW55LmpveUB0YXJlbnRvLmNvbSIsIkphaW55QDEyMyI6ImInJDJiJDEyJFh4VU9ZbVBGZ1NyMkhuclFZNTVqR2U3a3VmUmRoakxmTTdjU2NLSkxHZVNTZkxBQmJ4UGlPJyIsImV4cCI6MTYzMzA4Mzk2OX0.-hWfzbCR7ErGjK8B8PjnkpvtVBm1Rpavmjast0E4P4I'
    pip install -r requirements.txt
    python src/app.py
    pip install anuvaad-auditor==0.1.6
    from anuvaad_auditor import loghandler
    loghandler.log_info(<str(message)>, <json(input_object)>)
    loghandler.log_debug(<str(message)>, <json(input_object)>)
    loghandler.log_error(<str(error_message)>, <json(input_object)>, <exception_object>)
    loghandler.log_exception(<str(exception_message)>, <json(input_object)>, <exception_object>)
    from anuvaad_auditor import errorhandler
    errorhandler.post_error(<str(error_code)>, <str(error_msg)>, <exception_object>)
    errorhandler.post_error_wf(<str(error_code)>, <str(error_msg)>, <json(input_object)>, <exception_object>)
    from anuvaad_auditor.loghandler import log_info, log_debug, log_error, log_exception
    from anuvaad_auditor.errorhandler import post_error, post_error_wf
    
    def example_function():
        input_object = {"example": "data"}
        try:
            log_info("Starting example function", input_object)
            # Your code logic here
            raise KeyError('A sample key error')
        except KeyError as ke:
            log_exception("Caught a KeyError", input_object, ke)
            error_obj = post_error("KEY_ERROR", "KeyError occurred in example function", ke)
            return error_obj
        except Exception as e:
            log_exception("An unexpected error occurred", input_object, e)
            error_obj = post_error_wf("UNEXPECTED_ERROR", "Unexpected error in example function", input_object, e)
            return error_obj
    
    example_function()
    This document provides details about the Workflow Manager. WFM is the orchestrating module for the Anuvaad pipeline.

    Getting Started

    WFM is the backbone service of the Anuvaad system, it is a centralized orchestrator which directs the user input through the dataflow pipeline to achieve the desired output. It maintains a record of all the jobs and all the tasks involved in each job. WFM is the SPOC for the clients to retrieve details, status, error reports etc about the jobs executed (sync/async) in the system. Using WFM, we’ve been able to use Anuvaad not just as a Translation platform but also as an OCR platform, Tokenization platform, Sentence Alignment platform for dataset curation. Every use-case in Anuvaad is defined as a ‘Workflow’ in the WFM, These workflow definitions are in the form of a YAML file, which is read by WFM as an external configuration file.

    WFM Config: This is a YAML file which has a well defined structure to create workflows in the Anuvaad system. Every use-case in Anuvaad is called ‘Workflow’.

    Workflow - Set of steps to be executed on a given input to obtain the desired output. Anuvaad has 2 types of workflows: Async WF and Sync WF.

    Async WF - These are asynchronous workflows, wherein the modules involved in this flow communicate with each other and the WFM via the kafka queue asynchronously.

    Sync WF - These are synchronous workflows wherein the modules involved communicate with each other and the WFM via REST APIs. The client receives responses in real time.

    Structure of the config is as follows:

    • workflowCode: An alphanumeric code that UNIQUELY identifies a workflow. Format: WF_<A/S>_<codes_of_modules_in_sequence>

    • type: Type of the workflow - ASYNC or SYNC

    • description: Description of the workflow to explain what the workflow does

    • useCase: An alphanumeric prefix to the job ID signifying a reference to the workflowCode.

    • sequence: The set of steps to be defined under the workflow. This is a list of ‘steps’ where each ‘step’ contains keys order, tool & endState.

    • The ‘tool’ key is the definition of the tool used in the corresponding ‘step’ in the ‘sequence’. Each tool contains keys name, description, kafka-input, topic, partitions, kafka-output. In case of Sync WFs, the tool contains keys name, description, api-details, uri.

    • Order: Number that defines the order of this step in the sequence. 0 is the value for the first step, 1 being next and so on.

    • name: Name of the tool

    • description: Description of the tool

    • kafka-input: Details of the kafka input for that particular tool. The tool must accept input on this topic from the WFM.

    • kafka-output: Details of the kafka output for that particular tool. The tool must produce output on this topic from the WFM.

    • api-details: Details of the API exposed by the tool for WFM to access.

    An example workflowCode: WF_A_FCBMTKTR WF = Workflow A = Async FC = File Converter BM = Block Merger TK = Tokeniser TR = Translator Configs can be found here: wfm_configs

    WFM has 2 types of IDs involved in the jobs that hep uniquely identify a job and its intermediate tasks: jobID & taskID. jobID: This is a alphanumeric ID that uniquely identifies a job in the system. jobIDs are generated for both Sync and Async Jobs. Format: <use_case>-<random_string>-<13-digit epoch time> taskID: A job contains multiple intermediate tasks, taskID is a unique ID used to idenity each of those tasks. A combination of these taskIDs mapped to a given jobID can help trace an entire job through the system. Format: <module_code>-<13-digit epoch time>

    Modules

    API Details WFM exposes multiple APIs for the client to execute and fetch jobs in the Anuvaad system. The APIs are as follows: /async/initiate: API to execute Async workflows. /sync/initiate: API to execute Sync workflows. /configs/search: API to search WFM configs /jobs/search: API to search initiated jobs.

    Postman Collection:

    https://www.getpostman.com/collections/11b7d2bc4e5aa37d04c8

    Code:

    https://github.com/project-anuvaad/anuvaad/tree/master/anuvaad-etl/anuvaad-workflow-mgr

    Prerequisites

    • python 3.7

    • ubuntu 16.04

    Dependencies:

    Run:

    APIs and Documentation

    Details of the APIs can be found here: https://raw.githubusercontent.com/project-anuvaad/anuvaad/master/anuvaad-etl/anuvaad-workflow-mgr/docs/etl-wf-manager-kafka-contract.yml

    Details of the requests flowing in and out through kafka can be found here: https://raw.githubusercontent.com/project-anuvaad/anuvaad/master/anuvaad-etl/anuvaad-workflow-mgr/docs/etl-wf-manager-kafka-contract.yml

    Configs

    Wokflows have to be configured in a .yaml file as shown in the following document: https://raw.githubusercontent.com/project-anuvaad/anuvaad/master/anuvaad-etl/anuvaad-workflow-mgr/config/etl-wf-manager-workflow-config-users.yml

    License

    MIT

    Tokenizer

    Microservice tokenises pragraphs into independently translatable sentences.

    Layout Detector

    Microservice interface for Layout detection model.

    Block Segmenter

    Handles layout detection miss-classifications , region unifying.

    Word Detector

    Word detection.

    Block Merger

    An OCR system that extracts texts, images, tables, blocks etc from the input file and makes it avaible in the format which can be utilised by downstream services to perform Translation. This can also be used as an independent product that can perform OCR on files, images, ppts, etc.

    Translator

    Translator pushes sentences to NMT module, which internally invokes IndicTrans model hosted in Dhruva to translate and push back sentences during the document translation flow.

    Content Handler

    Repository Microservice which maintains and manages all the translated documents

    Translation Memory X(TMX)

    System translation memory to facilitate overriding NMT translation with user preferred translation. TMX provides three levels of caching - Global , User , Organisation.

    User Translation Memory(UTM)

    System tracks and remembers individual user translations or corrected translations and applies automatically when same sentences are encountered again.

    Service Contracts

    In Anuvaad All the service are talked through Workflow-manager by respective workflow configs defined for each service and communicating between these service are done by means of Kafka. Each services has its own functionality and are not dependent to predecessor services outputs. Two Main Flows:

    1. Translation - Single/Block Translation or Document (.pdf , .docx , .pptx) Translation.

    2. Digitization - OCR on (.pdf or Images).

    Document Digitization

    This pipeline is used to extract text from a digital/scanned document. Lines and layouts (header, footer, paragraph, table, cell, image) are detected by a custom-trained Prima layout model and OCR is done using the Anuvaad OCR model.

    Github repo:

    API contract:

    How to Use

    pip install -r requirements.txt
    python app.py

    WFM-KAFKA-CONTRACT

    WFM-CONFIGS

    WFM is the backbone service of the Anuvaad system, it is a centralized orchestrator which directs the user input through the dataflow pipeline to achieve the desired output. It maintains a record of all the jobs and all the tasks involved in each job. WFM is the SPOC for the clients to retrieve details, status, error reports etc about the jobs executed (sync/async) in the system. Using WFM, we’ve been able to use Anuvaad not just as a Translation platform but also as an OCR platform, Tokenization platform, Sentence Alignment platform for dataset curation. Every use-case in Anuvaad is defined as a ‘Workflow’ in the WFM, These workflow definitions are in the form of a YAML file, which is read by WFM as an external configuration file.

    USER-MANAGEMENT

    USER-MAGAGEMENT-CONTRACT

    This microservice is served with multiple APIs to manage the User and Admin side functionalities in Anuvaad.

    Translation

    if pdf document:

    FILE-UPLOADER -> FILE-CONVERTER -> BLOCK-MERGER -> TOKENISER -> TRANSLATOR -> CONTENT-HANDLER

    if docx and pptx:

    FILE-UPLOADER -> FILE-TRANSLATOR -> TOKENISER -> TRANSLATOR -> CONTENT-HANDLER

    Digitization

    V1.0

    FILE-UPLOADER -> FILE-CONVERTER -> GOOGLE-VISION-OCR -> OCR-TOKENISER -> OCR-CONTENT-HANDLER

    V1.5

    FILE-UPLOADER -> FILE-CONVERTER -> WORD-DETECTOR -> LAYOUT-DETECTOR -> BLOCK-SEGMENTER -> GOOGLE-VISION-OCR -> OCR-TOKENISER -> OCR-CONTENT-HANDLER

    V2.0

    FILE-UPLOADER -> FILE-CONVERTER -> WORD-DETECTOR -> LAYOUT-DETECTOR -> BLOCK-SEGMENTER -> TESSERACT-OCR -> OCR-TOKENISER -> OCR-CONTENT-HANDLER

    FILE-UPLOADER

    FILE-UPLOADER-CONTRACT

    The User Uploads the file and in return the file will be stored in the samba share for further api's to access them.

    FILE-CONVERTER

    FILE-CONVERTER-CONTRACT

    This micro service, which in turn is a Kafka consumer service, consumes the input files and converts them into PDF. Best results are obtained only for the file formats supported by Libreoffice.

    If Document format is .pdf then Block-merger will be used for OCR on the document.

    BLOCK-MERGER

    BLOCK-MERGER-CONTRACT

    It is used to extract text from a digital document in a structured format(paragraph,image,table) which is then used for translation purposes.

    If Document format is .docx or .pptx then File-Translator service will be used.

    FILE-TRANSLATOR

    This microservice is served with multiple APIs to transform the data in the file to form JSON file and download the translated files of type docx, pptx, and html.

    TOKENISER

    TOKENISER-CONTRACT

    This service is used to tokenise the input paragraphs received into independently translatable sentences which can be consumed by downstream services to translate the entire input.Regular expressions and specific libraries such as Nltk are being used to build this tokeniser.

    TRANSLATOR

    Translator receives input from the tokeniser module, the input is a JSON file that contains tokenised sentences. These tokenised sentences are extracted from the JSON file and then sent to NMT over kafka for translation. NMT expects a batch of ‘n’ sentences in one request, Translator created ‘m’ no of batches of ‘n’ sentences each and pushes to the NMT input topic. In parallel it also listens to the NMT’s output topic to recieve the translation of the batches sent. Once, all the ‘m’ no of batches are received back from the NMT, the translation of the document is marked complete. Next, Translator appends these translations back to the JSON file received from Tokeniser, The whole of this JSON which is now enriched with translations against every sentence is pushed to Content Handler via API, Content Handler then stores these translations.

    CONTENT-HANDLER

    CONTENT-HANDLER

    This microservice is served with multiple APIs to handle and retrieve back the contents (final result) of files translated in the Anuvaad system.

    WORD-DETECTOR

    WORD-DETECTOR-CONTRACT

    Input as pdf or image If input is pdf , then convert pdf into images Use custom prima line model to line detection in the image Return list of pages and each page includes a list of lines, it also includes page information(page path, page resolution).

    LAYOUT-DETECTOR

    LAYOUT-DETECTOR-CONTRACT

    Output of word detector as an input. Use a prima layout model for layout detection in the image. Layout classes: Paragraph, Image, Table, Footer, Header, Maths formula Return list of pages and each page includes a list of layouts and list of lines.

    BLOCK-SEGMENTER

    BLOCK-SEGMENTER-CONTRACT

    Output of layout detector as an input. Collation of line and word at layout level

    GOOGLE-VISION-OCR

    GOOGLE-VISION-OCR-CONTRACT

    Output of block segmenter as an input. Use google vision as OCR engine. Text collation at word,line and paragraph level.

    TESSERACT-OCR

    TESSERACT-OCR-CONTRACT

    Output of block segmenter as an input. Use Anuvaad ocr model as OCR engine. Text collation at word,line and paragraph level.

    OCR-TOKENISER

    OCR-TOKENISER-CONTRCAT

    This service is used to tokenise the input paragraphs received into independently translatable sentences which can be consumed by downstream services to translate the entire input.Regular expressions and specific libraries such as Nltk are being used to build this tokeniser.

    OCR-CONTENT-HANDLER

    OCR-CONTENT-HANDLER-CONTRACT

    This microservice is served with multiple APIs to handle and manipulate the digitized data from anuvaad-gv-document-digitize which is part of the Anuvaad system. This service is functionally similar to the Content Handler service but differs since the output document (digitized doc) structure varies.

    ALIGNER

    ALIGNER-CONTRACT

    This Module is for “aligning” or simply, finding similar sentence pairs from two lists of sentences, preferably in different languages. The service is dependent on the file uploader and workflow manager (WFM) services. Aligner service is based on Google’s LaBse model and FB’s FAISS algorithm. It accepts two files as inputs, from which two lists of sentences are collected. LaBse Embeddings are calculated for each of the sentences in the list. Cosine similarity between embeddings is calculated to find meaningfully similar sentence pairs. Faiss algorithm is used to fasten up the whole process dramatically.

    DOCUMENT-CONVERTER

    DOCUMENT-CONVERTER-CONTRACT

    This microservice is intended to generate the final document after translation and digitization. This currently supports pdf, txt, xlsx document generation.

    WORKFlow-MANAGER
    WFM-CONTRACT

    Upload a PDF or image file using the upload API:

    Upload URL: https://auth.anuvaad.org/anuvaad-api/file-uploader/v0/upload-file

    Get the upload ID and copy it to the DD2.0 input path.

  • Initiate the Workflow:

    WF URL: https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate

    DD2.0 Input:

  • Microservices

    Word Detector

    • Input: PDF or image

    • Output: List of pages with detected lines and page information.

    sample

    Github repo: Word Detector Craft

    API contract: Word Detector API Contract

    How to use: Word Detector
    1. Upload a PDF or image file using the upload API:

      Upload URL: https://auth.anuvaad.org/anuvaad-api/file-uploader/v0/upload-file

    2. Initiate the Word Detector Workflow:

      WF URL:

      Word Detector Input:

    Layout Detector

    • Input: Output of word detector

    • Output: List of pages with detected layouts and lines.

    sample

    Github repo: Layout Detector Prima

    API contract: Layout Detector API Contract

    How to use: Layout Detector
    1. Input JSON file of the word detector as an input path.

    2. Initiate the Layout Detector Workflow:

      WF URL: https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate

      Layout Detector Input:

    Block Segmenter

    • Input: Output of layout detector

    • Output: Collation of line and word at layout level.

    Github repo: Block Segmenter

    API contract: Block Segmenter API Contract

    How to use: Block Segmenter
    1. Input JSON file of the layout detector as an input path.

    2. Initiate the Block Segmenter Workflow:

      WF URL: https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate

      Block Segmenter Input:

    • Input: Output of block segmenter

    • Output: Text collation at word, line, and paragraph level using Google Vision as the OCR engine.

    Tesseract OCR

    • Input: Output of block segmenter

    • Output: Text collation at word, line, and paragraph level using Anuvaad OCR model.

    Github repo: OCR Tesseract Server

    API contract: Google Vision API Contract

    How to use: Tesseract OCR
    1. Input JSON file of the block segmenter as an input path.

    2. Initiate the Tesseract OCR Workflow:

      WF URL: https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate

      Tesseract OCR Input:

    Google OCR (Tesseract Alternative)

    How to use: Google Vision OCR
    1. Input JSON file of the block segmenter as an input path.

    2. Initiate the Google OCR Workflow:

      WF URL: https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate

      Google OCR Input:

    Github repo: OCR Google Vision Server

    API contract: Google Vision API Contract

    Anuvaad Document Processor
    API Contract
    Documet Digitization
    {
        "files": [
            {
                "locale": "language",
                "path": "file_name",
                "type": "file_format",
                "config": {
                    "OCR": {
                        "option": "HIGH_ACCURACY",
                        "language": "language"
                    }
                }
            }
        ],
        "workflowCode": "WF_A_FCWDLDBSOD20TESOTK"
    }
    https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate
    {
        "files": [
            {
                "locale": "language",
                "path": "file_name",
                "type": "file_format",
                "config": {
                    "OCR": {
                        "option": "HIGH_ACCURACY",
                        "language": "language"
                    }
                }
            }
        ],
        "workflowCode": "WF_A_WD"
    }
    {
        "files": [
            {
                "locale": "language",
                "path": "word_detector_output",
                "type": "json",
                "config": {
                    "OCR": {
                        "option": "HIGH_ACCURACY",
                        "language": "language"
                    }
                }
            }
        ],
        "workflowCode": "WF_A_LD"
    }
    {
        "files": [
            {
                "locale": "language",
                "path": "layout_detector_output",
                "type": "json",
                "config": {
                    "OCR": {
                        "option": "HIGH_ACCURACY",
                        "language": "language"
                    }
                }
            }
        ],
        "workflowCode": "WF_A_BS"
    }
    {
        "files": [
            {
                "locale": "language",
                "path": "block_segmenter_output",
                "type": "json",
                "config": {
                    "OCR": {
                        "option": "HIGH_ACCURACY",
                        "language": "language"
                    }
                }
            }
        ],
        "workflowCode": "WF_A_OD20TES"
    }
    {
        "files": [
            {
                "locale": "language",
                "path": "block_segmenter_output",
                "type": "json",
                "config": {
                    "OCR": {
                        "option": "HIGH_ACCURACY",
                        "language": "language"
                    }
                }
            }
        ],
        "workflowCode": "WF_A_OTES"
    }

    User management

    Feature Branch name: user-mangement_feature API Contract: here

    UMS is the initial Anuvaad module that facilitates user login and other account-related functionalities. It features admin level login and user level login. Only super Admin has the authority to create new organizations or add new users to the system (if not for sign-up). Admin can assign roles to the new users as well.

    Modules

    User Modules

    CreateUsers

    Whitelisted bulk API to create/register users in the system.

    Mandatory params: userName, email, password, roles

    Actions:

    • Validating input params as per the policies

    • Storing user entry in the database and assigning a unique id (userID)

    • Triggering verification email

    CreateUsers CURL Request

    VerifyUsers

    Whitelisted API to verify and complete the registration process on Anuvaad.

    Mandatory params: userName, userID

    Actions:

    • Validating input params as per the policies

    • Activating the user

    • Triggering registration successful email

    VerifyUsers CURL Request

    UserLogin

    Whitelisted API for login.

    Mandatory params: userName, password

    Actions:

    • Validating input params as per the policies

    • Issuing auth token (JWT token)

    • Activating user session

    UserLogin CURL Request

    UserLogout

    Whitelisted API for logging out.

    Mandatory params: userName

    Actions:

    • Validating input params as per the policies

    • Turning off user session

    UserLogout CURL Request

    AuthTokenSearch

    API to validate auth tokens and fetch back user details.

    Mandatory params: token

    Actions:

    • Validating the token

    • Returning user records matching the token only when the token is active

    • Same API is used for verifying a token generated on forgot-password as well.

    AuthTokenSearch CURL Request

    UpdateUsers

    Bulk API to update user details, RBAC enabled.

    Mandatory params: userID

    Updatable fields: orgID, roles, models, email

    Actions:

    • Validating input params as per the policies

    • Updating DB records

    UpdateUsers CURL Request

    ForgotPassword

    API for forgot password.

    Mandatory params: userName

    Actions:

    • Validating input params as per the policies

    • Generating reset password link and sending it via email

    ForgotPassword CURL Request

    ResetPassword

    API to update password, RBAC enabled.

    Mandatory params: userName, password

    Actions:

    • Validating input params as per the policies

    • Generating reset password link and sending it via email

    ResetPassword CURL Request

    Admin Modules

    (Only Admin has access)

    OnboardUsers

    Bulk API to onboard users to the Anuvaad system.

    Mandatory params: userName, email, password, roles

    Actions:

    • Validating input params as per the policies

    • Storing user entry in the database and assigning a unique userID

    • User account is verified and activated by default

    OnboardUsers CURL Request

    SearchUsers

    API for bulk search with pagination property.

    Actions:

    • Validating input params as per the policies

    • All user records are returned if skip_pagination is set to True

    • When no offset and limit are provided, default values are set as per configs

    • Only the records matching the search values are returned if skip_pagination

    SearchUsers CURL Request

    ActivateDeactivateUser

    API to update the activation status of a user.

    Mandatory params: userName, is_active

    Actions:

    • Validating input params as per the policies

    • Updating the user activation status

    ActivateDeactivateUser CURL Request

    SearchRoles

    API to fetch active roles in Anuvaad.

    Actions:

    • Returning active role codes

    SearchRoles CURL Request

    Organization Modules (Currently only ADMIN has access)

    CreateOrganization: Bulk API to upsert organizations.

    Mandatory params: code, active

    Actions:

    • Validating input params as per the policies

    • Creating or deactivating orgs as per active status on request

    CreateOrganization CURL Request

    SearchOrganization: API to get organization details.

    Actions:

    • If org_code is given, searches for that organization alone; otherwise, all organizations are returned.

    SearchOrganization CURL Request

    Extension (for Anuvaad web extension)

    GenerateIdToken: Generating token for web extension user.

    Mandatory params: id_token

    Actions:

    • Decrypting and validating the token

    • If the token is valid, register the user and return auth token

    GenerateIdToken CURL Request

    Notes

    • Add APIs with Zuul if they need external access.

    • Rebuild and deploy UMS whenever a new role is added with Zuul.

    • Email ID used for system notifications: [email protected]

    • Email templates are available .

    Setup Tips

    • Run the docker container.

    • Initialize the DB by creating a Super-Admin account directly in the DB.

    • Additional users can be added from the UI by logging into the super admin account.

    How to Initialize UMS without UI?

    1. Create an account (Admin is preferred) using the API anuvaad/user-mgmt/v1/users/create.

    2. Get the verification token from the email (2nd last ID on the ‘verify now’ link) or the userID from the user table.

    3. Complete the registration process by calling the anuvaad/user-mgmt/v1/users/verify-user API.

    is False
    here
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/create' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
        "users": [ 
            { 
                "name": "Jainy Joy", 
                "userName": "[email protected]", 
                "password": "password123", 
                "email": "[email protected]", 
                "orgID" : "ANUVAAD", 
                "roles": [ 
                    { 
                        "roleCode":"TRANSLATOR", 
                        "roleDesc":"Has access to translation services" 
                    } 
                ], 
                "models": [ 
                    { 
                        "src_lang": "en", 
                        "tgt_lang": "ml", 
                        "uuid": "7156838-90b6-465f-a9aa-5e8f7bfa97e8" 
                    }, 
                    { 
                        "src_lang": "en", 
                        "tgt_lang": "hi", 
                        "uuid": "2e2fb17a-c470-4562-9cf6-aef0a0ba70ec" 
                    } 
                ] 
            } 
        ]    
    }'
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/verify-user' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
        "userName": "[email protected]", 
        "userID": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" 
    }'
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/login' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
        "userName": "[email protected]", 
        "password": "password123" 
    }'
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/logout' \
    --header 'auth-token;' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
        "userName": "[email protected]" 
    }'
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/auth-token-search' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
        "token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VySUQiOiIxMTIyMzM0NCIsImV4cCI6MTYyNzU1MTk5NX0.Wqha17Jsf-D_6KXOsEj3STpV4FBfM_27DRghYKXp7Sg" 
    }'
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/update' \
    --header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIka2V1VFNUU2dTZW5vUzI1Y2djTmJxLmpaVWF1cVN6SXpaL0xGWHdySDRrenJTZE1WMkZPQnUnIiwiZXhwIjoxNjE3OTU4MTMxfQ.TEIg306dXvtiTvuCYdPWF1ZNjv9fQ1Y0iZyBXHoaqzM' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
        "users": [ 
            { 
                "userID": "530761e5be1e4e4ebf1335b985c0b1181617878383934", 
                "orgID": "NONMT", 
                "roleCode": "TRANSLATOR", 
                "models": [ 
                    { 
                        "src_lang": "en", 
                        "tgt_lang": "ml", 
                        "uuid": "7156838-90b6-465f-a9aa-5e8f7bfa97e8" 
                    } 
                ] 
            } 
        ] 
    }'
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/forgot-password' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
        "userName": "[email protected]" 
    }'
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/reset-password' \
    --header 'x-user-id: 7505827e810344b98db9433b8bab4f3d1606377202908' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
        "userName": "[email protected]", 
        "password": "xxxxxxx" 
    }'
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/onboard-users' \
    --header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkeFFISlZiUGhkVTFaL2RnNzAzbkUxdWtwZy5YY2wwV1A3R3U3S29JWEI2aHd2aHZILjVqN0snIiwiZXhwIjoxNjEyMzMzNDk0fQ.kVZRyyqaDnHOT9Qgqpet1sIzHjVbJwDALTgOpVxA6yo' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
        "users": [ 
            { 
                "name": "Test User", 
                "userName": "[email protected]", 
                "password": "password1123", 
                "email": "[email protected]", 
                "phoneNo": "", 
                "roles": [ 
                    { 
                        "roleCode": "TRANSLATOR", 
                        "roleDesc": "Has access to translation related resources" 
                    } 
                ], 
                "orgID": "TESTORG03" 
            } 
        ] 
    }'
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/search' \
    --header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkNUpsTWhYOUt0REVOQmxzYlZqYS5OdUYuLmxvWkV4VWw0b2ZDNng3S0dNaHhGMkVraHQvWjInIiwiZXhwIjoxNjExMjk0NzU4fQ.mXlh6tL4ahc1xL16QGv8qDHBWamEYsJmE5b5_lDiioE' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
        "userIDs": [], 
        "userNames": [], 
        "roleCodes": [ 
            "TRANSLATOR", 
            "ANNOTATOR" 
        ], 
        "offset": null, 
        "limit": null, 
        "skip_pagination": false 
    }'
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/activate-user' \
    --header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkeFFISlZiUGhkVTFaL2RnNzAzbkUxdWtwZy5YY2wwV1A3R3U3S29JWEI2aHd2aHZILjVqN0snIiwiZXhwIjoxNjEyMzMzNDk0fQ.kVZRyyqaDnHOT9Qgqpet1sIzHjVbJwDALTgOpVxA6yo' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
        "userName": "[email protected]", 
        "is_active": true 
    }'
    curl --location --request GET 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/get-roles' \
    --header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIka2V1VFNUU2dTZW5vUzI1Y2djTmJxLmpaVWF1cVN6SXpaL0xGWHdySDRrenJTZE1WMkZPQnUnIiwiZXhwIjoxNjE3OTU4MTMxfQ.TEIg306dXvtiTvuCYdPWF1ZNjv9fQ1Y0iZyBXHoaqzM' \
    --data-raw ''
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/org/upsert' \
    --header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkN1dkc1MzUW9Ob1dxY1NUSzUxREsxZWFIUFhWUW9oRWl2LnFtSTFXM2pJZVZoejVCdnVwRjYnIiwiZXhwIjoxNjExMTU0MjIxfQ.5aDzGWOemHW7dgdwezJhnAWiRXS6ljOSWEuPwW6pQUQ' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
        "organizations": [ 
            { 
                "code": "ANUVAAD", 
                "active": true, 
                "description": "default org for the users of Anuvaad system" 
            } 
        ] 
    }'
    curl --location --request GET 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/org/search?org_code=ANUVAAD' \
    --header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkcmI3TlZ3SEk1RVZYcFpmU05KSms2Lng0dEw4b01RMW9oZldsR01SNUFqdkFWa3BSRWNzckcnIiwiZXhwIjoxNjIwNDUxNjcxfQ.dpCOSd0pvxcKsyGqt3HzxtjWZDdNlLG_0zjhSsKfNbA' \
    --data-raw ''
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/extension/users/get/token' \
    --header 'Content-Type: application/json' \
    --data-raw '{ 
        "id_token": "eE7S2Tn/s8+xhU/EGJKxSC+SvR9IOrGcnbC7Jq5iCLuFrpxNOe8c/aGg5Le1:eV0n09cXpNXXVCfPSkdPmCi4gC68b1oH" 
    }'

    Content Handler

    This microservice is served with multiple APIs to handle and retrieve the contents (final result) of files translated in the Anuvaad system.

    Modules

    Common Information

    Some common info that is applicable to save, update operations on translations.

    Workflow Code

    • WF_S_TR and WF_S_TKTR: Changes the sentence structure, hence s0 pair needs to be updated.

    • DP_WFLOW_S_C: Doesn't change the sentence structure, hence no need to update the s0 pair.

    Sentence Keys

    • s0_src: Source sentence extracted from the file.

    • s0_tgt: Sentence translation from NMT.

    • tgt: Translation updated by the user (User translation). (Source may vary if the user edits the input document, or else it keeps the same as s0_src).

    File Content Modules

    SaveFileContent

    API to save translated documents. The JSON request object is generated from block-merger and later updated by tokenizer and translator. This API is used internally.

    Mandatory parameters: userid, pages, record_id, src_lang, tgt_lang

    Actions:

    • Validating input parameters as per the policies.

    • The document to be saved is converted into blocks.

    • Block can be of type images, lines, text_blocks, etc.

    • Every block is created with UUID.

    SaveFileContent CURL Request

    GetFileContent

    API to fetch back the documents. The response object would be an array of pages, with pagination enabled. RBAC enabled.

    Mandatory parameters: start_page, end_page, record_id

    Actions:

    • Validating input parameters as per the policies.

    • Fetching back the blocks as per the page number requested.

    GetFileContent CURL Request

    UpdateFileContent

    API to update the block content; triggered on split, merge, re-translate operations. Used internally.

    Mandatory parameters: record_id, user_id, blocks, workflowCode

    Actions:

    • Validating input parameters as per the policies.

    • Updating the list of blocks.

    UpdateFileContent CURL Request

    SaveFileContentReferences

    Internal API to store S3 link references to translated documents (on docx flow).

    Mandatory parameters: job_id, file_link

    Actions:

    • Validating input parameters as per the policies.

    • Storing records in the database.

    SaveFileContentReferences CURL Request

    GetFileContentReferences

    API to fetch back the S3 link for docx files. RBAC enabled.

    Mandatory parameters: job_ids

    Actions:

    • Validating input parameters as per the policies.

    • Fetching back the data from the database.

    GetFileContentReferences CURL Request

    Sentence Modules

    SaveSentence

    API to store user translations. RBAC enabled.

    Mandatory parameters: sentences, workflowCode, user_id

    Actions:

    • Validating input parameters as per the policies.

    • Updating the sentence blocks.

    • Saved sentences are always updated with "save": true flag.

    • Saved sentences are also saved in the Redis store for Sentence Memory.

    SaveSentence CURL Request

    FetchSentence

    Bulk API to fetch back sentences. RBAC enabled.

    Mandatory parameters: sentences, record_id, block_identifier, s_id

    Actions:

    • Validating input parameters as per the policies.

    • Returning back an array of sentences searched for.

    FetchSentence CURL Request

    Saving blocks in the database.

    curl --location --request POST 'http://gateway_anuvaad-content-handler:5001/anuvaad/content-handler/v0/save-content' \
    --header 'userid: 06b5419ab0f14669b1dff654533416411608108799138' \
    --header 'Content-Type: application/json' \
    --data-raw '{
      "file_locale": "en",
      "record_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json",
      "src_lang": "en",
      "tgt_lang": "hi",
      "pages": [
        {
          "images": [],
          "lines": [],
          "page_height": 1188,
          "page_no": 1,
          "page_width": 918,
          "text_blocks": [
            {
              "attrib": null,
              "avg_line_height": 15,
              "block_id": "ae3165c2-03aa-11eb-a840-02420a00032e-0",
              "block_identifier": "24610b3f-c0fd-4cbf-9597-1c037e84fc70",
              "children": [
                {
                  "attrib": "HEADER",
                  "block_id": "ae3165c2-03aa-11eb-a840-02420a00032e-0-0",
                  "children": null,
                  "font_color": "#000000",
                  "font_family": "ArialMT",
                  "font_size": 13,
                  "text": "Consulting Manager: Sample manager",
                  "text_height": 15,
                  "text_left": 108,
                  "text_top": 63,
                  "text_width": 293
                }
              ],
              "data_type": "text_blocks",
              "file_locale": "68072f3c-c57a-4f62-a7fc-42ed6f776c1e",
              "font_color": "#000000",
              "font_family": "ArialMT",
              "font_size": 13,
              "job_id": "",
              "page_info": {
                "page_height": 1188,
                "page_no": 1,
                "page_width": 918
              },
              "record_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json",
              "text": " Consulting Manager: Sample Manager  Phone: +91-1234567898/+91-80 123456 Email:  ​ [email protected] ",
              "text_height": 47,
              "text_left": 108,
              "text_top": 63,
              "text_width": 293,
              "tokenized_sentences": [
                {
                  "input_subwords": "['▁Consult', 'ing', '▁Manager', '▁:']",
                  "n_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json|1|ae3165c2-03aa-11eb-a840-02420a00032e-0",
                  "output_subwords": "['▁परामर्श', '▁प्रबंधक', 'ः']",
                  "pred_score": -0.8280696868896484,
                  "s_id": "94695768-5976-4fdc-853d-9aa49630ce77",
                  "src": "Consulting Manager:",
                  "tagged_src": "Consulting Manager:",
                  "tagged_tgt": "परामर्श प्रबंधकः",
                  "tgt": "परामर्श प्रबंधकः"
                }
              ],
              "underline": 1
            }
          ]
        }
      ]
    }'
    curl --location --request GET 'https://auth.anuvaad.org/anuvaad/content-handler/v0/fetch-content?record_id=A_FTTTR-GBWSA-1623682123483%7CDOCX-c7759250-6952-4575-9514-66a1383caabb.json&start_page=0&end_page=0' \
    --header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6ImphaW55LmpveUB0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkNzJjY1ZFRmNIcC9qSkg5dzBGMXFTdU5ZQlNXQThSMzdRak1zdm8wN01rMnNYeVI2N24xRlcnIiwiZXhwIjoxNjIzNzY5Njg0fQ.a6gaxGvG-yCLrE6qeTshf2V8j_S44-U6obgWyyHZRK8'
    curl --location --request POST 'http://gateway_anuvaad-content-handler:5001//anuvaad/content-handler/v0/update-content' \
    --header 'userid: kd' \
    --header 'Content-Type: application/json' \
    --data-raw '{
      "record_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json",
      "blocks": [
        {
          "attrib": null,
          "avg_line_height": 15,
          "block_id": "ae3165c2-03aa-11eb-a840-02420a00032e-0",
          "block_identifier": "24610b3f-c0fd-4cbf-9597-1c037e84fc70",
          "children": [
            {
              "attrib": "HEADER",
              "block_id": "ae3165c2-03aa-11eb-a840-02420a00032e-0-0",
              "children": null,
              "font_color": "#000000",
              "font_family": "ArialMT",
              "font_size": 13,
              "text": "Consulting Manager: Sample Manager",
              "text_height": 15,
              "text_left": 108,
              "text_top": 63,
              "text_width": 293
            }
          ],
          "data_type": "text_blocks",
          "file_locale": "68072f3c-c57a-4f62-a7fc-42ed6f776c1e",
          "font_color": "#000000",
          "font_family": "ArialMT",
          "font_size": 13,
          "job_id": "",
          "page_info": {
            "page_height": 1188,
            "page_no": 1,
            "page_width": 918
          },
          "record_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json",
          "text": " Consulting Manager: Sample Manager  Phone: +91-1234567898/+91-80 123456 Email:  ​ [email protected] ",
          "text_height": 47,
          "text_left": 108,
          "text_top": 63,
          "text_width": 293,
          "tokenized_sentences": [
            {
              "input_subwords": "['▁Consult', 'ing', '▁Manager', '▁:']",
              "n_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json|1|ae3165c2-03aa-11eb-a840-02420a00032e-0",
              "output_subwords": "['▁परामर्श', '▁प्रबंधक', 'ः']",
              "pred_score": -0.8280696868896484,
              "s_id": "94695768-5976-4fdc-853d-9aa49630ce77",
              "src": "Consulting Manager:",
              "tagged_src": "Consulting Manager:",
              "tagged_tgt": "परामर्श प्रबंधकः",
              "tgt": "परामर्श प्रबंधकः"
            }
          ],
          "underline": 1
        }
      ]
    }'
    curl --location --request POST 'http://gateway_anuvaad-content-handler:5001//anuvaad/content-handler/v0/ref-link/store' \
    --header 'ad-userid: kd' \
    --header 'userid: kd' \
    --header 'Content-Type: application/json' \
    --data-raw '{
      "records": [
        {
          "job_id": "abc1",
          "file_link": {
            "HTML": {
              "LIBRE": "https://anuvaad1.s3.amazonaws.com/upload/sample3tableshredacrossPages/LIBRE/sample3tableshredacrossPages.html",
              "PDFTOHTML": "https://anuvaad1.s3.amazonaws.com/upload/sample3tableshredacrossPages/PDFTOHTML/sample3tableshredacrossPages-html.html"
            },
            "PDF": {
              "LIBRE": "https://anuvaad1.s3.amazonaws.com/upload/sample3tableshredacrossPages/PDFTOHTML/sample3tableshredacrossPages.pdf"
            }
          }
        }
      ]
    }'
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/content-handler/v0/ref-link/fetch' \
    --header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6ImphaW55LmpveUB0YXJlbnRvLmNvbSIsIkphaW55QDEyMyI6ImInJDJiJDEyJDk2YzRMb0ZCTG05ZU1XVlJXNVFzTE9ydTlLZVc1emJnVnBhaFouclBuYnFReU96YUNDMFVpJyIsImV4cCI6MTY0MDY3MDg4N30.R0zEJyEeXhOZ41TnsPTD0rFov3kPmUVfL_DdOxKU0QI' \
    --header 'Content-Type: application/json' \
    --data-raw '{"job_ids":["A_FTTTR-cSCim-1632805831132"]}'
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/content-handler/v0/save-content-sentence' \
    --header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6ImphaW55LmpveUB0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkaXJXU2xrdjFDSWUzNzJZMzZiWlhFdTdKSDQ0QlViR2d2QlVSMW5OMXJxNEEuMWpuQ0JsTi4nIiwiZXhwIjoxNjEzNzM5NTI5fQ.g-JLNqFen-ol3y40OAFA82q1pi-b3BDSGtoWi-OyjhA' \
    --header 'Content-Type: application/json' \
    --data-raw '{"workflowCode":"DP_WFLOW_S_C",
    "sentences": [
      {
        "bleu_score": 1,
        "n_id": "",
        "s0_src": "He was released on bail on the 1st of December. We used to go there to bail out the old man.",
        "s0_tgt": "उन्हें 1 दिसंबर को जमानत पर रिहा कर दिया गया था। हम वहां पुराने आदमी को जमानत देने जाते थे।",
        "s_id": "4e412457-e357-419b-b477-1676b314afd5",
        "save": true,
        "src": "He was released on bail on the 1st of December. We used to go there to bail out the old man.",
        "src_lang": "en",
        "tagged_src": "He was released on bail on the NnUuMm०st of December. We used to go there to bail out the old man.",
        "tagged_tgt": "उन्हें NnUuMm० दिसंबर को जमानत पर रिहा कर दिया गया था। हम वहां पुराने आदमी को जमानत देने जाते थे।",
        "tgt": "उन्हें 1 दि��ंबर को जमानत पर रिहा कर दिया गया था। हम वहां पुराने आदमी को जमानत देने जाते थे।",
        "tgt_lang": "hi",
        "time_spent_ms": 6797,
        "tmx_phrases": []
      }
    ]}'
    curl --location --request POST 'https://auth.anuvaad.org/anuvaad/content-handler/v0/fetch-content-sentence' \
    --header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkTWVEZzhpUGY3dWJFR21jbDRaNUE3dUo0bEk4VEdJcVpzL3R4ckJZOF

    Modulewise Appendix

    Summary of the purpose of each module and necessary links

    Key API contract: API Contract

    slno
    Module name
    Purpose
    Code location
    API contract

    1

    user management

    manage the User and Admin side functionalities in Anuvaad.

    2

    file handler

    The User Uploads the file and in return the file will be stored in the samba share for further api's to access them.

    GitHub

    API Contract

    3

    file converter

    consumes the input files and converts them into PDF. Best results are obtained only for the file formats supported by Libreoffice.

    GitHub

    API Contract

    4

    file translator

    transform the data in the file to form JSON file and download the translated files of type docx, pptx, and html.

    GitHub

    API Contract

    5

    content handler

    handle and retrieve back the contents (final result) of files translated in the Anuvaad system.

    GitHub

    API Contract

    6

    document converter

    This microservice is intended to generate the final document after translation and digitization. This currently supports pdf, txt, xlsx document generation.

    GitHub

    API Contract

    7

    tokenizer

    tokenise the input paragraphs received into independently translatable sentences which can be consumed by downstream services to translate the entire input

    GitHub

    API Contract

    8

    ocr tokenizer

    This service is used to tokenise the input paragraphs received into independently translatable sentences which can be consumed by downstream services to translate the entire input.

    GitHub

    API Contract

    9

    ocr content handler

    handle and manipulate the digitized data from anuvaad-gv-document-digitize which is part of the Anuvaad system.

    GitHub

    API Contract

    10

    Aligner

    This Module is for “aligning” or simply, finding similar sentence pairs from two lists of sentences,

    GitHub

    API Contract

    11

    workflow manager

    centralized orchestrator which directs the user input through the dataflow pipeline to achieve the desired output.

    GitHub

    API Contract

    12

    Block merger

    extract text from a digital document in a structured format(paragraph,image,table) which is then used for translation purposes.

    GitHub

    API Contract

    13

    translator

    Translator is a wrapper over the NMT and is used to send sentence by sentence to NMT for translation of the document

    GitHub

    14

    word detector

    Input as pdf or image If input is pdf , then convert pdf into images Use custom prima line model to line detection in the image

    GitHub

    API Contract

    15

    layout detector

    Output of word detector as an input. Use a prima layout model for layout detection in the image.

    GitHub

    API Contract

    16

    block segmenter

    Output of layout detector as an input. Collation of line and word at layout level

    GitHub

    API Contract

    17

    google vision ocr

    Output of block segmenter as an input. Use google vision as OCR engine. Text collation at word,line and paragraph level.

    GitHub

    API Contract

    18

    tesseract ocr

    Output of block segmenter as an input. Use Anuvaad ocr model as OCR engine. Text collation at word,line and paragraph level.

    GitHub

    API Contract

    19

    NMT

    This service gets the translated content either by invoking the model directly or fetches translated content from Dhruva platform.

    GitHub

    API Contract

    20

    metrics

    Display Analytics

    GitHub

    API Contract

    GitHub
    API Contract

    KT Videos

    Service and KT Video Links

    S. No.
    Service
    KT Video Link

    1

    aai4b-nmt-inference

    2

    block-segmenter

    Video Link

    3

    content-handler

    Video Link

    4

    etl-document-converter

    Video Link

    5

    etl-file-translator

    Video Link

    6

    etl-tokeniser

    Video Link

    7

    etl-tokeniser-ocr

    Video Link

    8

    etl-translator

    Video Link 1 Video Link 2

    9

    Etl-wf-manager [critical]

    Video Link (summary) Video Link (detailed)

    10

    file-converter

    11

    layout-detector-prima

    Video Link

    12

    gv-document-digitization (Optional)

    Video Link

    13

    metrics

    Video Link 1 Video Link 2

    14

    ocr-content-handler

    Video Link 1 Video Link 2

    15

    ocr-tesseract-server

    Video Link 1 Video Link 2

    16

    user-fileuploader

    Video Link

    17

    word-detector-craft

    Video Link

    18

    docx-download-service[nodejs]

    Video Link

    19

    two Factor Authentication (optional)

    Video Link

    20

    user Management System

    Video Link

    21

    sentence Aligner

    22

    Zuul

    Video Link 1 Video Link 2

    23

    Architecture, Config Management, & Git Strategies [critical]

    Video Link

    24

    Frontend

    Video Link