1 of 45

Sunbird Anuvaad

Sunbird Anuvaad Overview

Overview

It was bootstrapped by EkStep Foundation in late 2019 as a solution to enable easier translation of legal documents from English to Indic languages & vice-versa. Creating Anuvaad platform allowed legal entities to digitize & translate the Orders/Judgements using an easy to use user interface.

Anuvaad leverages state of the art AI/ML models including NMT, OCR, Layout detection to provide high level of accuracy. Project Anuvaad was envisioned to be end to end open sourced solution for document translation across multiple domains.

Project Anuvaad is REST APIs driven and hence any third party system can use various features like sentence translation, layout detection etc.

NOTE: The documentation is still WIP. Feel free to contribute to it or raise issues if the desired info is not uptodate. Explore the KT videos if you would like to dive deep into each module.

Features

Anuvaad is loaded with lots of features to provide the optimal experience for the end user to smoothen the process of document translation. The notable features are highlighted below:

Document Digitization

Document digitization is the process of converting physical documents into digital formats, making them easily accessible and editable.

Layout Detection

Anuvaad is coupled with custom trained Layout detection models for Identifying and comprehending a document's structure, which involves the recognition of key elements, including headings, paragraphs, tables, and images. This process is essential not only for enhancing OCR accuracy but also for preserving the document's layout and structure in the translated version.

Document Translation

Document translation involves converting text from one language to another, facilitating cross-lingual communication and information access. Anuvaad supports using NMT models straight from Bhashini Dhruva or in-built plug and play type of models for domain specific use cases.

Document Structure Preservation

This feature ensures that the original formatting, layout, and structure of documents are maintained during the translation process, preserving the document's visual integrity.

Improve Translation from Speech

Speech to text technology converts spoken language into written text, enabling audio content to be transcribed for translation or other purposes.

Translation Memory

Translation memory stores and retrieves previously translated segments to ensure consistency across documents and reduce translation time.

Glossary Support

Glossary support provides access to defined terminology and specialised vocabulary, ensuring consistency and precision in translations, particularly in specialised fields.

Usage Analytics and Metrics

Usage analytics and metrics offer insights into how the platform is utilised, helping users track and optimise translation processes and workflows.

File Format Conversion

File format conversion simplifies the process of converting documents from one file format to another while preserving their content and structure, enhancing compatibility.

Transliteration Support

Transliteration support enables the conversion of text from one script or alphabet to another, aiding users in dealing with different writing systems and ensuring the correct pronunciation of words, especially in multilingual contexts.

USE

Video Tutorials

Various video tutorials demonstrating features and step-by-step instructions to utilize the best out of Anuvaad!

LEARN

Architecture

Architecture of Anuvaad

The architecture is around 2 major blocks :

Document Digitization
Document Translation

Components

Technology Stack

AI ML Assets

Repository structure and developers guide

The project Anuvaad repository serves as the primary codebase for the Anuvaad project, aimed at facilitating document processing and translation tasks efficiently.

Purpose of Folders

anuvaad-api: Houses standalone APIs utilized within the project, such as login and analytics functionalities.
anuvaad-fe: Contains frontend-related code, responsible for the user interface and interaction aspects of the application.
chrome-extension: Hosts code relevant to the Anuvaad Chrome extension, offering additional features and integrations within the Chrome browser environment.
anuvaad-nmt-inference [legacy]: Previously held legacy OpenNMT Python-based inference code. Deprecated and not actively utilized within the current project framework.
anuvaad-etl: Comprises sub-modules dedicated to document processing tasks, enhancing the extraction, transformation, and loading capabilities within the Anuvaad ecosystem.

Microservice Structure

As an application, the Workflow Manager, in conjunction with independent APIs, forms the foundational architecture of Anuvaad. The Workflow Manager facilitates communication among various modules and orchestrates their interactions. However, Anuvaad's design accommodates diverse use cases, allowing each module to operate autonomously when necessary. For instance, the Tokenizer service can function independently to tokenize an Indic sentence without reliance on other modules.

Components of Each Microservice

Each microservice within Anuvaad adheres to a consistent structure, comprising the following common elements:

Dockerfile: Provides instructions to build the individual microservice within a Docker container, ensuring portability and consistency across different environments.
docs Folder: Contains documentation outlining the API contracts necessary for running and testing the module independently. This documentation serves as a reference for developers and users alike.
config Folder: Stores module-specific configurations and secrets required for the proper functioning of the microservice. Centralizing configuration management simplifies deployment and maintenance tasks.
kafkawrapper: Defines Kafka/WFM (Workflow Manager) related communication protocols, facilitating seamless integration and communication between modules. In the production environment, the Workflow Manager plays a crucial role in establishing communication channels, rendering standalone APIs redundant.

Setting up Anuvaad on your own

Follow these steps to set up the Anuvaad Web Application on your local machine:

Frontend

Clone the Repository:

git clone https://github.com/project-anuvaad/anuvaad.git

Navigate to the Project Directory:

cd anuvaad/anuvaad-fe/anuvaad-webapp-webapp

Install Dependencies:
```
npm install
```
or
```
yarn install
```
Environment Variables: Create a .env file in the root directory of the project and configure the necessary environment variables. You can use the .env.example file as a reference.
Start the Development Server:
```
npm start
```
or
```
yarn start
```

Additional Commands

Build the Application:
```
npm run build
```
or
```
yarn build
```
Run Tests:
```
npm test
```
or
```
yarn test
```

Backend

General Guidelines:

Clone the repo and go to the module specific directory.
Run pip3 install -r requirements.txt.
Make necessary changes to config files with respect to MongoDB and Kafka.
Run python3 src/app.py.

Alternatively, modules could be run by building and running Docker images. Make sure configs and ports are configured as per your local setup.

Build Docker Image:
```
docker build -t <service-name> .
```
Run Docker Container:
```
docker run -r <service-name>
```

Git branching strategies

Anuvaad follows the standard feature-master type of branching strategy for code maintenance. The releases happen through the master branch via release tags.

Branches

Feature Branches

Feature branches are a set of branches owned by individual developers in order to work on specific tasks. These branches are forked out of the master branch and they eventually feed into the same master branch once the code for that particular use case is developed and tested. These branches can either be deleted right after merging to master or can be retained to be reused for other use cases.

Feature branches can ONLY be deployed in the ‘Dev’ environment. The ‘Dev’ environment is a dedicated VM for the developers to test their code. Once the code is dev-tested, it must be merged to the ‘develop’ branch which further feeds into the ‘master’ branch.

Develop Branch

The ‘Develop’ branch is a mirror branch to the master branch. This branch is dedicated for QA/UAT testing. All feature branches must feed into this branch before the use-case is sent for QA testing and at times UAT if needed. This branch will also act as a backup in case there’s something wrong with the master branch.

The develop branch can ONLY be deployed to the ‘QA’ environment. This is a dedicated environment for the QAs to perform unit, regression, and smoke testing of the features and the app as a whole. This environment can also be used for UAT purposes. Once there’s a QA signoff on the features, this will directly feed into master.

Master Branch

The master branch is the main branch from which all releases happen. All features, once dev-tested and QA-tested, will feed into master via the develop branch. The master branch is from where the code is deployed to production. Every release to production from the master branch will be tagged with the specific version of that release.

In case of production issues, we can fallback to any of the previous stable releases.

Hotfix Branches

Hotfix branches are temporary branches which are forked directly from the ‘master’ branch and will feed back into the master only. These are for special cases when there’s a production bug to be resolved, and the develop branch is at the (n+m)th commit and master at (n)th commit.

These branches will act as temporary mirror branches for the master branch and can be tested on the QA env. Once tested and merged back to master, these branches have to be deleted. After the merge, the develop branch will have to be rebased with master, and the features will have to be rebased with the develop branch. The commits will flow upstream only after a rebase is successfully completed on all the forks.

Apart from the feature branches, individual devs will also own these branches.

Code Check-in

Feature Branches: Code check-in to feature branches can be done by anyone; there’s no need for a review as such. These branches are mainly for the devs to test their code. The use case developed in this branch will have to be dev-tested on the ‘Dev’ environment before a merge request to the ‘develop’ branch is raised.
Develop Branch: Code check-in to the develop branch should only happen after a Peer Review. Merge to develop will only happen once the code is dev-tested on the Dev environment. It should be noted that a merge to develop should ensure that the code quality is up to the mark, all standards are followed, and it doesn’t break anything that is already merged to the develop branch by other devs. QA testing must happen on this branch deployed in a dedicated environment for QA/UAT. Any bugs reported will be fixed in the feature branch, reviewed, and then merged back to the develop branch. QA signoff happens on this branch.
Master Branch: Code check-in to the master branch will only happen from the develop branch and NO feature branches. Any merge to the master branch apart from the hotfix branch MUST come from the develop branch only. Merge from develop to master should happen only after an extensive code review from the leads. Only a select few members of the team will have access to merge the code to the master branch. The onus of the master branch is on the Technology leads of the team. Once the code is merged to master, a final round of regression testing must take place before the code is tagged for release.
Hotfix Branch: Code check-in to the hotfix branch can be done by individual devs once it is reviewed by a peer and the leads. This branch feeds into the master only after a second round of review. QA must happen on the hotfix branch before it is merged to master. The merge to master must also be released only after regression testing is done on the fix.

Anuvaad Module Config Guidelines

Configs:

Parameters of a module that can be injected in and out of the system with zero to minimal code change in order to enable/disable/modify certain features of the system. The configs pertaining to modules of anuvaad data-flow pipeline can be broadly classified into 2 categories as:

Configs outside the build (docker image)
Configs within the build. (docker image)

Configs outside the build:

These are those configs which are injected to the system on the fly, the changes injected thereby can be incorporated into the system on runtime or on just a restart of the system without having to re-build or push a logical piece of code. For instance, WFM reads configs for identifying different workflows configured. In order to add/edit/delete a workflow, one needs to make changes to the config file as required and push the file, the changes will be incorporated into the system on restart or through a reload API on runtime. (https://raw.githubusercontent.com/project-anuvaad/anuvaad/master/anuvaad-etl/anuvaad-workflow-mgr/config/etl-wf-manager-config-dev.yml) These files will be saved in the ‘configs’ folder outside the source code of the system.

Configs within the build:

These configs travel with the build. Meaning, these configs will be a part of docker image. These configs can be controlled via an environment file during deployment or internally within the code. This also means that any changes in these parameters will need a rebuild and re-deployment of the system. However, no change in the logic or the code should be needed to incorporate the changes. Most of the hooks that are exposed for a given system fall under this category. These are mentioned in the ‘configs’ folder inside the source code. It is recommended to use just one ***config.py file inside the folder for all these configs for better maintainability. In case someone prefers to separate out config files based on concern, they can do so but bear the overhead of maintaining them. For convenience and readability, these configs are further divided as:

Cross module common configs
Module specific configs.

Cross module common configs:

These configs are used across all modules. Configs like Kafka host, Mongo host, File Upload URL etc fall under this category as these will not change from module to module. However, if there’s a case where modules chose to use different values for these parameters they can go ahead by using a different variable. The point of having this is ensuring we don’t create redundant variables in the environment file and use them from the same variables that are already defined.

Module specific configs:

For convenience the second type of variables are categorised as:

Kafka Configs: Configs required for kafka like topics, consumer groups, partition keys etc. that are very specific to the module. Some other parameters required to customise your consumer and producer as per your requirement can be mentioned under this category.
Datastore Configs: Configs required for the datastore that is being used which is mostly Mongo in our use case. In case you’re using MySQL, Redis, Elasticsearch etc, mention the required parameters in this category.
Module Configs: All other configs required for your module can be mentioned here.

Note: It is recommended to have most of these parameters deriving values from the environment file only. In some cases, they can also be hard-coded within the code. It is mandatory for every file/class within the project to use these parameters from these variables of the ***config.py file only.

Having said that:

Please ensure the folder structure is perfectly maintained.
Never ever check-in sensitive data like AWS keys, passwords, PIIs etc in the config file, always erase/mask/encrypt them before pushing it to github.

MODULES

Modulewise Appendix

Summary of the purpose of each module and necessary links

slno

Module name

Purpose

Code location

API contract

user management

manage the User and Admin side functionalities in Anuvaad.

file handler

The User Uploads the file and in return the file will be stored in the samba share for further api's to access them.

file converter

consumes the input files and converts them into PDF. Best results are obtained only for the file formats supported by Libreoffice.

file translator

transform the data in the file to form JSON file and download the translated files of type docx, pptx, and html.

content handler

handle and retrieve back the contents (final result) of files translated in the Anuvaad system.

document converter

This microservice is intended to generate the final document after translation and digitization. This currently supports pdf, txt, xlsx document generation.

tokenizer

tokenise the input paragraphs received into independently translatable sentences which can be consumed by downstream services to translate the entire input

ocr tokenizer

This service is used to tokenise the input paragraphs received into independently translatable sentences which can be consumed by downstream services to translate the entire input.

ocr content handler

handle and manipulate the digitized data from anuvaad-gv-document-digitize which is part of the Anuvaad system.

Aligner

This Module is for “aligning” or simply, finding similar sentence pairs from two lists of sentences,

workflow manager

centralized orchestrator which directs the user input through the dataflow pipeline to achieve the desired output.

Block merger

extract text from a digital document in a structured format(paragraph,image,table) which is then used for translation purposes.

translator

Translator is a wrapper over the NMT and is used to send sentence by sentence to NMT for translation of the document

word detector

Input as pdf or image If input is pdf , then convert pdf into images Use custom prima line model to line detection in the image

layout detector

Output of word detector as an input. Use a prima layout model for layout detection in the image.

block segmenter

Output of layout detector as an input. Collation of line and word at layout level

google vision ocr

Output of block segmenter as an input. Use google vision as OCR engine. Text collation at word,line and paragraph level.

tesseract ocr

Output of block segmenter as an input. Use Anuvaad ocr model as OCR engine. Text collation at word,line and paragraph level.

NMT

This service gets the translated content either by invoking the model directly or fetches translated content from Dhruva platform.

metrics

Display Analytics

Anuvaad Workflow Manager

ANUVAAD DATAFLOW PIPELINE WORKFLOW MANAGER

Workflow Manager is the orchestrator for the entire dataflow pipeline.

Overview

This document provides details about the Workflow Manager. WFM is the orchestrating module for the Anuvaad pipeline.

Getting Started

WFM is the backbone service of the Anuvaad system, it is a centralized orchestrator which directs the user input through the dataflow pipeline to achieve the desired output. It maintains a record of all the jobs and all the tasks involved in each job. WFM is the SPOC for the clients to retrieve details, status, error reports etc about the jobs executed (sync/async) in the system. Using WFM, we’ve been able to use Anuvaad not just as a Translation platform but also as an OCR platform, Tokenization platform, Sentence Alignment platform for dataset curation. Every use-case in Anuvaad is defined as a ‘Workflow’ in the WFM, These workflow definitions are in the form of a YAML file, which is read by WFM as an external configuration file.

WFM Config: This is a YAML file which has a well defined structure to create workflows in the Anuvaad system. Every use-case in Anuvaad is called ‘Workflow’.

Workflow - Set of steps to be executed on a given input to obtain the desired output. Anuvaad has 2 types of workflows: Async WF and Sync WF.

Async WF - These are asynchronous workflows, wherein the modules involved in this flow communicate with each other and the WFM via the kafka queue asynchronously.

Sync WF - These are synchronous workflows wherein the modules involved communicate with each other and the WFM via REST APIs. The client receives responses in real time.

Structure of the config is as follows:

workflowCode: An alphanumeric code that UNIQUELY identifies a workflow. Format: WF_<A/S>_<codes_of_modules_in_sequence>
type: Type of the workflow - ASYNC or SYNC
description: Description of the workflow to explain what the workflow does
useCase: An alphanumeric prefix to the job ID signifying a reference to the workflowCode.
sequence: The set of steps to be defined under the workflow. This is a list of ‘steps’ where each ‘step’ contains keys order, tool & endState.
The ‘tool’ key is the definition of the tool used in the corresponding ‘step’ in the ‘sequence’. Each tool contains keys name, description, kafka-input, topic, partitions, kafka-output. In case of Sync WFs, the tool contains keys name, description, api-details, uri.
Order: Number that defines the order of this step in the sequence. 0 is the value for the first step, 1 being next and so on.
name: Name of the tool
description: Description of the tool
kafka-input: Details of the kafka input for that particular tool. The tool must accept input on this topic from the WFM.
kafka-output: Details of the kafka output for that particular tool. The tool must produce output on this topic from the WFM.
api-details: Details of the API exposed by the tool for WFM to access.

WFM has 2 types of IDs involved in the jobs that hep uniquely identify a job and its intermediate tasks: jobID & taskID. jobID: This is a alphanumeric ID that uniquely identifies a job in the system. jobIDs are generated for both Sync and Async Jobs. Format: <use_case>-<random_string>-<13-digit epoch time> taskID: A job contains multiple intermediate tasks, taskID is a unique ID used to idenity each of those tasks. A combination of these taskIDs mapped to a given jobID can help trace an entire job through the system. Format: <module_code>-<13-digit epoch time>

Modules

API Details WFM exposes multiple APIs for the client to execute and fetch jobs in the Anuvaad system. The APIs are as follows: /async/initiate: API to execute Async workflows. /sync/initiate: API to execute Sync workflows. /configs/search: API to search WFM configs /jobs/search: API to search initiated jobs.

Postman Collection:

Code:

Prerequisites

python 3.7
ubuntu 16.04

Dependencies:

Run:

APIs and Documentation

Configs

License

Auditor

A Python package that provides standardized logging and error handling for the Anuvaad dataflow pipeline. This package serves features like session tracing, job tracing, error debugging, and troubleshooting.

Installation

Prerequisites:

Python 3.7

Command:

pip install anuvaad-auditor==0.1.6

Logging/Auditing

This part of the library provides features for logging by exposing the following functions:

Import file:

from anuvaad_auditor import loghandler

Functions

log_info

Logs INFO level information.

loghandler.log_info(<str(message)>, <json(input_object)>)

log_debug

Logs DEBUG level information.

loghandler.log_debug(<str(message)>, <json(input_object)>)

log_error

Logs ERROR level information. Should be used for logical errors like “File is not valid”, “File format not accepted” etc.

loghandler.log_error(<str(error_message)>, <json(input_object)>, <exception_object>)

log_exception

Logs EXCEPTION level information. Should be used in case of exceptions like “TypeError”, “KeyError” etc.

loghandler.log_exception(<str(exception_message)>, <json(input_object)>, <exception_object>)

Notes

In all the functions, message and input-object are mandatory.
These functions build an object using these parameters and index them to Elasticsearch for easy tracing.
Ensure all major functions have a log_info call, all exceptions have log_exception calls, and all logical errors have log_error calls.

Error Handling

This part of the library provides features for standardizing and indexing the error objects of the pipeline.

Import file:

from anuvaad_auditor import errorhandler

Functions

post_error

Returns a standard error object for replying back to the client during a SYNC call and indexes the error to an error index.

errorhandler.post_error(<str(error_code)>, <str(error_msg)>, <exception_object>)

post_error_wf

Constructs a standard error object which will be indexed to a different error index and PUSHES THE ERROR TO WFM internally.

errorhandler.post_error_wf(<str(error_code)>, <str(error_msg)>, <json(input_object)>, <exception_object>)

Usage Notes:

Use post_error_wf for flows triggered via Kafka or REST through WFM.
Ensure both log functions and error functions are used in case of exceptions or errors.
Errors are indexed to two different indexes: Error index and Audit Index.
Use post_error_wf carefully, as this method will take the entire job to FAILED state.

Example Usage

from anuvaad_auditor.loghandler import log_info, log_debug, log_error, log_exception
from anuvaad_auditor.errorhandler import post_error, post_error_wf

def example_function():
    input_object = {"example": "data"}
    try:
        log_info("Starting example function", input_object)
        # Your code logic here
        raise KeyError('A sample key error')
    except KeyError as ke:
        log_exception("Caught a KeyError", input_object, ke)
        error_obj = post_error("KEY_ERROR", "KeyError occurred in example function", ke)
        return error_obj
    except Exception as e:
        log_exception("An unexpected error occurred", input_object, e)
        error_obj = post_error_wf("UNEXPECTED_ERROR", "Unexpected error in example function", input_object, e)
        return error_obj

example_function()

Current Version

anuvaad-auditor==0.1.1 - Please use this version.

OCR Content handler

This microservice is served with multiple APIs to handle and manipulate the digitized data from anuvaad-gv-document-digitize, which is part of the Anuvaad system. This service is functionally similar to the Content Handler service but differs since the output document (digitized doc) structure varies.

Modules

OCR Document Modules

DigitalDocumentSave

API to save translated documents. The JSON request object is generated from anuvaad-gv-document-digitizer and later updated by tokenizer. This API is being used internally.

Mandatory parameters: files, record_id

Actions:

Validating input params as per the policies
The document to be saved is converted into blocks of pages
Each block contains regions such as line, word, table, etc.
Every block is created with UUID
Saving blocks in the database

DigitalDocumentSave CURL Request

curl --location --request POST 'http://localhost:5001//anuvaad/ocr-content-handler/v0/ocr/save-document' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "jobID": "BM-15913540488115873", 
    "state": "INITIATED", 
    "status": "STARTED", 
    "stepOrder": 0, 
    "workflowCode": "abc", 
    "taskID": "vision_ocr1615969391110792", 
    "tool": "GVOCR", 
    "message": "OCR", 
    "metadata": { 
        "module": "WORKFLOW-MANAGER", 
        "receivedAt": 15993163946431696, 
        "sessionID": "4M1qOZj53tIZsCoLNzP0oP", 
        "userID": "d4e0b570-b72a-44e5-9110-5fdd54370a9d" 
    }, 
    "files": [{ 
        "file": { 
            "identifier": "string", 
            "name": "20695.pdf", 
            "type": "json" 
        }, 
        "config": { 
            "language": "en" 
        }, 
        "pages": [ 
            { 
                "identifier": "958b00e5-7864-4a73-a3ed-7640b1c3c1cf", 
                "resolution": 300, 
                "path": "/home/naresh/anuvaad/anuvaad-etl/anuvaad-extractor/document-processor/gv-document-digitization/upload/20695_41c92afd-53fd-4446-aaee-bedd194c59cf/images/206950001-1.jpg", 
                "boundingBox": { 
                    "vertices": [{ 
                        "x": 0, 
                        "y": 0 
                    }, { 
                        "x": 2481, 
                        "y": 0 
                    }, { 
                        "x": 2481, 
                        "y": 3508 
                    }, { 
                        "x": 0, 
                        "y": 3508 
                    }] 
                }, 
                "page_no": 0, 
                "regions": [ 
                    { 
                        "identifier": "1e0f1313-4c2f-47f3-9971-a797452439f8", 
                        "boundingBox": { 
                            "vertices": [{ 
                                "x": 0, 
                                "y": 0 
                            }, { 
                                "x": 2481, 
                                "y": 0 
                            }, { 
                                "x": 2481, 
                                "y": 3508 
                            }, { 
                                "x": 0, 
                                "y": 3508 
                            }] 
                        }, 
                        "class": "BGIMAGE", 
                        "data": "/home/naresh/anuvaad/anuvaad-etl/anuvaad-extractor/document-processor/gv-document-digitization/upload/20695_41c92afd-53fd-4446-aaee-bedd194c59cf/images/206950001-1_bgimages_.jpg" 
                    }
                ]
            }
        ]
    }]
}'

DigitalDocumentUpdateWord

API to update the text in the digitized doc. RBAC enabled.

Mandatory parameters: words, record_id, region_id, word_id, updated_word

Actions:

Validating input params as per the policies
Looping over the regions to locate the word to be updated
Updating the word and setting a flag save=True

DigitalDocumentUpdateWord CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/ocr-content-handler/v0/ocr/update-word' \
--header 'userID: d4e0b570-b72a-44e5-9110-5fdd54370a9d' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6ImphaW55LmpveUB0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkcXFjYUM2WW5yU2RFM2hDT2h4aXpnT0ZILjBxeFR4UWJBTHloZDFjTjBFOWluSnRqaTguOWknIiwiZXhwIjoxNjE2NTcxMjM3fQ.vCOncRM7BNK0qsv0OWnioIDfy-lOusTcMERsusm_ics' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "words":[ 
        { 
            "record_id":"A_OD10GV-msJYb-1616508492867|0-1616508495552232.json", 
            "region_id":"7df5afdc-6aac-498d-af2c-e73cdf438b90", 
            "word_id":"78f682ba-5571-4099-9055-f51d4d82368a", 
            "updated_word":"Constituency" 
        }
    ] 
}'

DigitalDocumentGet

API to fetch back the document. RBAC enabled.

Mandatory parameters: record_id, start_page, end_page

Actions:

Validating input params as per the policies
Returning back the document as an array of pages

DigitalDocumentGet CURL Request

curl --location --request GET 'https://auth.anuvaad.org/anuvaad/ocr-content-handler/v0/ocr/fetch-document?recordID=A_FWLBOD15GOT-eAIRP-1632812802745%7C0-16328129323475454.json&start_page=0&end_page=0' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6ImphaW55LmpveUB0YXJlbnRvLmNvbSIsIkphaW55QDEyMyI6ImInJDJiJDEyJFh4VU9ZbVBGZ1NyMkhuclFZNTVqR2U3a3VmUmRoakxmTTdjU2NLSkxHZVNTZkxBQmJ4UGlPJyIsImV4cCI6MTYzMzA4Mzk2OX0.-hWfzbCR7ErGjK8B8PjnkpvtVBm1Rpavmjast0E4P4I'

Block merger

This microservice is used to extract text from a digital document in a structured format (paragraph, image, table), which is then used for translation purposes.

Architecture

It takes an image or pdf as an input.
If the input is a pdf, it converts the pdf into images.
The pdftohtml tool is used to extract page-level information like text, word coordinates, page width, page height, tables, images, and others.
If the document language is vernacular, the pdftohtml tool does not work well, so we use Tesseract (or GV if required Alternatively) for OCR.
Horizontal merging is used to get lines using word coordinates.
Vertical merging is used to get blocks using line coordinates.
The final JSON contains page-level information like page width, page height, paragraphs, lines, words, and layout class.

Modules

API Details

Local Testing

Input:

Here it takes a PDF or image path as an input and the language of that document.

Workflow Initiate

Input:

Steps:

Upload a PDF or image file using the upload API:
Get the upload ID and copy that to the path of wf-initiate input of the block merger.
Do bulk search using jobIDs to get JSON ID of the BM service response:
Bulk search input format:
Download JSON using download API:

Content Handler

This microservice is served with multiple APIs to handle and retrieve the contents (final result) of files translated in the Anuvaad system.

Modules

Common Information

Some common info that is applicable to save, update operations on translations.

Workflow Code

WF_S_TR and WF_S_TKTR: Changes the sentence structure, hence s0 pair needs to be updated.
DP_WFLOW_S_C: Doesn't change the sentence structure, hence no need to update the s0 pair.

Sentence Keys

s0_src: Source sentence extracted from the file.
s0_tgt: Sentence translation from NMT.
tgt: Translation updated by the user (User translation). (Source may vary if the user edits the input document, or else it keeps the same as s0_src).

File Content Modules

SaveFileContent

API to save translated documents. The JSON request object is generated from block-merger and later updated by tokenizer and translator. This API is used internally.

Mandatory parameters: userid, pages, record_id, src_lang, tgt_lang

Actions:

Validating input parameters as per the policies.
The document to be saved is converted into blocks.
Block can be of type images, lines, text_blocks, etc.
Every block is created with UUID.
Saving blocks in the database.

SaveFileContent CURL Request

curl --location --request POST 'http://gateway_anuvaad-content-handler:5001/anuvaad/content-handler/v0/save-content' \
--header 'userid: 06b5419ab0f14669b1dff654533416411608108799138' \
--header 'Content-Type: application/json' \
--data-raw '{
  "file_locale": "en",
  "record_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json",
  "src_lang": "en",
  "tgt_lang": "hi",
  "pages": [
    {
      "images": [],
      "lines": [],
      "page_height": 1188,
      "page_no": 1,
      "page_width": 918,
      "text_blocks": [
        {
          "attrib": null,
          "avg_line_height": 15,
          "block_id": "ae3165c2-03aa-11eb-a840-02420a00032e-0",
          "block_identifier": "24610b3f-c0fd-4cbf-9597-1c037e84fc70",
          "children": [
            {
              "attrib": "HEADER",
              "block_id": "ae3165c2-03aa-11eb-a840-02420a00032e-0-0",
              "children": null,
              "font_color": "#000000",
              "font_family": "ArialMT",
              "font_size": 13,
              "text": "Consulting Manager: Sample manager",
              "text_height": 15,
              "text_left": 108,
              "text_top": 63,
              "text_width": 293
            }
          ],
          "data_type": "text_blocks",
          "file_locale": "68072f3c-c57a-4f62-a7fc-42ed6f776c1e",
          "font_color": "#000000",
          "font_family": "ArialMT",
          "font_size": 13,
          "job_id": "",
          "page_info": {
            "page_height": 1188,
            "page_no": 1,
            "page_width": 918
          },
          "record_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json",
          "text": " Consulting Manager: Sample Manager  Phone: +91-1234567898/+91-80 123456 Email:   sample.manager@anuvaad.com ",
          "text_height": 47,
          "text_left": 108,
          "text_top": 63,
          "text_width": 293,
          "tokenized_sentences": [
            {
              "input_subwords": "['▁Consult', 'ing', '▁Manager', '▁:']",
              "n_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json|1|ae3165c2-03aa-11eb-a840-02420a00032e-0",
              "output_subwords": "['▁परामर्श', '▁प्रबंधक', 'ः']",
              "pred_score": -0.8280696868896484,
              "s_id": "94695768-5976-4fdc-853d-9aa49630ce77",
              "src": "Consulting Manager:",
              "tagged_src": "Consulting Manager:",
              "tagged_tgt": "परामर्श प्रबंधकः",
              "tgt": "परामर्श प्रबंधकः"
            }
          ],
          "underline": 1
        }
      ]
    }
  ]
}'

GetFileContent

API to fetch back the documents. The response object would be an array of pages, with pagination enabled. RBAC enabled.

Mandatory parameters: start_page, end_page, record_id

Actions:

Validating input parameters as per the policies.
Fetching back the blocks as per the page number requested.

GetFileContent CURL Request

curl --location --request GET 'https://auth.anuvaad.org/anuvaad/content-handler/v0/fetch-content?record_id=A_FTTTR-GBWSA-1623682123483%7CDOCX-c7759250-6952-4575-9514-66a1383caabb.json&start_page=0&end_page=0' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6ImphaW55LmpveUB0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkNzJjY1ZFRmNIcC9qSkg5dzBGMXFTdU5ZQlNXQThSMzdRak1zdm8wN01rMnNYeVI2N24xRlcnIiwiZXhwIjoxNjIzNzY5Njg0fQ.a6gaxGvG-yCLrE6qeTshf2V8j_S44-U6obgWyyHZRK8'

UpdateFileContent

API to update the block content; triggered on split, merge, re-translate operations. Used internally.

Mandatory parameters: record_id, user_id, blocks, workflowCode

Actions:

Validating input parameters as per the policies.
Updating the list of blocks.

UpdateFileContent CURL Request

curl --location --request POST 'http://gateway_anuvaad-content-handler:5001//anuvaad/content-handler/v0/update-content' \
--header 'userid: kd' \
--header 'Content-Type: application/json' \
--data-raw '{
  "record_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json",
  "blocks": [
    {
      "attrib": null,
      "avg_line_height": 15,
      "block_id": "ae3165c2-03aa-11eb-a840-02420a00032e-0",
      "block_identifier": "24610b3f-c0fd-4cbf-9597-1c037e84fc70",
      "children": [
        {
          "attrib": "HEADER",
          "block_id": "ae3165c2-03aa-11eb-a840-02420a00032e-0-0",
          "children": null,
          "font_color": "#000000",
          "font_family": "ArialMT",
          "font_size": 13,
          "text": "Consulting Manager: Sample Manager",
          "text_height": 15,
          "text_left": 108,
          "text_top": 63,
          "text_width": 293
        }
      ],
      "data_type": "text_blocks",
      "file_locale": "68072f3c-c57a-4f62-a7fc-42ed6f776c1e",
      "font_color": "#000000",
      "font_family": "ArialMT",
      "font_size": 13,
      "job_id": "",
      "page_info": {
        "page_height": 1188,
        "page_no": 1,
        "page_width": 918
      },
      "record_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json",
      "text": " Consulting Manager: Sample Manager  Phone: +91-1234567898/+91-80 123456 Email:   sample.manager@anuvaad.com ",
      "text_height": 47,
      "text_left": 108,
      "text_top": 63,
      "text_width": 293,
      "tokenized_sentences": [
        {
          "input_subwords": "['▁Consult', 'ing', '▁Manager', '▁:']",
          "n_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json|1|ae3165c2-03aa-11eb-a840-02420a00032e-0",
          "output_subwords": "['▁परामर्श', '▁प्रबंधक', 'ः']",
          "pred_score": -0.8280696868896484,
          "s_id": "94695768-5976-4fdc-853d-9aa49630ce77",
          "src": "Consulting Manager:",
          "tagged_src": "Consulting Manager:",
          "tagged_tgt": "परामर्श प्रबंधकः",
          "tgt": "परामर्श प्रबंधकः"
        }
      ],
      "underline": 1
    }
  ]
}'

SaveFileContentReferences

Internal API to store S3 link references to translated documents (on docx flow).

Mandatory parameters: job_id, file_link

Actions:

Validating input parameters as per the policies.
Storing records in the database.

SaveFileContentReferences CURL Request

curl --location --request POST 'http://gateway_anuvaad-content-handler:5001//anuvaad/content-handler/v0/ref-link/store' \
--header 'ad-userid: kd' \
--header 'userid: kd' \
--header 'Content-Type: application/json' \
--data-raw '{
  "records": [
    {
      "job_id": "abc1",
      "file_link": {
        "HTML": {
          "LIBRE": "https://anuvaad1.s3.amazonaws.com/upload/sample3tableshredacrossPages/LIBRE/sample3tableshredacrossPages.html",
          "PDFTOHTML": "https://anuvaad1.s3.amazonaws.com/upload/sample3tableshredacrossPages/PDFTOHTML/sample3tableshredacrossPages-html.html"
        },
        "PDF": {
          "LIBRE": "https://anuvaad1.s3.amazonaws.com/upload/sample3tableshredacrossPages/PDFTOHTML/sample3tableshredacrossPages.pdf"
        }
      }
    }
  ]
}'

GetFileContentReferences

API to fetch back the S3 link for docx files. RBAC enabled.

Mandatory parameters: job_ids

Actions:

Validating input parameters as per the policies.
Fetching back the data from the database.

GetFileContentReferences CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/content-handler/v0/ref-link/fetch' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6ImphaW55LmpveUB0YXJlbnRvLmNvbSIsIkphaW55QDEyMyI6ImInJDJiJDEyJDk2YzRMb0ZCTG05ZU1XVlJXNVFzTE9ydTlLZVc1emJnVnBhaFouclBuYnFReU96YUNDMFVpJyIsImV4cCI6MTY0MDY3MDg4N30.R0zEJyEeXhOZ41TnsPTD0rFov3kPmUVfL_DdOxKU0QI' \
--header 'Content-Type: application/json' \
--data-raw '{"job_ids":["A_FTTTR-cSCim-1632805831132"]}'

Sentence Modules

SaveSentence

API to store user translations. RBAC enabled.

Mandatory parameters: sentences, workflowCode, user_id

Actions:

Validating input parameters as per the policies.
Updating the sentence blocks.
Saved sentences are always updated with "save": true flag.
Saved sentences are also saved in the Redis store for Sentence Memory.

SaveSentence CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/content-handler/v0/save-content-sentence' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6ImphaW55LmpveUB0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkaXJXU2xrdjFDSWUzNzJZMzZiWlhFdTdKSDQ0QlViR2d2QlVSMW5OMXJxNEEuMWpuQ0JsTi4nIiwiZXhwIjoxNjEzNzM5NTI5fQ.g-JLNqFen-ol3y40OAFA82q1pi-b3BDSGtoWi-OyjhA' \
--header 'Content-Type: application/json' \
--data-raw '{"workflowCode":"DP_WFLOW_S_C",
"sentences": [
  {
    "bleu_score": 1,
    "n_id": "",
    "s0_src": "He was released on bail on the 1st of December. We used to go there to bail out the old man.",
    "s0_tgt": "उन्हें 1 दिसंबर को जमानत पर रिहा कर दिया गया था। हम वहां पुराने आदमी को जमानत देने जाते थे।",
    "s_id": "4e412457-e357-419b-b477-1676b314afd5",
    "save": true,
    "src": "He was released on bail on the 1st of December. We used to go there to bail out the old man.",
    "src_lang": "en",
    "tagged_src": "He was released on bail on the NnUuMm०st of December. We used to go there to bail out the old man.",
    "tagged_tgt": "उन्हें NnUuMm० दिसंबर को जमानत पर रिहा कर दिया गया था। हम वहां पुराने आदमी को जमानत देने जाते थे।",
    "tgt": "उन्हें 1 दि��ंबर को जमानत पर रिहा कर दिया गया था। हम वहां पुराने आदमी को जमानत देने जाते थे।",
    "tgt_lang": "hi",
    "time_spent_ms": 6797,
    "tmx_phrases": []
  }
]}'

FetchSentence

Bulk API to fetch back sentences. RBAC enabled.

Mandatory parameters: sentences, record_id, block_identifier, s_id

Actions:

Validating input parameters as per the policies.
Returning back an array of sentences searched for.

FetchSentence CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/content-handler/v0/fetch-content-sentence' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkTWVEZzhpUGY3dWJFR21jbDRaNUE3dUo0bEk4VEdJcVpzL3R4ckJZOF

Document Digitization

This pipeline is used to extract text from a digital/scanned document. Lines and layouts (header, footer, paragraph, table, cell, image) are detected by a custom-trained Prima layout model and OCR is done using the Anuvaad OCR model.

How to Use

Upload a PDF or image file using the upload API:
Get the upload ID and copy it to the DD2.0 input path.
Initiate the Workflow:
DD2.0 Input:

Microservices

Word Detector

Input: PDF or image
Output: List of pages with detected lines and page information.

How to use: Word Detector

Upload a PDF or image file using the upload API:
Initiate the Word Detector Workflow:
Word Detector Input:

Layout Detector

Input: Output of word detector
Output: List of pages with detected layouts and lines.

How to use: Layout Detector

Input JSON file of the word detector as an input path.
Initiate the Layout Detector Workflow:
Layout Detector Input:

Block Segmenter

Input: Output of layout detector
Output: Collation of line and word at layout level.

How to use: Block Segmenter

Input JSON file of the layout detector as an input path.
Initiate the Block Segmenter Workflow:
Block Segmenter Input:

Input: Output of block segmenter
Output: Text collation at word, line, and paragraph level using Google Vision as the OCR engine.

Tesseract OCR

Input: Output of block segmenter
Output: Text collation at word, line, and paragraph level using Anuvaad OCR model.

How to use: Tesseract OCR

Input JSON file of the block segmenter as an input path.
Initiate the Tesseract OCR Workflow:
Tesseract OCR Input:

Google OCR (Tesseract Alternative)

How to use: Google Vision OCR

Input JSON file of the block segmenter as an input path.
Initiate the Google OCR Workflow:
Google OCR Input:

File uploader

ETL Translator

File translator

This microservice is served with multiple APIs to transform the data in the file to form JSON file and download the translated files of type DOCX, PPTX, and HTML.

Modules

Workflow Code

WF_A_FTTKTR

Steps:

Transformation Flow
- Use the data in DOCX, PPTX, or HTML file to create a JSON file.
Tokenizer Flow
- Read the JSON file created in the Transformation Flow and tokenize each paragraph.
- Tokenization is a process where we extract all the sentences in a paragraph.
Translation Flow
- Translate each sentence.

WF_S_FT

Steps:

WF_A_FTTKTR Flow
- This flow must be completed before calling the download flow.
Download Flow
- Fetch content for the file, replace original sentences with translated ones, and download the file.

Through WF Manager

Transform Flow

Mandatory Params for File Translator:

Path
Type
Locale

Actions:

Validate input parameters.
Generate a JSON file from the data of the given file.
Convert the given file to HTML, PDF, and push it to S3 (For showing it on UI).
Get the S3 link of the converted file and call content handler API to store the link.

Transform Flow CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate' \
--header 'auth-token: AUTHTOKEN' \
--header 'content-type: application/json' \
--data-raw '{
   "workflowCode": "WF_A_FTTKTR",
   "jobName": "HTML FILE.html",
   "jobDescription": "",
   "files": [
       {
           "path": "f3cf11bd-c6b8-4ea2-9bd2-9828b9847c8a.html",
           "type": "html",
           "locale": "en",
           "model": {...},
           "context": "JUDICIARY",
           "modifiedSentences": "A"
       }
   ]
}'

Download Flow

Mandatory Params for File Translator:

Path
Type
Locale

Actions:

Validate input parameters.
Call fetch-content to get the translation of the file passed in the param.
Replace the original text in the file with the translated text.
Return the path of the translated file.

Download Flow CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/sync/initiate' \
--header 'auth-token: AUTHTOKEN' \
--header 'Content-Type: application/json' \
--data-raw '{
   "workflowCode": "WF_S_FT",
   "jobName": "ch 2 communication skills.docx",
   "jobDescription": "",
   "files": [
       {
           "path": "A_FTTTR-RJjbi-1623847596274|DOCX-8f8c43a9-ac35-407f-874e-51d91be7f433.json",
           "type": "json",
           "locale": "en",
           "model": {...},
           "context": "JUDICIARY",
           "modifiedSentences": "A"
       }
   ]
}'

Legacy

ENGAGE

Content Handler

This microservice is served with multiple APIs to handle and retrieve the contents (final result) of files translated in the Anuvaad system.

Modules

Common Information

Some common info that is applicable to save, update operations on translations.

Workflow Code

WF_S_TR and WF_S_TKTR: Changes the sentence structure, hence s0 pair needs to be updated.
DP_WFLOW_S_C: Doesn't change the sentence structure, hence no need to update the s0 pair.

Sentence Keys

s0_src: Source sentence extracted from the file.
s0_tgt: Sentence translation from NMT.
tgt: Translation updated by the user (User translation). (Source may vary if the user edits the input document, or else it keeps the same as s0_src).

File Content Modules

SaveFileContent

API to save translated documents. The JSON request object is generated from block-merger and later updated by tokenizer and translator. This API is used internally.

Mandatory parameters: userid, pages, record_id, src_lang, tgt_lang

Actions:

Validating input parameters as per the policies.
The document to be saved is converted into blocks.
Block can be of type images, lines, text_blocks, etc.
Every block is created with UUID.
Saving blocks in the database.

SaveFileContent CURL Request

curl --location --request POST 'http://gateway_anuvaad-content-handler:5001/anuvaad/content-handler/v0/save-content' \
--header 'userid: 06b5419ab0f14669b1dff654533416411608108799138' \
--header 'Content-Type: application/json' \
--data-raw '{
  "file_locale": "en",
  "record_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json",
  "src_lang": "en",
  "tgt_lang": "hi",
  "pages": [
    {
      "images": [],
      "lines": [],
      "page_height": 1188,
      "page_no": 1,
      "page_width": 918,
      "text_blocks": [
        {
          "attrib": null,
          "avg_line_height": 15,
          "block_id": "ae3165c2-03aa-11eb-a840-02420a00032e-0",
          "block_identifier": "24610b3f-c0fd-4cbf-9597-1c037e84fc70",
          "children": [
            {
              "attrib": "HEADER",
              "block_id": "ae3165c2-03aa-11eb-a840-02420a00032e-0-0",
              "children": null,
              "font_color": "#000000",
              "font_family": "ArialMT",
              "font_size": 13,
              "text": "Consulting Manager: Sample manager",
              "text_height": 15,
              "text_left": 108,
              "text_top": 63,
              "text_width": 293
            }
          ],
          "data_type": "text_blocks",
          "file_locale": "68072f3c-c57a-4f62-a7fc-42ed6f776c1e",
          "font_color": "#000000",
          "font_family": "ArialMT",
          "font_size": 13,
          "job_id": "",
          "page_info": {
            "page_height": 1188,
            "page_no": 1,
            "page_width": 918
          },
          "record_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json",
          "text": " Consulting Manager: Sample Manager  Phone: +91-1234567898/+91-80 123456 Email:   sample.manager@anuvaad.com ",
          "text_height": 47,
          "text_left": 108,
          "text_top": 63,
          "text_width": 293,
          "tokenized_sentences": [
            {
              "input_subwords": "['▁Consult', 'ing', '▁Manager', '▁:']",
              "n_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json|1|ae3165c2-03aa-11eb-a840-02420a00032e-0",
              "output_subwords": "['▁परामर्श', '▁प्रबंधक', 'ः']",
              "pred_score": -0.8280696868896484,
              "s_id": "94695768-5976-4fdc-853d-9aa49630ce77",
              "src": "Consulting Manager:",
              "tagged_src": "Consulting Manager:",
              "tagged_tgt": "परामर्श प्रबंधकः",
              "tgt": "परामर्श प्रबंधकः"
            }
          ],
          "underline": 1
        }
      ]
    }
  ]
}'

GetFileContent

API to fetch back the documents. The response object would be an array of pages, with pagination enabled. RBAC enabled.

Mandatory parameters: start_page, end_page, record_id

Actions:

Validating input parameters as per the policies.
Fetching back the blocks as per the page number requested.

GetFileContent CURL Request

curl --location --request GET 'https://auth.anuvaad.org/anuvaad/content-handler/v0/fetch-content?record_id=A_FTTTR-GBWSA-1623682123483%7CDOCX-c7759250-6952-4575-9514-66a1383caabb.json&start_page=0&end_page=0' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6ImphaW55LmpveUB0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkNzJjY1ZFRmNIcC9qSkg5dzBGMXFTdU5ZQlNXQThSMzdRak1zdm8wN01rMnNYeVI2N24xRlcnIiwiZXhwIjoxNjIzNzY5Njg0fQ.a6gaxGvG-yCLrE6qeTshf2V8j_S44-U6obgWyyHZRK8'

UpdateFileContent

API to update the block content; triggered on split, merge, re-translate operations. Used internally.

Mandatory parameters: record_id, user_id, blocks, workflowCode

Actions:

Validating input parameters as per the policies.
Updating the list of blocks.

UpdateFileContent CURL Request

curl --location --request POST 'http://gateway_anuvaad-content-handler:5001//anuvaad/content-handler/v0/update-content' \
--header 'userid: kd' \
--header 'Content-Type: application/json' \
--data-raw '{
  "record_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json",
  "blocks": [
    {
      "attrib": null,
      "avg_line_height": 15,
      "block_id": "ae3165c2-03aa-11eb-a840-02420a00032e-0",
      "block_identifier": "24610b3f-c0fd-4cbf-9597-1c037e84fc70",
      "children": [
        {
          "attrib": "HEADER",
          "block_id": "ae3165c2-03aa-11eb-a840-02420a00032e-0-0",
          "children": null,
          "font_color": "#000000",
          "font_family": "ArialMT",
          "font_size": 13,
          "text": "Consulting Manager: Sample Manager",
          "text_height": 15,
          "text_left": 108,
          "text_top": 63,
          "text_width": 293
        }
      ],
      "data_type": "text_blocks",
      "file_locale": "68072f3c-c57a-4f62-a7fc-42ed6f776c1e",
      "font_color": "#000000",
      "font_family": "ArialMT",
      "font_size": 13,
      "job_id": "",
      "page_info": {
        "page_height": 1188,
        "page_no": 1,
        "page_width": 918
      },
      "record_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json",
      "text": " Consulting Manager: Sample Manager  Phone: +91-1234567898/+91-80 123456 Email:   sample.manager@anuvaad.com ",
      "text_height": 47,
      "text_left": 108,
      "text_top": 63,
      "text_width": 293,
      "tokenized_sentences": [
        {
          "input_subwords": "['▁Consult', 'ing', '▁Manager', '▁:']",
          "n_id": "FC-BM-TOK-TRANS-1601531696387|0-16015317191287522.json|1|ae3165c2-03aa-11eb-a840-02420a00032e-0",
          "output_subwords": "['▁परामर्श', '▁प्रबंधक', 'ः']",
          "pred_score": -0.8280696868896484,
          "s_id": "94695768-5976-4fdc-853d-9aa49630ce77",
          "src": "Consulting Manager:",
          "tagged_src": "Consulting Manager:",
          "tagged_tgt": "परामर्श प्रबंधकः",
          "tgt": "परामर्श प्रबंधकः"
        }
      ],
      "underline": 1
    }
  ]
}'

SaveFileContentReferences

Internal API to store S3 link references to translated documents (on docx flow).

Mandatory parameters: job_id, file_link

Actions:

Validating input parameters as per the policies.
Storing records in the database.

SaveFileContentReferences CURL Request

curl --location --request POST 'http://gateway_anuvaad-content-handler:5001//anuvaad/content-handler/v0/ref-link/store' \
--header 'ad-userid: kd' \
--header 'userid: kd' \
--header 'Content-Type: application/json' \
--data-raw '{
  "records": [
    {
      "job_id": "abc1",
      "file_link": {
        "HTML": {
          "LIBRE": "https://anuvaad1.s3.amazonaws.com/upload/sample3tableshredacrossPages/LIBRE/sample3tableshredacrossPages.html",
          "PDFTOHTML": "https://anuvaad1.s3.amazonaws.com/upload/sample3tableshredacrossPages/PDFTOHTML/sample3tableshredacrossPages-html.html"
        },
        "PDF": {
          "LIBRE": "https://anuvaad1.s3.amazonaws.com/upload/sample3tableshredacrossPages/PDFTOHTML/sample3tableshredacrossPages.pdf"
        }
      }
    }
  ]
}'

GetFileContentReferences

API to fetch back the S3 link for docx files. RBAC enabled.

Mandatory parameters: job_ids

Actions:

Validating input parameters as per the policies.
Fetching back the data from the database.

GetFileContentReferences CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/content-handler/v0/ref-link/fetch' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6ImphaW55LmpveUB0YXJlbnRvLmNvbSIsIkphaW55QDEyMyI6ImInJDJiJDEyJDk2YzRMb0ZCTG05ZU1XVlJXNVFzTE9ydTlLZVc1emJnVnBhaFouclBuYnFReU96YUNDMFVpJyIsImV4cCI6MTY0MDY3MDg4N30.R0zEJyEeXhOZ41TnsPTD0rFov3kPmUVfL_DdOxKU0QI' \
--header 'Content-Type: application/json' \
--data-raw '{"job_ids":["A_FTTTR-cSCim-1632805831132"]}'

Sentence Modules

SaveSentence

API to store user translations. RBAC enabled.

Mandatory parameters: sentences, workflowCode, user_id

Actions:

Validating input parameters as per the policies.
Updating the sentence blocks.
Saved sentences are always updated with "save": true flag.
Saved sentences are also saved in the Redis store for Sentence Memory.

SaveSentence CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/content-handler/v0/save-content-sentence' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6ImphaW55LmpveUB0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkaXJXU2xrdjFDSWUzNzJZMzZiWlhFdTdKSDQ0QlViR2d2QlVSMW5OMXJxNEEuMWpuQ0JsTi4nIiwiZXhwIjoxNjEzNzM5NTI5fQ.g-JLNqFen-ol3y40OAFA82q1pi-b3BDSGtoWi-OyjhA' \
--header 'Content-Type: application/json' \
--data-raw '{"workflowCode":"DP_WFLOW_S_C",
"sentences": [
  {
    "bleu_score": 1,
    "n_id": "",
    "s0_src": "He was released on bail on the 1st of December. We used to go there to bail out the old man.",
    "s0_tgt": "उन्हें 1 दिसंबर को जमानत पर रिहा कर दिया गया था। हम वहां पुराने आदमी को जमानत देने जाते थे।",
    "s_id": "4e412457-e357-419b-b477-1676b314afd5",
    "save": true,
    "src": "He was released on bail on the 1st of December. We used to go there to bail out the old man.",
    "src_lang": "en",
    "tagged_src": "He was released on bail on the NnUuMm०st of December. We used to go there to bail out the old man.",
    "tagged_tgt": "उन्हें NnUuMm० दिसंबर को जमानत पर रिहा कर दिया गया था। हम वहां पुराने आदमी को जमानत देने जाते थे।",
    "tgt": "उन्हें 1 दि��ंबर को जमानत पर रिहा कर दिया गया था। हम वहां पुराने आदमी को जमानत देने जाते थे।",
    "tgt_lang": "hi",
    "time_spent_ms": 6797,
    "tmx_phrases": []
  }
]}'

FetchSentence

Bulk API to fetch back sentences. RBAC enabled.

Mandatory parameters: sentences, record_id, block_identifier, s_id

Actions:

Validating input parameters as per the policies.
Returning back an array of sentences searched for.

FetchSentence CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/content-handler/v0/fetch-content-sentence' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkTWVEZzhpUGY3dWJFR21jbDRaNUE3dUo0bEk4VEdJcVpzL3R4ckJZOF

User management

UMS is the initial Anuvaad module that facilitates user login and other account-related functionalities. It features admin level login and user level login. Only super Admin has the authority to create new organizations or add new users to the system (if not for sign-up). Admin can assign roles to the new users as well.

Modules

User Modules

CreateUsers

Whitelisted bulk API to create/register users in the system.

Mandatory params: userName, email, password, roles

Actions:

Validating input params as per the policies
Storing user entry in the database and assigning a unique id (userID)
Triggering verification email

CreateUsers CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/create' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "users": [ 
        { 
            "name": "Jainy Joy", 
            "userName": "sample.user@anuvaad.com", 
            "password": "password123", 
            "email": "sample.user@anuvaad.com", 
            "orgID" : "ANUVAAD", 
            "roles": [ 
                { 
                    "roleCode":"TRANSLATOR", 
                    "roleDesc":"Has access to translation services" 
                } 
            ], 
            "models": [ 
                { 
                    "src_lang": "en", 
                    "tgt_lang": "ml", 
                    "uuid": "7156838-90b6-465f-a9aa-5e8f7bfa97e8" 
                }, 
                { 
                    "src_lang": "en", 
                    "tgt_lang": "hi", 
                    "uuid": "2e2fb17a-c470-4562-9cf6-aef0a0ba70ec" 
                } 
            ] 
        } 
    ]    
}'

VerifyUsers

Whitelisted API to verify and complete the registration process on Anuvaad.

Mandatory params: userName, userID

Actions:

Validating input params as per the policies
Activating the user
Triggering registration successful email

VerifyUsers CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/verify-user' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "userName": "sample.user@anuvaad.com", 
    "userID": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" 
}'

UserLogin

Whitelisted API for login.

Mandatory params: userName, password

Actions:

Validating input params as per the policies
Issuing auth token (JWT token)
Activating user session

UserLogin CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/login' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "userName": "sample.user@anuvaad.com", 
    "password": "password123" 
}'

UserLogout

Whitelisted API for logging out.

Mandatory params: userName

Actions:

Validating input params as per the policies
Turning off user session

UserLogout CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/logout' \
--header 'auth-token;' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "userName": "sample.user@anuvaad.com" 
}'

AuthTokenSearch

API to validate auth tokens and fetch back user details.

Mandatory params: token

Actions:

Validating the token
Returning user records matching the token only when the token is active
Same API is used for verifying a token generated on forgot-password as well.

AuthTokenSearch CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/auth-token-search' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VySUQiOiIxMTIyMzM0NCIsImV4cCI6MTYyNzU1MTk5NX0.Wqha17Jsf-D_6KXOsEj3STpV4FBfM_27DRghYKXp7Sg" 
}'

UpdateUsers

Bulk API to update user details, RBAC enabled.

Mandatory params: userID

Updatable fields: orgID, roles, models, email

Actions:

Validating input params as per the policies
Updating DB records

UpdateUsers CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/update' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIka2V1VFNUU2dTZW5vUzI1Y2djTmJxLmpaVWF1cVN6SXpaL0xGWHdySDRrenJTZE1WMkZPQnUnIiwiZXhwIjoxNjE3OTU4MTMxfQ.TEIg306dXvtiTvuCYdPWF1ZNjv9fQ1Y0iZyBXHoaqzM' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "users": [ 
        { 
            "userID": "530761e5be1e4e4ebf1335b985c0b1181617878383934", 
            "orgID": "NONMT", 
            "roleCode": "TRANSLATOR", 
            "models": [ 
                { 
                    "src_lang": "en", 
                    "tgt_lang": "ml", 
                    "uuid": "7156838-90b6-465f-a9aa-5e8f7bfa97e8" 
                } 
            ] 
        } 
    ] 
}'

ForgotPassword

API for forgot password.

Mandatory params: userName

Actions:

Validating input params as per the policies
Generating reset password link and sending it via email

ForgotPassword CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/forgot-password' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "userName": "sample.user@anuvaad.com" 
}'

ResetPassword

API to update password, RBAC enabled.

Mandatory params: userName, password

Actions:

Validating input params as per the policies
Generating reset password link and sending it via email

ResetPassword CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/reset-password' \
--header 'x-user-id: 7505827e810344b98db9433b8bab4f3d1606377202908' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "userName": "sample.user@anuvaad.com", 
    "password": "xxxxxxx" 
}'

Admin Modules

(Only Admin has access)

OnboardUsers

Bulk API to onboard users to the Anuvaad system.

Mandatory params: userName, email, password, roles

Actions:

Validating input params as per the policies
Storing user entry in the database and assigning a unique userID
User account is verified and activated by default

OnboardUsers CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/onboard-users' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkeFFISlZiUGhkVTFaL2RnNzAzbkUxdWtwZy5YY2wwV1A3R3U3S29JWEI2aHd2aHZILjVqN0snIiwiZXhwIjoxNjEyMzMzNDk0fQ.kVZRyyqaDnHOT9Qgqpet1sIzHjVbJwDALTgOpVxA6yo' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "users": [ 
        { 
            "name": "Test User", 
            "userName": "testest04@gmail.com", 
            "password": "password1123", 
            "email": "test@mail.com", 
            "phoneNo": "", 
            "roles": [ 
                { 
                    "roleCode": "TRANSLATOR", 
                    "roleDesc": "Has access to translation related resources" 
                } 
            ], 
            "orgID": "TESTORG03" 
        } 
    ] 
}'

SearchUsers

API for bulk search with pagination property.

Actions:

Validating input params as per the policies
All user records are returned if skip_pagination is set to True
When no offset and limit are provided, default values are set as per configs
Only the records matching the search values are returned if skip_pagination is False

SearchUsers CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/search' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkNUpsTWhYOUt0REVOQmxzYlZqYS5OdUYuLmxvWkV4VWw0b2ZDNng3S0dNaHhGMkVraHQvWjInIiwiZXhwIjoxNjExMjk0NzU4fQ.mXlh6tL4ahc1xL16QGv8qDHBWamEYsJmE5b5_lDiioE' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "userIDs": [], 
    "userNames": [], 
    "roleCodes": [ 
        "TRANSLATOR", 
        "ANNOTATOR" 
    ], 
    "offset": null, 
    "limit": null, 
    "skip_pagination": false 
}'

ActivateDeactivateUser

API to update the activation status of a user.

Mandatory params: userName, is_active

Actions:

Validating input params as per the policies
Updating the user activation status

ActivateDeactivateUser CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/activate-user' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkeFFISlZiUGhkVTFaL2RnNzAzbkUxdWtwZy5YY2wwV1A3R3U3S29JWEI2aHd2aHZILjVqN0snIiwiZXhwIjoxNjEyMzMzNDk0fQ.kVZRyyqaDnHOT9Qgqpet1sIzHjVbJwDALTgOpVxA6yo' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "userName": "testest03@gmail.com", 
    "is_active": true 
}'

SearchRoles

API to fetch active roles in Anuvaad.

Actions:

Returning active role codes

SearchRoles CURL Request

curl --location --request GET 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/users/get-roles' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIka2V1VFNUU2dTZW5vUzI1Y2djTmJxLmpaVWF1cVN6SXpaL0xGWHdySDRrenJTZE1WMkZPQnUnIiwiZXhwIjoxNjE3OTU4MTMxfQ.TEIg306dXvtiTvuCYdPWF1ZNjv9fQ1Y0iZyBXHoaqzM' \
--data-raw ''

Organization Modules (Currently only ADMIN has access)

CreateOrganization: Bulk API to upsert organizations.

Mandatory params: code, active

Actions:

Validating input params as per the policies
Creating or deactivating orgs as per active status on request

CreateOrganization CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/org/upsert' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkN1dkc1MzUW9Ob1dxY1NUSzUxREsxZWFIUFhWUW9oRWl2LnFtSTFXM2pJZVZoejVCdnVwRjYnIiwiZXhwIjoxNjExMTU0MjIxfQ.5aDzGWOemHW7dgdwezJhnAWiRXS6ljOSWEuPwW6pQUQ' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "organizations": [ 
        { 
            "code": "ANUVAAD", 
            "active": true, 
            "description": "default org for the users of Anuvaad system" 
        } 
    ] 
}'

SearchOrganization: API to get organization details.

Actions:

If org_code is given, searches for that organization alone; otherwise, all organizations are returned.

SearchOrganization CURL Request

curl --location --request GET 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/org/search?org_code=ANUVAAD' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6Imt1bWFyLmRlZXBha0B0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkcmI3TlZ3SEk1RVZYcFpmU05KSms2Lng0dEw4b01RMW9oZldsR01SNUFqdkFWa3BSRWNzckcnIiwiZXhwIjoxNjIwNDUxNjcxfQ.dpCOSd0pvxcKsyGqt3HzxtjWZDdNlLG_0zjhSsKfNbA' \
--data-raw ''

Extension (for Anuvaad web extension)

GenerateIdToken: Generating token for web extension user.

Mandatory params: id_token

Actions:

Decrypting and validating the token
If the token is valid, register the user and return auth token

GenerateIdToken CURL Request

curl --location --request POST 'https://auth.anuvaad.org/anuvaad/user-mgmt/v1/extension/users/get/token' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "id_token": "eE7S2Tn/s8+xhU/EGJKxSC+SvR9IOrGcnbC7Jq5iCLuFrpxNOe8c/aGg5Le1:eV0n09cXpNXXVCfPSkdPmCi4gC68b1oH" 
}'

Notes

Add APIs with Zuul if they need external access.
Rebuild and deploy UMS whenever a new role is added with Zuul.
Email ID used for system notifications: anuvaad.support@tarento.com

Setup Tips

Run the docker container.
Initialize the DB by creating a Super-Admin account directly in the DB.
Additional users can be added from the UI by logging into the super admin account.

How to Initialize UMS without UI?

Create an account (Admin is preferred) using the API anuvaad/user-mgmt/v1/users/create.
Get the verification token from the email (2nd last ID on the ‘verify now’ link) or the userID from the user table.
Complete the registration process by calling the anuvaad/user-mgmt/v1/users/verify-user API.