Architecture

Architecture

Components

Component
Details
Workflow Manager(WM)
Centralized Orchestrator based on user request.
Auditor
Python package/library used for formatting , exception handling.
File Uploader
Microservice to upload and maintain user documents.
File Converter
Microservice to convert files from one format to other. E.g: .doc to .pdf files.
Aligner
Microservice accepts source and target sentances and align them to form parallel corpus.
Tokenizer
Microservice tokenises pragraphs into independently translatable sentences.
Layout Detector
Microservice interface for Layout detection model.
Block Segmenter
Handles layout detection miss-classifications , region unifying.
Word Detector
Word detection.
Block Merger
An OCR system that extracts texts, images, tables, blocks etc from the input file and makes it avaible in the format which can be utilised by downstream services to perform Translation. This can also be used as an independent product that can perform OCR on files, images, ppts, etc.
Translator
Translator pushes sentences to OpenNMT which are translated and pushed back during the document translation flow.
Content Handler
Repository Microservice which maintains and manages all the translated documents
Translation Memory X(TMX)
System translation memory to facilitate overriding NMT translation with user preferred translation. TMX provides three levels of caching - Global , User , Organisation.
User Translation Memory(UTM)
System tracks and remembers individual user translations or corrected translations and applies automatically when same sentences are encountered again.

AI/ML Assets

Component
Details
PRIMA
Layout detection model.
Used for OCR in Document Digitization v1.0 , v1.5. Replaced with custom trained Tesseract in latest versions.
CRAFT
Used for Line detection.
Tesseract
Custom trained Tesseract used for OCR.
OpenNMT
Custom trained OpenNMT used for translation.

Technology Stack

Component
Details
Translator and OpenNMT are integrated through Kafka messaging.
MongoDB
Primary data storage.
Redis
Secondary in memory storage.
Cloud Storage
Samba storage is used to store user input files.
NGINX
Serve as a redirection server and also takes care of system level configs. Ngnix acts as the gateway.
Zuul
API Gateway to apply filters on client requests,authenticate,authorize,throttle client requests.