Block merger

This microservice is used to extract text from a digital document in a structured format (paragraph, image, table), which is then used for translation purposes.

Architecture

  • It takes an image or pdf as an input.

  • If the input is a pdf, it converts the pdf into images.

  • The pdftohtml tool is used to extract page-level information like text, word coordinates, page width, page height, tables, images, and others.

  • If the document language is vernacular, the pdftohtml tool does not work well, so we use Tesseract (or GV if required Alternatively) for OCR.

  • Horizontal merging is used to get lines using word coordinates.

  • Vertical merging is used to get blocks using line coordinates.

  • The final JSON contains page-level information like page width, page height, paragraphs, lines, words, and layout class.

  • API Contract: here

  • Code location: here

Modules

API Details

Local Testing

URL: http://0.0.0.0:5001/anuvaad-etl/block-merger/v0/merge-blocks

Input:

Here it takes a PDF or image path as an input and the language of that document.

Workflow Initiate

URL: https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate

Input:

Steps:

  1. Upload a PDF or image file using the upload API:

    Upload URL: https://auth.anuvaad.org/anuvaad-api/file-uploader/v0/upload-file

  2. Get the upload ID and copy that to the path of wf-initiate input of the block merger.

  3. Do bulk search using jobIDs to get JSON ID of the BM service response:

    Bulk search URL: https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/jobs/search/bulk

    Bulk search input format:

  4. Download JSON using download API:

    Download URL: https://auth.anuvaad.org/download/0-1640069280533983.json

Last updated