Block merger
Last updated
Last updated
This microservice is used to extract text from a digital document in a structured format (paragraph, image, table), which is then used for translation purposes.
It takes an image or pdf as an input.
If the input is a pdf, it converts the pdf into images.
The pdftohtml
tool is used to extract page-level information like text, word coordinates, page width, page height, tables, images, and others.
If the document language is vernacular, the pdftohtml
tool does not work well, so we use Tesseract (or GV if required Alternatively) for OCR.
Horizontal merging is used to get lines using word coordinates.
Vertical merging is used to get blocks using line coordinates.
The final JSON contains page-level information like page width, page height, paragraphs, lines, words, and layout class.
API Contract:
Code location:
Input:
Here it takes a PDF or image path as an input and the language of that document.
Input:
Upload a PDF or image file using the upload API:
Get the upload ID and copy that to the path of wf-initiate input of the block merger.
Do bulk search using jobIDs to get JSON ID of the BM service response:
Bulk search input format:
Download JSON using download API:
URL:
URL:
Upload URL:
Bulk search URL:
Download URL: