Block merger

This microservice is used to extract text from a digital document in a structured format (paragraph, image, table), which is then used for translation purposes.


  • It takes an image or pdf as an input.

  • If the input is a pdf, it converts the pdf into images.

  • The pdftohtml tool is used to extract page-level information like text, word coordinates, page width, page height, tables, images, and others.

  • If the document language is vernacular, the pdftohtml tool does not work well, so we use Tesseract (Google Vision) for OCR.

  • Horizontal merging is used to get lines using word coordinates.

  • Vertical merging is used to get blocks using line coordinates.

  • The final JSON contains page-level information like page width, page height, paragraphs, lines, words, and layout class.

  • API Contract: here

  • Code location: here


API Details

Local Testing



    "input": {
        "files": [
                "locale": "en",
                "path": "1.pdf",
                "type": "pdf"
    "jobID": "BM-15913540488115873",
    "state": "INITIATED",
    "status": "STARTED",
    "stepOrder": 0,
    "workflowCode": "abc",
    "tool": "BM",
    "metadata": { 
        "module": "WORKFLOW-MANAGER",
        "receivedAt": 15993163946431696,
        "sessionID": "4M1qOZj53tIZsCoLNzP0oP",
        "userID": "d4e0b570-b72a-44e5-9110-5fdd54370a9d"

Here it takes a PDF or image path as an input and the language of that document.

Workflow Initiate



    "workflowCode": "WF_A_BM",
    "files": [
            "path": "763b0d80-4e82-423f-a432-23ddffe5ad92.pdf",
            "type": "pdf",
            "locale": "en"


  1. Upload a PDF or image file using the upload API:

    Upload URL:

  2. Get the upload ID and copy that to the path of wf-initiate input of the block merger.

  3. Do bulk search using jobIDs to get JSON ID of the BM service response:

    Bulk search URL:

    Bulk search input format:

        "jobIDs": ["A_B-MtrjS-1640069221694"],
        "taskDetails": true
  4. Download JSON using download API:

    Download URL:

Last updated