OCR Content handler

This microservice is served with multiple APIs to handle and manipulate the digitized data from anuvaad-gv-document-digitize, which is part of the Anuvaad system. This service is functionally similar to the Content Handler service but differs since the output document (digitized doc) structure varies.

Modules

OCR Document Modules

DigitalDocumentSave

API to save translated documents. The JSON request object is generated from anuvaad-gv-document-digitizer and later updated by tokenizer. This API is being used internally.

Mandatory parameters: files, record_id

Actions:

  • Validating input params as per the policies

  • The document to be saved is converted into blocks of pages

  • Each block contains regions such as line, word, table, etc.

  • Every block is created with UUID

  • Saving blocks in the database

DigitalDocumentSave CURL Request
curl --location --request POST 'http://localhost:5001//anuvaad/ocr-content-handler/v0/ocr/save-document' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "jobID": "BM-15913540488115873", 
    "state": "INITIATED", 
    "status": "STARTED", 
    "stepOrder": 0, 
    "workflowCode": "abc", 
    "taskID": "vision_ocr1615969391110792", 
    "tool": "GVOCR", 
    "message": "OCR", 
    "metadata": { 
        "module": "WORKFLOW-MANAGER", 
        "receivedAt": 15993163946431696, 
        "sessionID": "4M1qOZj53tIZsCoLNzP0oP", 
        "userID": "d4e0b570-b72a-44e5-9110-5fdd54370a9d" 
    }, 
    "files": [{ 
        "file": { 
            "identifier": "string", 
            "name": "20695.pdf", 
            "type": "json" 
        }, 
        "config": { 
            "language": "en" 
        }, 
        "pages": [ 
            { 
                "identifier": "958b00e5-7864-4a73-a3ed-7640b1c3c1cf", 
                "resolution": 300, 
                "path": "/home/naresh/anuvaad/anuvaad-etl/anuvaad-extractor/document-processor/gv-document-digitization/upload/20695_41c92afd-53fd-4446-aaee-bedd194c59cf/images/206950001-1.jpg", 
                "boundingBox": { 
                    "vertices": [{ 
                        "x": 0, 
                        "y": 0 
                    }, { 
                        "x": 2481, 
                        "y": 0 
                    }, { 
                        "x": 2481, 
                        "y": 3508 
                    }, { 
                        "x": 0, 
                        "y": 3508 
                    }] 
                }, 
                "page_no": 0, 
                "regions": [ 
                    { 
                        "identifier": "1e0f1313-4c2f-47f3-9971-a797452439f8", 
                        "boundingBox": { 
                            "vertices": [{ 
                                "x": 0, 
                                "y": 0 
                            }, { 
                                "x": 2481, 
                                "y": 0 
                            }, { 
                                "x": 2481, 
                                "y": 3508 
                            }, { 
                                "x": 0, 
                                "y": 3508 
                            }] 
                        }, 
                        "class": "BGIMAGE", 
                        "data": "/home/naresh/anuvaad/anuvaad-etl/anuvaad-extractor/document-processor/gv-document-digitization/upload/20695_41c92afd-53fd-4446-aaee-bedd194c59cf/images/206950001-1_bgimages_.jpg" 
                    }
                ]
            }
        ]
    }]
}'

DigitalDocumentUpdateWord

API to update the text in the digitized doc. RBAC enabled.

Mandatory parameters: words, record_id, region_id, word_id, updated_word

Actions:

  • Validating input params as per the policies

  • Looping over the regions to locate the word to be updated

  • Updating the word and setting a flag save=True

DigitalDocumentUpdateWord CURL Request
curl --location --request POST 'https://auth.anuvaad.org/anuvaad/ocr-content-handler/v0/ocr/update-word' \
--header 'userID: d4e0b570-b72a-44e5-9110-5fdd54370a9d' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6ImphaW55LmpveUB0YXJlbnRvLmNvbSIsInBhc3N3b3JkIjoiYickMmIkMTIkcXFjYUM2WW5yU2RFM2hDT2h4aXpnT0ZILjBxeFR4UWJBTHloZDFjTjBFOWluSnRqaTguOWknIiwiZXhwIjoxNjE2NTcxMjM3fQ.vCOncRM7BNK0qsv0OWnioIDfy-lOusTcMERsusm_ics' \
--header 'Content-Type: application/json' \
--data-raw '{ 
    "words":[ 
        { 
            "record_id":"A_OD10GV-msJYb-1616508492867|0-1616508495552232.json", 
            "region_id":"7df5afdc-6aac-498d-af2c-e73cdf438b90", 
            "word_id":"78f682ba-5571-4099-9055-f51d4d82368a", 
            "updated_word":"Constituency" 
        }
    ] 
}'

DigitalDocumentGet

API to fetch back the document. RBAC enabled.

Mandatory parameters: record_id, start_page, end_page

Actions:

  • Validating input params as per the policies

  • Returning back the document as an array of pages

DigitalDocumentGet CURL Request
curl --location --request GET 'https://auth.anuvaad.org/anuvaad/ocr-content-handler/v0/ocr/fetch-document?recordID=A_FWLBOD15GOT-eAIRP-1632812802745%7C0-16328129323475454.json&start_page=0&end_page=0' \
--header 'auth-token: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyTmFtZSI6ImphaW55LmpveUB0YXJlbnRvLmNvbSIsIkphaW55QDEyMyI6ImInJDJiJDEyJFh4VU9ZbVBGZ1NyMkhuclFZNTVqR2U3a3VmUmRoakxmTTdjU2NLSkxHZVNTZkxBQmJ4UGlPJyIsImV4cCI6MTYzMzA4Mzk2OX0.-hWfzbCR7ErGjK8B8PjnkpvtVBm1Rpavmjast0E4P4I'

Last updated