# Document Digitization

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXdLr95e_bBkqYMvWo8RyV2_NMgNTCwC17gLh4a_MPNCyEsF7xQ0fKVDvhqsCS6ZIVc6Q08iY6fC-nD2YHTR8DARlAQzBNJrey4MtdWnAj8IE96ahZKSeARgg8ZtglkRiHfmlOEIqpkUIibBNdzxwyaOR-so?key=JvwCIxJ4D3_YxZg0iSqTdA" alt=""><figcaption><p>Documet Digitization</p></figcaption></figure>

This pipeline is used to extract text from a digital/scanned document. Lines and layouts (header, footer, paragraph, table, cell, image) are detected by a custom-trained Prima layout model and OCR is done using the Anuvaad OCR model.

**Github repo:** [Anuvaad Document Processor](https://github.com/project-anuvaad/anuvaad/tree/develop/anuvaad-etl/anuvaad-extractor/document-processor)

**API contract:** [API Contract](https://github.com/project-anuvaad/anuvaad/blob/develop/anuvaad-etl/anuvaad-extractor/document-processor/ocr/ocr-tesseract-server/doc/google-vision-api-contract.yml)

## How to Use

1. **Upload a PDF or image file using the upload API:**

   **Upload URL:** <https://auth.anuvaad.org/anuvaad-api/file-uploader/v0/upload-file>

   Get the upload ID and copy it to the DD2.0 input path.
2. **Initiate the Workflow:**

   **WF URL:** <https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate>

   **DD2.0 Input:**

   ```json
   {
       "files": [
           {
               "locale": "language",
               "path": "file_name",
               "type": "file_format",
               "config": {
                   "OCR": {
                       "option": "HIGH_ACCURACY",
                       "language": "language"
                   }
               }
           }
       ],
       "workflowCode": "WF_A_FCWDLDBSOD20TESOTK"
   }
   ```

## Microservices

### Word Detector

* **Input:** PDF or image
* **Output:** List of pages with detected lines and page information.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXekE4Zz0G__TeyG_JG67w98bDuNVsIcHDaVCTKD3jgTE-E7BiKg5gMUWaad5kfk8H7yGkqZGoeKQPa7smrDIQ-MnhB1Mnb8A4WAWAK3VmO6zhvXpzZspRqwLSdHZNHXF4r3cfaT9jcuoNGUV7zrtC5-hoaj?key=JvwCIxJ4D3_YxZg0iSqTdA" alt=""><figcaption><p>sample</p></figcaption></figure>

**Github repo:** [Word Detector Craft](https://github.com/project-anuvaad/anuvaad/tree/develop/anuvaad-etl/anuvaad-extractor/document-processor/word-detector/craft)

**API contract:** [Word Detector API Contract](https://github.com/project-anuvaad/anuvaad/blob/develop/anuvaad-etl/anuvaad-extractor/document-processor/word-detector/craft/doc/word-detector-craft-api-contract.yml)

<details>

<summary>How to use: Word Detector</summary>

1. **Upload a PDF or image file using the upload API:**

   **Upload URL:** <https://auth.anuvaad.org/anuvaad-api/file-uploader/v0/upload-file>
2. **Initiate the Word Detector Workflow:**

   **WF URL:** <https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate>

   **Word Detector Input:**

   ```json
   {
       "files": [
           {
               "locale": "language",
               "path": "file_name",
               "type": "file_format",
               "config": {
                   "OCR": {
                       "option": "HIGH_ACCURACY",
                       "language": "language"
                   }
               }
           }
       ],
       "workflowCode": "WF_A_WD"
   }
   ```

</details>

### Layout Detector

* **Input:** Output of word detector
* **Output:** List of pages with detected layouts and lines.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXcpfqBrfX8OiFT1mghhzudtUBBADa-kEgPrnQeXIMmKskaurmUZq6MNW_Sw-srnupPjHC1ss22dP9hllflBneXcFp_NE5raHaGbpjWAMBG3MPfwNg_jFVhx7lo8xeY8v507Mh26RqnbAGUWZrTR4WWRKnHx?key=JvwCIxJ4D3_YxZg0iSqTdA" alt=""><figcaption><p>sample</p></figcaption></figure>

**Github repo:** [Layout Detector Prima](https://github.com/project-anuvaad/anuvaad/tree/develop/anuvaad-etl/anuvaad-extractor/document-processor/layout-detector/prima)

**API contract:** [Layout Detector API Contract](https://github.com/project-anuvaad/anuvaad/blob/develop/anuvaad-etl/anuvaad-extractor/document-processor/layout-detector/prima/doc/page-layout-prima-api-contract.yml)

<details>

<summary>How to use: Layout Detector</summary>

1. **Input JSON file of the word detector as an input path.**
2. **Initiate the Layout Detector Workflow:**

   **WF URL:** <https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate>

   **Layout Detector Input:**

   ```json
   {
       "files": [
           {
               "locale": "language",
               "path": "word_detector_output",
               "type": "json",
               "config": {
                   "OCR": {
                       "option": "HIGH_ACCURACY",
                       "language": "language"
                   }
               }
           }
       ],
       "workflowCode": "WF_A_LD"
   }
   ```

</details>

### Block Segmenter

* **Input:** Output of layout detector
* **Output:** Collation of line and word at layout level.

**Github repo:** [Block Segmenter](https://github.com/project-anuvaad/anuvaad/tree/develop/anuvaad-etl/anuvaad-extractor/document-processor/block-segmenter)

**API contract:** [Block Segmenter API Contract](https://github.com/project-anuvaad/anuvaad/blob/develop/anuvaad-etl/anuvaad-extractor/document-processor/block-segmenter/docs/block-sementer-api-contract.yml)

<details>

<summary>How to use: Block Segmenter</summary>

1. **Input JSON file of the layout detector as an input path.**
2. **Initiate the Block Segmenter Workflow:**

   **WF URL:** <https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate>

   **Block Segmenter Input:**

   ```json
   {
       "files": [
           {
               "locale": "language",
               "path": "layout_detector_output",
               "type": "json",
               "config": {
                   "OCR": {
                       "option": "HIGH_ACCURACY",
                       "language": "language"
                   }
               }
           }
       ],
       "workflowCode": "WF_A_BS"
   }
   ```

</details>

* **Input:** Output of block segmenter
* **Output:** Text collation at word, line, and paragraph level using Google Vision as the OCR engine.

### Tesseract OCR

* **Input:** Output of block segmenter
* **Output:** Text collation at word, line, and paragraph level using Anuvaad OCR model.

**Github repo:** [OCR Tesseract Server](https://github.com/project-anuvaad/anuvaad/tree/develop/anuvaad-etl/anuvaad-extractor/document-processor/ocr/ocr-tesseract-server)

**API contract:** [Google Vision API Contract](https://github.com/project-anuvaad/anuvaad/blob/develop/anuvaad-etl/anuvaad-extractor/document-processor/ocr/ocr-tesseract-server/doc/google-vision-api-contract.yml)

<details>

<summary>How to use: Tesseract OCR</summary>

1. **Input JSON file of the block segmenter as an input path.**
2. **Initiate the Tesseract OCR Workflow:**

   **WF URL:** <https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate>

   **Tesseract OCR Input:**

   ```json
   {
       "files": [
           {
               "locale": "language",
               "path": "block_segmenter_output",
               "type": "json",
               "config": {
                   "OCR": {
                       "option": "HIGH_ACCURACY",
                       "language": "language"
                   }
               }
           }
       ],
       "workflowCode": "WF_A_OD20TES"
   }
   ```

</details>

### Google OCR (Tesseract Alternative)

<details>

<summary>How to use: Google Vision OCR</summary>

1. **Input JSON file of the block segmenter as an input path.**
2. **Initiate the Google OCR Workflow:**

   **WF URL:** <https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate>

   **Google OCR Input:**

   ```json
   {
       "files": [
           {
               "locale": "language",
               "path": "block_segmenter_output",
               "type": "json",
               "config": {
                   "OCR": {
                       "option": "HIGH_ACCURACY",
                       "language": "language"
                   }
               }
           }
       ],
       "workflowCode": "WF_A_OTES"
   }
   ```

</details>

**Github repo:** [OCR Google Vision Server](https://github.com/project-anuvaad/anuvaad/tree/develop/anuvaad-etl/anuvaad-extractor/document-processor/ocr/ocr-gv-server)

**API contract:** [Google Vision API Contract](https://github.com/project-anuvaad/anuvaad/blob/develop/anuvaad-etl/anuvaad-extractor/document-processor/ocr/ocr-gv-server/doc/google-vision-api-contract.yml)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://anuvaad.sunbird.org/modules/document-digitization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
