Project Anuvaad
  • Sunbird Anuvaad Overview
    • Features
  • USE
    • Playbook
    • Video Tutorials
  • LEARN
    • Architecture
    • Technology Stack
    • Repository structure and developers guide
    • Setting up Anuvaad on your own
    • Git branching strategies
    • Anuvaad Module Config Guidelines
  • MODULES
    • Modulewise Appendix
    • Anuvaad Workflow Manager
    • User management
    • Document converter
    • Auditor
    • OCR Content handler
    • Block merger
    • Content Handler
    • Document Digitization
    • File uploader
    • Aligner
    • ETL Translator
    • File translator
    • Anuvaad Zuul Gateway System
    • Anuvaad Translator
    • Tokenizer
    • Analytics
    • NMT
  • Legacy
    • Model Retraining
    • NMT Inference
    • Integration
      • Registration
      • Login and auth token
      • Supported Language pairs and translation models
      • Translate texts
    • Service Contracts
    • API Host Endpoints
  • ENGAGE
    • FAQ
    • KT Videos
    • Source Code Repository
    • Discuss
    • Tools
      • anuvaad-corpus-tools
      • layout-mt-corpus
      • ocr-toolkit
      • anuvaad-ocr-corpus
      • parallel-corpus
      • anuvaad-em
Powered by GitBook
On this page
  • How to Use
  • Microservices
  • Word Detector
  • Layout Detector
  • Block Segmenter
  • Tesseract OCR
  • Google OCR (Tesseract Alternative)
Edit on GitHub
Export as PDF
  1. MODULES

Document Digitization

PreviousContent HandlerNextFile uploader

Last updated 6 months ago

Documet Digitization

This pipeline is used to extract text from a digital/scanned document. Lines and layouts (header, footer, paragraph, table, cell, image) are detected by a custom-trained Prima layout model and OCR is done using the Anuvaad OCR model.

Github repo:

API contract:

How to Use

  1. Upload a PDF or image file using the upload API:

    Get the upload ID and copy it to the DD2.0 input path.

  2. Initiate the Workflow:

    DD2.0 Input:

    {
        "files": [
            {
                "locale": "language",
                "path": "file_name",
                "type": "file_format",
                "config": {
                    "OCR": {
                        "option": "HIGH_ACCURACY",
                        "language": "language"
                    }
                }
            }
        ],
        "workflowCode": "WF_A_FCWDLDBSOD20TESOTK"
    }

Microservices

Word Detector

  • Input: PDF or image

  • Output: List of pages with detected lines and page information.

How to use: Word Detector
  1. Upload a PDF or image file using the upload API:

  2. Initiate the Word Detector Workflow:

    Word Detector Input:

    {
        "files": [
            {
                "locale": "language",
                "path": "file_name",
                "type": "file_format",
                "config": {
                    "OCR": {
                        "option": "HIGH_ACCURACY",
                        "language": "language"
                    }
                }
            }
        ],
        "workflowCode": "WF_A_WD"
    }

Layout Detector

  • Input: Output of word detector

  • Output: List of pages with detected layouts and lines.

How to use: Layout Detector
  1. Input JSON file of the word detector as an input path.

  2. Initiate the Layout Detector Workflow:

    Layout Detector Input:

    {
        "files": [
            {
                "locale": "language",
                "path": "word_detector_output",
                "type": "json",
                "config": {
                    "OCR": {
                        "option": "HIGH_ACCURACY",
                        "language": "language"
                    }
                }
            }
        ],
        "workflowCode": "WF_A_LD"
    }

Block Segmenter

  • Input: Output of layout detector

  • Output: Collation of line and word at layout level.

How to use: Block Segmenter
  1. Input JSON file of the layout detector as an input path.

  2. Initiate the Block Segmenter Workflow:

    Block Segmenter Input:

    {
        "files": [
            {
                "locale": "language",
                "path": "layout_detector_output",
                "type": "json",
                "config": {
                    "OCR": {
                        "option": "HIGH_ACCURACY",
                        "language": "language"
                    }
                }
            }
        ],
        "workflowCode": "WF_A_BS"
    }
  • Input: Output of block segmenter

  • Output: Text collation at word, line, and paragraph level using Google Vision as the OCR engine.

Tesseract OCR

  • Input: Output of block segmenter

  • Output: Text collation at word, line, and paragraph level using Anuvaad OCR model.

How to use: Tesseract OCR
  1. Input JSON file of the block segmenter as an input path.

  2. Initiate the Tesseract OCR Workflow:

    Tesseract OCR Input:

    {
        "files": [
            {
                "locale": "language",
                "path": "block_segmenter_output",
                "type": "json",
                "config": {
                    "OCR": {
                        "option": "HIGH_ACCURACY",
                        "language": "language"
                    }
                }
            }
        ],
        "workflowCode": "WF_A_OD20TES"
    }

Google OCR (Tesseract Alternative)

How to use: Google Vision OCR
  1. Input JSON file of the block segmenter as an input path.

  2. Initiate the Google OCR Workflow:

    Google OCR Input:

    {
        "files": [
            {
                "locale": "language",
                "path": "block_segmenter_output",
                "type": "json",
                "config": {
                    "OCR": {
                        "option": "HIGH_ACCURACY",
                        "language": "language"
                    }
                }
            }
        ],
        "workflowCode": "WF_A_OTES"
    }

Upload URL:

WF URL:

sample

Github repo:

API contract:

Upload URL:

WF URL:

sample

Github repo:

API contract:

WF URL:

Github repo:

API contract:

WF URL:

Github repo:

API contract:

WF URL:

WF URL:

Github repo:

API contract:

https://auth.anuvaad.org/anuvaad-api/file-uploader/v0/upload-file
https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate
Word Detector Craft
Word Detector API Contract
https://auth.anuvaad.org/anuvaad-api/file-uploader/v0/upload-file
https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate
Layout Detector Prima
Layout Detector API Contract
https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate
Block Segmenter
Block Segmenter API Contract
https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate
OCR Tesseract Server
Google Vision API Contract
https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate
https://auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate
OCR Google Vision Server
Google Vision API Contract
Anuvaad Document Processor
API Contract