Project Anuvaad
  • Sunbird Anuvaad Overview
    • Features
  • USE
    • Playbook
    • Video Tutorials
  • LEARN
    • Architecture
    • Technology Stack
    • Repository structure and developers guide
    • Setting up Anuvaad on your own
    • Git branching strategies
    • Anuvaad Module Config Guidelines
  • MODULES
    • Modulewise Appendix
    • Anuvaad Workflow Manager
    • User management
    • Document converter
    • Auditor
    • OCR Content handler
    • Block merger
    • Content Handler
    • Document Digitization
    • File uploader
    • Aligner
    • ETL Translator
    • File translator
    • Anuvaad Zuul Gateway System
    • Anuvaad Translator
    • Tokenizer
    • Analytics
    • NMT
  • Legacy
    • Model Retraining
    • NMT Inference
    • Integration
      • Registration
      • Login and auth token
      • Supported Language pairs and translation models
      • Translate texts
    • Service Contracts
    • API Host Endpoints
  • ENGAGE
    • FAQ
    • KT Videos
    • Source Code Repository
    • Discuss
    • Tools
      • anuvaad-corpus-tools
      • layout-mt-corpus
      • ocr-toolkit
      • anuvaad-ocr-corpus
      • parallel-corpus
      • anuvaad-em
Powered by GitBook
On this page
  • Simplified implementation: Here
  • Local Setup (Without WFM & Uploader)
  • Remote (Invoked via WFM)
  • Testing
  • Notes
Edit on GitHub
Export as PDF
  1. MODULES

Aligner

PreviousFile uploaderNextETL Translator

Last updated 11 months ago

The Aligner module is designed for “aligning” or finding similar sentence pairs from two lists of sentences, preferably in different languages. The Aligner is a standalone service that cannot be accessed from the UI as of now. The service is dependent on the file uploader and workflow manager (WFM) services.

The Aligner service is based on Google’s LaBSE model and FB’s FAISS algorithm. It accepts two files as inputs, from which two lists of sentences are collected. LaBSE Embeddings are calculated for each of the sentences in the list. Cosine similarity between embeddings is calculated to find meaningfully similar sentence pairs. The FAISS algorithm is used to dramatically speed up the whole process.

Simplified implementation:

The service accepts two text files, and the aligner module can ideally be invoked using WFM. It is time-consuming and hence an async service. Once the run is fully done, a WFM-based search can be conducted using the job ID to obtain the result.

The response is typically a JSON file path, which can be downloaded using the download API. The JSON file is self-explanatory and it contains source_text, target_text, and the corresponding cosine similarity between them.

Local Setup (Without WFM & Uploader)

  1. Clone the Repo

  2. Install dependencies

    pip install -r requirements.txt
  3. Run the application

    python app.py
  4. Access from local:

Aligner CURL Request
curl --location --request POST 'http://127.0.0.1:5001/anuvaad-etl/extractor/aligner/v1/sentences/align' \
--header 'Content-Type: application/json' \
--data-raw '{
    "source": {
        "filepath": "/home/test.en",
        "locale": "en",
        "type": "json"
    },
    "target": {
        "filepath": "/home/test.ml",
        "locale": "ml",
        "type": "json"
    }
}'

</details>

<details>
<summary>Search Jobs in Local</summary>

```bash
curl --location --request GET 'http://127.0.0.1:5001/anuvaad-etl/extractor/aligner/v1/alignment/jobs/get/ALIGN-1614743930159'

Remote (Invoked via WFM)

Initiate Workflow
curl --location --request POST 'https://stage-auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/async/initiate' \
--header 'Content-Type: application/json' \
--header 'auth-token: {{auth-token}}' \
--header 'context:' \
--data-raw '{
    "workflowCode": "WF_A_JAL",
    "files": [
        {
            "locale": "ml",
            "path": "983da7e1-7cde-4091-8db4-cf845b5ea3c3.txt",
            "type": "txt"
        },
        {
            "locale": "en",
            "path": "aab70b95-ec0d-4c1c-9bfe-0c4864aecda0.txt",
            "type": "txt"
        }
    ]
}'

It returns a JOB ID, which can be searched using the WFM Bulk search API to see job progress and pull out results once done.

Search Bulk Workflow Jobs
curl --location --request POST 'https://stage-auth.anuvaad.org/anuvaad-etl/wf-manager/v1/workflow/jobs/search/bulk' \
--header 'auth-token: {{auth-token}}' \
--header 'Content-Type: application/json' \
--data-raw '{
    "jobIDs": [
        "{{jobIDs}}"
    ],
    "taskDetails": false
}'
  • WF_A_JAL is the Workflow code for JSON-based aligner, which returns the filepath of a JSON file that could be downloaded using the download API.

  • WF_A_AL is the old workflow code, that returns multiple txt files.

Testing

  1. Upload two files.

  2. Call API endpoint with file paths as parameters.

  3. Verify if sentences are matching properly in the JSON.

Notes

  • Can be used as an independent service by deploying file-uploader and aligner modules alone on a server, preferably GPU-based (tested working well on g4dn2xlarge).

Simplified implementations of the aligner could be found .

An explanatory article could be found and .

Here
here
here
here