Aligner
The Aligner module is designed for “aligning” or finding similar sentence pairs from two lists of sentences, preferably in different languages. The Aligner is a standalone service that cannot be accessed from the UI as of now. The service is dependent on the file uploader and workflow manager (WFM) services.
The Aligner service is based on Google’s LaBSE model and FB’s FAISS algorithm. It accepts two files as inputs, from which two lists of sentences are collected. LaBSE Embeddings are calculated for each of the sentences in the list. Cosine similarity between embeddings is calculated to find meaningfully similar sentence pairs. The FAISS algorithm is used to dramatically speed up the whole process.
Simplified implementation: Here
The service accepts two text files, and the aligner module can ideally be invoked using WFM. It is time-consuming and hence an async service. Once the run is fully done, a WFM-based search can be conducted using the job ID to obtain the result.
The response is typically a JSON file path, which can be downloaded using the download API. The JSON file is self-explanatory and it contains source_text
, target_text
, and the corresponding cosine similarity between them.
Local Setup (Without WFM & Uploader)
Clone the Repo
Install dependencies
Run the application
Access from local:
Remote (Invoked via WFM)
It returns a JOB ID, which can be searched using the WFM Bulk search API to see job progress and pull out results once done.
WF_A_JAL
is the Workflow code for JSON-based aligner, which returns the filepath of a JSON file that could be downloaded using the download API.WF_A_AL
is the old workflow code, that returns multiple txt files.
Testing
Upload two files.
Call API endpoint with file paths as parameters.
Verify if sentences are matching properly in the JSON.
Notes
Can be used as an independent service by deploying file-uploader and aligner modules alone on a server, preferably GPU-based (tested working well on
g4dn2xlarge
).Simplified implementations of the aligner could be found here.
Last updated