Pralekha: An Indic Document Alignment Evaluation Benchmark

Overview

PRALEKHA is a large-scale benchmark for evaluating document-level alignment techniques. It includes 2M+ documents, covering 11 Indic languages and English, with a balanced mix of aligned and unaligned pairs.

Usage

1. Getting Started

Follow these steps to set up the environment and get started with the pipeline:

1. Clone the Repository

Clone this repository to your local system:

git clone https://github.com/AI4Bharat/Pralekha.git
cd Pralekha

2. Set Up a Conda Environment

Create and activate a new Conda environment for this project:

conda create -n pralekha python=3.9 -y
conda activate pralekha

3. Install Dependencies

Install the required Python packages:

pip install -r requirements.txt

2. Input Directory Structure

The pipeline expects a directory structure in the following format:

A main directory containing language subdirectories named using their 3-letter ISO codes (e.g., eng for English, hin for Hindi, tam for Tamil, etc.)
Each language subdirectory will contain .txt documents named in the format {doc_id}.txt, where doc_id serves as the unique identifier for each document.

Below is an example of the expected directory structure:

data/
├── eng/
│   ├── tech-innovations-2023.txt                
│   ├── sports-highlights-day5.txt     
│   ├── press-release-456.txt         
│   ├── ...
├── hin/
│   ├── daily-briefing-april.txt       
│   ├── market-trends-yearend.txt      
│   ├── इंडिया-न्यूज़123.txt              
│   ├── ...
├── tam/
│   ├── kollywood-review-movie5.txt   
│   ├── 2023-pilgrimage-guide.txt       
│   ├── கடலோர-மாநில-செய்தி.txt          
│   ├── ...
...

3. Split Documents into Granular Shards

To process documents into granular shards, use the doc2granular-shards.sh script.

This script allows you to:

Tokenize documents into sentences.
Split documents into chunks.

Run the script:

bash doc2granular-shards.sh

4. Create Embeddings

Generate embeddings for your dataset using one of the two supported models: LaBSE or SONAR.

bash create_embeddings.sh

Choose the desired model by editing the script as needed. Both models can be run sequentially or independently by enabling/disabling the respective sections.

5. Run the Pipeline

The final step is to execute the pipeline based on your chosen method:

For baseline approaches:

bash run_baseline_pipeline.sh

For the proposed DAC approach:

bash run_dac_pipeline.sh

Each pipeline comes with a variety of configurable parameters, allowing you to tailor the process to your specific requirements. Please review and edit the scripts as needed before running to ensure they align with your desired configurations.

License

This dataset is released under the CC BY 4.0 license.

Contact

For any questions or feedback, please contact:

Raj Dabre ([email protected])
Sanjay Suryanarayanan ([email protected])
Haiyue Song ([email protected])
Mohammed Safi Ur Rahman Khan ([email protected])

Please get in touch with us for any copyright concerns.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pralekha: An Indic Document Alignment Evaluation Benchmark

Overview

Usage

1. Getting Started

1. Clone the Repository

2. Set Up a Conda Environment

3. Install Dependencies

2. Input Directory Structure

3. Split Documents into Granular Shards

4. Create Embeddings

5. Run the Pipeline

License

Contact

About

Releases

Packages

Contributors 2

Languages

License

AI4Bharat/Pralekha

Folders and files

Latest commit

History

Repository files navigation

Pralekha: An Indic Document Alignment Evaluation Benchmark

Overview

Usage

1. Getting Started

1. Clone the Repository

2. Set Up a Conda Environment

3. Install Dependencies

2. Input Directory Structure

3. Split Documents into Granular Shards

4. Create Embeddings

5. Run the Pipeline

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages