Skip to content

Collection of ETL scripts used to create a dataset of text in Spanish to train Large Language Models.

Notifications You must be signed in to change notification settings

jsurrea/LLM-Latino

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Seneca Extractor

Description

seneca_extractor is a Python package designed for extracting files and metadata from the Seneca institutional repository at Universidad de los Andes. This project is part of the LLM-Latino project and focuses on facilitating the access and manipulation of data stored in the repository.

Authors

  • Juan Sebastian Urrea Lopez
  • David Santiago Ortiz Almanza

Contact

Installation

To install this package, it is recommended to use a Python virtual environment to avoid dependency conflicts. You can follow these steps to set up your environment and install seneca_extractor:

  1. Create and activate a virtual environment (optional, but recommended):

    • On Windows:
      python -m venv venv
      .\venv\Scripts\activate
    • On Unix or MacOS:
      python3 -m venv venv
      source venv/bin/activate
  2. Install the package:

    • Navigate to the directory where the source code is located and run:
      pip install -e .

    This will install seneca_extractor in editable mode, which means any changes to the package source code will be immediately available without needing to reinstall the package.

About

Collection of ETL scripts used to create a dataset of text in Spanish to train Large Language Models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages