DocETL: Powering Complex Document Processing Pipelines

Website (Includes Demo) | Documentation | Discord | Paper (coming soon!)

DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers a low-code, declarative YAML interface to define LLM-powered operations on complex data.

When to Use DocETL

DocETL is the ideal choice when you're looking to maximize correctness and output quality for complex tasks over a collection of documents or unstructured datasets. You should consider using DocETL if:

You want to perform semantic processing on a collection of data
You have complex tasks that you want to represent via map-reduce (e.g., map over your documents, then group by the result of your map call & reduce)
You're unsure how to best express your task to maximize LLM accuracy
You're working with long documents that don't fit into a single prompt or are too lengthy for effective LLM reasoning
You have validation criteria and want tasks to automatically retry when the validation fails

Installation

See the documentation for installing from PyPI.

Prerequisites

Before installing DocETL, ensure you have Python 3.10 or later installed on your system. You can check your Python version by running:

python --version

Installation Steps (from Source)

Clone the DocETL repository:

git clone https://github.com/shreyashankar/docetl.git
cd docetl

Install Poetry (if not already installed):

pip install poetry

Install the project dependencies:

poetry install

Set up your OpenAI API key:

Create a .env file in the project root and add your OpenAI API key:

OPENAI_API_KEY=your_api_key_here

Alternatively, you can set the OPENAI_API_KEY environment variable in your shell.

Run the basic test suite to ensure everything is working (this costs less than $0.01 with OpenAI):

make tests-basic

That's it! You've successfully installed DocETL and are ready to start processing documents.

For more detailed information on usage and configuration, please refer to our documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 212 Commits
.github/workflows		.github/workflows
docetl		docetl
docs		docs
example_data		example_data
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
todos.md		todos.md
vision.md		vision.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocETL: Powering Complex Document Processing Pipelines

When to Use DocETL

Installation

Prerequisites

Installation Steps (from Source)

About

Releases

Packages

Languages

License

sunholo-data/docetl

Folders and files

Latest commit

History

Repository files navigation

DocETL: Powering Complex Document Processing Pipelines

When to Use DocETL

Installation

Prerequisites

Installation Steps (from Source)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages