-
Notifications
You must be signed in to change notification settings - Fork 26
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge main into feature/move-integration-test
- Loading branch information
Showing
33 changed files
with
589 additions
and
50 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
FROM --platform=linux/amd64 python:3.8-slim as base | ||
|
||
# System dependencies | ||
RUN apt-get update && \ | ||
apt-get upgrade -y && \ | ||
apt-get install git -y | ||
|
||
# Install requirements | ||
COPY requirements.txt / | ||
RUN pip3 install --no-cache-dir -r requirements.txt | ||
|
||
# Install Fondant | ||
# This is split from other requirements to leverage caching | ||
ARG FONDANT_VERSION=main | ||
RUN pip3 install fondant[component,aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION} | ||
|
||
# Set the working directory to the component folder | ||
WORKDIR /component | ||
COPY src/ src/ | ||
|
||
FROM base as test | ||
COPY tests/ tests/ | ||
RUN pip3 install --no-cache-dir -r tests/requirements.txt | ||
RUN python -m pytest tests | ||
|
||
FROM base | ||
COPY tests/ tests/ | ||
WORKDIR /component/src | ||
ENTRYPOINT ["fondant", "execute", "main"] | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
# Load from pdf | ||
|
||
<a id="load_from_pdf#description"></a> | ||
## Description | ||
Load pdf data stored locally or remote using langchain loaders. | ||
|
||
|
||
<a id="load_from_pdf#inputs_outputs"></a> | ||
## Inputs / outputs | ||
|
||
<a id="load_from_pdf#consumes"></a> | ||
### Consumes | ||
|
||
|
||
**This component does not consume data.** | ||
|
||
|
||
<a id="load_from_pdf#produces"></a> | ||
### Produces | ||
**This component produces:** | ||
|
||
- pdf_path: string | ||
- file_name: string | ||
- text: string | ||
|
||
|
||
|
||
<a id="load_from_pdf#arguments"></a> | ||
## Arguments | ||
|
||
The component takes the following arguments to alter its behavior: | ||
|
||
| argument | type | description | default | | ||
| -------- | ---- | ----------- | ------- | | ||
| pdf_path | str | The path to the a pdf file or a folder containing pdf files to load. Can be a local path or a remote path. If the path is remote, the loader class will be determined by the scheme of the path. | / | | ||
| n_rows_to_load | int | Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale | / | | ||
| index_column | str | Column to set index to in the load component, if not specified a default globally unique index will be set | / | | ||
| n_partitions | int | Number of partitions of the dask dataframe. If not specified, the number of partitions will be equal to the number of CPU cores. Set to high values if the data is large and the pipelineis running out of memory. | / | | ||
|
||
<a id="load_from_pdf#usage"></a> | ||
## Usage | ||
|
||
You can add this component to your pipeline using the following code: | ||
|
||
```python | ||
from fondant.pipeline import Pipeline | ||
|
||
|
||
pipeline = Pipeline(...) | ||
|
||
dataset = pipeline.read( | ||
"load_from_pdf", | ||
arguments={ | ||
# Add arguments | ||
# "pdf_path": , | ||
# "n_rows_to_load": 0, | ||
# "index_column": , | ||
# "n_partitions": 0, | ||
}, | ||
) | ||
``` | ||
|
||
<a id="load_from_pdf#testing"></a> | ||
## Testing | ||
|
||
You can run the tests using docker with BuildKit. From this directory, run: | ||
``` | ||
docker build . --target test | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
name: Load from pdf | ||
description: | | ||
Load pdf data stored locally or remote using langchain loaders. | ||
image: fndnt/load_from_pdf:dev | ||
tags: | ||
- Data loading | ||
|
||
produces: | ||
pdf_path: | ||
type: string | ||
file_name: | ||
type: string | ||
text: | ||
type: string | ||
|
||
args: | ||
pdf_path: | ||
description: | | ||
The path to the a pdf file or a folder containing pdf files to load. | ||
Can be a local path or a remote path. If the path is remote, the loader class will be | ||
determined by the scheme of the path. | ||
type: str | ||
n_rows_to_load: | ||
description: | | ||
Optional argument that defines the number of rows to load. Useful for testing pipeline runs | ||
on a small scale | ||
type: int | ||
default: None | ||
index_column: | ||
description: | | ||
Column to set index to in the load component, if not specified a default globally unique | ||
index will be set | ||
type: str | ||
default: None | ||
n_partitions: | ||
description: | | ||
Number of partitions of the dask dataframe. If not specified, the number of partitions will | ||
be equal to the number of CPU cores. Set to high values if the data is large and the pipeline | ||
is running out of memory. | ||
type: int | ||
default: None |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
PyMuPDF==1.23.8 |
Oops, something went wrong.