-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load subset into dataframe #54
Conversation
This was branched from #48 and merges back into it, the file diff is too big let me investigate |
3f933b6
to
b2c790a
Compare
Nice, the partitions persisting when writing to and reading from parquet will be really beneficial to us. |
3641dce
to
1daed44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GeorgesLorre, left some final comments.
pyproject.toml
Outdated
@@ -41,7 +41,7 @@ classifiers = [ | |||
[tool.poetry.dependencies] | |||
python = "^3.8" | |||
jsonschema = "^4.17.3" | |||
dask = "^2022.2.0" | |||
dask = "^2023.4.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we need the dataframe
extra here?
fondant/dataset.py
Outdated
col: name + "_" + col for col in df.columns if col not in index_fields | ||
} | ||
) | ||
# df = df.rename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reenable this?
tests/test_dataset.py
Outdated
fds = FondantDataset(manifest=manifest) | ||
df = fds.load_dataframe(spec=component_spec) | ||
assert len(df) == 151 | ||
assert list(df.columns) == ["id", "source", "Name", "HP", "Type 1", "Type 2"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column names should be "subset name underscore field name", like "properties_name"
tests/example_data/raw/split.py
Outdated
""" | ||
This is a small script to split the raw data into different subsets to be used while testing. | ||
|
||
The data is the 151 first pokemon and the following field are available: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of adding Parquet files to the Github repository as it slows things down, and as they stay in the commit history they stay there forever so just removing them later won't make it faster.
I'd host the Parquet files on the hub and load them using the hf_hub_download
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we stay in the range of kilobytes, I don't see an issue with this. Adding data files for the tests is fine as long as the goal is to validate format handling and they contain minimal data.
Moving the files out of the repo decouples them from the versioning and adds complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the files out of the repo also makes our tests dependent on an internet connection, which will only bring pain.
Some best practices for tests:
https://web.archive.org/web/20210510024513/http://beyondcoding.net/articles/firstprinciples.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok fine for me as long as the files are very small. Although I've never had an issue with internet not being available to test things out. With HF you can also include a commit hash if you want to include the version of the file, if needed.
1daed44
to
129b603
Compare
|
||
pyarrow = "^11.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we move this up from the optional to the required dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GeorgesLorre!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
Load subset into dataframe
Fixes ml6team/fondant-internal#54 PR that adds the functionality to load pdf documents from different local and remote storage. The implementation differs from the suggested solution at [#54](ml6team/fondant-internal#54) since: * Accumulating different loaders and loading each document individually seems to be inefficient since it would require the initialization of a client, temp storage, ... on every invocation [link](https://github.com/langchain-ai/langchain/blob/04caf07dee2e2843ab720e5b8f0c0e83d0b86a3e/libs/community/langchain_community/document_loaders/gcs_file.py#L62) * The langchain cloud loaders don't have a unified interface * Each would requires specific arguments to be passed (in contrast fsspec is much simpler) * Only the google loader enables defining a custom loader class, the rest uses the `Unstructured` loader which requires a lot of system and cuda dependencies to have it installed (a lot of overhead for just loading pdfs) The current implementation relies on copying the pdfs to a temporary local storage and loading them using the `PyPDFDirectoryLoader`, they are then loaded lazily. The assumption for now is that the loaded docs won't exceed the storage of the device which should be valid for most use cases. Later on, we can think on how to optimize this further.
This PR implements a very basic way of merging all subsets into 1 dataframe to be used in the component.
Still to do:
Out of scope of this PR: