Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load subset into dataframe #54

Merged
merged 7 commits into from
May 2, 2023
Merged

Conversation

GeorgesLorre
Copy link
Collaborator

@GeorgesLorre GeorgesLorre commented Apr 27, 2023

This PR implements a very basic way of merging all subsets into 1 dataframe to be used in the component.

Still to do:

  • add tests
  • add more tests

Out of scope of this PR:

  • De we add the subset name to the column when loading or when writing (when creating the subset)?
  • No indices are used when merging, it is a hash merge right now
  • Writing out the resulting dataframe into subsets

@GeorgesLorre
Copy link
Collaborator Author

GeorgesLorre commented Apr 27, 2023

This was branched from #48 and merges back into it, the file diff is too big let me investigate

@GeorgesLorre GeorgesLorre marked this pull request as draft April 27, 2023 10:06
@GeorgesLorre GeorgesLorre linked an issue Apr 27, 2023 that may be closed by this pull request
@GeorgesLorre GeorgesLorre added the Core Core framework label Apr 27, 2023
@GeorgesLorre GeorgesLorre changed the base branch from feature/split-components-load-transform to main April 27, 2023 13:20
@GeorgesLorre GeorgesLorre force-pushed the feature/load-subset-into-dataframe branch from 3f933b6 to b2c790a Compare April 27, 2023 14:38
@RobbeSneyders
Copy link
Member

Nice, the partitions persisting when writing to and reading from parquet will be really beneficial to us.

tests/test_dataset.py Outdated Show resolved Hide resolved
@GeorgesLorre GeorgesLorre force-pushed the feature/load-subset-into-dataframe branch from 3641dce to 1daed44 Compare May 2, 2023 11:50
@GeorgesLorre GeorgesLorre marked this pull request as ready for review May 2, 2023 12:08
@GeorgesLorre GeorgesLorre changed the title (WIP) Feature/load subset into dataframe Load subset into dataframe May 2, 2023
Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GeorgesLorre, left some final comments.

pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Outdated
@@ -41,7 +41,7 @@ classifiers = [
[tool.poetry.dependencies]
python = "^3.8"
jsonschema = "^4.17.3"
dask = "^2022.2.0"
dask = "^2023.4.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need the dataframe extra here?

col: name + "_" + col for col in df.columns if col not in index_fields
}
)
# df = df.rename(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reenable this?

fds = FondantDataset(manifest=manifest)
df = fds.load_dataframe(spec=component_spec)
assert len(df) == 151
assert list(df.columns) == ["id", "source", "Name", "HP", "Type 1", "Type 2"]
Copy link
Contributor

@NielsRogge NielsRogge May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column names should be "subset name underscore field name", like "properties_name"

"""
This is a small script to split the raw data into different subsets to be used while testing.

The data is the 151 first pokemon and the following field are available:
Copy link
Contributor

@NielsRogge NielsRogge May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of adding Parquet files to the Github repository as it slows things down, and as they stay in the commit history they stay there forever so just removing them later won't make it faster.

I'd host the Parquet files on the hub and load them using the hf_hub_download method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we stay in the range of kilobytes, I don't see an issue with this. Adding data files for the tests is fine as long as the goal is to validate format handling and they contain minimal data.

Moving the files out of the repo decouples them from the versioning and adds complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the files out of the repo also makes our tests dependent on an internet connection, which will only bring pain.

Some best practices for tests:
https://web.archive.org/web/20210510024513/http://beyondcoding.net/articles/firstprinciples.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok fine for me as long as the files are very small. Although I've never had an issue with internet not being available to test things out. With HF you can also include a commit hash if you want to include the version of the file, if needed.

@GeorgesLorre GeorgesLorre force-pushed the feature/load-subset-into-dataframe branch from 1daed44 to 129b603 Compare May 2, 2023 14:22

pyarrow = "^11.0.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we move this up from the optional to the required dependencies?

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GeorgesLorre!

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

@GeorgesLorre GeorgesLorre merged commit e6c2a71 into main May 2, 2023
@RobbeSneyders RobbeSneyders deleted the feature/load-subset-into-dataframe branch May 4, 2023 07:34
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
PhilippeMoussalli added a commit that referenced this pull request Jan 11, 2024
Fixes ml6team/fondant-internal#54

PR that adds the functionality to load pdf documents from different
local and remote storage.

The implementation differs from the suggested solution at
[#54](ml6team/fondant-internal#54) since:
* Accumulating different loaders and loading each document individually
seems to be inefficient since it would require the initialization of a
client, temp storage, ... on every invocation
[link](https://github.com/langchain-ai/langchain/blob/04caf07dee2e2843ab720e5b8f0c0e83d0b86a3e/libs/community/langchain_community/document_loaders/gcs_file.py#L62)
* The langchain cloud loaders don't have a unified interface
* Each would requires specific arguments to be passed (in contrast
fsspec is much simpler)
* Only the google loader enables defining a custom loader class, the
rest uses the `Unstructured` loader which requires a lot of system and
cuda dependencies to have it installed (a lot of overhead for just
loading pdfs)

The current implementation relies on copying the pdfs to a temporary
local storage and loading them using the `PyPDFDirectoryLoader`, they
are then loaded lazily. The assumption for now is that the loaded docs
won't exceed the storage of the device which should be valid for most
use cases. Later on, we can think on how to optimize this further.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core Core framework
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Merge loaded subsets into a single dataframe
4 participants