Load subset into dataframe #54

GeorgesLorre · 2023-04-27T09:15:04Z

This PR implements a very basic way of merging all subsets into 1 dataframe to be used in the component.

Still to do:

add tests
add more tests

Out of scope of this PR:

De we add the subset name to the column when loading or when writing (when creating the subset)?
No indices are used when merging, it is a hash merge right now
Writing out the resulting dataframe into subsets

GeorgesLorre · 2023-04-27T09:17:08Z

This was branched from #48 and merges back into it, the file diff is too big let me investigate

fondant/dataset.py

RobbeSneyders · 2023-04-27T20:16:43Z

Nice, the partitions persisting when writing to and reading from parquet will be really beneficial to us.

tests/test_dataset.py

RobbeSneyders

Thanks @GeorgesLorre, left some final comments.

pyproject.toml

RobbeSneyders · 2023-05-02T14:01:19Z

pyproject.toml

@@ -41,7 +41,7 @@ classifiers = [
 [tool.poetry.dependencies]
 python = "^3.8"
 jsonschema = "^4.17.3"
-dask = "^2022.2.0"
+dask = "^2023.4.0"


Why don't we need the dataframe extra here?

RobbeSneyders · 2023-05-02T14:01:58Z

fondant/dataset.py

-                col: name + "_" + col for col in df.columns if col not in index_fields
-            }
-        )
+        # df = df.rename(


Can we reenable this?

NielsRogge · 2023-05-02T14:12:39Z

tests/test_dataset.py

+    fds = FondantDataset(manifest=manifest)
+    df = fds.load_dataframe(spec=component_spec)
+    assert len(df) == 151
+    assert list(df.columns) == ["id", "source", "Name", "HP", "Type 1", "Type 2"]


The column names should be "subset name underscore field name", like "properties_name"

NielsRogge · 2023-05-02T14:14:30Z

tests/example_data/raw/split.py

+"""
+This is a small script to split the raw data into different subsets to be used while testing.
+
+The data is the 151 first pokemon and the following field are available:


Not a fan of adding Parquet files to the Github repository as it slows things down, and as they stay in the commit history they stay there forever so just removing them later won't make it faster.

I'd host the Parquet files on the hub and load them using the hf_hub_download method

As long as we stay in the range of kilobytes, I don't see an issue with this. Adding data files for the tests is fine as long as the goal is to validate format handling and they contain minimal data.

Moving the files out of the repo decouples them from the versioning and adds complexity.

Moving the files out of the repo also makes our tests dependent on an internet connection, which will only bring pain.

Some best practices for tests:
https://web.archive.org/web/20210510024513/http://beyondcoding.net/articles/firstprinciples.html

Ok fine for me as long as the files are very small. Although I've never had an issue with internet not being available to test things out. With HF you can also include a commit hash if you want to include the version of the file, if needed.

RobbeSneyders · 2023-05-02T14:42:31Z

pyproject.toml


+pyarrow = "^11.0.0"


Nit: can we move this up from the optional to the required dependencies?

RobbeSneyders

Thanks @GeorgesLorre!

NielsRogge

Thanks for working on this!

Load subset into dataframe

Fixes ml6team/fondant-internal#54 PR that adds the functionality to load pdf documents from different local and remote storage. The implementation differs from the suggested solution at [#54](ml6team/fondant-internal#54) since: * Accumulating different loaders and loading each document individually seems to be inefficient since it would require the initialization of a client, temp storage, ... on every invocation [link](https://github.com/langchain-ai/langchain/blob/04caf07dee2e2843ab720e5b8f0c0e83d0b86a3e/libs/community/langchain_community/document_loaders/gcs_file.py#L62) * The langchain cloud loaders don't have a unified interface * Each would requires specific arguments to be passed (in contrast fsspec is much simpler) * Only the google loader enables defining a custom loader class, the rest uses the `Unstructured` loader which requires a lot of system and cuda dependencies to have it installed (a lot of overhead for just loading pdfs) The current implementation relies on copying the pdfs to a temporary local storage and loading them using the `PyPDFDirectoryLoader`, they are then loaded lazily. The assumption for now is that the loaded docs won't exceed the storage of the device which should be valid for most use cases. Later on, we can think on how to optimize this further.

GeorgesLorre requested review from RobbeSneyders, PhilippeMoussalli and NielsRogge April 27, 2023 09:20

GeorgesLorre marked this pull request as draft April 27, 2023 10:06

GeorgesLorre linked an issue Apr 27, 2023 that may be closed by this pull request

Merge loaded subsets into a single dataframe #44

Closed

GeorgesLorre added the Core Core framework label Apr 27, 2023

GeorgesLorre changed the base branch from feature/split-components-load-transform to main April 27, 2023 13:20

PhilippeMoussalli reviewed Apr 27, 2023

View reviewed changes

fondant/dataset.py Outdated Show resolved Hide resolved

GeorgesLorre force-pushed the feature/load-subset-into-dataframe branch from 3f933b6 to b2c790a Compare April 27, 2023 14:38

RobbeSneyders reviewed May 2, 2023

View reviewed changes

tests/test_dataset.py Outdated Show resolved Hide resolved

GeorgesLorre force-pushed the feature/load-subset-into-dataframe branch from 3641dce to 1daed44 Compare May 2, 2023 11:50

GeorgesLorre marked this pull request as ready for review May 2, 2023 12:08

GeorgesLorre changed the title ~~(WIP) Feature/load subset into dataframe~~ Load subset into dataframe May 2, 2023

RobbeSneyders reviewed May 2, 2023

View reviewed changes

NielsRogge reviewed May 2, 2023

View reviewed changes

GeorgesLorre added 5 commits May 2, 2023 16:20

Combine subsets into single dataframe

1221e0d

Add first test + fixtures

4d98a56

Revamp dataset tests

46f1cd2

Rebase

a688793

Re-enable column renaming based on subset

129b603

GeorgesLorre force-pushed the feature/load-subset-into-dataframe branch from 1daed44 to 129b603 Compare May 2, 2023 14:22

GeorgesLorre added 2 commits May 2, 2023 16:24

Re-enable column renaming based on subset

4342575

Re-enable column renaming based on subset

1f8cfca

RobbeSneyders reviewed May 2, 2023

View reviewed changes

pyproject.toml

pyarrow = "^11.0.0"

Copy link

Member

RobbeSneyders May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we move this up from the optional to the required dependencies?

RobbeSneyders approved these changes May 2, 2023

View reviewed changes

NielsRogge approved these changes May 2, 2023

View reviewed changes

RobbeSneyders assigned GeorgesLorre May 2, 2023

GeorgesLorre merged commit e6c2a71 into main May 2, 2023

RobbeSneyders deleted the feature/load-subset-into-dataframe branch May 4, 2023 07:34

Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023

Merge pull request #54 from ml6team/feature/load-subset-into-dataframe

4bc708e

Load subset into dataframe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load subset into dataframe #54

Load subset into dataframe #54

GeorgesLorre commented Apr 27, 2023 •

edited

Loading

GeorgesLorre commented Apr 27, 2023 •

edited

Loading

RobbeSneyders commented Apr 27, 2023

RobbeSneyders left a comment

RobbeSneyders May 2, 2023

RobbeSneyders May 2, 2023

NielsRogge May 2, 2023 •

edited

Loading

NielsRogge May 2, 2023 •

edited

Loading

RobbeSneyders May 2, 2023

RobbeSneyders May 2, 2023

NielsRogge May 2, 2023

RobbeSneyders May 2, 2023

RobbeSneyders left a comment

NielsRogge left a comment

Load subset into dataframe #54

Load subset into dataframe #54

Conversation

GeorgesLorre commented Apr 27, 2023 • edited Loading

GeorgesLorre commented Apr 27, 2023 • edited Loading

RobbeSneyders commented Apr 27, 2023

RobbeSneyders left a comment

Choose a reason for hiding this comment

RobbeSneyders May 2, 2023

Choose a reason for hiding this comment

RobbeSneyders May 2, 2023

Choose a reason for hiding this comment

NielsRogge May 2, 2023 • edited Loading

Choose a reason for hiding this comment

NielsRogge May 2, 2023 • edited Loading

Choose a reason for hiding this comment

RobbeSneyders May 2, 2023

Choose a reason for hiding this comment

RobbeSneyders May 2, 2023

Choose a reason for hiding this comment

NielsRogge May 2, 2023

Choose a reason for hiding this comment

RobbeSneyders May 2, 2023

Choose a reason for hiding this comment

RobbeSneyders left a comment

Choose a reason for hiding this comment

NielsRogge left a comment

Choose a reason for hiding this comment

GeorgesLorre commented Apr 27, 2023 •

edited

Loading

GeorgesLorre commented Apr 27, 2023 •

edited

Loading

NielsRogge May 2, 2023 •

edited

Loading

NielsRogge May 2, 2023 •

edited

Loading