feat: add server dataset hub import #5591

jfcalvo · 2024-10-10T12:31:39Z

Description

This PR add changes to support the import of datasets from HF Hub into Argilla with the following features:

Add a new POST /api/v1/datasets/:dataset_id/import endpoint that will enqueue a background job to import records from a HF Hub dataset, returning information about the enqueued job. It expect the following parameters:
- name: the name of the dataset (i.e. lhoestq/demo1) @burtenshaw suggested changing it to repo_id
- subset: the dataset subset (i.e. default) @burtenshaw suggested to make the parameter optional
- split: the dataset split (i.e. train) @burtenshaw suggested to make the parameter opcional
Add a new background job so the import process can be done outside of request time.
Add a new HubDataset class encapsulating all the logic to import a dataset from the Hub.
Add a new /api/v1/jobs/:job_id to get information about the status of one specific job. This is useful if the UI or the SDK needs to know if the import process finished. (@frascuchon we can use this to give information about other processes, for example when a dataset distribution settings is changed).

Refs argilla-io/roadmap#21

Type of change

New feature (non-breaking change which adds functionality)

How Has This Been Tested

Manually testing and adding more automatic tests to our suite.

Checklist

I added relevant documentation
I followed the style guidelines of this project
I did a self-review of my code
I made corresponding changes to the documentation
I confirm My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

codecov · 2024-10-10T12:37:23Z

Codecov Report

Attention: Patch coverage is 90.00000% with 22 lines in your changes missing coverage. Please review.

Project coverage is 91.18%. Comparing base (ac29661) to head (f064267).
Report is 1 commits behind head on feat/argilla-direct-feature-branch.

Files with missing lines	Patch %	Lines
...-server/src/argilla_server/api/handlers/v1/jobs.py	66.66%	7 Missing ⚠️
argilla-server/src/argilla_server/jobs/hub_jobs.py	75.00%	5 Missing ⚠️
...rgilla_server/api/handlers/v1/datasets/datasets.py	50.00%	4 Missing ⚠️
...c/argilla_server/api/policies/v1/dataset_policy.py	40.00%	3 Missing ⚠️
argilla-server/src/argilla_server/contexts/hub.py	98.21%	2 Missing ⚠️
...r/src/argilla_server/api/policies/v1/job_policy.py	80.00%	1 Missing ⚠️

Additional details and impacted files

@@                          Coverage Diff                           @@
##           feat/argilla-direct-feature-branch    #5591      +/-   ##
======================================================================
- Coverage                               91.23%   91.18%   -0.05%     
======================================================================
  Files                                     145      150       +5     
  Lines                                    6058     6253     +195     
======================================================================
+ Hits                                     5527     5702     +175     
- Misses                                    531      551      +20

Flag	Coverage Δ
argilla-server	`91.18% <90.00%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

argilla-server/src/argilla_server/jobs/hub_jobs.py

…ntiate HubDataset class

argilla-server/pyproject.toml

…ts from hub

…ta URLs

…ataset-import

argilla-server/src/argilla_server/api/handlers/v1/datasets/datasets.py

argilla-server/src/argilla_server/jobs/hub_jobs.py

argilla-server/src/argilla_server/bulk/records_bulk.py

…ataset-import

…5597) # Description I have added a number of row to take importing a dataset from the hub. Specifically 500K rows. This can help us to avoid importing really big datasets with millions of rows into Argilla. Refs argilla-io/roadmap#21 **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [ ] Manually tested importing a dataset with a big number of records (> 500K) **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Paco Aranda <[email protected]>

…ataset-import

# Description This PR adds the following features: * Add support to dataset `ClassLabel` features casting them using `int2str` so we store string values on the Argilla imported dataset. * Now the casting is done at the row level and discarding keys that are not part of the mapping sources. Once the values arrive to the different record create values these are already casted. * We are casting `ClassLabel` features to string and `Image` features to data-url strings. Refs argilla-io/roadmap#21 **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [x] Adding additional tests import real datasets from HF. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

# Description This PR adds the following changes related to how `HubDataset` import functionality process rows with images: * If the image has not format we transform the image to `png`. * We convert images to `RGB` color space to avoid problems with other unsupported color spaces. Refs argilla-io/roadmap#21 **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [x] Manually testing `microsoft/cats_vs_dogs` dataset. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

…atures (#5607) # Description This PR adds the following changes: * Add support to `ClassLabel` features using `-1` (no label) values and casting these values to be `None`. * Now when importing `fields`, `metadata`, or `suggestions` and the value is `None` they will be ignored. Refs argilla-io/roadmap#21 **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [x] Adding new tests and modifying existing ones. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

feat: first iteration of background job to import datasets from hub

b2594b9

jfcalvo requested review from frascuchon and burtenshaw October 10, 2024 12:31

jfcalvo changed the title ~~feat: add server dataset hub import~~ [DRAFT] feat: add server dataset hub import Oct 10, 2024

jfcalvo marked this pull request as draft October 10, 2024 12:33

jfcalvo changed the title ~~[DRAFT] feat: add server dataset hub import~~ feat: add server dataset hub import Oct 10, 2024

jfcalvo commented Oct 11, 2024

View reviewed changes

argilla-server/src/argilla_server/jobs/hub_jobs.py Outdated Show resolved Hide resolved

jfcalvo added 2 commits October 11, 2024 11:43

feat: improve import_dataset_from_hub_job to get dataset before insta…

6108ffe

…ntiate HubDataset class

feat: improve HubDataset batch processing

3a875ee

burtenshaw reviewed Oct 11, 2024

View reviewed changes

argilla-server/pyproject.toml Show resolved Hide resolved

jfcalvo and others added 5 commits October 11, 2024 14:56

feat: use UpsertRecordsBulk of CreateRecordsBulk for importing datase…

b10a92e

…ts from hub

feat: transform dataset importing value columns with PIL images to da…

2b47522

…ta URLs

feat: add support to map suggestions importing datasets from hub

7f93e0b

Merge branch 'feat/argilla-direct-feature-branch' into feat/add-hub-d…

07aeec6

…ataset-import

Merge branch 'feat/argilla-direct-feature-branch' into feat/add-hub-d…

83237a7

…ataset-import

frascuchon reviewed Oct 14, 2024

View reviewed changes

argilla-server/src/argilla_server/api/handlers/v1/datasets/datasets.py Show resolved Hide resolved

frascuchon reviewed Oct 14, 2024

View reviewed changes

argilla-server/src/argilla_server/jobs/hub_jobs.py Outdated Show resolved Hide resolved

jfcalvo added 6 commits October 14, 2024 15:32

feat: add support for hub dataset mapping

a38fdda

feat: set metadata and suggestions as optional for HubDatasetMapping

15d694f

feat: when no external_id is mapped row_idx is used

c3752e8

feat: use streaming when loading the dataset

469ebc4

feat: refactor UpsertRecordsBulk to validate records individually

b665019

feat: ignore invalid records when importing datasets from hub

a2fbc10

frascuchon reviewed Oct 15, 2024

View reviewed changes

argilla-server/src/argilla_server/bulk/records_bulk.py Outdated Show resolved Hide resolved

jfcalvo and others added 5 commits October 15, 2024 15:10

Merge branch 'feat/argilla-direct-feature-branch' into feat/add-hub-d…

f3e33bd

…ataset-import

Merge branch 'feat/argilla-direct-feature-branch' into feat/add-hub-d…

fde2603

…ataset-import

Merge branch 'feat/argilla-direct-feature-branch' into feat/add-hub-d…

e84fb45

…ataset-import

jfcalvo and others added 2 commits October 18, 2024 12:56

jfcalvo marked this pull request as ready for review October 18, 2024 11:03

frascuchon approved these changes Oct 18, 2024

View reviewed changes

jfcalvo merged commit 8e85f50 into feat/argilla-direct-feature-branch Oct 18, 2024
5 of 6 checks passed

jfcalvo deleted the feat/add-hub-dataset-import branch October 18, 2024 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add server dataset hub import #5591

feat: add server dataset hub import #5591

jfcalvo commented Oct 10, 2024 •

edited

Loading

codecov bot commented Oct 10, 2024 •

edited

Loading

feat: add server dataset hub import #5591

feat: add server dataset hub import #5591

Conversation

jfcalvo commented Oct 10, 2024 • edited Loading

Description

codecov bot commented Oct 10, 2024 • edited Loading

Codecov Report

jfcalvo commented Oct 10, 2024 •

edited

Loading

codecov bot commented Oct 10, 2024 •

edited

Loading