Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add server dataset hub import #5591

Merged

Conversation

jfcalvo
Copy link
Member

@jfcalvo jfcalvo commented Oct 10, 2024

Description

This PR add changes to support the import of datasets from HF Hub into Argilla with the following features:

  • Add a new POST /api/v1/datasets/:dataset_id/import endpoint that will enqueue a background job to import records from a HF Hub dataset, returning information about the enqueued job. It expect the following parameters:
    • name: the name of the dataset (i.e. lhoestq/demo1) @burtenshaw suggested changing it to repo_id
    • subset: the dataset subset (i.e. default) @burtenshaw suggested to make the parameter optional
    • split: the dataset split (i.e. train) @burtenshaw suggested to make the parameter opcional
  • Add a new background job so the import process can be done outside of request time.
  • Add a new HubDataset class encapsulating all the logic to import a dataset from the Hub.
  • Add a new /api/v1/jobs/:job_id to get information about the status of one specific job. This is useful if the UI or the SDK needs to know if the import process finished. (@frascuchon we can use this to give information about other processes, for example when a dataset distribution settings is changed).

Refs argilla-io/roadmap#21

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested

  • Manually testing and adding more automatic tests to our suite.

Checklist

  • I added relevant documentation
  • I followed the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • I confirm My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

@jfcalvo jfcalvo changed the title feat: add server dataset hub import [DRAFT] feat: add server dataset hub import Oct 10, 2024
@jfcalvo jfcalvo marked this pull request as draft October 10, 2024 12:33
@jfcalvo jfcalvo changed the title [DRAFT] feat: add server dataset hub import feat: add server dataset hub import Oct 10, 2024
Copy link

codecov bot commented Oct 10, 2024

Codecov Report

Attention: Patch coverage is 90.00000% with 22 lines in your changes missing coverage. Please review.

Project coverage is 91.18%. Comparing base (ac29661) to head (f064267).
Report is 1 commits behind head on feat/argilla-direct-feature-branch.

Files with missing lines Patch % Lines
...-server/src/argilla_server/api/handlers/v1/jobs.py 66.66% 7 Missing ⚠️
argilla-server/src/argilla_server/jobs/hub_jobs.py 75.00% 5 Missing ⚠️
...rgilla_server/api/handlers/v1/datasets/datasets.py 50.00% 4 Missing ⚠️
...c/argilla_server/api/policies/v1/dataset_policy.py 40.00% 3 Missing ⚠️
argilla-server/src/argilla_server/contexts/hub.py 98.21% 2 Missing ⚠️
...r/src/argilla_server/api/policies/v1/job_policy.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@                          Coverage Diff                           @@
##           feat/argilla-direct-feature-branch    #5591      +/-   ##
======================================================================
- Coverage                               91.23%   91.18%   -0.05%     
======================================================================
  Files                                     145      150       +5     
  Lines                                    6058     6253     +195     
======================================================================
+ Hits                                     5527     5702     +175     
- Misses                                    531      551      +20     
Flag Coverage Δ
argilla-server 91.18% <90.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jfcalvo and others added 5 commits October 15, 2024 15:10
…5597)

# Description

I have added a number of row to take importing a dataset from the hub.
Specifically 500K rows. This can help us to avoid importing really big
datasets with millions of rows into Argilla.

Refs argilla-io/roadmap#21

**Type of change**

- New feature (non-breaking change which adds functionality)

**How Has This Been Tested**

- [ ] Manually tested importing a dataset with a big number of records
(> 500K)

**Checklist**

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: Paco Aranda <[email protected]>
# Description

This PR adds the following features:
* Add support to dataset `ClassLabel` features casting them using
`int2str` so we store string values on the Argilla imported dataset.
* Now the casting is done at the row level and discarding keys that are
not part of the mapping sources. Once the values arrive to the different
record create values these are already casted.
* We are casting `ClassLabel` features to string and `Image` features to
data-url strings.

Refs argilla-io/roadmap#21

**Type of change**

- New feature (non-breaking change which adds functionality)

**How Has This Been Tested**

- [x] Adding additional tests import real datasets from HF.

**Checklist**

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)
jfcalvo and others added 2 commits October 18, 2024 12:56
# Description

This PR adds the following changes related to how `HubDataset` import
functionality process rows with images:
* If the image has not format we transform the image to `png`.
* We convert images to `RGB` color space to avoid problems with other
unsupported color spaces.

Refs argilla-io/roadmap#21

**Type of change**

- New feature (non-breaking change which adds functionality)

**How Has This Been Tested**

- [x] Manually testing `microsoft/cats_vs_dogs` dataset.

**Checklist**

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)
…atures (#5607)

# Description

This PR adds the following changes:
* Add support to `ClassLabel` features using `-1` (no label) values and
casting these values to be `None`.
* Now when importing `fields`, `metadata`, or `suggestions` and the
value is `None` they will be ignored.

Refs argilla-io/roadmap#21

**Type of change**

- New feature (non-breaking change which adds functionality)

**How Has This Been Tested**

- [x] Adding new tests and modifying existing ones.

**Checklist**

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@jfcalvo jfcalvo marked this pull request as ready for review October 18, 2024 11:03
@jfcalvo jfcalvo merged commit 8e85f50 into feat/argilla-direct-feature-branch Oct 18, 2024
5 of 6 checks passed
@jfcalvo jfcalvo deleted the feat/add-hub-dataset-import branch October 18, 2024 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants