-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add server dataset hub import #5591
feat: add server dataset hub import #5591
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feat/argilla-direct-feature-branch #5591 +/- ##
======================================================================
- Coverage 91.23% 91.18% -0.05%
======================================================================
Files 145 150 +5
Lines 6058 6253 +195
======================================================================
+ Hits 5527 5702 +175
- Misses 531 551 +20
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
…5597) # Description I have added a number of row to take importing a dataset from the hub. Specifically 500K rows. This can help us to avoid importing really big datasets with millions of rows into Argilla. Refs argilla-io/roadmap#21 **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [ ] Manually tested importing a dataset with a big number of records (> 500K) **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Paco Aranda <[email protected]>
# Description This PR adds the following features: * Add support to dataset `ClassLabel` features casting them using `int2str` so we store string values on the Argilla imported dataset. * Now the casting is done at the row level and discarding keys that are not part of the mapping sources. Once the values arrive to the different record create values these are already casted. * We are casting `ClassLabel` features to string and `Image` features to data-url strings. Refs argilla-io/roadmap#21 **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [x] Adding additional tests import real datasets from HF. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)
# Description This PR adds the following changes related to how `HubDataset` import functionality process rows with images: * If the image has not format we transform the image to `png`. * We convert images to `RGB` color space to avoid problems with other unsupported color spaces. Refs argilla-io/roadmap#21 **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [x] Manually testing `microsoft/cats_vs_dogs` dataset. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)
…atures (#5607) # Description This PR adds the following changes: * Add support to `ClassLabel` features using `-1` (no label) values and casting these values to be `None`. * Now when importing `fields`, `metadata`, or `suggestions` and the value is `None` they will be ignored. Refs argilla-io/roadmap#21 **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [x] Adding new tests and modifying existing ones. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
8e85f50
into
feat/argilla-direct-feature-branch
Description
This PR add changes to support the import of datasets from HF Hub into Argilla with the following features:
POST /api/v1/datasets/:dataset_id/import
endpoint that will enqueue a background job to import records from a HF Hub dataset, returning information about the enqueued job. It expect the following parameters:name
: the name of the dataset (i.e.lhoestq/demo1
) @burtenshaw suggested changing it torepo_id
subset
: the dataset subset (i.e.default
) @burtenshaw suggested to make the parameter optionalsplit
: the dataset split (i.e.train
) @burtenshaw suggested to make the parameter opcionalHubDataset
class encapsulating all the logic to import a dataset from the Hub./api/v1/jobs/:job_id
to get information about the status of one specific job. This is useful if the UI or the SDK needs to know if the import process finished. (@frascuchon we can use this to give information about other processes, for example when a dataset distribution settings is changed).Refs argilla-io/roadmap#21
Type of change
How Has This Been Tested
Checklist