Adding Dataverse to Croissant 🥐 Online Health #530

pdurbin · 2024-02-16T17:09:32Z

Hi! I'm interested in some details about how to add Dataverse to Croissant 🥐 Online Health.

In health/crawler/spiders/huggingface.py I'm seeing an example for Hugging Face like this:

https://datasets-server.huggingface.co/croissant?dataset=mnist

Is this roughtly what you need from us? Can the URL be different? A URL like the following would fit better into our existing pattern:

https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi%3A10.7910/DVN/TJCLKP

Thanks!

marcenacp · 2024-02-19T21:52:39Z

Hi @pdurbin, thanks for creating this issue! The best way to crawl Hugging Face was to list all datasets and then visit each individual URL using their API.

https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi%3A10.7910/DVN/TJCLKP is perfect. How can I derive this URL for all other datasets on Dataverse? Do you have an API? Or should we crawl a website?

pdurbin · 2024-02-22T20:00:02Z

@marcenacp hi! We list all the datasets in a sitemap, so https://dataverse.harvard.edu/sitemap.xml contains this, for example:

<loc>https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP</loc>

Would that be enough? From there you could extract the DOI/Persistent ID and put into a (future) URL to download metadata in Croissant format. (I used exporter=schema.org in the example above but we would create a new exporter for Croissant.)

benjelloun · 2024-02-23T10:55:01Z

Hi Phil, It would be great to also embed the Croissant metadata inside the dataset Web pages (as an extension of the schema.org metadata you already have) so that it's crawlable by Search engines, and provide a download link / button for users. @pierre Marcenac ***@***.***> For Croissant Health we should ideally favor the Croissant metadata directly present in Web pages rather than the one available via an API call. As we realized with HuggingFace, there can be some slight discrepancies between them. Best, Omar

…

On Thu, Feb 22, 2024 at 9:00 PM Philip Durbin ***@***.***> wrote: @marcenacp <https://github.com/marcenacp> hi! We list all the datasets in a sitemap, so https://dataverse.harvard.edu/sitemap.xml contains this, for example: <loc> https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP </loc> Would that be enough? From there you could extract the DOI/Persistent ID and put into a (future) URL to download metadata in Croissant format. (I used exporter=schema.org in the example above but we would create a new exporter for Croissant.) — Reply to this email directly, view it on GitHub <#530 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMV3YRHUP2RYJLMJ2MSLS3YU6PU7AVCNFSM6AAAAABDMLSV6GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRQGE4DANJYGM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

pdurbin · 2024-02-23T16:50:39Z

@benjelloun thanks. Right, the schema.org metadata inside the <head> of Dataverse web pages (what we'd call "dataset landing pages") was added to Dataverse in 2017 to support Google Dataset Search, which was new at the time.

I've been thinking that I'd leave that format alone (which we call schema.org internally and "Schema.org JSON-LD" in our web interface) because I wouldn't want to break anything related to Google Dataset Search.

However, I know the Croissant team is collaborating with the Google Dataset Search team. If you're saying it's safe to modify the older format, extend it as you say, without breaking Google Dataset Search, that's great.

From a product perspective, I'm wondering if it would make sense to rebrand this format within Dataverse. As shown in the screenshot below, we call the older format "Schema.org JSON-LD" but perhaps we should simply change the label to "Croissant" if we proceed with a single, updated, extended format.

Alternatively, we could add Croissant as a new format (with an internal name of croissant, probably), swap it into the <head> (I assume we wouldn't want both formats in the <head>!), and still offer the older format to download via API or a button click.

We'll probably know more once we start hacking and see that Croissant is truly an extension, with no backward-incompatible changes from the older format.

qqmyers · 2024-02-24T15:37:25Z

Perhaps Signposting would be a way to expose the URL for the desired format in a standardized way. Dataverse already supports signposting and we could nominally add new metadata formats to what is listed for level 2 metadata formats.

pdurbin · 2024-02-26T11:39:36Z

@qqmyers hmm, good point. Signposting does provide a machine-readable way (HTTP headers) to retrieve URLs with more information.

@benjelloun @marcenacp are you familiar with Signposting? What do you think?

To use my dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP as an example, the "link" header looks like this:

link: <0000-0002-9528-9470>;rel="author", https://doi.org/10.7910/DVN/TJCLKP;rel="cite-as", https://dataverse.harvard.edu/api/access/datafile/6867328;rel="item";type="application/zip",https://dataverse.harvard.edu/api/access/datafile/6867331;rel="item";type="text/tab-separated-values",https://dataverse.harvard.edu/api/access/datafile/6867336;rel="item";type="text/x-python-script", https://doi.org/10.7910/DVN/TJCLKP;rel="describedby";type="application/vnd.citationstyles.csl+json",https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi:10.7910/DVN/TJCLKP;rel="describedby";type="application/ld+json", https://schema.org/AboutPage;rel="type",https://schema.org/Dataset;rel="type", http://creativecommons.org/publicdomain/zero/1.0;rel="license", https://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/linkset?persistentId=doi:10.7910/DVN/TJCLKP ; rel="linkset";type="application/linkset+json"

benjelloun · 2024-02-26T13:55:18Z

On Mon, Feb 26, 2024 at 12:39 PM Philip Durbin ***@***.***> wrote: @qqmyers <https://github.com/qqmyers> hmm, good point. Signposting does provide a machine-readable way (HTTP headers) to retrieve URLs with more information. @benjelloun <https://github.com/benjelloun> @marcenacp <https://github.com/marcenacp> are you familiar with Signposting <https://signposting.org>? What do you think?

Signposting looks interesting! At the moment, we do have restrictions on the crawling side that prevent us from reading metadata that is not directly embedded in the Webpage, so we can't rely on this approach in the near future unfortunately.

…

To use my dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP as an example, the "link" header looks like this: link: <0000-0002-9528-9470>;rel="author", https://doi.org/10.7910/DVN/TJCLKP;rel="cite-as", https://dataverse.harvard.edu/api/access/datafile/6867328 ;rel="item";type="application/zip", https://dataverse.harvard.edu/api/access/datafile/6867331 ;rel="item";type="text/tab-separated-values", https://dataverse.harvard.edu/api/access/datafile/6867336;rel="item";type="text/x-python-script", https://doi.org/10.7910/DVN/TJCLKP ;rel="describedby";type="application/vnd.citationstyles.csl+json", https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi:10.7910/DVN/TJCLKP;rel="describedby";type="application/ld+json", https://schema.org/AboutPage;rel="type",https://schema.org/Dataset;rel="type", http://creativecommons.org/publicdomain/zero/1.0;rel="license", https://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/linkset?persistentId=doi:10.7910/DVN/TJCLKP ; rel="linkset";type="application/linkset+json" — Reply to this email directly, view it on GitHub <#530 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMV3YQ3UKXSXFP3N4DX3DTYVRYAJAVCNFSM6AAAAABDMLSV6GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRTHE2DGOBWG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

benjelloun · 2024-02-26T14:06:29Z

Hi Phil, Please see replies inline.

On Fri, Feb 23, 2024 at 5:50 PM Philip Durbin ***@***.***> wrote: @benjelloun <https://github.com/benjelloun> thanks. Right, the schema.org metadata inside the <head> of Dataverse web pages (what we'd call "dataset landing pages") was added <IQSS/dataverse#4252> to Dataverse in 2017 to support Google Dataset Search, which was new at the time. Screenshot.2024-02-23.at.11.41.47.AM.png (view on web) <https://github.com/mlcommons/croissant/assets/21006/d483b7c4-02a9-4ea4-b3d2-2c417a35a96b> I've been thinking that I'd leave that format alone (which we call schema.org internally and "Schema.org JSON-LD" in our web interface) because I wouldn't want to break anything related to Google Dataset Search. However, I know the Croissant team is collaborating with the Google Dataset Search team. If you're saying it's safe to modify the older format, extend it as you say, without breaking Google Dataset Search, that's great.

I am also a member of the Google Dataset Search team :) I would definitely encourage you to modify the older format. That should not break Dataset Search. In case there are any unforeseen issues, we will fix them with high priority.

From a product perspective, I'm wondering if it would make sense to rebrand this format within Dataverse. As shown in the screenshot below, we call the older format "Schema.org JSON-LD" but perhaps we should simply change the label to "Croissant" if we proceed with a single, updated, extended format.

I think that would be a fine change. I don't think "schema.org JSON-LD" as an export format is very useful, because that format is primarily targeted at Search Engines, while Croissant is meant to be used as a working format for ML datasets.

Alternatively, we could add Croissant as a new format (with an internal name of croissant, probably), swap it into the <head> (I assume we wouldn't want both formats in the <head>!), and still offer the older format to download via API or a button click.

You definitely want the Croissant description (which extends the schema.org/Dataset one) to be embedded in the dataset page, so that it can be picked up by Search engines like Dataset Search.

We'll probably know more once we start hacking and see that Croissant is truly an extension, with no backward-incompatible changes from the older format.

Sounds good! Please reach out if you have any questions. This makes me think that we should probably write a short migration guide for dataset authors / platforms that already support schema.org/Dataset and would like to upgrade to Croissant. Best, Omar

…

Screenshot.2024-02-23.at.11.35.24.AM.png (view on web) <https://github.com/mlcommons/croissant/assets/21006/04c57d48-4364-45e9-b3f6-1abaea0d8fe2> — Reply to this email directly, view it on GitHub <#530 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMV3YRCAZIW6BS453JUPHLYVDCGZAVCNFSM6AAAAABDMLSV6GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRRGY3DKNZWHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

4tikhonov · 2024-02-26T14:43:26Z

Hi @benjelloun, Signposting is a child of Herbert van de Sompel, you can find the full spec here: https://signposting.org/FAIR/
We at DANS have contributed the initial version in Dataverse, lately it was extended and leveraged by Harvard and GDCC. I can bring Herbert here if you're interested. :)

pdurbin · 2024-03-20T19:44:55Z

@benjelloun @goeffthomas et al. thanks for the opportunity today to present to the Croissant Task Force the progress @4tikhonov and I have made toward supporting Croissant in Dataverse. Here are the talking points I was reading from as well as Slava's slides. (We also shared them with the Dataverse Google Group.)

As you suggested, I'll continue creating GitHub issues. And I'll come when I can to task force meetings. Thanks again!

pdurbin · 2024-05-06T20:00:24Z

@marcenacp you seem to have written most of the crawler. I got a weird error when I got to the scrapydweb step: err.txt. Any advice? Thanks!

Oh, I did uncomment this early return because I don't want to crawl all of Hugging Face, if that matters:

# Uncomment this early return for debugging purposes:
return [
    "lkarjun/Malayalam-Artiicles",

(I think the README might need to be updated, by the way. start_requests was removed from huggingface.py in 5264dcf.)

pdurbin · 2024-05-07T14:18:28Z

I opened a dedicated issue about scrapydweb:

health: scrapydweb fails to launch, seems to require newer version #647

marcenacp · 2024-05-14T14:13:34Z

@pdurbin, just saw this comment. I answered on #647. Thanks!

pdurbin · 2024-08-22T19:27:50Z

Good news! We just enabled Croissant on Harvard Dataverse. I wrote up a little announcement.

From my perspective, this issue (#530) is now unblocked in that somebody could probably fork this repo and start trying to add a dataverse.py script to health/crawler/spiders. But who should do that? Last I tried I had trouble (@marcenacp thanks for replying on #647).

The main thing to keep in mind is that there are 120+ installations of Dataverse out there. dataverse.py should probably:

check the list of Dataverse installations in JSON
for each installation
- check the sitemap (e.g. https://dataverse.harvard.edu/sitemap.xml if it uses a single sitemap or /sitemap/sitemap_index.xml if it uses multiple sitemaps)
- pick the first dataset and try to download the croissant export format (e.g. https://dataverse.harvard.edu/api/datasets/export?exporter=croissant&persistentId=doi:10.7910/DVN/TJCLKP ) to know if Croissant is enabled or not. Alternatively, you could check the <head> tag for Croissant metadata.
for the Dataverse installations where Croissant is enabled... do whatever health/crawler/spiders/huggingface.py or health/crawler/spiders/openml.py does to add those systems to Croissant Online Health. 😄

benjelloun · 2024-08-22T19:38:42Z

Great news! Thank you Phil for pushing this through. I'm very excited about the wealth of datasets that will become available in Croissant thanks to Dataverse. Best, Omar

…

On Thu, Aug 22, 2024 at 3:28 PM Philip Durbin ***@***.***> wrote: Good news! We just enabled <IQSS/dataverse.harvard.edu#294> Croissant on Harvard Dataverse. I wrote up a little announcement <https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/dLmV7HTcAgAJ> . From my perspective, this issue (#530 <#530>) is now unblocked in that somebody could probably fork this repo and start trying to add a dataverse.py script to health/crawler/spiders <https://github.com/mlcommons/croissant/tree/v1.0.7/health/crawler/spiders>. But who should do that? Last I tried I had trouble ***@***.*** <https://github.com/marcenacp> thanks for replying on #647 <#647>). The main thing to keep in mind is that there are 120+ installations <https://dataverse.org/installations> of Dataverse out there. dataverse.py should probably: - check the list of Dataverse installations in JSON <https://iqss.github.io/dataverse-installations/data/data.json> - for each installation - check the sitemap (e.g. https://dataverse.harvard.edu/sitemap.xml if it uses a single sitemap <https://guides.dataverse.org/en/6.3/installation/config.html#single-sitemap-file> or /sitemap/sitemap_index.xml if it uses multiple sitemaps <https://guides.dataverse.org/en/6.3/installation/config.html#multiple-sitemap-files-sitemap-index-file> ) - pick the first dataset and try to download the croissant export format (e.g. https://dataverse.harvard.edu/api/datasets/export?exporter=croissant&persistentId=doi:10.7910/DVN/TJCLKP ) to know if Croissant is enabled or not. Alternatively, you could check the <head> tag for Croissant metadata. - for the Dataverse installations where Croissant is enabled... do whatever health/crawler/spiders/huggingface.py or health/crawler/spiders/openml.py does to add those systems to Croissant Online Health. 😄 — Reply to this email directly, view it on GitHub <#530 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMV3YR7XBRAFYPECJVFQ23ZSY3UZAVCNFSM6AAAAABDMLSV6GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBVGQ3TSNRRGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

goeffthomas · 2024-08-22T19:44:58Z

Nice work @pdurbin! BTW, as a little prework on the Kaggle <> Dataverse integration we've been looking at, I wrote a little notebook to figure out what version every installation is running: https://www.kaggle.com/code/goefft/check-dataverse-installation-versions

I just made it public in case it's of value to anyone who's looking into what you've suggested above. Basically, you can use this to target installations that are running >= 6.3 and they should have the exporter, right?

pdurbin · 2024-08-22T20:25:42Z

@benjelloun me too!

@goeffthomas nice notebook! You're right, I forgot to mention versions. Definitely any version lower than 6.0 should be excluded because the external exporter mechanism I'm using is not supported.

Dataverse 6.2 and higher puts the Croissant metadata in <head> (IQSS/dataverse#10382). But 6.0 and 6.1 both support the exporter. That is, URLs like https://dataverse.harvard.edu/api/datasets/export?exporter=croissant&persistentId=doi:10.7910/DVN/TJCLKP should work even on 6.0 and 6.1 if the croissant exporter is enabled.

marcenacp added this to mlcroissant Feb 19, 2024

pdurbin mentioned this issue Feb 22, 2024

bump example jar from 1.0.0 to 2.0.0 and update README gdcc/dataverse-exporters#2

Merged

pdurbin mentioned this issue Feb 26, 2024

Croissant support 🥐 IQSS/dataverse#10341

Closed

pdurbin mentioned this issue Mar 15, 2024

1.0 as a string should be a valid version for a dataset #609

Open

cmbz mentioned this issue Mar 27, 2024

Project: Kaggle (Croissant) IQSS/dataverse-pm#163

Open

12 tasks

This was referenced May 6, 2024

Add Croissant to Signposting "describedby" output IQSS/dataverse#10542

Open

health: scrapydweb fails to launch, seems to require newer version #647

Open

pdurbin mentioned this issue Nov 22, 2024

expose links to all export formats via Signposting IQSS/dataverse#11045

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Dataverse to Croissant 🥐 Online Health #530

Adding Dataverse to Croissant 🥐 Online Health #530

pdurbin commented Feb 16, 2024

marcenacp commented Feb 19, 2024

pdurbin commented Feb 22, 2024

benjelloun commented Feb 23, 2024 via email

pdurbin commented Feb 23, 2024

qqmyers commented Feb 24, 2024

pdurbin commented Feb 26, 2024

benjelloun commented Feb 26, 2024 via email

benjelloun commented Feb 26, 2024 via email

4tikhonov commented Feb 26, 2024

pdurbin commented Mar 20, 2024 •

edited

Loading

pdurbin commented May 6, 2024

pdurbin commented May 7, 2024

marcenacp commented May 14, 2024

pdurbin commented Aug 22, 2024

benjelloun commented Aug 22, 2024 via email

goeffthomas commented Aug 22, 2024

pdurbin commented Aug 22, 2024

Adding Dataverse to Croissant 🥐 Online Health #530

Adding Dataverse to Croissant 🥐 Online Health #530

Comments

pdurbin commented Feb 16, 2024

marcenacp commented Feb 19, 2024

pdurbin commented Feb 22, 2024

benjelloun commented Feb 23, 2024 via email

pdurbin commented Feb 23, 2024

qqmyers commented Feb 24, 2024

pdurbin commented Feb 26, 2024

benjelloun commented Feb 26, 2024 via email

benjelloun commented Feb 26, 2024 via email

4tikhonov commented Feb 26, 2024

pdurbin commented Mar 20, 2024 • edited Loading

pdurbin commented May 6, 2024

pdurbin commented May 7, 2024

marcenacp commented May 14, 2024

pdurbin commented Aug 22, 2024

benjelloun commented Aug 22, 2024 via email

goeffthomas commented Aug 22, 2024

pdurbin commented Aug 22, 2024

pdurbin commented Mar 20, 2024 •

edited

Loading