-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Dataverse to Croissant 🥐 Online Health #530
Comments
Hi @pdurbin, thanks for creating this issue! The best way to crawl Hugging Face was to list all datasets and then visit each individual URL using their API. https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi%3A10.7910/DVN/TJCLKP is perfect. How can I derive this URL for all other datasets on Dataverse? Do you have an API? Or should we crawl a website? |
@marcenacp hi! We list all the datasets in a sitemap, so https://dataverse.harvard.edu/sitemap.xml contains this, for example:
Would that be enough? From there you could extract the DOI/Persistent ID and put into a (future) URL to download metadata in Croissant format. (I used |
Hi Phil,
It would be great to also embed the Croissant metadata inside the dataset
Web pages (as an extension of the schema.org metadata you already have) so
that it's crawlable by Search engines, and provide a download link / button
for users.
@pierre Marcenac ***@***.***> For Croissant Health we should
ideally favor the Croissant metadata directly present in Web pages rather
than the one available via an API call. As we realized with HuggingFace,
there can be some slight discrepancies between them.
Best,
Omar
…On Thu, Feb 22, 2024 at 9:00 PM Philip Durbin ***@***.***> wrote:
@marcenacp <https://github.com/marcenacp> hi! We list all the datasets in
a sitemap, so https://dataverse.harvard.edu/sitemap.xml contains this,
for example:
<loc>
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP
</loc>
Would that be enough? From there you could extract the DOI/Persistent ID
and put into a (future) URL to download metadata in Croissant format. (I
used exporter=schema.org in the example above but we would create a new
exporter for Croissant.)
—
Reply to this email directly, view it on GitHub
<#530 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABMV3YRHUP2RYJLMJ2MSLS3YU6PU7AVCNFSM6AAAAABDMLSV6GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRQGE4DANJYGM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
@benjelloun thanks. Right, the schema.org metadata inside the I've been thinking that I'd leave that format alone (which we call However, I know the Croissant team is collaborating with the Google Dataset Search team. If you're saying it's safe to modify the older format, extend it as you say, without breaking Google Dataset Search, that's great. From a product perspective, I'm wondering if it would make sense to rebrand this format within Dataverse. As shown in the screenshot below, we call the older format "Schema.org JSON-LD" but perhaps we should simply change the label to "Croissant" if we proceed with a single, updated, extended format. Alternatively, we could add Croissant as a new format (with an internal name of We'll probably know more once we start hacking and see that Croissant is truly an extension, with no backward-incompatible changes from the older format. |
Perhaps Signposting would be a way to expose the URL for the desired format in a standardized way. Dataverse already supports signposting and we could nominally add new metadata formats to what is listed for level 2 metadata formats. |
Hi Phil,
Please see replies inline.
On Fri, Feb 23, 2024 at 5:50 PM Philip Durbin ***@***.***> wrote:
@benjelloun <https://github.com/benjelloun> thanks. Right, the schema.org
metadata inside the <head> of Dataverse web pages (what we'd call
"dataset landing pages") was added
<IQSS/dataverse#4252> to Dataverse in 2017 to
support Google Dataset Search, which was new at the time.
Screenshot.2024-02-23.at.11.41.47.AM.png (view on web)
<https://github.com/mlcommons/croissant/assets/21006/d483b7c4-02a9-4ea4-b3d2-2c417a35a96b>
I've been thinking that I'd leave that format alone (which we call
schema.org internally and "Schema.org JSON-LD" in our web interface)
because I wouldn't want to break anything related to Google Dataset Search.
However, I know the Croissant team is collaborating with the Google
Dataset Search team. If you're saying it's safe to modify the older format,
extend it as you say, without breaking Google Dataset Search, that's great.
I am also a member of the Google Dataset Search team :)
I would definitely encourage you to modify the older format. That should
not break Dataset Search. In case there are any unforeseen issues, we will
fix them with high priority.
From a product perspective, I'm wondering if it would make sense to
rebrand this format within Dataverse. As shown in the screenshot below, we
call the older format "Schema.org JSON-LD" but perhaps we should simply
change the label to "Croissant" if we proceed with a single, updated,
extended format.
I think that would be a fine change. I don't think "schema.org JSON-LD" as
an export format is very useful, because that format is primarily targeted
at Search Engines, while Croissant is meant to be used as a working format
for ML datasets.
Alternatively, we could add Croissant as a new format (with an internal
name of croissant, probably), swap it into the <head> (I assume we
wouldn't want both formats in the <head>!), and still offer the older
format to download via API or a button click.
You definitely want the Croissant description (which extends the
schema.org/Dataset one) to be embedded in the dataset page, so that it can
be picked up by Search engines like Dataset Search.
We'll probably know more once we start hacking and see that Croissant is
truly an extension, with no backward-incompatible changes from the older
format.
Sounds good! Please reach out if you have any questions.
This makes me think that we should probably write a short migration guide
for dataset authors / platforms that already support schema.org/Dataset and
would like to upgrade to Croissant.
Best,
Omar
… Screenshot.2024-02-23.at.11.35.24.AM.png (view on web)
<https://github.com/mlcommons/croissant/assets/21006/04c57d48-4364-45e9-b3f6-1abaea0d8fe2>
—
Reply to this email directly, view it on GitHub
<#530 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABMV3YRCAZIW6BS453JUPHLYVDCGZAVCNFSM6AAAAABDMLSV6GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRRGY3DKNZWHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @benjelloun, Signposting is a child of Herbert van de Sompel, you can find the full spec here: https://signposting.org/FAIR/ |
@benjelloun @goeffthomas et al. thanks for the opportunity today to present to the Croissant Task Force the progress @4tikhonov and I have made toward supporting Croissant in Dataverse. Here are the talking points I was reading from as well as Slava's slides. (We also shared them with the Dataverse Google Group.) As you suggested, I'll continue creating GitHub issues. And I'll come when I can to task force meetings. Thanks again! |
@marcenacp you seem to have written most of the crawler. I got a weird error when I got to the Oh, I did uncomment this early return because I don't want to crawl all of Hugging Face, if that matters:
(I think the README might need to be updated, by the way. |
I opened a dedicated issue about |
Good news! We just enabled Croissant on Harvard Dataverse. I wrote up a little announcement. From my perspective, this issue (#530) is now unblocked in that somebody could probably fork this repo and start trying to add a The main thing to keep in mind is that there are 120+ installations of Dataverse out there.
|
Great news! Thank you Phil for pushing this through. I'm very excited about
the wealth of datasets that will become available in Croissant thanks to
Dataverse.
Best,
Omar
…On Thu, Aug 22, 2024 at 3:28 PM Philip Durbin ***@***.***> wrote:
Good news! We just enabled
<IQSS/dataverse.harvard.edu#294> Croissant on
Harvard Dataverse. I wrote up a little announcement
<https://groups.google.com/g/dataverse-community/c/JI8HPgGarr8/m/dLmV7HTcAgAJ>
.
From my perspective, this issue (#530
<#530>) is now unblocked in
that somebody could probably fork this repo and start trying to add a
dataverse.py script to health/crawler/spiders
<https://github.com/mlcommons/croissant/tree/v1.0.7/health/crawler/spiders>.
But who should do that? Last I tried I had trouble ***@***.***
<https://github.com/marcenacp> thanks for replying on #647
<#647>).
The main thing to keep in mind is that there are 120+ installations
<https://dataverse.org/installations> of Dataverse out there. dataverse.py
should probably:
- check the list of Dataverse installations in JSON
<https://iqss.github.io/dataverse-installations/data/data.json>
- for each installation
- check the sitemap (e.g. https://dataverse.harvard.edu/sitemap.xml
if it uses a single sitemap
<https://guides.dataverse.org/en/6.3/installation/config.html#single-sitemap-file>
or /sitemap/sitemap_index.xml if it uses multiple sitemaps
<https://guides.dataverse.org/en/6.3/installation/config.html#multiple-sitemap-files-sitemap-index-file>
)
- pick the first dataset and try to download the croissant export
format (e.g.
https://dataverse.harvard.edu/api/datasets/export?exporter=croissant&persistentId=doi:10.7910/DVN/TJCLKP
) to know if Croissant is enabled or not. Alternatively, you could check
the <head> tag for Croissant metadata.
- for the Dataverse installations where Croissant is enabled... do
whatever health/crawler/spiders/huggingface.py or
health/crawler/spiders/openml.py does to add those systems to Croissant
Online Health. 😄
—
Reply to this email directly, view it on GitHub
<#530 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABMV3YR7XBRAFYPECJVFQ23ZSY3UZAVCNFSM6AAAAABDMLSV6GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBVGQ3TSNRRGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Nice work @pdurbin! BTW, as a little prework on the Kaggle <> Dataverse integration we've been looking at, I wrote a little notebook to figure out what version every installation is running: https://www.kaggle.com/code/goefft/check-dataverse-installation-versions I just made it public in case it's of value to anyone who's looking into what you've suggested above. Basically, you can use this to target installations that are running >= 6.3 and they should have the exporter, right? |
@benjelloun me too! @goeffthomas nice notebook! You're right, I forgot to mention versions. Definitely any version lower than 6.0 should be excluded because the external exporter mechanism I'm using is not supported. Dataverse 6.2 and higher puts the Croissant metadata in |
Hi! I'm interested in some details about how to add Dataverse to Croissant 🥐 Online Health.
In health/crawler/spiders/huggingface.py I'm seeing an example for Hugging Face like this:
https://datasets-server.huggingface.co/croissant?dataset=mnist
Is this roughtly what you need from us? Can the URL be different? A URL like the following would fit better into our existing pattern:
https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi%3A10.7910/DVN/TJCLKP
Thanks!
The text was updated successfully, but these errors were encountered: