Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Dataverse to Croissant 🥐 Online Health #530

Open
Tracked by #163
pdurbin opened this issue Feb 16, 2024 · 17 comments
Open
Tracked by #163

Adding Dataverse to Croissant 🥐 Online Health #530

pdurbin opened this issue Feb 16, 2024 · 17 comments

Comments

@pdurbin
Copy link
Member

pdurbin commented Feb 16, 2024

Hi! I'm interested in some details about how to add Dataverse to Croissant 🥐 Online Health.

In health/crawler/spiders/huggingface.py I'm seeing an example for Hugging Face like this:

https://datasets-server.huggingface.co/croissant?dataset=mnist

Is this roughtly what you need from us? Can the URL be different? A URL like the following would fit better into our existing pattern:

https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi%3A10.7910/DVN/TJCLKP

Thanks!

@marcenacp
Copy link
Contributor

Hi @pdurbin, thanks for creating this issue! The best way to crawl Hugging Face was to list all datasets and then visit each individual URL using their API.

https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi%3A10.7910/DVN/TJCLKP is perfect. How can I derive this URL for all other datasets on Dataverse? Do you have an API? Or should we crawl a website?

@pdurbin
Copy link
Member Author

pdurbin commented Feb 22, 2024

@marcenacp hi! We list all the datasets in a sitemap, so https://dataverse.harvard.edu/sitemap.xml contains this, for example:

<loc>https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP</loc>

Would that be enough? From there you could extract the DOI/Persistent ID and put into a (future) URL to download metadata in Croissant format. (I used exporter=schema.org in the example above but we would create a new exporter for Croissant.)

@benjelloun
Copy link
Contributor

benjelloun commented Feb 23, 2024 via email

@pdurbin
Copy link
Member Author

pdurbin commented Feb 23, 2024

@benjelloun thanks. Right, the schema.org metadata inside the <head> of Dataverse web pages (what we'd call "dataset landing pages") was added to Dataverse in 2017 to support Google Dataset Search, which was new at the time.

Screenshot 2024-02-23 at 11 41 47 AM

I've been thinking that I'd leave that format alone (which we call schema.org internally and "Schema.org JSON-LD" in our web interface) because I wouldn't want to break anything related to Google Dataset Search.

However, I know the Croissant team is collaborating with the Google Dataset Search team. If you're saying it's safe to modify the older format, extend it as you say, without breaking Google Dataset Search, that's great.

From a product perspective, I'm wondering if it would make sense to rebrand this format within Dataverse. As shown in the screenshot below, we call the older format "Schema.org JSON-LD" but perhaps we should simply change the label to "Croissant" if we proceed with a single, updated, extended format.

Alternatively, we could add Croissant as a new format (with an internal name of croissant, probably), swap it into the <head> (I assume we wouldn't want both formats in the <head>!), and still offer the older format to download via API or a button click.

We'll probably know more once we start hacking and see that Croissant is truly an extension, with no backward-incompatible changes from the older format.

Screenshot 2024-02-23 at 11 35 24 AM

@qqmyers
Copy link

qqmyers commented Feb 24, 2024

Perhaps Signposting would be a way to expose the URL for the desired format in a standardized way. Dataverse already supports signposting and we could nominally add new metadata formats to what is listed for level 2 metadata formats.

@pdurbin
Copy link
Member Author

pdurbin commented Feb 26, 2024

@qqmyers hmm, good point. Signposting does provide a machine-readable way (HTTP headers) to retrieve URLs with more information.

@benjelloun @marcenacp are you familiar with Signposting? What do you think?

To use my dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP as an example, the "link" header looks like this:

link: <0000-0002-9528-9470>;rel="author", https://doi.org/10.7910/DVN/TJCLKP;rel="cite-as", https://dataverse.harvard.edu/api/access/datafile/6867328;rel="item";type="application/zip",https://dataverse.harvard.edu/api/access/datafile/6867331;rel="item";type="text/tab-separated-values",https://dataverse.harvard.edu/api/access/datafile/6867336;rel="item";type="text/x-python-script", https://doi.org/10.7910/DVN/TJCLKP;rel="describedby";type="application/vnd.citationstyles.csl+json",https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi:10.7910/DVN/TJCLKP;rel="describedby";type="application/ld+json", https://schema.org/AboutPage;rel="type",https://schema.org/Dataset;rel="type", http://creativecommons.org/publicdomain/zero/1.0;rel="license", https://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/linkset?persistentId=doi:10.7910/DVN/TJCLKP ; rel="linkset";type="application/linkset+json"

@benjelloun
Copy link
Contributor

benjelloun commented Feb 26, 2024 via email

@benjelloun
Copy link
Contributor

benjelloun commented Feb 26, 2024 via email

@4tikhonov
Copy link

Hi @benjelloun, Signposting is a child of Herbert van de Sompel, you can find the full spec here: https://signposting.org/FAIR/
We at DANS have contributed the initial version in Dataverse, lately it was extended and leveraged by Harvard and GDCC. I can bring Herbert here if you're interested. :)

@pdurbin
Copy link
Member Author

pdurbin commented Mar 20, 2024

@benjelloun @goeffthomas et al. thanks for the opportunity today to present to the Croissant Task Force the progress @4tikhonov and I have made toward supporting Croissant in Dataverse. Here are the talking points I was reading from as well as Slava's slides. (We also shared them with the Dataverse Google Group.)

As you suggested, I'll continue creating GitHub issues. And I'll come when I can to task force meetings. Thanks again!

@pdurbin
Copy link
Member Author

pdurbin commented May 6, 2024

@marcenacp you seem to have written most of the crawler. I got a weird error when I got to the scrapydweb step: err.txt. Any advice? Thanks!

Oh, I did uncomment this early return because I don't want to crawl all of Hugging Face, if that matters:

# Uncomment this early return for debugging purposes:
return [
    "lkarjun/Malayalam-Artiicles",

(I think the README might need to be updated, by the way. start_requests was removed from huggingface.py in 5264dcf.)

@pdurbin
Copy link
Member Author

pdurbin commented May 7, 2024

I opened a dedicated issue about scrapydweb:

@marcenacp
Copy link
Contributor

@pdurbin, just saw this comment. I answered on #647. Thanks!

@pdurbin
Copy link
Member Author

pdurbin commented Aug 22, 2024

Good news! We just enabled Croissant on Harvard Dataverse. I wrote up a little announcement.

From my perspective, this issue (#530) is now unblocked in that somebody could probably fork this repo and start trying to add a dataverse.py script to health/crawler/spiders. But who should do that? Last I tried I had trouble (@marcenacp thanks for replying on #647).

The main thing to keep in mind is that there are 120+ installations of Dataverse out there. dataverse.py should probably:

@benjelloun
Copy link
Contributor

benjelloun commented Aug 22, 2024 via email

@goeffthomas
Copy link
Contributor

Nice work @pdurbin! BTW, as a little prework on the Kaggle <> Dataverse integration we've been looking at, I wrote a little notebook to figure out what version every installation is running: https://www.kaggle.com/code/goefft/check-dataverse-installation-versions

I just made it public in case it's of value to anyone who's looking into what you've suggested above. Basically, you can use this to target installations that are running >= 6.3 and they should have the exporter, right?

@pdurbin
Copy link
Member Author

pdurbin commented Aug 22, 2024

@benjelloun me too!

@goeffthomas nice notebook! You're right, I forgot to mention versions. Definitely any version lower than 6.0 should be excluded because the external exporter mechanism I'm using is not supported.

Dataverse 6.2 and higher puts the Croissant metadata in <head> (IQSS/dataverse#10382). But 6.0 and 6.1 both support the exporter. That is, URLs like https://dataverse.harvard.edu/api/datasets/export?exporter=croissant&persistentId=doi:10.7910/DVN/TJCLKP should work even on 6.0 and 6.1 if the croissant exporter is enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

6 participants