Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Croissant to Signposting "describedby" output #10542

Open
pdurbin opened this issue May 6, 2024 · 2 comments · May be fixed by #11045
Open

Add Croissant to Signposting "describedby" output #10542

pdurbin opened this issue May 6, 2024 · 2 comments · May be fixed by #11045
Labels
FY25 Sprint 10 FY25 Sprint 10 (2024-11-06 - 2024-11-20) FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) Size: 10 A percentage of a sprint. 7 hours.

Comments

@pdurbin
Copy link
Member

pdurbin commented May 6, 2024

Today @siacus and I were talking about how dataset landing pages can become heavy when the machine-readable JSON we put in the <head> (Schema.org JSON-LD or Croissant) gets large. In a real-life dataset with 25K files, the Croissant file can be 7.1 MB.

We talked about putting a link to the Croissant file in our Signposting output, like we do for Schema.org JSON-LD. Basically, robots could request just the headers (e.g. with curl --head) and receive a link to the Croissant file, rather than the entire payload, which can be large.

Unfortunately, people suffering from heavy dataset pages won't get relief until the large content is removed from the <head> of the page, but putting the link in Signposting gives machines an option for the future if the world wants to move in that direction. We already suggested Signposting to the Croissant/Google Dataset Search team at mlcommons/croissant#530 (comment)

In our Signposting output, we already include a link for downloading Schema.org JSON-LD data via API. For example:

<https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi:10.7910/DVN/TJCLKP>;rel="describedby"

The Signposting spec seems to allow multiple "describedby" values, but if we prefer to keep a single "describedby" value, we could consider swapping out schema.org for croissant when it's available, like we do for the <head> tag:

I don't think this is a lot of work. A 3 is probably enough but I'll give it a 10 for reviewing the Signposting spec and talking to that community, if need be, about multiple "describedby" values. The file to edit is SignpostingResources.java as seen in PR #8981.

See also this issue we opened with the Croissant team where we asked for guidance on large Croissant files:

Related issues and PRs:

@pdurbin pdurbin added the Size: 10 A percentage of a sprint. 7 hours. label May 6, 2024
@qqmyers
Copy link
Member

qqmyers commented May 6, 2024

FWIW: I think signposting uses multiple describedbys - since you add the type attribute to specify the format for each one. We originally didn't put all of our exports in it because the draft/spec said something about only common formats, but in subsequent discussions, I don't think there would be any concern if we just automatically added all exports that are installed to the list.

@cmbz
Copy link

cmbz commented Aug 1, 2024

20204/08/01

  • Prioritized at request of @siacus

@pdurbin pdurbin moved this from This Sprint 🏃‍♀️ 🏃 to In Progress 💻 in IQSS Dataverse Project Nov 7, 2024
@pdurbin pdurbin self-assigned this Nov 7, 2024
@cmbz cmbz added the FY25 Sprint 10 FY25 Sprint 10 (2024-11-06 - 2024-11-20) label Nov 7, 2024
@cmbz cmbz added the FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) label Nov 21, 2024
@pdurbin pdurbin linked a pull request Nov 22, 2024 that will close this issue
@pdurbin pdurbin moved this from In Progress 💻 to Done 🧹 in IQSS Dataverse Project Nov 22, 2024
@pdurbin pdurbin removed their assignment Nov 22, 2024
pdurbin added a commit that referenced this issue Nov 25, 2024
…0542

The test file is used in InfoIT#testGetExportFormats
pdurbin added a commit that referenced this issue Nov 25, 2024
pdurbin added a commit that referenced this issue Nov 25, 2024
Before this PR...

In development:

Expected: is "http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/6A3292"
  Actual: is "http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/6A3292"

On Jenkins

Expected: is "http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/6A3292"
  Actual: http://ec2-3-225-221-142.compute-1.amazonaws.com/dataset.xhtml?persistentId=doi:10.5072/FK2/6A3292

So we'll change to just "endsWith" since we aren't actually testing the baseurl,
just the datasetPid which we fixed up in ca93d60.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FY25 Sprint 10 FY25 Sprint 10 (2024-11-06 - 2024-11-20) FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) Size: 10 A percentage of a sprint. 7 hours.
Projects
Status: Done 🧹
Development

Successfully merging a pull request may close this issue.

3 participants