Add JSON-LD extraction from HTML #2804

wallberg · 2024-06-20T16:33:13Z

Implementation of issue #2692.

See also https://w3c.github.io/json-ld-syntax/#embedding-json-ld-in-html-documents and https://www.w3.org/TR/json-ld11-api/#html-content-algorithms .

Summary of changes

If source.content_type is "text/html" or "application/xhtml+xml" then parse the document as HTML and extract script elements of type="application/ld+json" as JSON-LD.

The default behavior is to extract only the first matching script element. These overrides are available:

To extract all script elements: supply an optional extract_all_scripts=True parameter to JsonLDParser.parse()
To extract one script element with a specific id attribute value: add the id value as a fragment identifier in the IRI available from source.getSystemId()

Detailed changes

rdflib.plugins.parsers.jsonld.JsonLDParser.parse

add docstring
change parameter list from **kwargs to explicit list
add optional extract_all_scripts parameter
get the fragment identifier from source.getSystemId()
add fragment_id and extract_all_scripts parameters to the call to source_to_json

rdflib.plugins.shared.jsonld.util.source_to_json

add docstring
add optional fragment_id and extract_all_scripts parameters
change the return value to a tuple with the extracted JSON document and value of the HTML base element
if source.content_type is "text/html" or "application/xhtml+xml" then parse source as HTML and extract the appropriate script element(s) and the HTML base element

test/jsonld/test_onedotone.py

enable all existing html tests (except html/f004-in). (Note: for more information on the failing html/f004-in test, see https://lists.w3.org/Archives/Public/public-json-ld-wg/2024May/0000.html)
if inputpath ends with ".html" (with optional fragment identifier) then invoke runner.do_test_html

test/jsonld/runner.py

add new do_test_html function (Note: the html test cases from the JSON-LD Test Suite combine testing
for JSON-LD extraction from the HTML with testing for other algorithms (e.g. compact/flatten),
which rdflib does not currently support. In order to test extraction only and ignore
the compact/flatten algorithms, do_test_html performs a graph comparison using
rdflib.compare.isomorphic, without serializing back to JSON)

Breaking Changes

When rdflib.plugins.shared.jsonld.util.source_to_json extracts JSON-LD from HTML, it needs to return the value of the HTML base element in addition to the JSON. I took the simplest path and returned a tuple containing the JSON and the base.

I can think of other ways to return the base without breaking the current return value:

Return json when processing a json document and tuple (json, base) when processing an html document.
Add an optional parameter to return tuple (json, base) instead of json.
Continue returning only json, but add an optional parameter which will receive the value of base.

Checklist

Checked that there aren't other open pull requests for
the same change.
Checked that all tests and type checking passes.
If the change adds new features or changes the RDFLib public API:
- Created an issue to discuss the change and get in-principle agreement. Turning rdflib jsonld into a "full processor" (a.o. for schema.org compliance) #2692
- Considered adding an example in ./examples.
If the change has a potential impact on users of this project:
- Added or updated tests that fail without the change.
- Updated relevant documentation to avoid inaccuracies.
- Considered adding additional documentation.
Considered granting push permissions to the PR branch,
so maintainers can fix minor issues and keep your PR up to date.

See https://w3c.github.io/json-ld-syntax/#embedding-json-ld-in-html-documents and https://www.w3.org/TR/json-ld11-api/#html-content-algorithms . Implementation summary: rdflib.plugins.parsers.jsonld.JsonLDParser.parse * add docstring * change parameter list from **kwargs to explicit list * add optional extract_all_scripts parameter * get the fragment identifier from source.getSystemId() * add fragment_id and extract_all_scripts parameters to the call to source_to_json rdflib.plugins.shared.jsonld.util.source_to_json * add docstring * add optional fragment_id and extract_all_scripts parameters * change the return value to a tuple with the extracted JSON document and value of the HTML base element * if source.content_type is "text/html" or "application/xhtml+xml" then parse source as HTML and extract the appropriate script element(s) and the HTML base element Testing test/jsonld/test_onedotone.py * enable all existing html tests (except html/f004-in) * if inputpath ends with ".html" (with optional fragment identifier) then invoke runner.do_test_html For more information on the failing html/f004-in test, see https://lists.w3.org/Archives/Public/public-json-ld-wg/2024May/0000.html . test/jsonld/runner.py * add new do_test_html function Note that the html test cases from the JSON-LD Test Suite combine testing for JSON-LD extraction from the HTML with testing for other algorithms (e.g. compact/flatten), which rdflib does not currently support. In order to test extraction only and ignore the compact/flatten algorithms, do_test_html performs a graph comparison using rdflib.compare.isomorphic, without serializing back to JSON .

…ded-jsonld

coveralls · 2024-06-20T16:40:24Z

coverage: 91.036% (+0.006%) from 91.03%
when pulling 53b353f on wallberg:issue-2692-embedded-jsonld
into 0ecc400 on RDFLib:main.

coveralls · 2024-07-24T09:57:19Z

coverage: 91.067% (+0.02%) from 91.047%
when pulling 13166ec on wallberg:issue-2692-embedded-jsonld
into bb17072 on RDFLib:main.

nicholascar · 2024-07-24T23:47:40Z

@wallberg you have prefixed this PR with "Draft" but it's not actually a draft PR. Do you consider it ready for review?

Since @ashleysommer fixed the GitHub vlaidation pipeline, it appears to be passing all tests.

wallberg · 2024-07-25T18:24:51Z

@nicholascar yes, ready for review.

ashleysommer · 2024-07-25T22:17:07Z

I'm happy to see this is using the built-in html.parser library, because we will soon be removing the old html5lib dependency from our dependencies.

wallberg added 2 commits May 8, 2024 15:17

Merge branch 'main' of github.com:RDFLib/rdflib into issue-2692-embed…

53b353f

…ded-jsonld

Merge branch 'main' into issue-2692-embedded-jsonld

3ec036a

Merge branch 'main' into issue-2692-embedded-jsonld

3edf1a6

nicholascar added the awaiting feedback More feedback is needed from the author of the PR or Issue. label Jul 24, 2024

wallberg changed the title ~~Draft: Add JSON-LD extraction from HTML~~ Add JSON-LD extraction from HTML Jul 25, 2024

Merge branch 'main' into issue-2692-embedded-jsonld

13166ec

nicholascar removed the awaiting feedback More feedback is needed from the author of the PR or Issue. label Jul 26, 2024

nicholascar self-requested a review July 26, 2024 09:15

nicholascar approved these changes Jul 26, 2024

View reviewed changes

nicholascar merged commit 0853467 into RDFLib:main Jul 26, 2024
22 checks passed

wallberg mentioned this pull request Oct 27, 2024

Fetch schema.org data from DRUM wallberg/linked-data-sandbox#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JSON-LD extraction from HTML #2804

Add JSON-LD extraction from HTML #2804

wallberg commented Jun 20, 2024 •

edited

Loading

coveralls commented Jun 20, 2024

coveralls commented Jul 24, 2024 •

edited

Loading

nicholascar commented Jul 24, 2024

wallberg commented Jul 25, 2024

ashleysommer commented Jul 25, 2024

Add JSON-LD extraction from HTML #2804

Add JSON-LD extraction from HTML #2804

Conversation

wallberg commented Jun 20, 2024 • edited Loading

Summary of changes

Detailed changes

Breaking Changes

Checklist

coveralls commented Jun 20, 2024

coveralls commented Jul 24, 2024 • edited Loading

nicholascar commented Jul 24, 2024

wallberg commented Jul 25, 2024

ashleysommer commented Jul 25, 2024

wallberg commented Jun 20, 2024 •

edited

Loading

coveralls commented Jul 24, 2024 •

edited

Loading