Turning rdflib jsonld into a "full processor" (a.o. for schema.org compliance) #2692

ghost · 2024-01-28T11:21:01Z

The JSON-LD 1.1 draft spec mentions different levels of processing for JSON-LD: https://w3c.github.io/json-ld-syntax/#processor-levels

A pure processor can only parse JSON-LD expressed in JSON directly, but a full processor can also parse JSON-LD embedded in HTML.

It would be great if rdflib-jsonld would support this. It would make rdflib-jsonld a library that could be used for HTML documents following the schema.org guidelines for embedding (meta)data in HTML pages as described in their getting started guide https://schema.org/docs/gs.html.

Together with the RDFa & microdata parsers this can then work as a fully RDF based version of the Structured Data Testing tool from Google: https://search.google.com/structured-data/testing-tool.

ghost · 2024-01-28T11:36:23Z

I'm looking into implementing this using RDFLib/rdflib-jsonld#63 as a starting point. In that code HTML parsing is attempted after JSON parsing fails, but I'm looking at choosing the parse using the content type of the source.

I see that the the JSON-LD test suite in "test/jsonld/1.1" already provides a number of tests which just need to be enabled.

ghost · 2024-04-01T20:53:48Z

FYI: I'm occasionally working on this at https://github.com/wallberg-umd/rdflib/tree/issue-2692-embedded-jsonld-draft . I've added the basic functionality and enabled the existing tests. I'm now working through making the tests pass.

See https://w3c.github.io/json-ld-syntax/#embedding-json-ld-in-html-documents and https://www.w3.org/TR/json-ld11-api/#html-content-algorithms . Implementation summary: rdflib.plugins.parsers.jsonld.JsonLDParser.parse * add docstring * change parameter list from **kwargs to explicit list * add optional extract_all_scripts parameter * get the fragment identifier from source.getSystemId() * add fragment_id and extract_all_scripts parameters to the call to source_to_json rdflib.plugins.shared.jsonld.util.source_to_json * add docstring * add optional fragment_id and extract_all_scripts parameters * change the return value to a tuple with the extracted JSON document and value of the HTML base element * if source.content_type is "text/html" or "application/xhtml+xml" then parse source as HTML and extract the appropriate script element(s) and the HTML base element Testing test/jsonld/test_onedotone.py * enable all existing html tests (except html/f004-in) * if inputpath ends with ".html" (with optional fragment identifier) then invoke runner.do_test_html For more information on the failing html/f004-in test, see https://lists.w3.org/Archives/Public/public-json-ld-wg/2024May/0000.html . test/jsonld/runner.py * add new do_test_html function Note that the html test cases from the JSON-LD Test Suite combine testing for JSON-LD extraction from the HTML with testing for other algorithms (e.g. compact/flatten), which rdflib does not currently support. In order to test extraction only and ignore the compact/flatten algorithms, do_test_html performs a graph comparison using rdflib.compare.isomorphic, without serializing back to JSON .

wallberg · 2024-05-08T21:04:05Z

I've completed an initial implementation for this issue, see https://github.com/wallberg/rdflib/tree/issue-2692-embedded-jsonld .

It contains one breaking change: when rdflib.plugins.shared.jsonld.util.source_to_json extracts JSON-LD from HTML, it needs to return the value of the HTML base element in addition to the JSON. I took the simplest path and returned a tuple containing the JSON and the base.

I can think of other ways to return the base without breaking the current return value:

Return json when processing a json document and (json, base) when processing an html document.
Add an optional parameter to return (json, base) instead of json.
Continue returning only json, but add an optional parameter which will receive the value of base.

I'd like to get some feedback on the preferred approach before submitting the PR.

A note on the current status of validation:

task test passes
task lint passes
task mypy fails, I believe to existing problems in main
poetry run python -m mypy rdflib/plugins/parsers/jsonld.py rdflib/plugins/shared/jsonld/context.py rdflib/plugins/shared/jsonld/util.py test/jsonld/runner.py test/jsonld/test_context.py test/jsonld/test_onedotone.py passes

See https://w3c.github.io/json-ld-syntax/#embedding-json-ld-in-html-documents and https://www.w3.org/TR/json-ld11-api/#html-content-algorithms . Implementation summary: rdflib.plugins.parsers.jsonld.JsonLDParser.parse * add docstring * change parameter list from **kwargs to explicit list * add optional extract_all_scripts parameter * get the fragment identifier from source.getSystemId() * add fragment_id and extract_all_scripts parameters to the call to source_to_json rdflib.plugins.shared.jsonld.util.source_to_json * add docstring * add optional fragment_id and extract_all_scripts parameters * change the return value to a tuple with the extracted JSON document and value of the HTML base element * if source.content_type is "text/html" or "application/xhtml+xml" then parse source as HTML and extract the appropriate script element(s) and the HTML base element Testing test/jsonld/test_onedotone.py * enable all existing html tests (except html/f004-in) * if inputpath ends with ".html" (with optional fragment identifier) then invoke runner.do_test_html For more information on the failing html/f004-in test, see https://lists.w3.org/Archives/Public/public-json-ld-wg/2024May/0000.html . test/jsonld/runner.py * add new do_test_html function Note that the html test cases from the JSON-LD Test Suite combine testing for JSON-LD extraction from the HTML with testing for other algorithms (e.g. compact/flatten), which rdflib does not currently support. In order to test extraction only and ignore the compact/flatten algorithms, do_test_html performs a graph comparison using rdflib.compare.isomorphic, without serializing back to JSON . Co-authored-by: Ashley Sommer <[email protected]> Co-authored-by: Nicholas Car <[email protected]>

ghost changed the title ~~Turning rdflib-jsonld into a "full processor" (a.o. for schema.org compliance)~~ Turning rdflib jsonld into a "full processor" (a.o. for schema.org compliance) Jan 29, 2024

wallberg mentioned this issue Jun 20, 2024

Add JSON-LD extraction from HTML #2804

Merged

8 tasks

nicholascar closed this as completed Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turning rdflib jsonld into a "full processor" (a.o. for schema.org compliance) #2692

Turning rdflib jsonld into a "full processor" (a.o. for schema.org compliance) #2692

ghost commented Jan 28, 2024

ghost commented Jan 28, 2024

ghost commented Apr 1, 2024

wallberg commented May 8, 2024

Turning rdflib jsonld into a "full processor" (a.o. for schema.org compliance) #2692

Turning rdflib jsonld into a "full processor" (a.o. for schema.org compliance) #2692

Comments

ghost commented Jan 28, 2024

ghost commented Jan 28, 2024

ghost commented Apr 1, 2024

wallberg commented May 8, 2024