Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JSON-LD extraction from HTML #2804

Merged
merged 5 commits into from
Jul 26, 2024

Conversation

wallberg
Copy link
Contributor

@wallberg wallberg commented Jun 20, 2024

Implementation of issue #2692.

See also https://w3c.github.io/json-ld-syntax/#embedding-json-ld-in-html-documents and https://www.w3.org/TR/json-ld11-api/#html-content-algorithms .

Summary of changes

If source.content_type is "text/html" or "application/xhtml+xml" then parse the document as HTML and extract script elements of type="application/ld+json" as JSON-LD.

The default behavior is to extract only the first matching script element. These overrides are available:

  • To extract all script elements: supply an optional extract_all_scripts=True parameter to JsonLDParser.parse()
  • To extract one script element with a specific id attribute value: add the id value as a fragment identifier in the IRI available from source.getSystemId()

Detailed changes

rdflib.plugins.parsers.jsonld.JsonLDParser.parse

  • add docstring
  • change parameter list from **kwargs to explicit list
  • add optional extract_all_scripts parameter
  • get the fragment identifier from source.getSystemId()
  • add fragment_id and extract_all_scripts parameters to the call to source_to_json

rdflib.plugins.shared.jsonld.util.source_to_json

  • add docstring
  • add optional fragment_id and extract_all_scripts parameters
  • change the return value to a tuple with the extracted JSON document and value of the HTML base element
  • if source.content_type is "text/html" or "application/xhtml+xml" then parse source as HTML and extract the appropriate script element(s) and the HTML base element

test/jsonld/test_onedotone.py

test/jsonld/runner.py

  • add new do_test_html function (Note: the html test cases from the JSON-LD Test Suite combine testing
    for JSON-LD extraction from the HTML with testing for other algorithms (e.g. compact/flatten),
    which rdflib does not currently support. In order to test extraction only and ignore
    the compact/flatten algorithms, do_test_html performs a graph comparison using
    rdflib.compare.isomorphic, without serializing back to JSON)

Breaking Changes

When rdflib.plugins.shared.jsonld.util.source_to_json extracts JSON-LD from HTML, it needs to return the value of the HTML base element in addition to the JSON. I took the simplest path and returned a tuple containing the JSON and the base.

I can think of other ways to return the base without breaking the current return value:

  • Return json when processing a json document and tuple (json, base) when processing an html document.
  • Add an optional parameter to return tuple (json, base) instead of json.
  • Continue returning only json, but add an optional parameter which will receive the value of base.

Checklist

  • Checked that there aren't other open pull requests for
    the same change.
  • Checked that all tests and type checking passes.
  • If the change adds new features or changes the RDFLib public API:
  • If the change has a potential impact on users of this project:
    • Added or updated tests that fail without the change.
    • Updated relevant documentation to avoid inaccuracies.
    • Considered adding additional documentation.
  • Considered granting push permissions to the PR branch,
    so maintainers can fix minor issues and keep your PR up to date.

See https://w3c.github.io/json-ld-syntax/#embedding-json-ld-in-html-documents
and https://www.w3.org/TR/json-ld11-api/#html-content-algorithms .

Implementation summary:

rdflib.plugins.parsers.jsonld.JsonLDParser.parse
* add docstring
* change parameter list from **kwargs to explicit list
* add optional extract_all_scripts parameter
* get the fragment identifier from source.getSystemId()
* add fragment_id and extract_all_scripts parameters to the call to source_to_json

rdflib.plugins.shared.jsonld.util.source_to_json
* add docstring
* add optional fragment_id and extract_all_scripts parameters
* change the return value to a tuple with the extracted JSON document and value of the HTML base element
* if source.content_type is "text/html" or "application/xhtml+xml" then parse source as HTML and extract the appropriate script element(s) and the HTML base element

Testing

test/jsonld/test_onedotone.py
* enable all existing html tests (except html/f004-in)
* if inputpath ends with ".html" (with optional fragment identifier) then invoke runner.do_test_html

For more information on the failing html/f004-in test, see https://lists.w3.org/Archives/Public/public-json-ld-wg/2024May/0000.html .

test/jsonld/runner.py
* add new do_test_html function

Note that the html test cases from the JSON-LD Test Suite combine testing
for JSON-LD extraction from the HTML with testing for other algorithms (e.g. compact/flatten),
which rdflib does not currently support. In order to test extraction only and ignore
the compact/flatten algorithms, do_test_html performs a graph comparison using
rdflib.compare.isomorphic, without serializing back to JSON .
@coveralls
Copy link

Coverage Status

coverage: 91.036% (+0.006%) from 91.03%
when pulling 53b353f on wallberg:issue-2692-embedded-jsonld
into 0ecc400 on RDFLib:main.

@coveralls
Copy link

coveralls commented Jul 24, 2024

Coverage Status

coverage: 91.067% (+0.02%) from 91.047%
when pulling 13166ec on wallberg:issue-2692-embedded-jsonld
into bb17072 on RDFLib:main.

@nicholascar
Copy link
Member

@wallberg you have prefixed this PR with "Draft" but it's not actually a draft PR. Do you consider it ready for review?

Since @ashleysommer fixed the GitHub vlaidation pipeline, it appears to be passing all tests.

@nicholascar nicholascar added the awaiting feedback More feedback is needed from the author of the PR or Issue. label Jul 24, 2024
@wallberg wallberg changed the title Draft: Add JSON-LD extraction from HTML Add JSON-LD extraction from HTML Jul 25, 2024
@wallberg
Copy link
Contributor Author

@nicholascar yes, ready for review.

@ashleysommer
Copy link
Contributor

I'm happy to see this is using the built-in html.parser library, because we will soon be removing the old html5lib dependency from our dependencies.

@nicholascar nicholascar removed the awaiting feedback More feedback is needed from the author of the PR or Issue. label Jul 26, 2024
@nicholascar nicholascar self-requested a review July 26, 2024 09:15
@nicholascar nicholascar merged commit 0853467 into RDFLib:main Jul 26, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants