-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JSON-LD extraction from HTML #2804
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
See https://w3c.github.io/json-ld-syntax/#embedding-json-ld-in-html-documents and https://www.w3.org/TR/json-ld11-api/#html-content-algorithms . Implementation summary: rdflib.plugins.parsers.jsonld.JsonLDParser.parse * add docstring * change parameter list from **kwargs to explicit list * add optional extract_all_scripts parameter * get the fragment identifier from source.getSystemId() * add fragment_id and extract_all_scripts parameters to the call to source_to_json rdflib.plugins.shared.jsonld.util.source_to_json * add docstring * add optional fragment_id and extract_all_scripts parameters * change the return value to a tuple with the extracted JSON document and value of the HTML base element * if source.content_type is "text/html" or "application/xhtml+xml" then parse source as HTML and extract the appropriate script element(s) and the HTML base element Testing test/jsonld/test_onedotone.py * enable all existing html tests (except html/f004-in) * if inputpath ends with ".html" (with optional fragment identifier) then invoke runner.do_test_html For more information on the failing html/f004-in test, see https://lists.w3.org/Archives/Public/public-json-ld-wg/2024May/0000.html . test/jsonld/runner.py * add new do_test_html function Note that the html test cases from the JSON-LD Test Suite combine testing for JSON-LD extraction from the HTML with testing for other algorithms (e.g. compact/flatten), which rdflib does not currently support. In order to test extraction only and ignore the compact/flatten algorithms, do_test_html performs a graph comparison using rdflib.compare.isomorphic, without serializing back to JSON .
@wallberg you have prefixed this PR with "Draft" but it's not actually a draft PR. Do you consider it ready for review? Since @ashleysommer fixed the GitHub vlaidation pipeline, it appears to be passing all tests. |
nicholascar
added
the
awaiting feedback
More feedback is needed from the author of the PR or Issue.
label
Jul 24, 2024
wallberg
changed the title
Draft: Add JSON-LD extraction from HTML
Add JSON-LD extraction from HTML
Jul 25, 2024
@nicholascar yes, ready for review. |
I'm happy to see this is using the built-in |
nicholascar
removed
the
awaiting feedback
More feedback is needed from the author of the PR or Issue.
label
Jul 26, 2024
nicholascar
approved these changes
Jul 26, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implementation of issue #2692.
See also https://w3c.github.io/json-ld-syntax/#embedding-json-ld-in-html-documents and https://www.w3.org/TR/json-ld11-api/#html-content-algorithms .
Summary of changes
If
source.content_type
is "text/html" or "application/xhtml+xml" then parse the document as HTML and extract script elements of type="application/ld+json" as JSON-LD.The default behavior is to extract only the first matching script element. These overrides are available:
extract_all_scripts=True
parameter toJsonLDParser.parse()
source.getSystemId()
Detailed changes
rdflib.plugins.parsers.jsonld.JsonLDParser.parse
rdflib.plugins.shared.jsonld.util.source_to_json
test/jsonld/test_onedotone.py
test/jsonld/runner.py
do_test_html
function (Note: the html test cases from the JSON-LD Test Suite combine testingfor JSON-LD extraction from the HTML with testing for other algorithms (e.g. compact/flatten),
which rdflib does not currently support. In order to test extraction only and ignore
the compact/flatten algorithms, do_test_html performs a graph comparison using
rdflib.compare.isomorphic, without serializing back to JSON)
Breaking Changes
When
rdflib.plugins.shared.jsonld.util.source_to_json
extracts JSON-LD from HTML, it needs to return the value of the HTML base element in addition to the JSON. I took the simplest path and returned a tuple containing the JSON and the base.I can think of other ways to return the base without breaking the current return value:
Checklist
the same change.
./examples
.so maintainers can fix minor issues and keep your PR up to date.