-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HTML] Add Schema.org and other inline rdf support #164
Comments
Currently, it is only generating an RDF-like view of the DOM tree. In general, SA generates the main graph for the resource content (RDF-like view) and, in some cases, additional graphs for metadata (e.g. EXIF metadata for images). In the case of HTML, SA could generate additional named graphs with extracted metadata.
We could use http://any23.apache.org -- other ideas? |
Thanks. You might look at
https://github.com/wbsg-uni-mannheim/WDCFramework/blob/master/pom.xml since
they extract these formats and seem to build upon any23
Named graphs makes sense to distinguish the different syntax sources
UK Guardian newspaper pages are usually good if you want to find examples
of json-ld and microdata in the same page. Or at least used to be.
…On Tue, 30 Nov 2021 at 10:19, Enrico Daga ***@***.***> wrote:
Currently, it is only generating an RDF-like view of the DOM tree.
In general, SA generates the main graph for the resource content (RDF-like
view) and, in some cases, additional graphs for metadata (e.g. EXIF
metadata for images).
In the case of HTML, SA could generate additional named graphs with
extracted metadata.
These should include:
- RDFa
- Microdata
- Microformats
- Others?
We could use http://any23.apache.org -- other ideas?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#164 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABJSGILFJJ6G2MIRGGT2LTUOSQMXANCNFSM5I66EMEQ>
.
|
This relates to #13 |
With dcc589e SA is able to extract metadata from HTML pages. |
That's fantastic - nice work!
…On Sat, 11 Dec 2021, 08:31 luigi-asprino, ***@***.***> wrote:
With dcc589e
<dcc589e>
SA is able to extract metadata from HTML pages.
This feature relies on Any23.
By default Any23 extracts quads having the URL of the page as graph URI.
Therefore, at the moment, the content extracted by SA and Any23 collapses
on the same graph.
The option to enable this feature is html.metadata=(true/false) (false by
default).
Of course, we can discuss which is the best way to serve Any23 extracted
content. This was just a tentative implementation of the feature.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#164 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABJSGNLPSPOSTGXWJWWODDUQMD43ANCNFSM5I66EMEQ>
.
|
Graph names can be customized according to the running extractor. Will do a commit with partial work in this direction. |
Any23 should use the HTTP client of SA.
However, this means that we need to make a public method |
However, I would prefer to just pass an InputStream to Any23, really. |
cool i do see the embedded json-ld (which uses schema.org) from IMDB now.
yields:
it would be nice if it was in a different named graph so i could easily tell if html had embedded RDF (by counting the number of distinct graphs). |
ops i missed them in the snippet but they are there. EDIT here they are
|
Yes, this is the plan |
A great many pages contain RDF data via Schema.org (in microdata, json-ld, rdfa). There are also other vocabularies which uses those syntaxes. Does SPARQL Anything represent that data naturally, or could it be adapted to do so?
The text was updated successfully, but these errors were encountered: