[HTML] Add Schema.org and other inline rdf support #164

danbri · 2021-11-29T12:00:23Z

A great many pages contain RDF data via Schema.org (in microdata, json-ld, rdfa). There are also other vocabularies which uses those syntaxes. Does SPARQL Anything represent that data naturally, or could it be adapted to do so?

enridaga · 2021-11-30T10:19:45Z

Currently, it is only generating an RDF-like view of the DOM tree.

In general, SA generates the main graph for the resource content (RDF-like view) and, in some cases, additional graphs for metadata (e.g. EXIF metadata for images).

In the case of HTML, SA could generate additional named graphs with extracted metadata.
These should include:

RDFa
Microdata
Microformats
Others?

We could use http://any23.apache.org -- other ideas?

danbri · 2021-11-30T11:16:30Z

Thanks. You might look at https://github.com/wbsg-uni-mannheim/WDCFramework/blob/master/pom.xml since they extract these formats and seem to build upon any23 Named graphs makes sense to distinguish the different syntax sources UK Guardian newspaper pages are usually good if you want to find examples of json-ld and microdata in the same page. Or at least used to be.

…

On Tue, 30 Nov 2021 at 10:19, Enrico Daga ***@***.***> wrote: Currently, it is only generating an RDF-like view of the DOM tree. In general, SA generates the main graph for the resource content (RDF-like view) and, in some cases, additional graphs for metadata (e.g. EXIF metadata for images). In the case of HTML, SA could generate additional named graphs with extracted metadata. These should include: - RDFa - Microdata - Microformats - Others? We could use http://any23.apache.org -- other ideas? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABJSGILFJJ6G2MIRGGT2LTUOSQMXANCNFSM5I66EMEQ> .

luigi-asprino · 2021-12-07T16:27:55Z

This relates to #13

luigi-asprino · 2021-12-11T08:30:59Z

With dcc589e SA is able to extract metadata from HTML pages.
This feature relies on Any23.
By default Any23 extracts quads having the URL of the page as graph URI.
Therefore, at the moment, the content extracted by SA and Any23 collapses on the same graph.
The option to enable this feature is html.metadata=(true/false) (false by default).
Of course, we can discuss which is the best way to serve Any23 extracted content. This was just a tentative implementation of the feature.

danbri · 2021-12-11T15:23:56Z

That's fantastic - nice work!

…

On Sat, 11 Dec 2021, 08:31 luigi-asprino, ***@***.***> wrote: With dcc589e <dcc589e> SA is able to extract metadata from HTML pages. This feature relies on Any23. By default Any23 extracts quads having the URL of the page as graph URI. Therefore, at the moment, the content extracted by SA and Any23 collapses on the same graph. The option to enable this feature is html.metadata=(true/false) (false by default). Of course, we can discuss which is the best way to serve Any23 extracted content. This was just a tentative implementation of the feature. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABJSGNLPSPOSTGXWJWWODDUQMD43ANCNFSM5I66EMEQ> .

enridaga · 2021-12-13T09:53:40Z

Graph names can be customized according to the running extractor. Will do a commit with partial work in this direction.

…(comment) #164

enridaga · 2021-12-13T09:57:02Z

Any23 should use the HTTP client of SA.

Any23.setHTTPClient

However, this means that we need to make a public method Triplifier.getHTTPClient, which we don't have at the moment.

enridaga · 2021-12-13T09:57:46Z

Any23 should use the HTTP client of SA.
Any23.setHTTPClient
However, this means that we need to make a public method Triplifier.getHTTPClient, which we don't have at the moment.

However, I would prefer to just pass an InputStream to Any23, really.

justin2004 · 2021-12-16T23:06:35Z

cool i do see the embedded json-ld (which uses schema.org) from IMDB now.

curl --silent 'http://localhost:3000/sparql.anything'  \
-H 'Accept: text/csv' \
--data-urlencode 'query=
PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX fx: <http://sparql.xyz/facade-x/ns/>
select *
# construct {?s ?p ?o}
WHERE {
service <x-sparql-anything:>{
    fx:properties fx:location "https://www.imdb.com/title/tt1160419/" .
    fx:properties fx:media-type "text/html" .
    fx:properties fx:html.metadata "true" .
    graph ?g {?s ?p ?o .}
}
}'

yields:

s,p,o,g

...
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/url,https://www.imdb.com/title/tt1160419/,https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/site_name,IMDb,https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/title,Dune (2021) - IMDb,https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/description,"Dune: Directed by Denis Villeneuve. With Timothée Chalamet, Rebecca Ferguson, Oscar Isaac, Jason Momoa. Feature adaptation of Frank Herbert's science fiction novel about the son of a noble family entrusted with the protection of the most valuable asset and most vital element in the galaxy.",https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/type,video.movie,https://www.imdb.com/title/tt1160419/
...

it would be nice if it was in a different named graph so i could easily tell if html had embedded RDF (by counting the number of distinct graphs).

justin2004 · 2021-12-16T23:08:12Z

ops i missed them in the snippet but they are there.

EDIT

here they are

s,p,o,g
_:b0,http://schema.org/actor,_:b1,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/actor,_:b2,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/actor,_:b3,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/aggregateRating,_:b4,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/alternateName,Dune,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/contentRating,PG-13,https://www.imdb.com/title/tt1160419/
...

enridaga · 2021-12-17T09:16:01Z

it would be nice if it was in a different named graph so i could easily tell if html had embedded RDF (by counting the number of distinct graphs).

Yes, this is the plan

luigi-asprino · 2024-09-13T14:59:19Z

231fb35 includes a test for RDFa, which passes
6de3a47 includes a test for microformats which fails.
I'm not familiar with microformats.
So currently only microdata and RDFa are supported.
Any23 should have extractors for microformats, but I couldn't get it to work.

enridaga changed the title ~~Schema.org and other inline rdf support~~ [HTML] Add Schema.org and other inline rdf support Nov 30, 2021

enridaga added the Feature New feature or request label Nov 30, 2021

luigi-asprino added a commit that referenced this issue Dec 10, 2021

See #164

dcc589e

luigi-asprino added a commit that referenced this issue Dec 11, 2021

#164

900484c

enridaga added a commit that referenced this issue Dec 13, 2021

Partial work for moving triplified metadata to other graphs.See #164 …

e32c9ed

…(comment) #164

enridaga mentioned this issue Feb 10, 2022

Prepare release 0.6.0 #204

Closed

luigi-asprino added a commit that referenced this issue Sep 13, 2024

#164 Include test for RDFa

231fb35

luigi-asprino added a commit that referenced this issue Sep 13, 2024

#164 Include test for microformats

6de3a47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HTML] Add Schema.org and other inline rdf support #164

[HTML] Add Schema.org and other inline rdf support #164

danbri commented Nov 29, 2021

enridaga commented Nov 30, 2021

danbri commented Nov 30, 2021 via email

luigi-asprino commented Dec 7, 2021

luigi-asprino commented Dec 11, 2021

danbri commented Dec 11, 2021 via email

enridaga commented Dec 13, 2021

enridaga commented Dec 13, 2021

enridaga commented Dec 13, 2021 •

edited

Loading

justin2004 commented Dec 16, 2021 •

edited

Loading

justin2004 commented Dec 16, 2021 •

edited

Loading

enridaga commented Dec 17, 2021

luigi-asprino commented Sep 13, 2024

[HTML] Add Schema.org and other inline rdf support #164

[HTML] Add Schema.org and other inline rdf support #164

Comments

danbri commented Nov 29, 2021

enridaga commented Nov 30, 2021

danbri commented Nov 30, 2021 via email

luigi-asprino commented Dec 7, 2021

luigi-asprino commented Dec 11, 2021

danbri commented Dec 11, 2021 via email

enridaga commented Dec 13, 2021

enridaga commented Dec 13, 2021

enridaga commented Dec 13, 2021 • edited Loading

justin2004 commented Dec 16, 2021 • edited Loading

justin2004 commented Dec 16, 2021 • edited Loading

enridaga commented Dec 17, 2021

luigi-asprino commented Sep 13, 2024

enridaga commented Dec 13, 2021 •

edited

Loading

justin2004 commented Dec 16, 2021 •

edited

Loading

justin2004 commented Dec 16, 2021 •

edited

Loading