Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HTML] Add Schema.org and other inline rdf support #164

Open
danbri opened this issue Nov 29, 2021 · 12 comments
Open

[HTML] Add Schema.org and other inline rdf support #164

danbri opened this issue Nov 29, 2021 · 12 comments
Labels
Feature New feature or request

Comments

@danbri
Copy link

danbri commented Nov 29, 2021

A great many pages contain RDF data via Schema.org (in microdata, json-ld, rdfa). There are also other vocabularies which uses those syntaxes. Does SPARQL Anything represent that data naturally, or could it be adapted to do so?

@enridaga
Copy link
Member

Currently, it is only generating an RDF-like view of the DOM tree.

In general, SA generates the main graph for the resource content (RDF-like view) and, in some cases, additional graphs for metadata (e.g. EXIF metadata for images).

In the case of HTML, SA could generate additional named graphs with extracted metadata.
These should include:

  • RDFa
  • Microdata
  • Microformats
  • Others?

We could use http://any23.apache.org -- other ideas?

@danbri
Copy link
Author

danbri commented Nov 30, 2021 via email

@enridaga enridaga changed the title Schema.org and other inline rdf support [HTML] Add Schema.org and other inline rdf support Nov 30, 2021
@enridaga enridaga added the Feature New feature or request label Nov 30, 2021
@luigi-asprino
Copy link
Member

This relates to #13

luigi-asprino added a commit that referenced this issue Dec 10, 2021
@luigi-asprino
Copy link
Member

With dcc589e SA is able to extract metadata from HTML pages.
This feature relies on Any23.
By default Any23 extracts quads having the URL of the page as graph URI.
Therefore, at the moment, the content extracted by SA and Any23 collapses on the same graph.
The option to enable this feature is html.metadata=(true/false) (false by default).
Of course, we can discuss which is the best way to serve Any23 extracted content. This was just a tentative implementation of the feature.

luigi-asprino added a commit that referenced this issue Dec 11, 2021
@danbri
Copy link
Author

danbri commented Dec 11, 2021 via email

@enridaga
Copy link
Member

Graph names can be customized according to the running extractor. Will do a commit with partial work in this direction.

@enridaga
Copy link
Member

Any23 should use the HTTP client of SA.

Any23.setHTTPClient

However, this means that we need to make a public method Triplifier.getHTTPClient, which we don't have at the moment.

@enridaga
Copy link
Member

enridaga commented Dec 13, 2021

Any23 should use the HTTP client of SA.

Any23.setHTTPClient

However, this means that we need to make a public method Triplifier.getHTTPClient, which we don't have at the moment.

However, I would prefer to just pass an InputStream to Any23, really.

@justin2004
Copy link
Contributor

justin2004 commented Dec 16, 2021

cool i do see the embedded json-ld (which uses schema.org) from IMDB now.

curl --silent 'http://localhost:3000/sparql.anything'  \
-H 'Accept: text/csv' \
--data-urlencode 'query=
PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX fx: <http://sparql.xyz/facade-x/ns/>
select *
# construct {?s ?p ?o}
WHERE {
service <x-sparql-anything:>{
    fx:properties fx:location "https://www.imdb.com/title/tt1160419/" .
    fx:properties fx:media-type "text/html" .
    fx:properties fx:html.metadata "true" .
    graph ?g {?s ?p ?o .}
}
}'

yields:

s,p,o,g

...
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/url,https://www.imdb.com/title/tt1160419/,https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/site_name,IMDb,https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/title,Dune (2021) - IMDb,https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/description,"Dune: Directed by Denis Villeneuve. With Timothée Chalamet, Rebecca Ferguson, Oscar Isaac, Jason Momoa. Feature adaptation of Frank Herbert's science fiction novel about the son of a noble family entrusted with the protection of the most valuable asset and most vital element in the galaxy.",https://www.imdb.com/title/tt1160419/
https://www.imdb.com/title/tt1160419/,http://opengraphprotocol.org/schema/type,video.movie,https://www.imdb.com/title/tt1160419/
...

it would be nice if it was in a different named graph so i could easily tell if html had embedded RDF (by counting the number of distinct graphs).

@justin2004
Copy link
Contributor

justin2004 commented Dec 16, 2021

ops i missed them in the snippet but they are there.

EDIT

here they are

s,p,o,g
_:b0,http://schema.org/actor,_:b1,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/actor,_:b2,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/actor,_:b3,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/aggregateRating,_:b4,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/alternateName,Dune,https://www.imdb.com/title/tt1160419/
_:b0,http://schema.org/contentRating,PG-13,https://www.imdb.com/title/tt1160419/
...

@enridaga
Copy link
Member

it would be nice if it was in a different named graph so i could easily tell if html had embedded RDF (by counting the number of distinct graphs).

Yes, this is the plan

@luigi-asprino
Copy link
Member

231fb35 includes a test for RDFa, which passes
6de3a47 includes a test for microformats which fails.
I'm not familiar with microformats.
So currently only microdata and RDFa are supported.
Any23 should have extractors for microformats, but I couldn't get it to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants