HTML can be captured by the Document Object Model (DOM) specification. HTML elements (also known as tags) can be considered containers.
According to the Facade-X model, SPARQL Anything uses:
RDF Properties for specifying tag attributes; Container membership properties for specifying relations to child elements in the DOM tree. These may include text, which can be expressed as RDF literals of type xsd:string. Tag names are used to type the container. Specifically, the tag name is used to mint a URI that identifies the class of the corresponding containers.
SPARQL Anything selects this transformer for the following file extensions:
- html
SPARQL Anything selects this transformer for the following media types:
- text/html
<html>
<head>
<title>Hello world!</title>
</head>
<body>
<p class="paragraph">Hello world</p>
</body>
</html>
Located at https://sparql-anything.cc/examples/simple.html
CONSTRUCT
{
?s ?p ?o .
}
WHERE
{ SERVICE <x-sparql-anything:location=https://sparql-anything.cc/examples/simple.html>
{ GRAPH ?g
{ ?s ?p ?o }
}
}
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX eg: <http://www.example.org/>
PREFIX fx: <http://sparql.xyz/facade-x/ns/>
PREFIX ja: <http://jena.hpl.hp.com/2005/11/Assembler#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rss: <http://purl.org/rss/1.0/>
PREFIX vcard: <http://www.w3.org/2001/vcard-rdf/3.0#>
PREFIX whatwg: <https://html.spec.whatwg.org/#>
PREFIX xhtml: <http://www.w3.org/1999/xhtml#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX xyz: <http://sparql.xyz/facade-x/data/>
[ rdf:type fx:root , xhtml:html;
rdf:_1 [ rdf:type xhtml:head;
rdf:_1 [ rdf:type xhtml:title;
rdf:_1 "Hello world!";
whatwg:innerHTML "Hello world!";
whatwg:innerText "Hello world!"
];
whatwg:innerHTML "<title>Hello world!</title>";
whatwg:innerText "Hello world!"
];
rdf:_2 [ rdf:type xhtml:body;
rdf:_1 [ rdf:type xhtml:p;
rdf:_1 "Hello world";
xhtml:class "paragraph";
whatwg:innerHTML "Hello world";
whatwg:innerText "Hello world"
];
whatwg:innerHTML "<p class=\"paragraph\">Hello world</p>";
whatwg:innerText "Hello world"
];
whatwg:innerHTML "<head>\n <title>Hello world!</title>\n</head>\n<body>\n <p class=\"paragraph\">Hello world</p>\n</body>";
whatwg:innerText "Hello world! Hello world"
] .
Option name | Description | Valid Values | Default Value |
---|---|---|---|
html.selector | A CSS selector that restricts the HTML tags to consider for the triplification. | Any valid CSS selector. | :root |
html.metadata | It tells the triplifier to extract inline RDF from HTML pages. The triples extracted will be included in the default graph. -- See #164 | true/false | false |
html.browser | It tells the triplifier to use the specified browser to navigate to the page to obtain HTML. By default a browser is not used. The use of a browser has some dependencies -- see BROWSER and justin2004's blogpost. | chromium | webkit |
html.parser | It tells the triplifier to use the specified JSoup parser (default: html). | xml html | html |
html.browser.wait | When using a browser to navigate, it tells the triplifier to wait for the specified number of seconds (after telling the browser to navigate to the page) before attempting to obtain HTML. -- See See justin2004's blogpost. | Any integer | Not set |
html.browser.screenshot | When using a browser to navigate, take a screenshot of the webpage (perhaps for troubleshooting) and save it here. See justin2004's blogpost. | Any valid URL | Not set |
html.browser.timeout | When using a browser to navigate, it tells the browser if it spends longer than this amount of time (in milliseconds) until a load event is emitted then the operation will timeout -- See justin2004's blogpost. | Any integer | 30000 |
A CSS selector that restricts the HTML tags to consider for the triplification.
Any valid CSS selector.
:root
Selecting text contained in elements of the class "paragraph"
<html>
<head>
<title>Hello world!</title>
</head>
<body>
<p class="paragraph">Hello world</p>
</body>
</html>
https://sparql-anything.cc/examples/simple.html
PREFIX whatwg: <https://html.spec.whatwg.org/#>
SELECT ?text
WHERE
{ SERVICE <x-sparql-anything:location=https://sparql-anything.cc/examples/simple.html,html.selector=.paragraph>
{ ?s whatwg:innerText ?text }
}
-----------------
| text |
=================
| "Hello world" |
-----------------
It tells the triplifier to extract inline RDF from HTML pages. The triples extracted will be included in the default graph. -- See #164
true/false
false
Extract triples embedded in the web page at the following address https://sparql-anything.cc/examples/Microdata1.html
<!--
~ Copyright (c) 2022 SPARQL Anything Contributors @ http://github.com/sparql-anything
~
~ Licensed under the Apache License, Version 2.0 (the "License");
~ you may not use this file except in compliance with the License.
~ You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
-->
<!DOCTYPE html>
<html>
<body>
<div itemscope itemtype="https://schema.org/Movie">
<h1 itemprop="name">Avatar</h1>
<span>Director: James Cameron (born August 16, 1954)</span>
</div>
</body>
</html>
https://sparql-anything.cc/examples/Microdata1.html
CONSTRUCT
{
?s ?p ?o .
}
WHERE
{ SERVICE <x-sparql-anything:location=https://sparql-anything.cc/examples/Microdata1.html,html.metadata=true>
{ GRAPH ?g
{ ?s ?p ?o }
}
}
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX eg: <http://www.example.org/>
PREFIX fx: <http://sparql.xyz/facade-x/ns/>
PREFIX ja: <http://jena.hpl.hp.com/2005/11/Assembler#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rss: <http://purl.org/rss/1.0/>
PREFIX vcard: <http://www.w3.org/2001/vcard-rdf/3.0#>
PREFIX whatwg: <https://html.spec.whatwg.org/#>
PREFIX xhtml: <http://www.w3.org/1999/xhtml#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX xyz: <http://sparql.xyz/facade-x/data/>
<https://sparql-anything.cc/examples/Microdata1.html>
<http://www.w3.org/1999/xhtml/microdata#item>
[ rdf:type <https://schema.org/Movie>;
<https://schema.org/name> "Avatar"
] .
[ rdf:type fx:root , xhtml:html;
rdf:_1 [ rdf:type xhtml:head ];
rdf:_2 [ rdf:type xhtml:body;
rdf:_1 [ rdf:type xhtml:div;
rdf:_1 [ rdf:type xhtml:h1;
rdf:_1 "Avatar";
xhtml:itemprop "name";
whatwg:innerHTML "Avatar";
whatwg:innerText "Avatar"
];
rdf:_2 [ rdf:type xhtml:span;
rdf:_1 "Director: James Cameron (born August 16, 1954)";
whatwg:innerHTML "Director: James Cameron (born August 16, 1954)";
whatwg:innerText "Director: James Cameron (born August 16, 1954)"
];
xhtml:itemscope "";
xhtml:itemtype "https://schema.org/Movie";
whatwg:innerHTML "<h1 itemprop=\"name\">Avatar</h1><span>Director: James Cameron (born August 16, 1954)</span>";
whatwg:innerText "Avatar Director: James Cameron (born August 16, 1954)"
];
whatwg:innerHTML "<div itemscope itemtype=\"https://schema.org/Movie\">\n <h1 itemprop=\"name\">Avatar</h1><span>Director: James Cameron (born August 16, 1954)</span>\n</div>";
whatwg:innerText "Avatar Director: James Cameron (born August 16, 1954)"
];
whatwg:innerHTML "<head></head>\n<body>\n <div itemscope itemtype=\"https://schema.org/Movie\">\n <h1 itemprop=\"name\">Avatar</h1><span>Director: James Cameron (born August 16, 1954)</span>\n </div>\n</body>";
whatwg:innerText "Avatar Director: James Cameron (born August 16, 1954)"
] .
It tells the triplifier to use the specified browser to navigate to the page to obtain HTML. By default a browser is not used. The use of a browser has some dependencies -- see BROWSER and justin2004's blogpost.
chromium|webkit|firefox
Not set
It tells the triplifier to use the specified JSoup parser (default: html).
xml html
html
The element names are case-sensitive when using the XML parser.
<?xml version="1.0" ?>
<xx:Element xmlns:xx="http://www.example.org">
<xx:someThing>Hallo world</xx:someThing>
<xx:someThingElse xx:key="0.1"/>
</xx:Element>
https://sparql-anything.cc/examples/simple.xml
CONSTRUCT
{
?s ?p ?o .
}
WHERE
{ SERVICE <x-sparql-anything:location=https://sparql-anything.cc/examples/simple.xml,triplifier=io.github.sparqlanything.html.HTMLTriplifier,html.parser=xml>
{ ?s ?p ?o }
}
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX eg: <http://www.example.org/>
PREFIX fx: <http://sparql.xyz/facade-x/ns/>
PREFIX ja: <http://jena.hpl.hp.com/2005/11/Assembler#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rss: <http://purl.org/rss/1.0/>
PREFIX vcard: <http://www.w3.org/2001/vcard-rdf/3.0#>
PREFIX whatwg: <https://html.spec.whatwg.org/#>
PREFIX xhtml: <http://www.w3.org/1999/xhtml#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX xyz: <http://sparql.xyz/facade-x/data/>
[ rdf:type <http://www.example.org#Element> , fx:root;
rdf:_1 [ rdf:type <http://www.example.org#someThing>;
rdf:_1 "Hallo world";
whatwg:innerHTML "Hallo world";
whatwg:innerText "Hallo world"
];
rdf:_2 [ rdf:type <http://www.example.org#someThingElse>;
<http://www.example.org#key> "0.1"
];
xhtml:xmlns:xx "http://www.example.org";
whatwg:innerHTML "\n\t<xx:someThing>Hallo world</xx:someThing>\n\t<xx:someThingElse xx:key=\"0.1\" />\n";
whatwg:innerText "Hallo world"
] .
When using a browser to navigate, it tells the triplifier to wait for the specified number of seconds (after telling the browser to navigate to the page) before attempting to obtain HTML. -- See See justin2004's blogpost.
Any integer
Not set
When using a browser to navigate, take a screenshot of the webpage (perhaps for troubleshooting) and save it here. See justin2004's blogpost.
Any valid URL
Not set
When using a browser to navigate, it tells the browser if it spends longer than this amount of time (in milliseconds) until a load event is emitted then the operation will timeout -- See justin2004's blogpost.
Any integer
30000