Skip to content

Latest commit

 

History

History

errors

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Erroneous datasets

Unfortunately, many datasets cannot be included into the LOD Cloud because they do follow standards. Datasets that are currently not included because of errors are described in this directory.

Certificate error

The following datasets can not be accessed because of an incorrect certificate:

Does not exist (404 reply)

The following dataset can not be accessed because their online location does not exist:

Server error (503)

Intentional server error (503) for DDoS mitigation

Some servers use CloudFlare DDoS mitigation. The intention is to allow human users who access the data through a web browser with JavaScript engine, and to disallow machine users who access the data from scripts (and typically without a JavaScript engine).

The CloudFlare process works as follows:

  1. When a human user first visits a URL in their web browser with JavaScript engine, the browser does not know that the URL will be serviced from a CloudFlare server. This is therefore a regular HTTP request.

  2. The CloudFlare server does not support regular HTTP requests and sends back a 503 server error, together with an HTML page body containing JavaScript code.

  3. The web browser loads the HTML page with the intention of showing a human-readable message to the user. However, the HTML page also includes JavaScript code from CloudFlare. This code is tries to determine whether the user is a legitimate user. The requirements for being a legitimate user are unclear, but at least includes the requirements under step (1): accessing the URL through a web browser with a JavaScript engine.

  4. If the CloudFlare code determines that the user who isssued the HTTP request under step (1) is a valid user, the JavaScript code automatically issues another HTTP requests for the URL originally requested in step (1). The JavaScript code ensures that the new request contains certain tokens to communicate to the server that the request is now initiated through CloudFlare JavaScript code.

  5. Observing the tokens in the new HTTP request, the server verifies whether the tokens make the request eligible for a reply. If the request is considered eligible, the reply will provide access to the resource requested in step (1). The reply will also contain Set-Cookie header.

  6. The web browser retains the cookie. If the same resource is requested in the future, the web browser will include the cookie in the request, and will directly obtain access to the resource.

Since this process requires a JavaScript engine and Cookie store, most machine users will not be able to access datasets disseminated through CloudFlare.

The following datasets cannot be accessed because they use this CloudFlare approach:

Flaky server

The following dataset regularly cannot be downloaded because of an unstable server:

  • Semantic Finlex

Erroneous Content-Type header

The following datasets are serviced with an incorrect Content-Type header:

  • AIFB binary/octet-stream] emits binary/octet-stream i.o.text/n3.
  • BabelNet emits text/rdf+n3;charset=utf-8 i.o. text/turtle.
  • BIBO emits application/xml i.o. application/rdf+xml.
  • Bibsonomy emits application/xml i.o. application/rdf+xml.
  • Function Ontology emits application/octet-stream i.o. text/turtle.
  • Infection Transmission Ontology emits application/octect-stream i.o. text/turtle.
  • OGC GeoSPARQL emits text/xml i.o. application/rdf+xml.
  • Linked Art emits application/xml i.o. application/rdf+xml.
  • Provenance emits application/rdf\+xml i.o. application/rdf+xml.
  • Public Contracts Ontology emits text/plain i.o. application/rdf+xml.
  • SDMX Attribute emits text/plain; charset=utf-8 i.o. text/turtle.
  • SDMX Code emits text/plain; charset=utf-8; should be text/turtle.
  • SDMX Concept emits text/plain; charset=utf-8 i.o. text/turtle.
  • SDMX Dimension emits text/plain; charset=utf-8 i.o. text/turtle.
  • SDMX Measure emits text/plain; charset=utf-8 i.o. text/turtle.
  • W3C R2RML emits text/html i.o. application/rdf+xml.

No Content-Type header

The following datasets emit no Content-Type header at all:

Erroneous handling of Accept header

The following Accept header value is used when accessing RDF documents online:

application/trig,
application/n-quads,
application/n-triples;q=0.9,
text/turtle;q=0.9,
application/x-turtle;q=0.9,
text/rdf+n3;q=0.9,
application/rdf+xml;q=0.8,
text/plain;q=0.8,
*/*;q=0.7

The following datasets are serviced from servers that cannot process the above Accept header:

Use the following cURL command to test these URLs:

curl -vL -H 'Accept: application/trig, application/n-quads, application/n-triples;q=0.9, text/turtle;q=0.9, application/x-turtle;q=0.9, text/rdf+n3;q=0.9, application/ld+json;q=0.85, application/rdf+xml;q=0.8, text/plain;q=0.8, */*;q=0.7' '{url}' | head

No RDF

Requesting the following datasets results in a valid server reply, but do not return RDF data:

  • SWRL emits text/html; charset=iso-8859-1.

Erroneous IRIs

Incorrect escaping

Character escapes in IRIs must use %hh-notation.

\u-notation

The following datasets use \u-escaping:

  • Library of Congress Names line 66,292,711: <http://viaf.org/processed/NLI\u007C001461487>
  • VIAF line 841,558: <http://dbpedia.org/resource/National_Theatre_"To\u0161a_Jovanovi\u0107">

Absent escaping

Some characters are not allowed to appear unescaped in IRIs.

Unescaped backslash characters

  • VIAF line 841,558: <http://dbpedia.org/resource/National_Theatre_"To\u0161a_Jovanovi\u0107">

Unescaped caret characters

  • Pleiades line 60.882: <http://www.persee.fr/web/revues/home/prescript/article/racf_0220-6617_1991_num_30_1_2657?luceneQuery=%28%2B%28content%3AAQUAE+title%3AAQUAE^2.0+fullContent%3AAQUAE^100.0+fullTitle%3AAQUAE^140.0+summary%3AAQUAE+authors%3AAQUAE^5.0+illustrations%3AAQUAE^4.0>; reported at isawnyu/pleiades-rdf#7.

Unescaped double quote

Unescaped space characters

  • ISO 19115-1@2014 file https://raw.githubusercontent.com/ISO-TC211/GOM/master/isotc211_GOM_harmonizedOntology/iso19115/-1/2014/ExampleOfExtendedMatadata.rdf contains a space in IRI http://def.isotc211.org/iso19115/-1/2014/ExampleOfExtendedMatadata/code/KeywordTypeCode -BioCollection.
  • ISO 19115-1@2018 file https://raw.githubusercontent.com/ISO-TC211/GOM/master/isotc211_GOM_harmonizedOntology/iso19115/-1/2018/ExampleOfExtendedMatadata.rdf contains a space in IRI http://def.isotc211.org/iso19115/-1/2014/ExampleOfExtendedMatadata/code/KeywordTypeCode -BioCollection.
  • LingHub line 11: <http://logd.tw.rpi.edu/source/congress-gov/file/biographical-directory-of-the-united-states-congress/version/2012-Jan-04/conversion/congress-gov-biographical-directory-of-the-united-state s-congress-2012-Jan-04.ttl.tgz>
  • Linked Movie Database (2012-02-10) line 35.710: <http://data.linkedmdb.org/resource/country/iso alpha2>.
  • Rijksmuseum Actors line 106.332: <skos:exactMatch rdf:resource=" https://rkd.nl/explore/artists/420649"/>

Scheme grammar violations

The following datasets cannot be parsed because they contain a forward slash character in their scheme component:

Compression errors

GNU zip errors

The following dataset uses GNU zip compression, but seems to contain strange characters when decompressed:

Syntax errors

The following datasets contain syntax errors:

Alias overloading