Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a sensible way to handle many new URI/literals #2119

Closed
pgwillia opened this issue Jan 26, 2021 · 12 comments
Closed

Find a sensible way to handle many new URI/literals #2119

pgwillia opened this issue Jan 26, 2021 · 12 comments

Comments

@pgwillia
Copy link
Member

Let's add a TODO to consider more robust ways of doing this (lookup table in DB or some kind of authority query), because I could see this growing quite large.

Originally posted by @mbarnett in #2089 (comment)

Peel will introduce many new URIs in it's metadata. This is different than the Controlled Vocabularies used before where there were a limited/restricted permitted responses that we drew from for submitting items. The part that is similar is the need to convert to/from URI to humanized string.

@pgwillia
Copy link
Member Author

pgwillia commented Feb 12, 2021

Problem

Peel will introduce many new URIs in it's metadata. This is different than the Controlled Vocabularies used before where there were a limited/restricted permitted responses that we drew from for submitting items. The part that is similar is the need to convert to/from URI to humanized string.

I think that having an intermediate symbol is desirable in URI translation so that talking about special cases in our code remain readable. I.e. CONTROLLED_VOCABULARIES[:era][:item_type].article

What are we doing?

We have a number of config/controlled_vocabularies which are used to set up the app-wide CONTROLLED_VOCABULARY hash.

Each of the URIs in the controlled vocabularies has a symbol which matches the same symbol in our locales to give a humanized literal

def humanize_uri_code(vocab, code)
t("controlled_vocabularies.#{vocab}.#{code}")
end
def humanize_uri(vocab, uri)
code = CONTROLLED_VOCABULARIES[vocab].from_uri(uri)
return nil if code.nil?
humanize_uri_code(vocab, code)
end

We use this to

What have others done?

Rails i18n

This guide discusses i18n in rails.

easy-to-use and extensible framework for translating your application to a single custom language other than English or for providing multi-language support in your application.

The public API is

I18n.t 'store.title' # alias for translate - Lookup text translations
I18n.l Time.now # alias for localize  - Localize Date and Time objects to local formats

The default backend is Simple. There's discussion in the guide about using the Chain backend. It's useful when you want to use standard translations with a Simple backend but store custom application translations in a database or other backends. The same author that worked on i18n gem has a i18n-active_record gem that implements an ActiveRecord backend. This backend stores translations in a Translation table with locale, key, value, interpolations and is_proc attributes.

Other references

URI Service

URI Service - found a ruby gem from Columbia University Libraries which is database backed and Solr cached lookup/creation service for URIs.

UriService.client.find_term_by_uri('http://id.example.com/123')
# Returns a term hash or nil
#   {
#     'uri' => 'http://id.example.com/123',
#     'value' => 'Batman, John, 1800-1839',
#     'vocabulary_string_key' => 'names',
#     'type' => 'external'
#   }

This gem also allows you to

  • create local, external and temporary vocabularies.
  • search for a term in a vocabulary UriService.client.find_terms_by_query('names', 'batman')

Linked Data concepts

Resource Caching

How can an application that relies on loading data be more tolerant of network failures and/or reduce use of bandwidth

Label Everything

How can we ensure that every resource has a basic human-readable name?

@pgwillia
Copy link
Member Author

pgwillia commented Feb 12, 2021

Solution

Generate an ActiveRecord Model called Vocabulary with namespace, vocab, uri, and code attributes. Perhaps make the code optional and derive it from snake case of the label.

The API for this model:

  • Vocabulary.<namespace>.<vocab>.<code> returns a uri
  • Vocabulary.<namespace>.<vocab>.from_uri(uri) returns a code
  • t('vocabulary.<namespace>.<vocab>.<code>') returns a label

Use the ActiveRecord backend for i18n to store the translations/labels for each URI/code. Have after_create hook for Vocabulary which adds the translation to the i18n ActiveRecord backend.

Other considerations

  • Look first in Vocabulary but fall back to our CONTROLLED_VOCABULARIES if not found.

@pgwillia
Copy link
Member Author

This was an exercise to see if we could get a label from a URI programmatically. We might do this once on ingest and then store in our database(s).

linkeddata

require 'rdf'
require 'linkeddata'

graph = RDF::Graph.load("http://id.loc.gov/authorities/names/n79007225") # a URI from the FolkFest data
graph.query({predicate: RDF::Vocab::SKOS.prefLabel}).first.object.to_s
=> "Edmonton (Alta.)"

graph.query({predicate: RDF::Vocab::SKOS.prefLabel}) { |statement| puts statement.object.to_s }
Edmonton (Alta.)
Alberta--Edmonton Region
Alberta--Edmonton Metropolitan Area
Alberta--Edmonton
Alberta--Edmonton Suburban Area
Edmonton
Strathcona (Edmonton, Alta.)--(DLC)n 81057372

From Getting data from the Semantic Web (Ruby) and Queryable:query

@mbarnett
Copy link
Contributor

Pretty sure there's a much cheaper way to do that, because the RDF::Graph calls are very expensive/slow. This was a huge issue in ActiveTriples/ActiveFedora. Give me a bit to check

@mbarnett
Copy link
Contributor

Hmmm, so you can do something like:

irb(main):016:0> uri = "http://purl.org/dc/terms/title"
=> "http://purl.org/dc/terms/title"
irb(main):017:0> RDF::Vocabulary.find_term(uri)
=> #<RDF::Vocabulary::Term:0x3fcd9556c1a4 ID:http://purl.org/dc/terms/title>
irb(main):018:0> RDF::Vocabulary.find(uri)
=> RDF::StrictVocabulary(http://purl.org/dc/terms/)
irb(main):019:0> RDF::Vocabulary.find_term(uri).label.to_s
=> "Title"

generally, which is good both because it bypasses all the in memory graph stuff and because in your example we have to know the vocab the URI comes from to form the query, which is a bunch of extra work to have to do if we can't just have a Vocab-agnostic conversion. It's still not terribly efficient because it does this in just about the dumbest way possible – it simply looks through every vocab it knows about sequentially trying to find the corresponding URI. But the Graph query is worse in that I think it's built on top of this, so it's several more layers of inefficient on top of this inefficiency.

Looking up the specific URI "http://id.loc.gov/authorities/names/n79007225" doesn't seem to work, however. Could be a bug in RDF, could be something else at play there.

Long story short, humanize_uri gets called a lot and it's in a hot path on the facets rendering code, so it can't be slow without having a very large impact on performance. That probably rules out anything like the Graph approach, the RDF find methods, and definitely querying external authorities whose labels we may not even like anyways. I think we're going to have to store the mappings in the DB by hand and only rely on this kind of method for an initial population of the values, and only where the labels these methods provide are sufficiently user-friendly to be worth using.

Not sure there's any value in the URI Service gem either beyond what we could more easily do ourselves without having to worry that it's abandoned.

@pgwillia
Copy link
Member Author

pgwillia commented Feb 18, 2021

Looking up the specific URI "http://id.loc.gov/authorities/names/n79007225" doesn't seem to work, however. Could be a bug in RDF, could be something else at play there.

There could be something else at play there. The example you gave was the URI corresponding to a predicate. The example I'm trying to solve is the object as a URI. I tried some of the other URIs in our config/controlled_vocabularies and none of these worked either.

This is an example that didn't work with RDF::Graph so this method isn't bullet proof either.

english: https://iso639-3.sil.org/code/eng

ERROR <https://iso639-3.sil.org/code/eng>: 5:501: FATAL: error parsing attribute name
...

Long story short, humanize_uri gets called a lot and it's in a hot path on the facets rendering code, so it can't be slow without having a very large impact on performance. That probably rules out anything like the Graph approach, the RDF find methods, and definitely querying external authorities whose labels we may not even like anyways. I think we're going to have to store the mappings in the DB by hand and only rely on this kind of method for an initial population of the values, and only where the labels these methods provide are sufficiently user-friendly to be worth using.

I have no intention of using RDF::Graph in the application code for any just in time type lookup. The problem I was trying to solve with that snippet was looking up a reasonable label to the URIs in the Folk Fest data. I did this by hand for [config/controlled_vocabularies/digitization_*] in my FolkFest Modelling PR by visiting the URI in my browser to find a reasonable code. I hadn't yet even done the work to fill in the values for the i18n translations. When we do load the metadata we could use something like that snippet with RDF::Graph in an ActiveJob to lookup and store an appropriate label. Interpreting URIs is one of the activities that I didn't think would scale when we add more digitization content. The label would then be stored in the i18n ActiveRecord backend and used by humanize_uri whenever the app needs it.

Not sure there's any value in the URI Service gem either beyond what we could more easily do ourselves without having to worry that it's abandoned.

💯 agree. The URI Service gem will not be useful for us. This was just part of my scan of the state of the world.

@mbarnett
Copy link
Contributor

Hmmm yeah good point, wasn't thinking in terms of Predicate vs Object.

I almost want to say we should look at more of a Questioning Authority approach for populating the DB, but I'm still concerned about some of the labels maybe not being what we'd want to present to an end user (perhaps needlessly worried, but Hydra was so bad for presenting really poor UX due to that kind of reliance on mechanical mapping)

@pgwillia
Copy link
Member Author

pgwillia commented Feb 18, 2021

generally, which is good both because it bypasses all the in memory graph stuff and because in your example we have to know the vocab the URI comes from to form the query, which is a bunch of extra work to have to do if we can't just have a Vocab-agnostic conversion. It's still not terribly efficient because it does this in just about the dumbest way possible – it simply looks through every vocab it knows about sequentially trying to find the corresponding URI. But the Graph query is worse in that I think it's built on top of this, so it's several more layers of inefficient on top of this inefficiency.

It would be nice to have something where we could just call label and have it do the right thing. That's where I did some reading about "Linked Data Concepts" and how this plays out in most graphs. There are some other predicates to look at if skos:prefLabel isn't there like skos:altLabel and rdfs:label. From what I gather prefLabel should be vocabulary agnostic. We're trying to find the edges of the graph not the connecting parts, if that makes any sense.

[edit] vocabulary agnostic is I guess not really what I meant. Maybe graph agnostic? Most nodes in a linked data graph will terminate with a prefLabel.

@pgwillia
Copy link
Member Author

Tell me more about Questioning Authority approach?

The blog post about the i18n ActiveRecord backend includes a basic admin interface for adding/editing translations. Would that help with fixing bad mappings?

@mbarnett
Copy link
Contributor

I was actually thinking less in terms of "it might not be prefLabel" and more in terms of "it might not be in RDF::Vocab::SKOS" but yeah, there might be two ways for that to go wrong.

Lemme mull the questioning authority vs graph query approach some more, may or may not come up with some thoughts

@pgwillia
Copy link
Member Author

pgwillia commented Feb 24, 2021

First we can create a graph from a spreadsheet like 'FolkFestTriples - v.1' downloaded as a csv called 'follkfest.csv' in the current path.

graph = RDF::Graph.new
CSV.foreach('folkfest.csv', headers: true) do |row|
  subject = RDF::URI.new(row["Entity"], validate: true)
  predicate = RDF::URI.new(row["Property"], validate: true)
  object = begin
    RDF::URI.new(row["Value"], validate: true) 
  rescue
    RDF::Literal.new(row["Value"])
  end
  graph << [subject, predicate, object]
end

Then we can find the URI's and their labels

query = RDF::Query.new do
  pattern [:collection, RDF::URI.new("http://rdaregistry.info/Elements/u/P60249"), :program]
  pattern [:program, :predicate, :term]
  pattern [:term, RDF::Vocab::SKOS.prefLabel, :label], optional: true
end
thesaurus = graph.query(query).select(:predicate, :term, :label).order_by(:predicate).filter { |solution| solution.term.uri? }.distinct
predicate term label
http://id.loc.gov/vocabulary/relators/aut http://terms.library.ualberta.ca/name/name1  
http://id.loc.gov/vocabulary/relators/pbl http://viaf.org/viaf/148861489  
http://id.loc.gov/vocabulary/relators/pup http://id.loc.gov/authorities/names/n79007225 "Edmonton (Alta.)"
http://pcdm.org/models#memberOf https://digitalcollections.library.ualberta.ca/collection/UUID  
http://purl.org/dc/elements/1.1/coverage http://id.loc.gov/authorities/names/n79007225 "Edmonton (Alta.)"
http://purl.org/dc/elements/1.1/subject http://viaf.org/viaf/148861489  
http://purl.org/dc/elements/1.1/subject http://id.loc.gov/authorities/subjects/sh2008004058 "Folk music festivals"
http://purl.org/dc/terms/language https://iso639-3.sil.org/code/eng "English"
http://purl.org/dc/terms/type http://id.loc.gov/vocabulary/resourceTypes/txt "Text"
http://www.europeana.eu/schemas/edm/hasType http://id.loc.gov/authorities/genreForms/gf2014026156 "Programs (Publications)"
http://www.europeana.eu/schemas/edm/rights https://rightsstatements.org/page/InC/1.0/?language=en "In Copyright"
https://www.w3.org/TR/rdf-schema/#ch_type https://schema.org/CreativeWork  

@pgwillia
Copy link
Member Author

pgwillia commented Mar 4, 2021

I think this is solved by @mbarnett's ControlledVocabulary API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants