Indexing file content in Solr #1043

Natkeeran · 2019-03-04T20:23:46Z

One of the powerful features of Islandora 7.x is its ability to index content in the datastreams. In Islandora 8, we can index field information in content types. However, there is no prescribed way to index file content (ex xml, json files). What is the approach that will be taken to support this feature in Islandora 8?

seth-shaw-unlv · 2019-03-04T20:37:08Z

Two possible strategies,

have a file->text extractor service that can update a "transcript" field on the metadata node. (This is what we were planning on.)
Index the files/media as independent entities that are returned in search results with links to the related item. It appears search_api_attachments (D8 version still in beta) does this, but it presumes a Node -> Media relationship rather than the Media -> Node relationship we use.

dannylamb · 2019-03-04T21:06:31Z

FWIW I was considering @seth-shaw-unlv's first strategy for stuff like (H)OCR and transcripts. Having that as a field you don't display but index is hands down the simplest way to go. Make an action to run in response to updates of a media and have it dump its contents into a field.

For something that would require a transform I'm less certain as to how it would play out. You could transform in Drupal with Twig templates and json or make a microservice so you're no longer constrained by PHP's limited xml handling. It all depends on the use case, I guess.

Natkeeran · 2019-03-06T13:57:14Z

@dannylamb

For full text text, the first approach can work.

But, there are many modules/use cases out there that require transform (tei, oral history etc), thus having a way to support that would be helpful.

whikloj · 2019-03-25T20:13:55Z

@Natkeeran could you flesh out the requirement of transform a little bit? I am unclear on how you would use TEI in an Islandora 8 context.

Natkeeran · 2019-03-27T13:11:33Z

@whikloj
In 7.x, you can use custom xslts to index TEI elements into solr. Those solr fields can then be searched and faceted in Drupal via Islandora Solr Search. We are using in this feature in several places:

oral histories indexes cues in solr, which are then queired for display
we use solr to bring back audit, tech/fits and foxml info for reporting
tei are index as noted above
vtt, and annotations are indexed into solr in similar way as well

Though we may not need all the above use cases in 8.x, the question remains if we need a generic way to index media/datastreams in solr then make them available for search, faceting etc in Drupal.

whikloj · 2019-03-27T13:34:15Z

My concern is thinking in 7.x terms for 8.

For instance (IMHO) media !== datastream, more media & file == datastream but even that seems a little wrong as a datastream in Fcrepo 3 only has one parent. In Drupal 8 we could have multiple content nodes pointing to the same file with separate media entities.

Maybe we need some sort of special entity to store file information. These entities would reference a file and could contain the FITS type data. If more than one node references the file, this data is still only stored once and perhaps not as XML.

Could we convert it to some usable JSON that would be easier to work with. This data is meant to be machine readable.

I guess what I'm saying is that most people in the Islandora 7.x world have trouble with and then learn to hate the XSLTs. So I think it might be nice to dump them.

But I'm good with XSLTs, so I can go either way.

whikloj added this to the 1.x milestone Apr 11, 2019

kstapelfeldt added Search labels Sep 9, 2021

kstapelfeldt added Type: enhancement Identifies work on an enhancement to the Islandora codebase Subject: Search related to advanced and basic searching capabilities. and removed enhancement labels Sep 25, 2021

rosiel mentioned this issue Oct 22, 2021

Use Case: OCR is searchable and i can tell it a language. #1957

Open

kstapelfeldt added this to Islandora Issues Queue Feb 8, 2022

kstapelfeldt moved this to Todo in Islandora Issues Queue Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing file content in Solr #1043

Indexing file content in Solr #1043

Natkeeran commented Mar 4, 2019

seth-shaw-unlv commented Mar 4, 2019

dannylamb commented Mar 4, 2019

Natkeeran commented Mar 6, 2019

whikloj commented Mar 25, 2019

Natkeeran commented Mar 27, 2019 •

edited

Loading

whikloj commented Mar 27, 2019

Indexing file content in Solr #1043

Indexing file content in Solr #1043

Comments

Natkeeran commented Mar 4, 2019

seth-shaw-unlv commented Mar 4, 2019

dannylamb commented Mar 4, 2019

Natkeeran commented Mar 6, 2019

whikloj commented Mar 25, 2019

Natkeeran commented Mar 27, 2019 • edited Loading

whikloj commented Mar 27, 2019

Natkeeran commented Mar 27, 2019 •

edited

Loading