Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing file content in Solr #1043

Open
Natkeeran opened this issue Mar 4, 2019 · 6 comments
Open

Indexing file content in Solr #1043

Natkeeran opened this issue Mar 4, 2019 · 6 comments
Labels
Subject: Search related to advanced and basic searching capabilities. Type: enhancement Identifies work on an enhancement to the Islandora codebase
Milestone

Comments

@Natkeeran
Copy link
Contributor

One of the powerful features of Islandora 7.x is its ability to index content in the datastreams. In Islandora 8, we can index field information in content types. However, there is no prescribed way to index file content (ex xml, json files). What is the approach that will be taken to support this feature in Islandora 8?

@seth-shaw-unlv
Copy link
Contributor

Two possible strategies,

  1. have a file->text extractor service that can update a "transcript" field on the metadata node. (This is what we were planning on.)
  2. Index the files/media as independent entities that are returned in search results with links to the related item. It appears search_api_attachments (D8 version still in beta) does this, but it presumes a Node -> Media relationship rather than the Media -> Node relationship we use.

@dannylamb
Copy link
Contributor

FWIW I was considering @seth-shaw-unlv's first strategy for stuff like (H)OCR and transcripts. Having that as a field you don't display but index is hands down the simplest way to go. Make an action to run in response to updates of a media and have it dump its contents into a field.

For something that would require a transform I'm less certain as to how it would play out. You could transform in Drupal with Twig templates and json or make a microservice so you're no longer constrained by PHP's limited xml handling. It all depends on the use case, I guess.

@Natkeeran
Copy link
Contributor Author

@dannylamb

For full text text, the first approach can work.

But, there are many modules/use cases out there that require transform (tei, oral history etc), thus having a way to support that would be helpful.

@whikloj
Copy link
Member

whikloj commented Mar 25, 2019

@Natkeeran could you flesh out the requirement of transform a little bit? I am unclear on how you would use TEI in an Islandora 8 context.

@Natkeeran
Copy link
Contributor Author

Natkeeran commented Mar 27, 2019

@whikloj
In 7.x, you can use custom xslts to index TEI elements into solr. Those solr fields can then be searched and faceted in Drupal via Islandora Solr Search. We are using in this feature in several places:

  • oral histories indexes cues in solr, which are then queired for display
  • we use solr to bring back audit, tech/fits and foxml info for reporting
  • tei are index as noted above
  • vtt, and annotations are indexed into solr in similar way as well

Though we may not need all the above use cases in 8.x, the question remains if we need a generic way to index media/datastreams in solr then make them available for search, faceting etc in Drupal.

@whikloj
Copy link
Member

whikloj commented Mar 27, 2019

My concern is thinking in 7.x terms for 8.

For instance (IMHO) media !== datastream, more media & file == datastream but even that seems a little wrong as a datastream in Fcrepo 3 only has one parent. In Drupal 8 we could have multiple content nodes pointing to the same file with separate media entities.

Maybe we need some sort of special entity to store file information. These entities would reference a file and could contain the FITS type data. If more than one node references the file, this data is still only stored once and perhaps not as XML.

Could we convert it to some usable JSON that would be easier to work with. This data is meant to be machine readable.

I guess what I'm saying is that most people in the Islandora 7.x world have trouble with and then learn to hate the XSLTs. So I think it might be nice to dump them.

But I'm good with XSLTs, so I can go either way.

@whikloj whikloj added this to the 1.x milestone Apr 11, 2019
@kstapelfeldt kstapelfeldt added Type: enhancement Identifies work on an enhancement to the Islandora codebase Subject: Search related to advanced and basic searching capabilities. and removed enhancement labels Sep 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Subject: Search related to advanced and basic searching capabilities. Type: enhancement Identifies work on an enhancement to the Islandora codebase
Projects
Development

No branches or pull requests

5 participants