Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr: make some internal fields indexed=true (searchable) for troubleshooting #2038

Closed
pdurbin opened this issue Apr 17, 2015 · 12 comments
Closed
Assignees
Milestone

Comments

@pdurbin
Copy link
Member

pdurbin commented Apr 17, 2015

Currently many "internal" fields are set to indexed=false in our Solr schema.xml because we weren't sure if we really want them to be searchable.

@scolapasta and I have been talking about how it might be useful to make more of them searchable for troubleshooting purposes.

Unfortunately, it's not enough to simply change the schema.xml file to indexed=true. You also have to reindex everything. So, we let's use this ticket to define which internal fields we want to be searchable. Here is a list of candidates:

  • entityId
  • parentId
  • parentIdentifier
  • parentName
  • parentCitation
  • citation
  • identifier
  • persistentUrl
  • unf
  • fileSizeInBytes
  • fileMd5
  • fileContentType
  • deaccessionReason
  • datasetVersionId

For info on what each of these do: https://github.com/IQSS/dataverse/blob/master/src/main/java/edu/harvard/iq/dataverse/search/SearchFields.java

Note that we do not plan to copy any of these to the "catchall" field ("text") used by Basic Search. Nor do we intend to show these on the Advanced Search page. The idea is that if you know to type "entityId:123" you'll find both in the GUI and the Search API (all users, not just superusers).

Also note that when I say searchable I don't mean in a "friendly" way. These are "string" and "long" Solr field types so you have to provide an exact string match when searching.

I'm giving this to @scolapasta to indicate which fields should be searchable.

@pdurbin pdurbin added this to the Dataverse 4.0: Release Patch milestone Apr 17, 2015
@pdurbin pdurbin changed the title Solr: make some internal fields indexed=true for troubleshooting Solr: make some internal fields indexed=true (searchable) for troubleshooting Apr 17, 2015
@scolapasta scolapasta modified the milestones: 4.0.1, Dataverse 4.0: Release Patch, In Review - Short Term Apr 17, 2015
pdurbin added a commit that referenced this issue Apr 24, 2015
Requires schema change and re-indexing #2038

Also show output for orphaned files in index/status API call.
@pdurbin
Copy link
Member Author

pdurbin commented May 21, 2015

As I mentioned in #2086 I think the only way to solve #2086 is to make this field searchable:

  • parentId

While we're making more fields searchable, I would suggest also making these searchable because the values are short and shouldn't contain very many special characters (or at least the special characters will be predictable):

  • entityId
  • identifier
  • parentIdentifier
  • unf
  • fileSizeInBytes
  • fileMd5
  • fileContentType
  • datasetVersionId

I'd suggest taking a "wait and see" approach on making the following searchable because the values are long and potentially tricky to search on given special characters and such:

  • parentName
  • parentCitation
  • citation
  • persistentUrl
  • deaccessionReason

@scolapasta I'm moving this to the current milestone since it's required for #2086. If you want to simply give me this ticket to make the change above, that's fine.

@pdurbin pdurbin modified the milestones: 4.0.1, In Review May 21, 2015
pdurbin added a commit that referenced this issue May 21, 2015
Requires schema change and re-indexing #2038

Also show output for orphaned files in index/status API call.
@pdurbin
Copy link
Member Author

pdurbin commented May 21, 2015

@scolapasta in 96e411c I started making parentId searchable, which (again) I needed for #2086.

Feel free to move this ticket out of 4.0.1. There's no rush for the other fields.

@raprasad
Copy link
Contributor

raprasad commented Jun 1, 2015

@scolapasta, this will be needed for the Data Related to Me page.
If a dvobject id is known, no way to grab only that object from solr.

  • Ideally: index by dvobject id
  • OR index
    • dataverse alias
    • dataset persistent id
    • datafile id

@scolapasta scolapasta modified the milestones: 4.0.2, 4.0.1 Jun 8, 2015
@scolapasta
Copy link
Contributor

entityId is now searchable

@scolapasta scolapasta modified the milestones: Candidates for 4.0.3, 4.0.2 Jul 8, 2015
@pdurbin
Copy link
Member Author

pdurbin commented Jul 8, 2015

entityId is now searchable

Right. This was merged from the "mydata" branch into the "4.0.2" branch the other day.

@scolapasta scolapasta modified the milestones: 4.2, Candidates for 4.2 Jul 15, 2015
@pdurbin
Copy link
Member Author

pdurbin commented Jul 29, 2015

@scolapasta should we use this issue for the idea of a new Solr field containing "4.2" or whatever Dataverse version was used to index the Solr document? I guess we could call it "indexedByDataverseVersion" or something (suggestions welcome). It sounds like we want it to be searchable for troubleshooting purposes which is what made me think of this issue. /cc @kcondon @landreev

@scolapasta scolapasta modified the milestones: 4.3, 4.2 Sep 17, 2015
@scolapasta scolapasta assigned pdurbin and unassigned scolapasta Sep 17, 2015
@pdurbin
Copy link
Member Author

pdurbin commented Sep 17, 2015

@scolapasta and I discussed my recommendations at #2038 (comment) and decided to go with them. That is to say that as of f14ab64 the following fields are searchable after reindexing:

  • parentId (added in 4.0.1)
  • entityId (added in 4.1)
  • identifier
  • parentIdentifier
  • unf
  • fileSizeInBytes
  • fileMd5
  • fileContentType
  • datasetVersionId

Here's an example of searching by the MD5 of a file: curl -s 'http://localhost:8983/solr/collection1/select?rows=1000000&wt=json&indent=true&q=fileMd5:0386269a5acb2c57b4eade587ff4db64'

Passing to QA. I'd suggest testing #2530 at the same time.

@kcondon
Copy link
Contributor

kcondon commented Oct 1, 2015

@pdurbin
A few observations, not sure if they're issues:

  1. When matching a file such as with Md5, the file type field shows up twice on the file card:
    File Type: PNG Image
    File Type: PNG Image
  2. For file size in bytes, I was not able to match based on a specific reported file size: 289.5KB for example. I tried 289.5 (I know, bytes), 289.5x1024=296448, 289.5x1000=289500, no match. May be difficult in practice using this. Also, would it be useful to be able to specify >, < or a range?
  3. Could not specify an individual UNF value though it works for *. I think this is because the UNF value has : in the value: UNF:6:x10r+Q9EK6aF/BMi+eKzGw==

Passing back for comment. Otherwise, if the purpose is only for troubleshooting, then probably ok.

@pdurbin
Copy link
Member Author

pdurbin commented Oct 2, 2015

A few observations, not sure if they're issues:

1: md5

  1. When matching a file such as with Md5, the file type field shows up twice on the file card:
    File Type: PNG Image
    File Type: PNG Image

I suspect I'm doing something wrong but I can't reproduce this. When I search for "fileMd5:28bea8a0f1d3ceb96a1f2fe1f33c4bd2" it seems to find just the file I want:

harvard_dataverse_-_2015-10-02_15 35 31

2: size in bytes

2 . For file size in bytes, I was not able to match based on a specific reported file size: 289.5KB for example. I tried 289.5 (I know, bytes), 289.5x1024=296448, 289.5x1000=289500, no match. May be difficult in practice using this. Also, would it be useful to be able to specify >, < or a range?

Yeah, specifying range queries in Solr is unfriendly, as far as I know. Some day I'd like to work on #370 and make this easier from the GUI. For now you find the number of files over 123 MB in size with a search for "fileSizeInBytes:[123456789 TO *]" like this:

harvard_dataverse_-_2015-10-02_15 39 57

Look for "range" at https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser for more on this topic.

3: UNF

3 . Could not specify an individual UNF value though it works for *. I think this is because the UNF value has : in the value: UNF:6:x10r+Q9EK6aF/BMi+eKzGw==

When I query Solr directly it seems to work (curl -s 'http://localhost:8983/solr/collection1/select?rows=100&wt=json&indent=true&q=unf:"UNF:3:kaUC9IweEBCicuix7s5ZyQ=="') but the only semi-useful search I can figure out through the GUI is for "unf:*" like you said, to find out the number of files with a UNF in the system:

harvard_dataverse_-_2015-10-02_15 47 17

This makes me think, however, that we should probably be indexing UNF at the dataset level too, since I believe datasets get UNFs in their citation when they contain at least one file that has a UNF. (I'm not exactly sure how this works but @landreev probably knows.)

Back to QA. I hope this helps.

@pdurbin pdurbin assigned kcondon and unassigned pdurbin Oct 2, 2015
@kcondon
Copy link
Contributor

kcondon commented Oct 2, 2015

Thanks for following up Phil.
I could not reproduce issue 1, not sure what happened there, it was not critical anyway.
It seems like we have ways to use the other two so I'm good. Closing ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants