Skip to content
This repository has been archived by the owner on Oct 30, 2018. It is now read-only.

Search in shared files using a single index #10

Open
butonic opened this issue Feb 11, 2014 · 26 comments
Open

Search in shared files using a single index #10

butonic opened this issue Feb 11, 2014 · 26 comments
Assignees
Milestone

Comments

@butonic
Copy link
Contributor

butonic commented Feb 11, 2014

Originally opened as owncloud-archive/apps#1464


Steps to reproduce

  1. Alice shares a file with Bob containing the word 'secret'
  2. Bob searches for 'secret'
  3. He gets a search result for an occurence in the file shared by alice.

Expected behaviour

Users should be able to find files that have been shared with them by searching in the content.

Actual behaviour

Currently, only the users files are indexed.

Technical background

The lucene index is stored on a per user basis and resides in the /<userhome>/lucene_index. While it is not encrypted for performance reasons, that is possible but would prevent using another users index for the full text search (because we cannot access his encrypted index without his secret key).

Planned Approach

The current plan is to make the documents in the lucene index contain the name of users and groups allowed to access the file. Whenever a file is shared / unshared we need to update the document in the lucene index. Unfortunately, lucene - by design - only allows adding or deleting documents in the index. Initial testing indicates that query hits can be used to obtain the original document, update it with the updated list of users / groups who can acces the document and then delete & reinsert the document into the index. All without having to reindex the original file. Which would take far too long.

Maintaining the permissisons like this is described in http://www.lucenetutorial.com/techniques/permission-filtering.html and we cann add the user that is querying to the query as a subquery as shown in http://framework.zend.com/manual/1.12/en/zend.search.lucene.searching.html

Further thoughts

When we add user / group permissions we could create a single global index and use that instead of querying each individual user index. Whether that will improve performance (because we only need to access one index) or degrade it (because the index might grow very large) remains to be tested.

Using a single index simplifies the whole architecture. And is the way to go.

@butonic
Copy link
Contributor Author

butonic commented Jul 22, 2014

maybe @craigpg has some input on how to approach searching inside shared files

@butonic butonic changed the title Roadmap Full text Search in shares Search in shared files Jul 22, 2014
@alexphelps
Copy link

This is exactly what we're looking for. Is this planned?

@butonic butonic changed the title Search in shared files Search in shared files using a single index Mar 27, 2015
@VicDeo
Copy link
Contributor

VicDeo commented May 4, 2015

would be nice to combine with #40

@AmiZya
Copy link

AmiZya commented Oct 27, 2015

Any update here please ?

@butonic butonic added this to the 9.0-current milestone Oct 28, 2015
@butonic
Copy link
Contributor Author

butonic commented Oct 28, 2015

I will have a look for OC9.

@NacreData
Copy link

I have a client who is interested in this work and we may be able to provide some programming time or other resources. We'd prefer to collaborate on work in progress and not duplicate work.

@ghost
Copy link

ghost commented Feb 17, 2016

@NacreData I suspect there is no duplicate work in place right now. Would you be able to provide some time for this?

@NacreData
Copy link

Yes I have some time allocated for it over the next month or so, I will probably be starting next week. Do you have any ideas or direction for anything else to tell me that would make my efforts more successful?

devin

contact info: http://nacredata.com/devin

On Feb 17, 2016, at 18:34, C. Montero Luque [email protected] wrote:

@NacreData I suspect there is no duplicate work in place right now. Would you be able to provide some time for this?


Reply to this email directly or view it on GitHub.

@ghost
Copy link

ghost commented Feb 23, 2016

@PVince81

@ghost ghost modified the milestones: 9.0.1-next-maintenance, 9.0-current Feb 23, 2016
@PVince81
Copy link
Contributor

@butonic @VicDeo any hints ?

@NacreData
Copy link

Looking today at how much more (compared to my hack described above) would be involved in doing it the "right" way described in "planned approach" at the top. It would help greatly to have the code used for "Initial testing indicates that query hits can be used to obtain the original document, update it with the updated list of users / groups who can acces the document and then delete & reinsert the document into the index.". Is that possible?

@ghost
Copy link

ghost commented Mar 7, 2016

@butonic could you provide @NacreData that code?

@scolebrook
Copy link

@NacreData Don't assume that the users home directory is named after their username. That's the default but is not guaranteed. In an AD backed system the default internal username will be the value of the objectGUID attribute, a long string of letters and numbers. The user_ldap app allows the home directory to be named after a different attribute as this is often much more convenient and certainly easier to type.

So the only way to find out the path for the home directory for a given user object is to ask it. \OC::$server->getUserManager->get($uid)->getHome() is what you're looking for. If the backend can provide the home path like user_ldap does, you get that returned. Otherwise you get a constructed path of datadirectory.'/'.$uid.

There are lots of bugs in lots of apps because of this incorrect file system assumption.

@NacreData
Copy link

The code I have at https://github.com/NacreData/search_lucene/tree/shared-files is now working for me to search across shared files. This does the work to move the index to a combined/site-wide index and then use user-specific file attributes to filter the results. I have submitted a pull request #120. Much of what I've done could probably be done in a better/more maintainable way by folks with more OC experience - I hope some folks will help improve this so it can be committed and maintained. Thanks.

@ghost ghost assigned georgehrke and unassigned butonic Apr 19, 2016
@PVince81
Copy link
Contributor

@georgehrke mind having a look at the above comments ?

@PVince81
Copy link
Contributor

@georgehrke ?

@georgehrke
Copy link
Contributor

I'll take a look Monday when I'm back from vacation

Please excuse my brevity and typos.
Sent from my mobile

Please excuse my brevity and typos.
Sent from my mobile

On Apr 22, 2016, at 11:02 AM, Vincent Petry [email protected] wrote:

@georgehrke ?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@ghost ghost modified the milestones: 9.0.3-next-maintenance, 9.0.2-current-maintenance May 1, 2016
@PVince81 PVince81 modified the milestones: 9.0.4-current-maintenance, 9.0.3 Jun 30, 2016
@PVince81 PVince81 modified the milestones: 9.1.1, 9.0.4 Jul 18, 2016
@feuse8
Copy link

feuse8 commented Jul 25, 2016

Any help needed on this issue? Can you tell anything about the actual status?

@NacreData
Copy link

@feuse8 help would be awesome! As far as I can tell, the best place to look for the current status is to look at the last two comments in the pull-request thread here: #121 I am happy to help with explaining what I've done and what I'm thinking to help move this forward, and I'll be writing more code if/when possible over the next month or two.

@AsimAJ33
Copy link

any update on this?

@NacreData
Copy link

NacreData commented May 11, 2017 via email

@PVince81 PVince81 modified the milestones: backlog, 10.0 May 17, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants