Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize re-assigning datasets to topics #84

Open
amercader opened this issue Oct 2, 2018 · 1 comment
Open

Optimize re-assigning datasets to topics #84

amercader opened this issue Oct 2, 2018 · 1 comment
Labels
enhancement New feature or request

Comments

@amercader
Copy link
Member

Right now we are walking over all datasets which doesn't scale very well. I worry that if there are many datasets we will find race condition issues.

What about if the job that rebuilds the topics would receive this parameters (we'll worry about how to get these later):

  • group name
  • terms added
  • terms deleted

Then we can use the search to find datasets that contain any of the terms added or deleted and run the same logic used now to map the terms to groups: if the dataset has one of the new added terms it will get added to the group, if it has one of the deleted ones it will be removed from it.

Does this sound good?

We need to do two things for that:

  • Get added and deleted terms: On the create hook all terms are now, so all of them are added. On the update hook we'll have to go find the extras directly to the db. Luckily at this point the changes haven't been committed so we can just query directly from them:

    model.Session.execute('SELECT * FROM "group_extra" where group_id = :id', {'id': entity.id}).first()
    
  • Index the dataset terms in a way that can be queried: there is a dynamic Solr field to index things as lists. On the before_index hook we need to index vocab_harvest_dataset_terms = dataset['harvest_datasets_terms'].split('\n') and it will get index. On the job, we then use this field in the queries.

@roll
Copy link
Contributor

roll commented Oct 3, 2018

@amercader
Regarding the implementation on top of my head also:
https://github.com/okfn/ckanext-unhcr/blob/0f677a61041f9bbab4b5ccf5440023b796f22b81/ckanext/unhcr/jobs.py#L141-L164

That's how we were able to get revision data for UNCHR. So probably we can do the same inside this job to get terms_before and terms_after and get the difference.

Regarding the issue in general - I don't see yet that performance can be a factor. It should be milliseconds on many thousands of datasets. But race conditions - yea probably something to think about.

@amercader amercader added enhancement New feature or request and removed enhancement New feature or request labels Nov 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants