Optimize re-assigning datasets to topics #84

amercader · 2018-10-02T14:08:03Z

Right now we are walking over all datasets which doesn't scale very well. I worry that if there are many datasets we will find race condition issues.

What about if the job that rebuilds the topics would receive this parameters (we'll worry about how to get these later):

group name
terms added
terms deleted

Then we can use the search to find datasets that contain any of the terms added or deleted and run the same logic used now to map the terms to groups: if the dataset has one of the new added terms it will get added to the group, if it has one of the deleted ones it will be removed from it.

Does this sound good?

We need to do two things for that:

Get added and deleted terms: On the create hook all terms are now, so all of them are added. On the update hook we'll have to go find the extras directly to the db. Luckily at this point the changes haven't been committed so we can just query directly from them:
```
model.Session.execute('SELECT * FROM "group_extra" where group_id = :id', {'id': entity.id}).first()
```
Index the dataset terms in a way that can be queried: there is a dynamic Solr field to index things as lists. On the before_index hook we need to index vocab_harvest_dataset_terms = dataset['harvest_datasets_terms'].split('\n') and it will get index. On the job, we then use this field in the queries.

The text was updated successfully, but these errors were encountered:

roll · 2018-10-03T08:21:08Z

@amercader
Regarding the implementation on top of my head also:
https://github.com/okfn/ckanext-unhcr/blob/0f677a61041f9bbab4b5ccf5440023b796f22b81/ckanext/unhcr/jobs.py#L141-L164

That's how we were able to get revision data for UNCHR. So probably we can do the same inside this job to get terms_before and terms_after and get the difference.

Regarding the issue in general - I don't see yet that performance can be a factor. It should be milliseconds on many thousands of datasets. But race conditions - yea probably something to think about.

amercader added enhancement New feature or request and removed enhancement New feature or request labels Nov 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize re-assigning datasets to topics #84

Optimize re-assigning datasets to topics #84

amercader commented Oct 2, 2018

roll commented Oct 3, 2018

Optimize re-assigning datasets to topics #84

Optimize re-assigning datasets to topics #84

Comments

amercader commented Oct 2, 2018

roll commented Oct 3, 2018