You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now we are walking over all datasets which doesn't scale very well. I worry that if there are many datasets we will find race condition issues.
What about if the job that rebuilds the topics would receive this parameters (we'll worry about how to get these later):
group name
terms added
terms deleted
Then we can use the search to find datasets that contain any of the terms added or deleted and run the same logic used now to map the terms to groups: if the dataset has one of the new added terms it will get added to the group, if it has one of the deleted ones it will be removed from it.
Does this sound good?
We need to do two things for that:
Get added and deleted terms: On the create hook all terms are now, so all of them are added. On the update hook we'll have to go find the extras directly to the db. Luckily at this point the changes haven't been committed so we can just query directly from them:
model.Session.execute('SELECT * FROM "group_extra" where group_id = :id', {'id': entity.id}).first()
Index the dataset terms in a way that can be queried: there is a dynamic Solr field to index things as lists. On the before_index hook we need to index vocab_harvest_dataset_terms = dataset['harvest_datasets_terms'].split('\n') and it will get index. On the job, we then use this field in the queries.
The text was updated successfully, but these errors were encountered:
That's how we were able to get revision data for UNCHR. So probably we can do the same inside this job to get terms_before and terms_after and get the difference.
Regarding the issue in general - I don't see yet that performance can be a factor. It should be milliseconds on many thousands of datasets. But race conditions - yea probably something to think about.
Right now we are walking over all datasets which doesn't scale very well. I worry that if there are many datasets we will find race condition issues.
What about if the job that rebuilds the topics would receive this parameters (we'll worry about how to get these later):
Then we can use the search to find datasets that contain any of the terms added or deleted and run the same logic used now to map the terms to groups: if the dataset has one of the new added terms it will get added to the group, if it has one of the deleted ones it will be removed from it.
Does this sound good?
We need to do two things for that:
Get added and deleted terms: On the
create
hook all terms are now, so all of them are added. On theupdate
hook we'll have to go find the extras directly to the db. Luckily at this point the changes haven't been committed so we can just query directly from them:Index the dataset terms in a way that can be queried: there is a dynamic Solr field to index things as lists. On the
before_index
hook we need to indexvocab_harvest_dataset_terms = dataset['harvest_datasets_terms'].split('\n')
and it will get index. On the job, we then use this field in the queries.The text was updated successfully, but these errors were encountered: