Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a search reindex management command #909

Closed
seav opened this issue Nov 9, 2016 · 2 comments
Closed

Implement a search reindex management command #909

seav opened this issue Nov 9, 2016 · 2 comments
Assignees
Milestone

Comments

@seav
Copy link
Contributor

seav commented Nov 9, 2016

This GH issue is to support the implementation of the search feature (#825) and was discussed during the search call on Nov. 7/8.

There are times when the ES index for a project needs to be recreated. This is when the index is first populated, when the index needs to be refreshed because it had somehow gone out of sync with the platform data, when a data migration occurs and new (or existing) fields need to be (re)indexed, etc. To provide this function, a reindex management command needs to be implemented.

This command accepts a project slug as its argument and performs the following:

  1. Create an empty ES index for the specified project. The existing project's index (if any) is still functioning and serving search queries.
  2. Index all of the project's records and resource metadata by pulling the data from the database and then pushing it to the ES cluster via its bulk API.
  3. Switch the new index with the active index using ES' index aliasing feature then drop the now inactive and outdated index.

When all projects need to be reindexed, the idea is to reindex a small project first, test that the reindexing works as expected, reindex another project if needed, then reindex the rest of the projects via an ad hoc script that calls the command for the remaining projects sequentially.

@seav seav added this to the Sprint 11 milestone Nov 9, 2016
@seav
Copy link
Contributor Author

seav commented Nov 14, 2016

Question: How do we ensure that any records that are created/updated/deleted while the reindexing is ongoing are not lost or become stale? One possible solution is that when a project is reindexed, the update processing (#908) is stopped (but the queue is still available to receive data). This may mean that a record may be updated in the index twice (once when the reindexing picks up the updated record, and second when the queue is finally processed) but this is not a problem, of course. Not sure though about the case when a record is deleted: the second time the index is updated to delete a record may result in an error.

@dpalomino dpalomino modified the milestones: Sprint 12, Sprint 11 Dec 5, 2016
@amplifi amplifi removed the records label Jan 6, 2017
@amplifi amplifi assigned amplifi and unassigned seav and linzjax Jan 17, 2017
@amplifi
Copy link
Contributor

amplifi commented Jan 17, 2017

Done using Logstash to sync database with Elasticsearch directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants