-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use DAO layer instead of raw 'SELECT *' during migration #3532
Conversation
Hi, Thank you for the patch. It seems necessary, and could be shipped in our upcoming 0.14 (for which the merge window closes this Friday). However, it is actually a deliberate decision not to use the DAO in the migrations.
The DAO has no such retry mechanism. The driver does. The DAO should not be used in migrations because of schema changes evolving over time, which results in other errors. Unfortunately this guideline is not always respected. Another reason not to is because the migrations will elect one coordinator to run all of the migrations, in order to avoid that very issue you are referring to. Since the DAO will load balance between nodes, it increases the chances of hitting a node on which the schema has not being propagated yet. In this particular case, the Ensuring the schema reaches consensus is done, but only after all of the migrations finished running, see: https://github.com/Kong/kong/blob/master/kong/dao/factory.lua#L458-L464 In this case, we should probably wait for the schema consensus before making the subsequent query (but this query should still be executed without using the DAO).
We will need you to rebase this branch on top of our current master branch. Your PR should only consist of a single commit. See the part of CONTRIBUTING.md about this topic. |
Hi @thibaultcha, I understand the risk you mention by using the DOA layer... but the various attempts we made with our 4 nodes cluster (migration launched with a non-existing kong keyspace => db is empty) highlights that:
So, even if we can understand you analysis, the test results we have are at the opposite of the analysis ! Oops :(
For sure, we tried the
Last remark:
A DBA told me that your explanation definitely makes sense for potential consistency issues related to data updates, buy maybe it is not for schema updates because schema propagation looks to be done using maximum consistency level (and we are using the LOCAL_QUORUM consistency). Any idea on that? Thanks for your help! |
@pamiel Yes, I understand that the approach you used here fix the issue, but it isn't the ideal solution, because the DAO should not be used in migrations (for several aforementioned reasons). It fixes the issue because
Yes, this is what I mean.
No, this goes directly to the driver and skips the DAO.
Well, as per the Cassandra error observed in #3392, and because of the nature of the fix (waiting for schema consensus), this is obviously the culprit here. |
Ok, understood. Does this look ok for you ? Thanks for your help. |
@pamiel Yes, this looks good. Could you please rebase your branch on top of master? We would need your PR to be made of one single commit implementing the change, as stated previously. We do not have enough time to cleanup the git history for you, sorry. Thanks! |
After a couple of hours fighting against git, I think I properly rebased with master and squashed all the commits to a single commit. |
@pamiel Yes, that is great, nicely done. |
Manually merged to master. Thank you! |
Summary
This PR directly addresses issue #3392 (and indirectly addresses some posts of issue #3005)
Step
2017-01-24-132600_upstream_timeouts_2
of the migration process starts with a selection of all rows of the apis table (https://github.com/Kong/kong/blob/master/kong/dao/migrations/cassandra.lua#L417). However:2017-01-24-132600_upstream_timeouts
that just altered the sameapis
tableSELECT * FROM apis
query, so not using the DAO layer... so not taking benefit on the "retry on read failure" policy nor of the "wait schema consensus" provided by this DAO layer.Working on the issue with a Cassandra DBA, the analysis is that:
SELECT *
is run, generating a read error (most probably due to an inconsistency on the schema of the queried node)... and unfortunately, no retry is implemented on this read error as this is a raw query that is done.local rows, err = dao.apis:find_all()
instead of thelocal rows, err = dao.db:query([[ SELECT * FROM apis; ]])
: the schema consensus is obtained and even if a read error occurs, a retry is performed by the DAO layer.Full changelog
local rows, err = dao.apis:find_all()
instead of thelocal rows, err = dao.db:query([[ SELECT * FROM apis; ]])
inkong/doa/migrations/cassandra.lua
(line 417)Sorry, it not really clear on my fork: it mentions some "old" commits I already pushed a couple of month ago... but it says that the only updated file is
kong/doa/migrations/cassandra.lua
(which is correct).Can you cross check that from your side the updates included in this PR are limited to this file only? Thanks a lot!
Issues resolved
Fix #3392