-
-
Notifications
You must be signed in to change notification settings - Fork 41
Deadlock error while running test suite #121
Comments
Could you give as more informations, for example the output of |
I can't reproduce with the latest master. What do you mean by "running the tests individually"? |
I can reproduce this issue.
Some time 1 is failing, some 2 or more |
This looks like something with unknown users. |
@mujx by individual tests i meant: |
I tried two other Linux machines (Debian 8, Arch) and still can't reproduce the issue. @farodin91 Turn debug logging on and see if there is any deadlock. What other tests fail? @anuragsoni Can you run |
@mujx I will try that later today. Also I do have the the debug logs from the test run that failed for me. I just attached a small snippet above but I can attach the full file later today when I get home, if you think that will help. I was able to reproduce this on both the docker image and on a postgresql instance installed on my own machine (Ubuntu 16.10, postgresql-9.5) |
I think I reproduced the issue. I increased the number of test threads by adding The problem seems to be happening when registering a new user with this helper function. The failures are random but they all have the same source. |
This never happened before it must one of last 5 commits. |
@mujx using RUST_TEST_THREADS=1 works, and that makes sense I guess since every test will be run sequentially. |
We don't actually want to restrict tests to one thread, though. Each test case should happen in a transaction so they don't step on each other like this. |
That is true. There shouldn't be deadlocks while running tests |
So I have been trying to figure out the reasons for this deadlock in tests but I was unable to get better logs than the ones I attached to the issue above. I tried to replicate the same scenario in another project where I use PostgreSQL and this is what I can tell so far: Minimal example to reproduce:Transaction 1 executes:INSERT INTO USERS (id, password_hash, active) VALUES ("alice", "foo",false); this acquires a lock with the user id "alice" Transaction 2 executes:INSERT INTO USERS (id, password_hash, active) VALUES ('bob', 'foo',false);
INSERT INTO USERS (id, password_hash, active) VALUES ('alice', 'foo',false); The first row will successfully acquire a lock and will be executes. But Transaction 1 executes:INSERT INTO USERS (id, password_hash, active) VALUES ('bob', 'foo',false); Transaction1 attempts to acquire a row level lock for bob but The actual error in postgresql will look something like
|
Thanks very much for digging into this. I'm not super strong with databases—could you explain exactly how the changes in #147 fix the problem? My understanding was that using transactions would isolate the SQL statements in each test from each other, but clearly they still create locks that can affect each other. How does reordering the lines in those tests avoid this? I think I'm just missing the rule of thumb to follow on how to spot this and avoid it in the future. Does this careful ordering also prevent deadlocks from happening at runtime in production? It seems like what happened here with the tests could just as easily happen with concurrent requests on the server. |
I found the following documentation for deadlocks arising because of row level locks in postgres: https://www.postgresql.org/docs/9.4/static/explicit-locking.html Like you i'm not super strong with the details of database systems, but this problem reminded me of an issue we had at my workplace a while back. In short my understanding is that the goal is to avoid cycles in concurrent code. In my mind I translate a transaction to a thread that is trying to lock a resources. If the rows are locked in a consistent order, i.e. we always try to lock Row A before Row B, then we will avoid cycles. The transactions will be blocking but won't cause a deadlock. The re-ordering in the tests ensured that the insert calls to the user table are always done in descending order of the primary key (user id in this case, and this could have been ascending order too, it just needs to be deterministic). In the example above if transaction2 locked the rows in the same order as transaction1: INSERT INTO USERS (id, password_hash, active) VALUES ('alice', 'foo',false);
INSERT INTO USERS (id, password_hash, active) VALUES ('bob', 'foo',false); it will see that row "alice" is locked by transaction1 so it would wait for it to finish and now acquire a lock for row "bob", as a result transaction1 can continue its run, lock row "bob", do its work, and then finish. This avoid deadlock by not allowing a cyclic dependency to occur between the two transactions. I suppose yes, this scenario can happen in actual requests on the server. Another articles that I found useful: |
Also, when a deadlock is detected the database system will abort the transaction, so one more approach could be to retry any transaction that was aborted because of deadlocks? |
Would it also work to create a dedicated ID column for the users table? Would that prevent two DB connections trying to create a user with the same name at the same time from locking the same row? Edit: And would that even matter, since we would still need a uniqueness constraint on the Matrix user ID? Or is the table's primary key the only concern for this type of locking? |
The database logs can be accessed through I agree with @anuragsoni that we should be able to retry some of the transactions. I tried jimmy's solution with the extra id but there are still failures when setting the unique constraint on user_ids. I will try again using a composite primary key this time. The changes are here. Would it make sense to change the test helper in a way that it registers random strings as user_ids and let alice = test.register_new_user();
alice.token => "regular token";
alice.id => "@qWIoYYgNCjwTpX:ruma.test" edit: Composite PK seems to work. @aurabindo, @anuragsoni please try to verify. |
@mujx @jimmycuadra The error when setting a unique constrains on user_ids seems reasonable as that will also try to acquire a row level lock. Composite keys should work when using the SERIAL type as shown in the link shared by @mujx The other tables like events, profiles, room_aliases could potentially have the same problem too. But then on another note, from what i've read in the matrix spec so far, when running in production like environment, it wouldn't be normal to have a single transaction trying to create/update/delete multiple resources? In our tests we were creating more than one user as part of the same transaction but in actual usage i'm guessing that there would be just one resource per create/update/delete action? If that will be true then there wouldn't be deadlocks as such. And in a case where there is a deadlock nonetheless, we should be able to retry the transaction. As for the tests, we can control the test environment to not cause problems like this. We could use random strings as username as suggested or create resources in a specific order etc. |
Sounds like randomized user IDs would avoid the deadlock problem, but it also might be confusing to have a user referred to as "alice" whose user localpart is not actually "alice". I dunno, maybe if the helper method didn't take a localpart parameter like you show above, it would be clear enough reading the code. Technically it still wouldn't be deterministic, but the chances of two identical IDs being created in two concurrent tests is very slim. |
@jimmycuadra for the tests just following a simple rule like "create names in alphabetical order" might be enough. That will ensure no deadlocks in tests. Adding random strings to the user_ids might be confusing (and more code to maintain 😁) |
I'm in favor of the random strings because that wouldn't change the test logic or add extra mental overhead when trying to write new tests (new contributors should be aware of the order situation and it should also be documented). The composite primary key solution seems unnecessary for the user's table. We shouldn't really care if the alice id doesn't contain "alice" because the logic is independent of the names and the wrapper structs will keep the code readable. We can use the code from the access tokens to generate them. |
I reread this comment and I think I finally understand how changing the order of operations prevents the deadlock. I thought I'd write it out here just to help myself and anyone who comes across this in the future and wants to learn about this problem. My confusion was about the point at which locks are released. Anurag says (as does the Postgres documentation) that this sequences of events:
will result in a deadlock because T1 will be waiting for row B to unlock, which will never happen because T2 is waiting for row A to unlock, which is held by T1. This implies that once a row is locked by a transaction, it stays locked until the transaction is rolled back or committed. Now let's look at two transactions that use the same order:
Assuming the first two locks happen concurrently and T1 happens to win the lock, T2 now cannot proceed until that lock is released. However, T1 can continue on its merry way because row B is not locked—T2 is still stopped and waiting at its previous step. T1 can continue to transaction commit or roll back, at which point the locks on rows A and B will be released, and T2 can now proceed unencumbered. |
Sounds like we all agree on randomized user IDs from the helper method. I believe this will require updating some tests where user IDs are hardcoded into JSON that we are asserting against. I'm inclined to merge #147 for the quick fix, but leave this open until we make the change to the helper method in another PR. |
@jimmycuadra @mujx randomized user IDs sound good then. @jimmycuadra #147 can indeed be a stopgap to unblock any other PR that might be having the deadlock problems. @mujx are you going to work on the test helpers? If not i'm okay to work on that as well. |
@anuragsoni Feel free to start working on it. |
@mujx Okay. I will pick it up. Also, I was thinking about the 'composite key' approach you mentioned above. That might avoid deadlocks in insert actions, but wouldn't it still lead to deadlocks for @jimmycuadra @mujx Maybe we need another issue to track the change required to retry a transaction in case a deadlock is encountered in a normal flow in production? I don't believe we handle retries currently? |
@anuragsoni I opened #148 to discuss retries. |
I was following the setup guide from the README (using docker and docker-compose) When running the full test suite
./scripts/cargo test
I sometimes see random failures in the following tests.Turning on the debug logs i can see the following:
DEBUG:ruma::error: Converting to ApiError from: DatabaseError(__Unknown, "deadlock detected")
DEBUG:ruma::error: Converting to ApiError from: UserReturnedError(ApiError { errcode: Unknown, error: "An unknown server-side error occurred." })
test api::r0::profile::tests::put_avatar_url_unauthorized ... �[31mFAILED�(B�[0m
DEBUG:ruma::test: Username: {"username": "alice", "password": "secret"}
DEBUG:ruma::error: Converting to ApiError from: DatabaseError(__Unknown, "deadlock detected")
DEBUG:ruma::error: Converting to ApiError from: UserReturnedError(ApiError { errcode: Unknown, error: "An unknown server-side error occurred." })
test api::r0::room_creation::tests::with_invited_users ... �[31mFAILED�(B�[0m
DEBUG:ruma::test: Username: {"username": "bob", "password": "secret"}
DEBUG:ruma::error: Converting to ApiError from: DatabaseError(__Unknown, "deadlock detected")
DEBUG:ruma::error: Converting to ApiError from: UserReturnedError(ApiError { errcode: Unknown, error: "An unknown server-side error occurred." })
test api::r0::profile::tests::put_displayname_unauthorized ... �[31mFAILED�(B�[0m
Note: Running the tests individually works with a success.
steps to reproduce:
The text was updated successfully, but these errors were encountered: