Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inserted might be None #1525

Closed
peterbe opened this issue Mar 1, 2018 · 15 comments
Closed

inserted might be None #1525

peterbe opened this issue Mar 1, 2018 · 15 comments
Labels

Comments

@peterbe
Copy link
Contributor

peterbe commented Mar 1, 2018

See

inserted = result.fetchone()
if not inserted['inserted']:

If inserted becomes None you get a TypeError trying to call inserted['inserted'].

This has actually happened in a real production instance:

Exception(most recent call first)

TypeError: 'NoneType' object is not subscriptable
  File "kinto/core/utils.py", line 414, in follow_subrequest
    return request.invoke_subrequest(subrequest, **kwargs), subrequest
  File "pyramid/router.py", line 195, in invoke_subrequest
    return self.invoke_request(request, _use_tweens=use_tweens)
  File "pyramid/router.py", line 249, in invoke_request
    response = handle_request(request)
  File "pyramid/router.py", line 156, in handle_request
    view_name
  File "pyramid/view.py", line 642, in _call_view
    response = view_callable(context, request)
  File "pyramid/config/views.py", line 181, in __call__
    return view(context, request)
  File "pyramid/viewderivers.py", line 390, in attr_view
    return view(context, request)
  File "pyramid/viewderivers.py", line 368, in predicate_wrapper
    return view(context, request)
  File "pyramid/viewderivers.py", line 301, in secured_view
    return view(context, request)
  File "pyramid/viewderivers.py", line 439, in rendered_view
    result = view(context, request)
  File "pyramid/viewderivers.py", line 148, in _requestonly_view
    response = view(request)
  File "cornice/service.py", line 494, in wrapper
    response = view_()
  File "kinto/core/resource/__init__.py", line 468, in put
    record = self.model.create_record(new_record)
  File "kinto/core/resource/model.py", line 315, in create_record
    record = super().create_record(record, parent_id)
  File "kinto/core/resource/model.py", line 191, in create_record
    auth=self.auth)
  File "statsd/client.py", line 37, in _wrapped
    return_value = f(*args, **kwargs)
  File "kinto/core/storage/postgresql/__init__.py", line 308, in create
    if not inserted['inserted']:

For those with access see this Sentry entry.

@peterbe
Copy link
Contributor Author

peterbe commented Mar 1, 2018

One immediate thought that springs to mind is that the query uses UNION ALL for two queries. (The ALL just helps to avoid duplicates across the queries). But the python code uses fetchone(). There's something illogical there. "Give be A and B. Result is the first one."

@peterbe
Copy link
Contributor Author

peterbe commented Mar 1, 2018

Sorry, the "UNION ALL" does the opposite.

"This PostgreSQL UNION ALL operator would return a category_id multiple times in your result set if the category_id appeared in both the products and categories table. The PostgreSQL UNION ALL operator does not remove duplicates. If you wish to remove duplicates, try using the PostgreSQL UNION operator."
-- https://www.techonthenet.com/postgresql/union_all.php

"[UNION] eliminates duplicate rows from its result, in the same way as DISTINCT, unless UNION ALL is used."
-- https://www.postgresql.org/docs/8.3/static/queries-union.html

@leplatrem leplatrem added the bug label Mar 1, 2018
@peterbe
Copy link
Contributor Author

peterbe commented Mar 13, 2018

@peterbe
Copy link
Contributor Author

peterbe commented Mar 19, 2018

More examples for those who can see this: https://sentry.prod.mozaws.net/operations/kinto-testpilot-prod/issues/3321530/

@leplatrem
Copy link
Contributor

inserted can be None when:

  • none of the 2 requests in the union returns a record
  • the cursor is consumed twice

I cannot reproduce, but I don't think UNION/UNION ALL is the problem here

@peterbe
Copy link
Contributor Author

peterbe commented Apr 16, 2018

When you say "I cannot reproduce" did you mean you wrote a unit test (functional with real Postgres) or did you mean you tried via HTTP?

I studied the SQL and tested it locally and it looks correct. If the record doesn't exist, it works. If the record existed but was deleted, it works (and makes it not deleted). If it already existed and wasn't delete it still works.

none of the 2 requests in the union returns a record

How could that happen?? The only time the first part of the UNION would return 0 records is if it already exists (by id, parent_id and collection_id) and the update did not do anything. If that is the case, the second part would find it and return at least 1 record. But that makes no sense.

I'm starting to suspect that it lies somewhere else outside this SQL query. I don't actually know how transactions in kinto are done but one thing that makes me nervous is that the SQL query is a write operation (sometimes) but there's no commit on the transaction.

@peterbe
Copy link
Contributor Author

peterbe commented Apr 16, 2018

I've tried really hard to reproduce this and failing.

I think the current solution isn't thread-safe but it's hard to prove because of the way transactions are done in kinto. It's not immediately clear where the commit happens after the kinto_client.create() record happens.

The SQL query can do three things:

  1. Do nothing, just return the row unchanged (inserted=FALSE)
  2. Insert the record (inserted=TRUE)
  3. Update the existing record and change deleted=TRUE (inserted=TRUE)

If the first thing happens, it will raise a UnicityError. (I guess that's to inform the client that you tried to create a record that already existed!)

But it's not clear where the COMMIT happens. I suspect what might happens is that record is changing during an insert. But it's not clear either.

One interesting thing to remember is that we've seen a lot of these errors in buildhub. In buildhub, we never delete things. I.e. we never set deleted=TRUE on any records. So what happens there is probably concurrent calls to kinto_client.create(same, arguments, in, concurrent, threads). But the strange thing is that if you try to call this SQL with the same it should raise an exception like:

duplicate key value violates unique constraint "idx_records_parent_id_collection_id_last_modified"
DETAIL:  Key (parent_id, collection_id, last_modified)=(/buckets/build-hub/collections/releases, record, 2018-04-16 09:39:00.004) already exists.

It concerns me what this can happen. Perhaps we can work together to audit how the transactions are committed. I really think there are better patterns to insert something

@leplatrem
Copy link
Contributor

It concerns me what this can happen.

Me too! I spent quite a while on this today and could not come up with a clear path that would lead to this kind of situation either. According to Sentry it happened like 100 times on Testpilot and almost 500 times on buildhub. This is far from being paranormal!

Perhaps we can work together to audit how the transactions are committed

The transactions are bound to the request lifecycle using pyramid_tm and this SQLAlchemy binding:

# Plug with Pyramid transaction manager
options['extension'] = ZopeTransactionExtension()
session_factory = scoped_session(sessionmaker(bind=engine, **options))

Commit or rollbacks are done just before the response is served.

In buildhub we use kinto-elasticsearch which introduces a round-trip to an ES server in the response cycle. It could make the transactions last a lot longer than normal.

I really think there are better patterns to insert something

:)

@peterbe
Copy link
Contributor Author

peterbe commented Apr 16, 2018

Some progress!
I managed to demonstrate the problem that Erwin talks about in: https://stackoverflow.com/a/15950324/205832

See (but don't judge me how ugly code is): https://gist.github.com/peterbe/883da6f65e315383ca83b7ade5e94e42

When you run that you get this:

▶ python dummy.py
! 0
! 1
commit 1
1 page data = ('firefox_beta_58-0b15rc1_linux-i686_cs2', {}, 1523861592005, True)
commit 0
0 page data = None

...every time!

What it simulates is two concurrent threads. If you merge their timelines you get this:

  1. (Thread 1) BEGIN
  2. (Thread 2) BEGIN
  3. (Thread 1) WITH create_record...
  4. (Thread 2) WITH create_record...
  5. (Thread 2) COMMIT

Basically, if you extrapolate that to the Python/kinto layer I think this happens:

# Thread 1
kinto_client.create('myid', 'mycollection', 'mybucket', '2018-04-16 06:53:12.004')

# Thread 2
kinto_client.create('myid', 'mycollection', 'mybucket', '2018-04-16 06:53:12.005')

The combination (parent_id, collection_id, last_modified) is different in both calls so you don't get trapped by the idx_records_parent_id_collection_id_last_modified uniqueness index on the inserts.

Please check if I've understood that right. If that's the case, the right way to write a unit test would be something like these pseudo kinto_client.create(...) above.

@peterbe
Copy link
Contributor Author

peterbe commented Apr 16, 2018

I really think there are better patterns to insert something

Erwin suggested you use a loop instead. I don't really yet know what that means for us.

Either way, we could perhaps instead reject on a Python level if inserted = result.fetchone() becomes None. If that happens, we know you called create() with the same ID (and parent_id and collection_id) but with different last_modified). It could push the problem onto the calling client.

The reason that's probably a good idea is that one of the clients did get an OK so it would not make sense to update the record for that other client.

Meaning, if it's true that two concurrent clients attempt the same .create() (with possibly different JSON!) it would be weird to "correct" the second client's call.

@leplatrem
Copy link
Contributor

Wao thanks! Excellent :)

Meaning, if it's true that two concurrent clients attempt the same .create() (with possibly different JSON!) it would be weird to "correct" the second client's call.

Definitely, if we know that when inserted is None we are in a conflict situation, we can raise a python exception. The client should probably get a 409 Conflict (that will be retried)

@peterbe
Copy link
Contributor Author

peterbe commented Apr 24, 2018

The easy fix would be to pursue my unfinished PR and just deal with the fact that the inserted becomes None without any locks.

The upsert prototype tries to break up the task by attempting to do an update (e.g. undoing the logical delete) with an UPDATE (using row-level locking), then it tries to do an insert and watch out for IntegrityErrors. That makes sure we can control two concurrent attempts to call create() with the same parameters.

Looking for thoughts and feedback on this.

@Natim
Copy link
Member

Natim commented Nov 30, 2018

I have this bug often with the Default bucket apparently.

image

@Natim
Copy link
Member

Natim commented Dec 3, 2018

I wonder if an easy fix in that case wouldn't be to raise a 409 Conflict error.

@glasserc
Copy link
Contributor

glasserc commented Dec 5, 2018

Just to recap some of the above:

  • inserted being none seems impossible because the query has two parts. The first part tries to insert the record and replaces any tombstone; the second part tries to fetch an existing record. inserted being none means it couldn't insert (there's an existing record) but also that it couldn't fetch an existing record, which seems inconsistent.
  • Like Peter says, it's probably coming from two different threads both doing create() at once. The first thread is inserting (or maybe updating?) successfully but hasn't committed; the second one is not able to insert but not able to read the other (uncommitted) row.
  • create() has to raise a UnicityError in case the record already exists. This is important for e.g. default_bucket not running object creation events. This is not exactly the same as raising a 409. Note that the two requests don't actually conflict so a 409 is a little bit wrong.
  • UnicityError takes as an argument the already-existing record. Of course, if we get no inserted result, then we don't have any way to get access to the already-existing record.
  • The code links to a StackOverflow answer about how to do a "SELECT or INSERT" query, although the answer has been updated since I landed this code. Although the existing code is based on the old answer, the current wisdom is that you can hit the same race condition we do, and that you have to either loop, use a different serialization level, or accept the possibility of the race condition.

This might be a good time to talk about switching to a stronger serialization level, for example repeatable read. This would at least get us to a place where transactions behave predictably. On the other hand, we'd then have to be ready to handle serialization errors, which might require a lot more work.

The Postgres documentation does say (about read committed): "Because of the above rules, it is possible for an updating command to see an inconsistent snapshot: it can see the effects of concurrent updating commands on the same rows it is trying to update, but it does not see effects of those commands on other rows in the database." Indeed, I think that's what we're seeing. This behavior seems to remain in v10 and v11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants