Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanner: transaction fails with Session not found error #1527

Closed
110y opened this issue Aug 8, 2019 · 22 comments
Closed

spanner: transaction fails with Session not found error #1527

110y opened this issue Aug 8, 2019 · 22 comments
Assignees
Labels
api: spanner Issues related to the Spanner API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@110y
Copy link
Contributor

110y commented Aug 8, 2019

Client

Spanner: v0.37.4 (But as far as I know, it also occurs when we use latest version)

Describe Your Environment

Alpine Docker on GKE

Expected Behavior

If possible, this client library should retries a transaction when it fails with Session not found error.

Actual Behavior

We sometimes get the Session not found errors.
This client library retries transactions only when the Abort error occurred, so when taken session is not active, it just returns the NotFound error to callers without retrying.

  • sh, err = c.idleSessions.takeWriteSession(ctx)
    if err != nil {
    // If session retrieval fails, just fail the transaction.
    return err
    }
  • // Only retry on ABORTED.
    if isAbortErr(funcErr) {
    // Aborted, do exponential backoff and continue.
    b, ok := extractRetryDelay(funcErr)
    if !ok {
    b = backoff.DefaultBackoff.Delay(retryCount)
    }
    trace.TracePrintf(ctx, nil, "Backing off after ABORTED for %s, then retrying", b)
    select {
    case <-ctx.Done():
    return errContextCanceled(ctx, funcErr)
    case <-time.After(b):
    }
    retryCount++
    continue
    }

Since this library creates the pool of the sessions, I think it should retry failed transactions caused by Session not found by taking another session. But are there any problems to take another session from the pool?

@olavloite olavloite added the api: spanner Issues related to the Spanner API. label Aug 8, 2019
@jeanbza jeanbza added the type: question Request for information or clarification. Not an issue. label Aug 8, 2019
@jeanbza jeanbza added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: question Request for information or clarification. Not an issue. labels Aug 8, 2019
@olavloite
Copy link
Contributor

The client library creates and maintains a session pool, and takes measures to keep these sessions alive by doing a GetSession ping call for sessions that have not been used for a while. Still it is possible that Cloud Spanner (or another user) deletes a session server side, causing a Session not found error when using the client library.

This could be mitigated by retrying failed transactions and other server calls that operate on a session by first taking a new session and then retrying. This is however not complete straightforward, as there are roughly 4 different categories of calls that we would need to consider in order of increasing complexity:

  1. Single-use calls that just do one query or one update. These calls could simply be retried with a new session. This also includes the methods that do more than one gRPC call under water to achieve their goals, but that are presented to the user as a single call, such as executePartitionedUpdate.
  2. Read/write transactions that fail with a Session not found. These are already executed in a retry-loop, and the Session not found condition could be added to the conditions that should cause the transaction to be retried, but with the additional requirement that the transaction needs to take a new session.
  3. The BeginTransaction call of a MultiUseReadOnlyTransaction. Read-only transactions are normally not executed within a retry loop, but the initial BeginTransaction call could be retried on a new session.
  4. Any query within a MultiUseReadOnlyTransaction that fails with a Session not found error would need a new read-only transaction to be created on a different session with the same read timestamp as the session that failed, and the query must be retried on the new transaction. The client library must execute any subsequent query for the transaction on the new read-only transaction.

@olavloite
Copy link
Contributor

@110y Do you have any specific use case that reproduces this problem more frequently than others (or even always)?

@110y
Copy link
Contributor Author

110y commented Aug 12, 2019

@olavloite

Do you have any specific use case that reproduces this problem more frequently than others (or even always)?

Unfortunately, I do not understand why this problem occurs...

One of the my case is:

  • Just executing ReadWriteTransaction.
  • There is at least 1 request to the spanner per second.
  • Creates 60 sessions at initial time.
  • Did not delete sessions in my code.
  • Errors caused by Session not found are rarely happen (recently, there are 2 errors per week).

The document says:

There are two ways to delete a session:

  • A client can delete a session.
  • The Cloud Spanner database service can delete a session when the session is idle for more than 1 hour.

and, as far as my understanding is correct, if we are continuously sending requests to the spanner then there are no idle sessions for more than 1 hour since this library manages sessions in FIFO manner and used session will be pushed back to the session pool. That's why I'm wondering this problem and guessing the spanner deletes sessions even though idle time is not over 1 hour.

Do you have any thoughts or suggestions?

@olavloite
Copy link
Contributor

@110y To my understanding, it is possible (although not very common) that Cloud Spanner deletes sessions that have been idle for less than 1 hour, which could cause this problem. The reason I asked whether you had a specific use case that would always (or often) cause this problem, was to check whether you were running into some unknown bug in the session pool. Considering your error rate of 1-2 errors per week at 1QPS, I don't think that this is a specific bug in the Go session pool.

The Java client library for Cloud Spanner added a protection against this problem a couple of months ago along the lines that I mentioned above. I'll have a look to see if it is feasible to add this protection for the Go client library as well.

@110y
Copy link
Contributor Author

110y commented Aug 13, 2019

@olavloite

As trial, I send a CL that makes ReadWriteTransaction retry on Session not found error (case 2 you mentioned above).

gopherbot pushed a commit that referenced this issue Sep 9, 2019
Session not found errors should be retried by taking a new session from
the pool and retrying the gRPC call when that is possible. This change
fixes this for multi-use read-only transactions when the error occurs
on the BeginTransaction call. This method can safely be retried on a
new session, as the user has not yet been able to execute any queries
on the transaction yet. This is also the most probable moment for a
Session not found error to occur for a multi-use read-only transaction.

A Session not found error halfway through a read-only multi use
transaction could in theory be mitigated by starting a new read-only
transaction on a new session with the same read timestamp as the
transaction that produced the error, retry the query that returned the
error on the new transaction and update the internal transaction reference
to ensure any future queries that the user executes on the transaction
will use the new transaction. This is not included in this change.

Updates #1527.

Change-Id: Ibc48e558bf07e8066996c6aaad864c4450abae66
Reviewed-on: https://code-review.googlesource.com/c/gocloud/+/44051
Reviewed-by: kokoro <[email protected]>
Reviewed-by: Emmanuel Odeke <[email protected]>
@110y
Copy link
Contributor Author

110y commented Oct 11, 2019

@olavloite

Could you please take a look this CL which fix the problem for ReadWriteTransaction?

https://code-review.googlesource.com/c/gocloud/+/45910

@olavloite
Copy link
Contributor

@110y
Sorry for the delay on this. We had an integration test that started failing after one of the related changes for this was merged (not this one, the one for SingleUse transactions). It looks now like those fails were unrelated to these changes, but I wanted to make sure before proceeding with this. But I'll take another look into this change asap.

@110y
Copy link
Contributor Author

110y commented Nov 6, 2019

@olavloite

I've updated my CL based on your comments.
Could you please take another look?

@kazegusuri
Copy link

@olavloite We still suffered from this error frequently. It happens hundreds of times everyday. What's the status of this issue?

@110y
Copy link
Contributor Author

110y commented Jan 15, 2020

@olavloite

I've fixed my CL: https://code-review.googlesource.com/c/gocloud/+/45910/ to follow latest master.
Please take another look and could you merge the change if it is +1 ?

@olavloite
Copy link
Contributor

@110y and @kazegusuri

Thanks for updating your CL and sorry for taking so long to merge this. I'll have a look at this this morning and try to get it in ASAP.

@110y
Copy link
Contributor Author

110y commented Jan 15, 2020

@olavloite

By the way, do you have a plan to revive this CL which make ROTxn retry on SessionNotFound error?

@olavloite
Copy link
Contributor

@110y Yes (and additional transaction types).

gopherbot pushed a commit that referenced this issue Jan 19, 2020
Updates #1527

Ref: #1527
Change-Id: Iea12342ca098c8056abc2206b91edbeda630e718
Reviewed-on: https://code-review.googlesource.com/c/gocloud/+/45910
Reviewed-by: kokoro <[email protected]>
Reviewed-by: Hengfeng Li <[email protected]>
gopherbot pushed a commit that referenced this issue Jan 20, 2020
'Session not found' errors on BeginTransaction calls for a read-only transaction
should be retried on a new session, and the invalid session should be removed
from the session pool.

Updates #1527.

Change-Id: I49a6cb5e096c8b93c7aec76cdbd1c3d640f50c0d
Reviewed-on: https://code-review.googlesource.com/c/gocloud/+/50510
Reviewed-by: kokoro <[email protected]>
Reviewed-by: Hengfeng Li <[email protected]>
gopherbot pushed a commit that referenced this issue Jan 21, 2020
Single use read-only transactions should be retried if the query returns
a 'Session not found' error. These queries can safely be retried on a
different session as there is no transaction atomicity that can be
violated. The retry must be executed by the partial result sets stream,
as the query is sent to the server at the first call to
RowIterator.Next.

Updates #1527.

Change-Id: Ia3c77643e42adb2f88dba2cfd5d1bba7f9dbf3be
Reviewed-on: https://code-review.googlesource.com/c/gocloud/+/50511
Reviewed-by: Hengfeng Li <[email protected]>
@kazegusuri
Copy link

@olavloite Thanks you for handling issues about spanner. It seems many issues are fixed since last spanner client release as v1.1.0. Do you have a plan to cut a new release for spanner?

@olavloite
Copy link
Contributor

@hengfengli Do you know if there is a date planned for the next release?

@hengfengli
Copy link
Contributor

I'll ask @skuruppu and make a release soon.

@skuruppu
Copy link
Contributor

I'll ask @skuruppu and make a release soon.

@hengfengli, yes please cut a release.

@skuruppu
Copy link
Contributor

skuruppu commented Feb 7, 2020

The release has been cut so closing the issue. Please refer to the release notes.

@skuruppu skuruppu closed this as completed Feb 7, 2020
@kanekv
Copy link

kanekv commented May 15, 2020

@skuruppu I still got Session not found error on v1.5.1. May be it's because I am running on emulator?

@olavloite
Copy link
Contributor

@kanekv
I assume that you are getting this during a test, as you wrote that you are using the emulator. Is that correct?

The protection against Session not found errors was added because the Cloud Spanner backend can sometimes delete sessions without the client knowing about it. The emulator does not do that, at least not to my knowledge. At the same time, there are some known differences between the emulator and the Cloud Spanner backend. One of them is that the emulator reports some errors differently than Cloud Spanner. This also applies to Session not found errors.

So if you have a test that explicitly deletes a session on the emulator, and then tries to use that session in the client, then this error is explainable.

@kanekv
Copy link

kanekv commented May 15, 2020

@olavloite Not sure why it happened, I didn't run tests, I had an app connected to emulator instance and after coming back in a couple of hours (making a request after some idle time) it started throwing this error. May be it is indeed specific to emulator.

@olavloite
Copy link
Contributor

olavloite commented May 15, 2020

@kanekv Thanks for the quick reply. That is interesting information, though. This seems to indicate that the client library is not keeping sessions alive on the emulator. In addition to the Session not found retry protection, the client library also contains a background process that keeps all idle sessions alive. It seems like that one is not working as it should with the emulator. I'll open a separate issue and have a quick look at it. I guess it hasn't really been noticed yet, as most users use the emulator for tests, and those are not running long enough to ever cause a session to time out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the Spanner API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

7 participants