-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reading specific rows never returns #18199
Comments
Hi @jcsdt, thanks for the report! Would you mind grepping your logs on node 1 for the string Another option would be to look at debug tracing. If possible, could you run In terms of what's going wrong here, the |
root@localhost:26257/> SET CLUSTER SETTING trace.debug.enable = true; Is this expected ? |
It's not listed on that page anyways |
Ah yeah, you're running v1.0.5. Based on your logs it looks like you've already tried restarting the cluster. Is that correct? |
Yes, we've restarted, we've also upgraded from 10.4 to 1.0.5 Anyways I did the environment variable thing. Shall I just send you access to the UI ? |
That would be great. You can email me if you'd prefer. |
The bug that caused #17741 is definitely in 1.0.x which is why we're putting the fix into 1.0.6 (not yet released). |
Before the rolling restart to @christian-lefty it looks like you restarted your cluster this morning. Has the issue gone away? If this is what we're suspecting then it shouldn't have because that bug wasn't fixed yet in the version of Cockroach you're running. |
Hey so we rolled v1.1-alpha.20170817 this didnt fix our issue |
The fix for the issue I've referenced will be in tomorrow's alpha release. This should hopefully fix the issue we're seeing on your cluster. If you're able to compile from source then you can check out the SHA 3cab35b and build that. If not, I can send you the alpha binary that we're going to publish tomorrow. |
If it's easy for you to send it that would be better cause I've never built CDB from source so I'd have to setup everything. |
Sure, I'll email you the binary. |
@christian-lefty are you still seeing similar logs messages to the ones posted above? |
Yeah, it looks like you're still having issues with your cluster. Specifically, there seem to be some stray intents that our GC process is having issues cleaning up. I'm consistently seeing:
in the low verbosity logs I have access to. It seems like these stray intents are blocking any requests that happen to come upon them, which would explain the logs like:
and corresponding goroutines stuck for hours in On top of this, there are a few goroutines that have been stuck in I don't think we're currently tracking any issues where intents for a committed transaction have gotten stuck. @bdarnell or @tschottdorf, do you know of any that I'm forgetting/not finding? It's tough to tell from the outside what exactly is causing this issue. Right now I think a load balancer is getting in the way of request tracing (that may be why the /debug/requests page is only showing a single active trace), which removes an opportunity to get more insight into what's going wrong. I'm also really missing the new logspy mode right now! |
Hi Nathan, so what are our options now? Should we rollback to a binary
before 1.0? is that even gonna work?
On Fri, 8 Sep 2017 at 09:06, Nathan VanBenschoten ***@***.***> wrote:
Yeah, it looks like you're still having issues with your cluster.
Specifically, there seem to be some stray intents that our GC process is
having issues cleaning up. I'm consistently seeing:
[gc,n3,s7,r10568/4:/Table/52/1/"56{16648…-20705…}] unable to resolve intents of committed txn on gc: context deadline exceeded
in the low verbosity logs I have access to. It seems like these stray
intents are blocking any requests that happen to come upon them, which
would explain the logs like:
"[gc,n2,s6,r9216/3:/Table/52/2/NULL/"54{35…-54…}] push of txn id=3d0a91cd key=/Table/52/1/"5436607633#1496793886960"/0 rw=false pri=0.01699221 iso=SERIALIZABLE stat=PENDING epo=0 ts=1496793894.500072856,1 orig=0.000000000,0 max=0.000000000,0 wto=false rop=false seq=11 failed: context deadline exceeded"
and corresponding goroutines stuck for hours in maybePushTransactions.
On top of this, there are a few goroutines that have been stuck in
beginCmds and a few that have been stuck in redirectOnOrAcquireLease for
about the same amount of time. At the moment, I'm not sure what to make of
these other than that they're probably the pushTxnRequests we're looking
for and that they're getting clogged up somewhere.
I don't think we're currently tracking any issues where intents for a
committed transaction have gotten stuck. @bdarnell
<https://github.com/bdarnell> or @tschottdorf
<https://github.com/tschottdorf>, do you know of any that I'm
forgetting/not finding?
It's tough to tell from the outside what exactly is causing this issue.
Right now I think a load balancer is getting in the way of request tracing
(that may be why the /debug/requests page is only showing a single active
trace), which removes an opportunity to get more insight into what's going
wrong. I'm also really missing the new logspy
<#18081> mode right now!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18199 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANDDMJ2WqOIfCudQ2myFTZmdeg-KWH_tks5sgOeBgaJpZM4PMFaW>
.
--
Christian Rivasseau
Co-founder and CTO @ Lefty
+33 6 67 35 26 74
|
@christian-lefty Downgrading from 1.1 to 1.0.x is safe[1]. Going back further than that is not. Do you have any clients that might be keeping a long-running transaction alive? Reducing @nvanbenschoten Could it be the lastTerm issue? This looks like a lease/liveness issue, not anything specific to intents or transactions. My recommendation is to move forward with a fresh build from the release-1.1 branch to get the lastTerm fix, and the [1] Details on downgrade safety: the final step in the upgrade process will be to run |
Here is a pre-built linux binary you can use. |
@bdarnell I don't think this is the lastTerm issue. @jcsdt initially ran into this on v1.0.5, which was before the lastTerm caching was introduced. It has also remained across a few restarts, so whatever is stuck must be persistently stored. Another interesting datapoint is that I'm seeing GCs fail because of lease acquisition issues. I also see a lease acquisition issues on the node liveness range in the traces ( The |
@nvanbenschoten we deployed the binary above so you can access We also set
|
@jcsdt thanks for updating the binary. The |
Hey @nvanbenschoten just enabled traces if you wanna take a look |
Thanks @christian-lefty! I'm seeing node1 being brought up and down. Is this intentional on your end? Also, I know you're still seeing the The cause of this slow replication is unclear to me but could be due in part to large hotspots of activity in the client workload on the |
When would you clear |
Yeah, I think the slow recovery here is "expected" for this pathological case. It's probing one entry at a time for the log index where they diverged because it is very strange for two leaders to be able to ping-pong like this and each accumulating their own conflicting fork of the log. Immediately jumping back to the lastAppliedIndex (so that n2 truncates its entire uncommitted log tail and n3 ships it a new copy of the log) would be optimal in this case, but would result in unnecessary log copies in cases where the divergence is small. I'm not sure which case is better to optimize for (maybe just take larger steps? This seems like a difficult heuristic to tune since it only matters in rare cases)
My idea is to track the last index (instead of a bool |
@nvanbenschoten actually our queries are still hanging. |
@christian-lefty yeah I can see that range Could you try sending that query to one of the other two node's SQL gateways? Also, yes let's keep this cluster up for a bit longer. |
The probe to catch It looks like the snapshots for
That log is produced by the As with the first issue, I suspect the slow snapshots are due to |
Manual testing in cockroachdb#15997 surfaced that one limiting factor in resolving many intents is contention on the transaction's abort cache entry. In one extreme test, I wrote 10E6 abortable intents into a single range, in which case the GC queue sends very large batches of intent resolution requests for the same transaction to the intent resolver. These requests all overlapped on the transaction's abort cache key, causing very slow progress, and ultimately preventing the GC queue from making a dent in the minute allotted to it. Generally this appears to be a somewhat atypical case, but since @nvanbenschoten observed something similar in cockroachdb#18199 it seemed well worth addressing, by means of 1. allow intent resolutions to not touch the abort span 2. correctly declare the keys for `ResolveIntent{,Range}` to only declare the abort cache key if it is actually going to be accessed. With these changes, the gc queue was able to clear out a million intents comfortably on my older 13" MacBook (single node). Also use this option in the intent resolver, where possible -- most transactions don't receive abort cache entries, and intents are often "found" by multiple conflicting writers. We want to avoid adding artificial contention there, though in many situations the same intent is resolved and so a conflict still exists. Migration: a new field number was added to the proto and the old one preserved. We continue to populate it. Downstream of Raft, we use the new field but if it's unset, synthesize from the deprecated field. I believe this is sufficient and we can just remove all traces of the old field in v1.3. (v1.1 uses the old, v1.2 uses the new with compatibility for the old, v1.3 only the new field).
Fallout from cockroachdb#18199 and corresponding testing in cockroachdb#15997. When the context is expired, there is no point in shooting off another gazillion requests.
Fallout from cockroachdb#18199 and corresponding testing in cockroachdb#15997. I think it'll be nontrivial to max out these budgets in practice, but I can definitely do it in intentionally evil tests, and it's good to know that there is some rudimentary form of memory accounting in this queue.
Fallout from cockroachdb#18199 and corresponding testing in cockroachdb#15997. When the context is expired, there is no point in shooting off another gazillion requests.
Fallout from cockroachdb#18199 and corresponding testing in cockroachdb#15997. When the context is expired, there is no point in shooting off another gazillion requests.
Manual testing in cockroachdb#15997 surfaced that one limiting factor in resolving many intents is contention on the transaction's abort cache entry. In one extreme test, I wrote 10E6 abortable intents into a single range, in which case the GC queue sends very large batches of intent resolution requests for the same transaction to the intent resolver. These requests all overlapped on the transaction's abort cache key, causing very slow progress, and ultimately preventing the GC queue from making a dent in the minute allotted to it. Generally this appears to be a somewhat atypical case, but since @nvanbenschoten observed something similar in cockroachdb#18199 it seemed well worth addressing, by means of 1. allow intent resolutions to not touch the abort span 2. correctly declare the keys for `ResolveIntent{,Range}` to only declare the abort cache key if it is actually going to be accessed. With these changes, the gc queue was able to clear out a million intents comfortably on my older 13" MacBook (single node). Also use this option in the intent resolver, where possible -- most transactions don't receive abort cache entries, and intents are often "found" by multiple conflicting writers. We want to avoid adding artificial contention there, though in many situations the same intent is resolved and so a conflict still exists. Migration: a new field number was added to the proto and the old one preserved. We continue to populate it. Downstream of Raft, we use the new field but if it's unset, synthesize from the deprecated field. I believe this is sufficient and we can just remove all traces of the old field in v1.3. (v1.1 uses the old, v1.2 uses the new with compatibility for the old, v1.3 only the new field).
Closing this, since we've already identified the cause of the problems faced here and have opened focused issues to address each. Thanks for all the patience and help during this process @christian-lefty! |
Bug report
We are running a 3 nodes cluster under v1.0.5
Node 1:
Node 2:
Node 3:
Some of our queries stopped returning results, even simple queries such as
SELECT id from users where id = '2970788235';
( id is the primary key of the table )We tried those queries from both our java code and cockroach cli.
While this happens for specific rows, the server continues to work fine when reading others.
Not sure it is related but our cockroach logs output a lot of
The text was updated successfully, but these errors were encountered: