Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading specific rows never returns #18199

Closed
jcsdt opened this issue Sep 4, 2017 · 47 comments
Closed

reading specific rows never returns #18199

jcsdt opened this issue Sep 4, 2017 · 47 comments
Assignees
Labels
C-question A question rather than an issue. No code/spec/doc change needed. O-community Originated from the community
Milestone

Comments

@jcsdt
Copy link

jcsdt commented Sep 4, 2017

Bug report

We are running a 3 nodes cluster under v1.0.5

Node 1:

I170904 17:40:50.553387 303137322 util/log/clog.go:910  [config] file created at: 2017/09/04 17:40:50
I170904 17:40:50.553387 303137322 util/log/clog.go:910  [config] running on machine: online-cockroach-a
I170904 17:40:50.553387 303137322 util/log/clog.go:910  [config] binary: CockroachDB CCL v1.0.5 (linux amd64, built 2017/08/24 17:43:46, go1.8.3)
I170904 17:40:50.553387 303137322 util/log/clog.go:910  [config] arguments: [/home/lefty/cockroach start --store /data/1/cockroach --store /data/2/cockroach --store /data/3/cockroach --host=0.0.0.0 --advertise-host=10.91.208.131 --certs-dir /home/lefty/cockroach_certs --http-host 0.0.0.0 --http-port 8080 --join 10.91.151.188 --join 10.91.155.209]

Node 2:

I170904 17:44:10.433957 250568015 util/log/clog.go:910  [config] file created at: 2017/09/04 17:44:10
I170904 17:44:10.433957 250568015 util/log/clog.go:910  [config] running on machine: online-cockroach-b
I170904 17:44:10.433957 250568015 util/log/clog.go:910  [config] binary: CockroachDB CCL v1.0.5 (linux amd64, built 2017/08/24 17:43:46, go1.8.3)
I170904 17:44:10.433957 250568015 util/log/clog.go:910  [config] arguments: [/home/lefty/cockroach start --store /data/1/cockroach --store /data/2/cockroach --store /data/3/cockroach --host=0.0.0.0 --advertise-host=10.91.151.188 --certs-dir /home/lefty/cockroach_certs --http-host 0.0.0.0 --http-port 8080 --join 10.91.208.131 --join 10.91.155.209]

Node 3:

I170904 17:44:34.696304 36972590 util/log/clog.go:910  [config] file created at: 2017/09/04 17:44:34
I170904 17:44:34.696304 36972590 util/log/clog.go:910  [config] running on machine: online-cockroach-c
I170904 17:44:34.696304 36972590 util/log/clog.go:910  [config] binary: CockroachDB CCL v1.0.5 (linux amd64, built 2017/08/24 17:43:46, go1.8.3)
I170904 17:44:34.696304 36972590 util/log/clog.go:910  [config] arguments: [/home/lefty/cockroach start --store /data/1/cockroach --store /data/2/cockroach --store /data/3/cockroach --host=0.0.0.0 --advertise-host=10.91.155.209 --certs-dir /home/lefty/certs --http-host 0.0.0.0 --http-port 8080 --join 10.91.208.131 --join 10.91.151.188]

Some of our queries stopped returning results, even simple queries such as
SELECT id from users where id = '2970788235'; ( id is the primary key of the table )

We tried those queries from both our java code and cockroach cli.
While this happens for specific rows, the server continues to work fine when reading others.

Not sure it is related but our cockroach logs output a lot of

W170904 17:30:35.746218 241 storage/gc_queue.go:417  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] unable to resolve intents of committed txn on gc: context deadline exceeded
W170904 17:30:35.746247 241 storage/gc_queue.go:417  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] unable to resolve intents of committed txn on gc: context deadline exceeded
W170904 17:30:35.746295 241 storage/gc_queue.go:417  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] unable to resolve intents of committed txn on gc: context deadline exceeded
W170904 17:30:35.746343 241 storage/gc_queue.go:417  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] unable to resolve intents of committed txn on gc: context deadline exceeded
W170904 17:30:35.746392 241 storage/gc_queue.go:417  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] unable to resolve intents of committed txn on gc: context deadline exceeded
W170904 17:30:35.746445 241 storage/gc_queue.go:417  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] unable to resolve intents of committed txn on gc: context deadline exceeded
W170904 17:30:35.746506 241 storage/gc_queue.go:417  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] unable to resolve intents of committed txn on gc: context deadline exceeded
W170904 17:30:35.746536 241 storage/gc_queue.go:417  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] unable to resolve intents of committed txn on gc: context deadline exceeded
W170904 17:30:35.746565 241 storage/gc_queue.go:417  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] unable to resolve intents of committed txn on gc: context deadline exceeded
W170904 17:30:35.746605 241 storage/gc_queue.go:417  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] unable to resolve intents of committed txn on gc: context deadline exceeded
W170904 17:30:35.746642 241 storage/gc_queue.go:417  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] unable to resolve intents of committed txn on gc: context deadline exceeded
W170904 17:30:35.746693 241 storage/gc_queue.go:417  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] unable to resolve intents of committed txn on gc: context deadline exceeded
W170904 17:30:35.746739 241 storage/gc_queue.go:417  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] unable to resolve intents of committed txn on gc: context deadline exceeded
W170904 17:30:35.746777 241 storage/gc_queue.go:417  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] unable to resolve intents of committed txn on gc: context deadline exceeded
W170904 17:30:35.746897 302617246 storage/gc_queue.go:879  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] push of txn id=ac2d8029 key=/Table/52/1/"5320052451#1500557009361"/0 rw=false pri=0.01513283 iso=SERIALIZABLE stat=PE
NDING epo=0 ts=1500557007.703880494,0 orig=0.000000000,0 max=0.000000000,0 wto=false rop=false seq=3 failed: context deadline exceeded
W170904 17:30:35.746907 302617051 storage/gc_queue.go:879  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] push of txn id=e925a948 key=/Table/52/1/"5319772163#1500556927647"/0 rw=false pri=0.00135358 iso=SERIALIZABLE stat=PE
NDING epo=0 ts=1500556918.902844582,0 orig=0.000000000,0 max=0.000000000,0 wto=false rop=false seq=4 failed: context deadline exceeded
W170904 17:30:35.746912 302617057 storage/gc_queue.go:879  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] push of txn id=e93d4ffc key=/Table/52/1/"5320670056#1500972905215"/0 rw=false pri=0.03541256 iso=SERIALIZABLE stat=PE
NDING epo=0 ts=1500972901.427410919,0 orig=0.000000000,0 max=0.000000000,0 wto=false rop=false seq=3 failed: context deadline exceeded
W170904 17:30:35.746933 302617247 storage/gc_queue.go:879  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] push of txn id=541e71ce key=/Table/52/1/"5320052451#1500557180174"/0 rw=false pri=0.02997946 iso=SERIALIZABLE stat=PE
NDING epo=0 ts=1500557104.540700487,0 orig=0.000000000,0 max=0.000000000,0 wto=false rop=false seq=5 failed: context deadline exceeded
W170904 17:30:35.746947 302617189 storage/gc_queue.go:879  [gc,n1,s2,r10122/3:/Table/52/1/"53{19312…-20860…}] push of txn id=aae047d9 key=/Table/52/1/"5320063883#1500516197839"/0 rw=false pri=0.01194207 iso=SERIALIZABLE stat=PE
NDING epo=0 ts=1500516197.750966587,0 orig=0.000000000,0 max=0.000000000,0 wto=false rop=false seq=3 failed: context deadline exceeded
@dianasaur323 dianasaur323 added this to the 1.1 milestone Sep 5, 2017
@cuongdo cuongdo assigned tbg and nvanbenschoten and unassigned cuongdo and tbg Sep 5, 2017
@nvanbenschoten
Copy link
Member

Hi @jcsdt, thanks for the report! Would you mind grepping your logs on node 1 for the string ,r10122/ up to the point where you saw the first unable to resolve intents of committed txn on gc: context deadline exceeded error and posting the results?

Another option would be to look at debug tracing. If possible, could you run SET CLUSTER SETTING trace.debug.enable = true? This will allow us to peer into all requests on the admin UI's /debug/requests page. From here we can look at all active requests by clicking on the [<number> active] button. This should be pretty loud based on those logs and should be able to point out where those GC requests are getting stuck.

In terms of what's going wrong here, the gcQueue has the default timeout of 1m, so this doesn't seem like a timeout issue (like #18155). Without a bit more information it might be tough to track down.

@christian-lefty
Copy link

root@localhost:26257/> SET CLUSTER SETTING trace.debug.enable = true;
pq: unknown cluster setting 'trace.debug.enable'

Is this expected ?

@christian-lefty
Copy link

It's not listed on that page anyways
https://www.cockroachlabs.com/docs/stable/cluster-settings.html

@nvanbenschoten
Copy link
Member

Ah yeah, you're running v1.0.5. SET CLUSTER SETTING trace.debug.enable = true will be in v1.1 but for now, we'll have to start Cockroach with the environment variable COCKROACH_ENABLE_TRACING=true.

Based on your logs it looks like you've already tried restarting the cluster. Is that correct?

@christian-lefty
Copy link

Yes, we've restarted, we've also upgraded from 10.4 to 1.0.5

Anyways I did the environment variable thing. Shall I just send you access to the UI ?

@nvanbenschoten
Copy link
Member

That would be great. You can email me if you'd prefer.

@nvanbenschoten
Copy link
Member

I'm seeing the same retry proposal 5e933a46d8e7d317: reasonTicks issue we saw in #17524. Looking back at #17741, it looks like that was a symptom over there as well. Could they all be the same root issue?

@petermattis
Copy link
Collaborator

The bug that caused #17741 is definitely in 1.0.x which is why we're putting the fix into 1.0.6 (not yet released).

@nvanbenschoten
Copy link
Member

Before the rolling restart to v1.1-alpha.20170817 is saw a number of ranges which had lease epochs beneath their leaseholder's node liveness epoch. This was a clear symptom of the bug that caused #17741. However, the range that I was hoping to see exhibit this behavior (r10122, the one in the retry loop) strangely wasn't.

@christian-lefty it looks like you restarted your cluster this morning. Has the issue gone away? If this is what we're suspecting then it shouldn't have because that bug wasn't fixed yet in the version of Cockroach you're running.

@christian-lefty
Copy link

Hey so we rolled v1.1-alpha.20170817 this didnt fix our issue

@nvanbenschoten
Copy link
Member

The fix for the issue I've referenced will be in tomorrow's alpha release. This should hopefully fix the issue we're seeing on your cluster. If you're able to compile from source then you can check out the SHA 3cab35b and build that. If not, I can send you the alpha binary that we're going to publish tomorrow.

@christian-lefty
Copy link

If it's easy for you to send it that would be better cause I've never built CDB from source so I'd have to setup everything.

@nvanbenschoten
Copy link
Member

Sure, I'll email you the binary.

@nvanbenschoten
Copy link
Member

@christian-lefty are you still seeing similar logs messages to the ones posted above?

@nvanbenschoten
Copy link
Member

Yeah, it looks like you're still having issues with your cluster. Specifically, there seem to be some stray intents that our GC process is having issues cleaning up. I'm consistently seeing:

[gc,n3,s7,r10568/4:/Table/52/1/"56{16648…-20705…}] unable to resolve intents of committed txn on gc: context deadline exceeded

in the low verbosity logs I have access to. It seems like these stray intents are blocking any requests that happen to come upon them, which would explain the logs like:

"[gc,n2,s6,r9216/3:/Table/52/2/NULL/"54{35…-54…}] push of txn id=3d0a91cd key=/Table/52/1/"5436607633#1496793886960"/0 rw=false pri=0.01699221 iso=SERIALIZABLE stat=PENDING epo=0 ts=1496793894.500072856,1 orig=0.000000000,0 max=0.000000000,0 wto=false rop=false seq=11 failed: context deadline exceeded"

and corresponding goroutines stuck for hours in maybePushTransactions.

On top of this, there are a few goroutines that have been stuck in beginCmds and a few that have been stuck in redirectOnOrAcquireLease for about the same amount of time. At the moment, I'm not sure what to make of these other than that they're probably the pushTxnRequests we're looking for and that they're getting clogged up somewhere.

I don't think we're currently tracking any issues where intents for a committed transaction have gotten stuck. @bdarnell or @tschottdorf, do you know of any that I'm forgetting/not finding?

It's tough to tell from the outside what exactly is causing this issue. Right now I think a load balancer is getting in the way of request tracing (that may be why the /debug/requests page is only showing a single active trace), which removes an opportunity to get more insight into what's going wrong. I'm also really missing the new logspy mode right now!

@christian-lefty
Copy link

christian-lefty commented Sep 8, 2017 via email

@bdarnell
Copy link
Contributor

bdarnell commented Sep 8, 2017

@christian-lefty Downgrading from 1.1 to 1.0.x is safe[1]. Going back further than that is not.

Do you have any clients that might be keeping a long-running transaction alive?

Reducing kv.gc.batch_size might help. Try SET CLUSTER SETTING kv.gc.batch_size=1000; (it defaults to 100k). Maybe even try setting it to 1 to see if that gets things unwedged (I think it would be problematic to run with a GC batch size of 1 in the long term, but it might help here)

@nvanbenschoten Could it be the lastTerm issue? This looks like a lease/liveness issue, not anything specific to intents or transactions.

My recommendation is to move forward with a fresh build from the release-1.1 branch to get the lastTerm fix, and the logspy endpoint if that doesn't fix it.

[1] Details on downgrade safety: the final step in the upgrade process will be to run SET CLUSTER SETTING VERSION='1.1'. Once this has been done, downgrading is no longer allowed (but if you just run the new binary without this step, you can go back). This is done automatically when version 1.1 initializes a new cluster, so you can't downgrade a cluster to a version that is older than the one used to create it.

@bdarnell
Copy link
Contributor

bdarnell commented Sep 8, 2017

Here is a pre-built linux binary you can use.

@nvanbenschoten
Copy link
Member

@bdarnell I don't think this is the lastTerm issue. @jcsdt initially ran into this on v1.0.5, which was before the lastTerm caching was introduced. It has also remained across a few restarts, so whatever is stuck must be persistently stored.

Another interesting datapoint is that I'm seeing GCs fail because of lease acquisition issues. I also see a lease acquisition issues on the node liveness range in the traces (requested lease overlaps previous lease). The range debug page for node liveness looks fine now though, other than 15 dropped commands on a follower.

The kv.gc.batch_size is a good idea. At the very least, that should help isolate exactly what's stuck by cleaning up anything around it. @christian-lefty if you do run the binary provided above, let me know. The logspy endpoint introduced there should provide a lot better visibility in the problem.

@jcsdt
Copy link
Author

jcsdt commented Sep 11, 2017

@nvanbenschoten we deployed the binary above so you can access logspy

We also set kv.gc.batch_size=1; but we're still seeing

unable to resolve intents of committed txn on gc: context deadline exceeded

@nvanbenschoten
Copy link
Member

@jcsdt thanks for updating the binary. The logspy endpoint has been helpful in letting us identify part of the error. It looks like we're having serious issues performing RangeLookups for the range that contains the stuck intents (think DNS). Would you mind running the command SET CLUSTER SETTING trace.debug.enable = 'true' one more time, so that we can watch a trace of this lookup-cache-evict loop?

@christian-lefty
Copy link

Hey @nvanbenschoten just enabled traces if you wanna take a look

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Sep 13, 2017

Thanks @christian-lefty! I'm seeing node1 being brought up and down. Is this intentional on your end?

Also, I know you're still seeing the unable to resolve intents log messages, but I'm curious if the initial symptom of SELECT id from users where id = '2970788235'; not finishing is still visible. I ask because throughout my debugging I haven't actually been able to track down anything that's completely stuck. What I have seen is very backed up replication on certain ranges resulting in cascading performance degradations elsewhere.

The cause of this slow replication is unclear to me but could be due in part to large hotspots of activity in the client workload on the user_archives table. This hotspot might be around the key range /Table/52/1/"#1505243733147"-/Table/52/1/"1000875591#1487783699744", although it seems to always be moving slowly. Here, I'm seeing about 4000 (large) Raft commands/sec on a single range. Does a workload like this sound characteristic of your application? An example of what this might look like is an application that inserts into the user_archives table about 4000 times per second with primary keys (in this case, id) that are sequentially ordered.

@petermattis
Copy link
Collaborator

Fortunately, with a little refactoring in etcd/raft I think we can avoid the need for a precise count. What we require here is to ensure that we never have more than one config change in flight at a time. If we simply assume pessimistically that the tail of the log has a config change (so that the new leader cannot propose a config change until it has applied all entries up to the point of its election) and we can skip the scan.

When would you clear raft.pendingConf? Currently that field gets cleared when the conf change is applied. Keeping track of the the current last index and watching for when that index is committed seems doable, but a bit tricky. Did you have a simpler idea in mind?

@bdarnell
Copy link
Contributor

This might also be expected because I suspect the uncommitted entries in n2 and n3 forked a long time ago when the other issue began. It might fix itself once the index of the msgApps sent to n2 gets down to n2's lastAppliedIndex of 1689719.

Yeah, I think the slow recovery here is "expected" for this pathological case. It's probing one entry at a time for the log index where they diverged because it is very strange for two leaders to be able to ping-pong like this and each accumulating their own conflicting fork of the log.

Immediately jumping back to the lastAppliedIndex (so that n2 truncates its entire uncommitted log tail and n3 ships it a new copy of the log) would be optimal in this case, but would result in unnecessary log copies in cases where the divergence is small. I'm not sure which case is better to optimize for (maybe just take larger steps? This seems like a difficult heuristic to tune since it only matters in rare cases)

When would you clear raft.pendingConf? Currently that field gets cleared when the conf change is applied. Keeping track of the the current last index and watching for when that index is committed seems doable, but a bit tricky. Did you have a simpler idea in mind?

My idea is to track the last index (instead of a bool pendingConf, it would be configChangeBlockedUntilIndex). I don't think it will be tricky because it doesn't need to persist across leadership changes.

@christian-lefty
Copy link

@nvanbenschoten actually our queries are still hanging.
Should I still wait some more? Let me know when the cluster looks stable.

@nvanbenschoten
Copy link
Member

@christian-lefty yeah I can see that range 3591 is still struggling. node1 is currently waiting on a snapshot for that range, but I suspect that there is a long queue to perform snapshots at the moment since we only allow a single snapshot at a time. I don't have any solid proof for that yet though, so I'll do some digging.

Could you try sending that query to one of the other two node's SQL gateways? Also, yes let's keep this cluster up for a bit longer.

@nvanbenschoten
Copy link
Member

The probe to catch n1, r11844 up is still ongoing, but it is slowly making progress. Based on its current rate, it should be done within the next few hours. I'm not sure why this slowed down so much as it was probing at about 20 indices/sec before and it's all the way down to about 10 indices/min. The only clue I have here is that I have seen the splitQueue's use of MVCCFindSplitKey showing up in profiles, which @tschottdorf and I witnessed have a very troubling effect earlier today. I'm sure he'll pursue this further in #15997.

It looks like the snapshots for n1, r3591 are still failing. I was finally able to see the following error message:

E170920 05:48:34.003360 127 storage/queue.go:656  [raftsnapshot,n3,s7,r3591/1:/Table/{SystemCon…-11}] snapshot failed: rate: Wait(n=1) would exceed context deadline

That log is produced by the rate package here. It indicates that a context timeout occurred when processing the snapshot. Just like with the gcQueue, the raftSnapshotQueue has a processing timeout of only 1 minute, which seems exceptionally low to me. When we need to send a snapshot to revive a struggling replica, I'd expect us to be liberal with our timeouts since this is the last strategy we have before abandoning slow replicas. I'm going to bump this timeout up to 10 minutes and see if that will help to revive this range. Let's hold off the restart necessary for this until the probing on r11844 completes though.

As with the first issue, I suspect the slow snapshots are due to engine-level slowdowns, possibly related to the tight looping MVCCFindSplitKey calls.

tbg added a commit to tbg/cockroach that referenced this issue Sep 20, 2017
Manual testing in cockroachdb#15997 surfaced that one limiting
factor in resolving many intents is contention on the transaction's abort cache entry. In one
extreme test, I wrote 10E6 abortable intents into a single range, in which case the GC queue sends
very large batches of intent resolution requests for the same transaction to the intent resolver.

These requests all overlapped on the transaction's abort cache key, causing very slow progress, and
ultimately preventing the GC queue from making a dent in the minute allotted to it. Generally this
appears to be a somewhat atypical case, but since @nvanbenschoten observed something similar in
cockroachdb#18199 it seemed well worth addressing, by means of

1. allow intent resolutions to not touch the abort span
2. correctly declare the keys for `ResolveIntent{,Range}` to only declare the abort cache key
   if it is actually going to be accessed.

With these changes, the gc queue was able to clear out a million intents comfortably on my older
13" MacBook (single node).

Also use this option in the intent resolver, where possible -- most transactions don't receive abort
cache entries, and intents are often "found" by multiple conflicting writers. We want to avoid
adding artificial contention there, though in many situations the same intent is resolved and so a
conflict still exists.

Migration: a new field number was added to the proto and the old one preserved. We continue to
populate it. Downstream of Raft, we use the new field but if it's unset, synthesize from the
deprecated field. I believe this is sufficient and we can just remove all traces of the old field in
v1.3. (v1.1 uses the old, v1.2 uses the new with compatibility for the old, v1.3 only the new field).
tbg added a commit to tbg/cockroach that referenced this issue Sep 20, 2017
Fallout from cockroachdb#18199 and corresponding
testing in cockroachdb#15997.

When the context is expired, there is no point in shooting off another gazillion requests.
tbg added a commit to tbg/cockroach that referenced this issue Sep 20, 2017
Fallout from cockroachdb#18199 and corresponding testing in cockroachdb#15997. I think it'll be nontrivial to max out
these budgets in practice, but I can definitely do it in intentionally evil tests, and it's good to
know that there is some rudimentary form of memory accounting in this queue.
tbg added a commit to tbg/cockroach that referenced this issue Sep 21, 2017
Fallout from cockroachdb#18199 and corresponding
testing in cockroachdb#15997.

When the context is expired, there is no point in shooting off another gazillion requests.
tbg added a commit to tbg/cockroach that referenced this issue Sep 21, 2017
Fallout from cockroachdb#18199 and corresponding
testing in cockroachdb#15997.

When the context is expired, there is no point in shooting off another gazillion requests.
tbg added a commit to tbg/cockroach that referenced this issue Sep 30, 2017
Manual testing in cockroachdb#15997 surfaced that one limiting
factor in resolving many intents is contention on the transaction's abort cache entry. In one
extreme test, I wrote 10E6 abortable intents into a single range, in which case the GC queue sends
very large batches of intent resolution requests for the same transaction to the intent resolver.

These requests all overlapped on the transaction's abort cache key, causing very slow progress, and
ultimately preventing the GC queue from making a dent in the minute allotted to it. Generally this
appears to be a somewhat atypical case, but since @nvanbenschoten observed something similar in
cockroachdb#18199 it seemed well worth addressing, by means of

1. allow intent resolutions to not touch the abort span
2. correctly declare the keys for `ResolveIntent{,Range}` to only declare the abort cache key
   if it is actually going to be accessed.

With these changes, the gc queue was able to clear out a million intents comfortably on my older
13" MacBook (single node).

Also use this option in the intent resolver, where possible -- most transactions don't receive abort
cache entries, and intents are often "found" by multiple conflicting writers. We want to avoid
adding artificial contention there, though in many situations the same intent is resolved and so a
conflict still exists.

Migration: a new field number was added to the proto and the old one preserved. We continue to
populate it. Downstream of Raft, we use the new field but if it's unset, synthesize from the
deprecated field. I believe this is sufficient and we can just remove all traces of the old field in
v1.3. (v1.1 uses the old, v1.2 uses the new with compatibility for the old, v1.3 only the new field).
@nvanbenschoten
Copy link
Member

Closing this, since we've already identified the cause of the problems faced here and have opened focused issues to address each. Thanks for all the patience and help during this process @christian-lefty!

@jordanlewis jordanlewis added C-question A question rather than an issue. No code/spec/doc change needed. O-community Originated from the community and removed O-deprecated-community-questions labels Apr 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-question A question rather than an issue. No code/spec/doc change needed. O-community Originated from the community
Projects
None yet
Development

No branches or pull requests