Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error executing schema change: failed to update job / does not exist #39255

Closed
stickenhoffen opened this issue Aug 2, 2019 · 16 comments
Closed
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.

Comments

@stickenhoffen
Copy link

stickenhoffen commented Aug 2, 2019

Describe the problem

The following lines are continually logged across three nodes in a cluster:

W190802 04:50:00.605588 195 sql/schema_changer.go:1586  [n1] Error executing schema change: failed to update job 469519352643780609: job with ID 469519352643780609 does not exist
W190802 04:50:01.458351 195 sql/schema_changer.go:1586  [n1] Error executing schema change: failed to update job 469519352643780609: job with ID 469519352643780609 does not exist
W190802 04:50:02.314610 195 sql/schema_changer.go:1586  [n1] Error executing schema change: failed to update job 469491126704537601: job with ID 469491126704537601 does not exist
W190802 04:50:03.159336 195 sql/schema_changer.go:1586  [n1] Error executing schema change: failed to update job 469530263381803009: job with ID 469530263381803009 does not exist
W190802 04:50:03.946957 195 sql/schema_changer.go:1586  [n1] Error executing schema change: failed to update job 469492583951958017: job with ID 469492583951958017 does not exist
W190802 04:50:04.838493 195 sql/schema_changer.go:1586  [n1] Error executing schema change: failed to update job 469492912178266113: job with ID 469492912178266113 does not exist
W190802 04:50:05.706084 195 sql/schema_changer.go:1586  [n1] Error executing schema change: failed to update job 469503700072857601: job with ID 469503700072857601 does not exist

To Reproduce

I dropped some indexes, and recreated. Not exactly sure when it started.

Expected behavior

No errors.

Environment:

  • CCL v19.1.3 @ 2019/07/08 18:24:39 (go1.11.6)
  • Ubuntu 18.04.2 LTS
  • Three node cluster (4 x vCPU 16GB GCP)

Additional context
What was the impact?

Increased CPU load.

@mattcrdb mattcrdb added the O-community Originated from the community label Aug 5, 2019
@roncrdb
Copy link

roncrdb commented Aug 6, 2019

Hey @stickenhoffen,

It appears that you may be running into a known issue #38088. Could you send us over the debug zip.

I've shared a drive folder fo you to upload the debug zip file to.

Please let me know when you've added the debug zip.

Thanks,

Ron

@roncrdb roncrdb self-assigned this Aug 6, 2019
@roncrdb
Copy link

roncrdb commented Aug 8, 2019

Hey @stickenhoffen,

Just wanted to follow up to see if you're able to send over that debug zip.

@roncrdb
Copy link

roncrdb commented Aug 14, 2019

Hey @stickenhoffen,

Just wanted to follow up, are you still seeing this issue?

Let us know!

@stickenhoffen
Copy link
Author

stickenhoffen commented Aug 14, 2019 via email

@ricardocrdb
Copy link

Hello @stickenhoffen

Have you had any luck in attempting to reproduce the issue? Let us know so we can continue to try to troubleshoot this issue and help you get it resolved.

@stickenhoffen
Copy link
Author

stickenhoffen commented Aug 27, 2019

Hi @ricardocrdb

I've dropped and recreated a few indexes, and although I'm not seeing the schema change error being logged, CPU is increasing rapidly on all three nodes as per before.

I've uploaded a debug ZIP.

There are also several jobs related to indexing that are at "Waiting For GC TTL".

Edit: all those waiting for GC are "DROP INDEX".

@ricardocrdb
Copy link

Hey @stickenhoffen

Would it be possible to get the output for http://<adminUrl>/debug/pprof/ui/profile/ from each of the three nodes, and screenshot it for us to see? You can put the screenshots in the same Google Drive location as the debug zip. I am seeing various connection errors on all three nodes, and this page should give us a little more information.

W190827 08:47:46.049802 39986968 vendor/google.golang.org/grpc/server.go:666  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
W190827 08:47:48.931406 39987187 vendor/google.golang.org/grpc/server.go:666  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
W190827 08:47:49.132941 39987293 vendor/google.golang.org/grpc/server.go:666  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
I190827 08:47:55.926647 169 server/status/runtime.go:500  [n1] runtime stats: 5.8 GiB RSS, 199 goroutines, 672 MiB/296 MiB/1.1 GiB GO alloc/idle/total, 3.7 GiB/4.6 GiB CGO alloc/total, 862.6 CGO/sec, 198.6/34.8 %(u/s)time, 0.0 %gc (5x), 14 MiB/24 MiB (r/w)net
W190827 08:47:56.051428 39987943 vendor/google.golang.org/grpc/server.go:666  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
W190827 08:47:58.933189 39988351 vendor/google.golang.org/grpc/server.go:666  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
W190827 08:47:59.153501 39988424 vendor/google.golang.org/grpc/server.go:666  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
I190827 08:48:05.928882 169 server/status/runtime.go:500  [n1] runtime stats: 5.8 GiB RSS, 206 goroutines, 408 MiB/532 MiB/1.1 GiB GO alloc/idle/total, 3.7 GiB/4.6 GiB CGO alloc/total, 674.0 CGO/sec, 194.0/31.9 %(u/s)time, 0.0 %gc (4x), 13 MiB/22 MiB (r/w)net
W190827 08:48:06.052746 39989374 vendor/google.golang.org/grpc/server.go:666  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
W190827 08:48:08.934566 39989671 vendor/google.golang.org/grpc/server.go:666  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
W190827 08:48:09.135798 39989684 vendor/google.golang.org/grpc/server.go:666  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
I190827 08:48:15.930985 169 server/status/runtime.go:500  [n1] runtime stats: 5.8 GiB RSS, 204 goroutines, 444 MiB/498 MiB/1.1 GiB GO alloc/idle/total, 3.7 GiB/4.6 GiB CGO alloc/total, 410.2 CGO/sec, 185.9/34.6 %(u/s)time, 0.0 %gc (4x), 3.4 MiB/22 MiB (r/w)net
W190827 08:48:16.053974 39990231 vendor/google.golang.org/grpc/server.go:666  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
W190827 08:48:18.946795 39990432 vendor/google.golang.org/grpc/server.go:666  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
W190827 08:48:19.138098 39990401 vendor/google.golang.org/grpc/server.go:666  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
I190827 08:48:22.879095 165 gossip/gossip.go:557  [n1] gossip status (ok, 3 nodes)
gossip client (1/3 cur/max conns)

Let me know when those screenshots have been sent over.

@stickenhoffen
Copy link
Author

@ricardocrdb uploaded as requested.

@ricardocrdb
Copy link

Hey @stickenhoffen

At this point based on those screenshots, I am not seeing anything jump out at me. Are you still seeing high CPU usage while waiting for GC TTL? Please let me know, if possible another screenshot of the CPU usage specifically from the AdminUI may help, along with the Job ID of the DROP INDEX. Let me know once that is uploaded.

@awoods187 awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. labels Aug 30, 2019
@stickenhoffen
Copy link
Author

Hi @ricardocrdb

A couple of schoolboy errors on my part. Firstly, we had a cronjob that was smashing a huge table, actually functionality that is no longer required! We had a whole bunch of missing indexes, the two together was killing it.

Secondly, I wasn't using the healthz endpoint to test, I was using TCP socket checks, that accounts for the HTTP2 error lines.

So for now, everything looks good. I'm very happy! :)

Thanks.

@ricardocrdb
Copy link

Hey @stickenhoffen

That's great to hear, I am glad everything is working smoothly. I will go ahead and close this issue out. If anything else comes up, feel free to open a new issue.

Cheers,
Ricardo

@hollinwilkins
Copy link

Hey, can we reopen this issue. I am actively getting this error. It is causing clients to not be able to connect to the database, and when I look in the UI, none of my databases are listed.

@hollinwilkins
Copy link

hollinwilkins commented Oct 4, 2019

@ricardocrdb Also, when I go to the "Jobs" tab in the UI, the spinner spins forever and never loads. Looks like it is stuck on this API call: _admin/v1/jobs?status=&type=0&limit=50

Running version: v19.1.2
Running in Kubernetes
Deployed using stable helm chart
Insecure mode

Please help, it doesn't look like any data was lost, but I cannot access the cluster from the JDBC postgres client.

I CAN access the cluster and run queries using the cockroachdb client.

@hollinwilkins
Copy link

Funny enough, I just deleted all pods, let them come back up, and it seems to be working fine now. If that happens again in the future, how can I help you debug it? Any debug information I could save anywhere?

@hollinwilkins
Copy link

Spoke too soon, it is still happening. I do have access to my databases now from JDBC postgres driver.

@irfansharif
Copy link
Contributor

irfansharif commented Feb 28, 2020

@hollinwilkins: we seem to have missed your messages here, hopefully you were able to find a resolution elsewhere. We did some recent work on this area in #44299.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Projects
None yet
Development

No branches or pull requests

7 participants