Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import: allow worker node failure #25480

Closed
maddyblue opened this issue May 14, 2018 · 18 comments · Fixed by #26881
Closed

import: allow worker node failure #25480

maddyblue opened this issue May 14, 2018 · 18 comments · Fixed by #26881
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@maddyblue
Copy link
Contributor

During IMPORT the job coordinator can go down and the job will be restarted. However if another node goes down I think the job will be marked as failed. In this case we should be aware that this is a transient error and restart the job so it makes a new plan with healthy nodes.

@maddyblue maddyblue added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery labels May 14, 2018
@thstart
Copy link

thstart commented May 15, 2018

I investigated further and one reason for this is cluster is getting too busy during building indexes. This particular schema is having 15 fields and 15 indexes. I suppose indexes are too many. Could you please confirm?

@maddyblue
Copy link
Contributor Author

Indexes do contribute to the time, yes. You could try without them and backfill them later.

@thstart
Copy link

thstart commented May 16, 2018

This is great idea. Is it possible to build an index from a field in JSONB? I know about inverted indexes but my JSONB has many fields and I want only some of the indexed.

@knz
Copy link
Contributor

knz commented May 17, 2018

Yes you can create an index over a field in a JSONB value: 1) create a computed column that extracts the value you want in the JSONB field 2) create an index over that computed column.

After that to use the index you'll need to write the WHERE constraint over the computed column (not the original JSON field) but at least the index will be fully automated this way.

@thstart
Copy link

thstart commented May 19, 2018

Currently it is not possible to build computed column after IMPORT
https://www.cockroachlabs.com/docs/stable/computed-columns.html#main-content
Computed columns:
Cannot be added after a table is created.

@thstart
Copy link

thstart commented May 20, 2018

Regenerated CSV again. Now I try to IMPORT without creating indexes.

  1. tried IMPORT again. In n2 logs I see "W180520 04:32:34.225690 122 sql/schema_changer.go:1146 [n3] Error executing schema change: not hit TTL deadline: -16h36m18.295557736s remaining
    I180520 04:32:37.048262 119 server/status/runtime.go:219 [n3] runtime stats: 4.1 GiB RSS, 185 goroutines, 337 MiB/181 MiB/969 MiB GO alloc/idle/total, 4.5 GiB/5.8 GiB CGO alloc/total, 472.92cgo/sec, 0.12/0.04 %(u/s)time, 0.00 %gc (1x)"
  2. restarted n2,
  3. started IMPORT from n3 - I see "W180520 04:32:34.225690 122 sql/schema_changer.go:1146 [n3] Error executing schema change: not hit TTL deadline: -16h36m18.295557736s remaining
    I180520 04:32:37.048262 119 server/status/runtime.go:219 [n3] runtime stats: 4.1 GiB RSS, 185 goroutines, 337 MiB/181 MiB/969 MiB GO alloc/idle/total, 4.5 GiB/5.8 GiB CGO alloc/total, 472.92cgo/sec, 0.12/0.04 %(u/s)time, 0.00 %gc (1x)" in the logs of n3

@maddyblue
Copy link
Contributor Author

You can add computed columns after IMPORT in the 2.1 alpha release. We may add them to IMPORT in the future too, but it's complicated and it's not a priority yet.

Ok those error messages are interesting. I'll look into this on Monday.

@thstart
Copy link

thstart commented May 20, 2018

After importing 7 hours got the following message in command line:
Error: pq: job 349697967389802499: node liveness error: restarting in the background
Failed running "sql"

From other side in the web UI I still see this Job running and now after a while it show now Failed. Th csv file is 150 mil records/200GB and CRDB cannot handle it.

n1 logs -W180520 21:19:25.282063 213 storage/node_liveness.go:501 [n1,hb] slow heartbeat took 1.1s. .sst added periodically in the CRDB folder
n2 logs : W180520 21:17:35.536163 133 storage/node_liveness.go:501 [n2,hb] slow heartbeat took 4.6s
W180520 21:17:35.536177 133 storage/node_liveness.go:438 [n2,hb] failed node liveness heartbeat: context deadline exceeded. .sst added periodically in the CRDB folder
n3: not adding .sst

@thstart
Copy link

thstart commented May 20, 2018

Reason for stop:
"could not parse JSON: error decoding JSON: invalid character 'U' after object key:value pair"

So after a day of IMPORT it stops because of just one line. I opened it to see where is this character and cannot see where is it. I suppose it is some Unicode character. Are there a way to filter it out during import?

Do you have a code for a Golang web server which can do what Caddy is doing so I too add some regex filtering. This is important because I have one database which is 400 mil records and this can happen any time when I import. If I can filter out special characters during import I could prevent that or to have a separate step to generate a safe csv file first filtering it out. it is very time consuming.

I made comparison IMPORT test with 100 mil records:

  1. with PostgresSql ~1 hour
  2. with CRDB ~24 hours

This is 24 times difference and something is very wrong with CRDB. With this long time the chance to be interrupted by communication issue is big. I think the way IMPORT should work is to make possible all records are inside the node doing importing and then beginning replication.

Also there is one more consideration - imagine I have 500 mil records / 500 GB initial master table which I need to update daily once master table is created. There are several issues:

  1. it will take ~1 week to IMPORT
  2. Probably 1 week more to replicate in same datacenter.

I could live with the fact it will take so long if it didn't stop. Daily updates with UPSERT would be fine.
What will happen if I have live cluster with paying users if the cluster fails? I could not wait 2 weeks to recreate the DB? Currently I am using RocksDB with 2 separate servers. Server #1 is live, next week I update server #2 for 3-4 days and switch server #2 to be live. 2 weeks later I update sever #1 and when ready switch to #1 to be live. It worked that way for 3 years. If something happens with the live server I always have one server a week behind, can switch to it immediately. That way users have inunterrupted service and I can let them have the new data next week. For this I need 2 servers.

Now with CRDB I need at least 3 servers and the way I see it probably I will need 5 to be sure cluster is not down for some reason. I need to watch and restart CRDB server manually because the dashboard not always shows some server needs restart. So even with 3 servers there is no guarantee cluster will be not down. If I UPSERT 10 mil records / day probably will take several hours - I am not sure how this will affect the users using it live - how slow it will be? 3 servers is not cheap to invest in and still not get a stable cluster. With RocksDB and 2 servers the only inconvenience is switching live servers which I managed to automate. Yes there is no SQL but at least it is working. Is having SQL too big price to pat to have it?

Now - there is more. Currently I have servers in datacenter at West Cost if I want to add servers at East Cost = here what will happen:

  1. If replication to 3 servers in same datacenter of 100 mil records takes 24 hours, imagine how long it will take if they need to be replicated to East Cost.
  2. The way we do it with RocksDB - our Master DB is ~500 GB. We put it in USB drive and send a currier to East Coast. A person there just makes a copy to RocksDB folder and makes it live. So every week we have fresh DB and servers and can expand.

So I don't know what is better - the system to work albeit little primitive at first look, or to fight with CRDB? I can live with the fact it takes 2 weeks to IMPORT master DB and then several hours daily if it works. But assuming it works for our West Coast datacenter how to expand to East Coast? First it will take forever just to create the master DB not talking about the bandwidth cost. There should be solution to attach ready to use Master DB to CRDB from USB drive and then with UPSERT to do daily updates from West Coast to East Coast. The time to spend and the bandwidth to pay would be tolerable if our users have daily updates and there would be no switch between servers.

If not CRDB would be not practical for real production use with hundred of GBs of data.

@thstart
Copy link

thstart commented May 21, 2018

Regardin orchestration. If CRDB is stable there is not much need of orchestration. But it is not stable. I see the dashboard not always represents what logs show. To be sure before IMPORT now I check every node's logs and usually I need to restart at least one before it.

With RocksDB currently we are using 2 separate k8s as single master/node on 2 physical servers. k8s+ElasticSearch+Kibana+Sentinel for Alerts is what is needed for guaranteeing uptime, monitoring and alerts. On each server 2 PODs with web service pointing to a folder handled as hostPath guarantee that is one POD is down k8s will bring it up within minutes. hostPath with the unclear statements what it will not behave on k8s cluster made us to use 1 physical server for k8s master/node. It is fine to have two k8s on two physical servers separately. The benefits of k8s are still there.

Now with CRDB - it needs at least 3 physical servers to work at all. This is fine if justified. Using k8s with 3 servers and persistent storage is a problem. hostPath prevents use of k8s effectively - the explanation is very vague. DaemonSet - prevents k8s to recover from failures. The other solution Stateful is still in beta - not production ready. So we cannot use k8s as separate master/node installations on each physical server as before. But the storage story of k8s with multiple servers is not ready yet for production (we waited a year for that but still not ready).

dc/os needs at least 5 servers. They need to be configured as planned and not possible to change this later. masters/nodes in one server is not possible. You cannot begin small and test in real situation before investing too much. CRDB app in the dc/os catalog is not recommended for production yet.

So orchestration for CRDB is not ready for production too.

@knz
Copy link
Contributor

knz commented May 21, 2018

Hi @thstart thank you for the quality feedback.

I do not know much about orchestration myself but I can provide a general response to the other points. There are many specific questions in your comments which we can answer one by one.

Before I go there I would like to highlight that there are other (production) CockroachDB users who are able to sustain similar read/write workloads as yours against multiple data centers in different regions (including west + east US) without much concern, so I am confident that the hurdles you see can be overcome.

First, as to the number of CockroachDB nodes vs. performance.

As you know the number 3 is a minimum to create a consensus group for any range. Here consider that any write transaction will need a response from a majority (2 in this case) before it is acknowledged. If you use 3 nodes with 2 on one side and 1 on the other side of a country, any transaction issued from the 1-node side will span the entire country, which is not a happy case for performance. Of course this is true of any strongly consistent system, not just CockroachDB.

Then also if you only use 3 nodes in that way, and there is a whole datacenter outage on the 2-node region, then the entire cluster becomes unavailable. This is not ideal either.

This is why we recommend users to use groups of 3 nodes in each datacenter, then configure replication zones to ensures that the data specific to each region is co-located in that region. This increases the performance within each region, and ensures that the data of that region is still available when another DC goes down. (Then perhaps you'll want to use 3 datacenters not 2, so that there's still a majority of nodes for the meta-information common to all the cluster in case 1 DC goes down.)

Then regarding the performance of IMPORT and subsequent data updates

As you have found out, CockroachDB's IMPORT is slower than PostgreSQL's. A part of this is because the data is replicated across multiple nodes upfront. Another part of this is that we have not yet fully optimized IMPORT so there is still room for improvement (after all, CockroachDB is still younger than PostgreSQL). But beyond that, as you pointed out the initial import of your master database should be a one-time operation. This is a one-time cost which you can amortize.

Once this import is done, then updating new records each day should not impact performance on the existing data (if it does, we'll certainly want to look into this). You should probably spread out the updates over time though (instead of issuing all in one go), because you want to avoid transaction contention. Again these are general SQL best practices and not specific to CockroachDB.

Regarding the error during IMPORT

It is indeed very inconvenient that the entire IMPORT job was aborted just because of a single data error. There are multiple things we could do here. One could be to park the CSV lines containing errors in a separate file for manual inspection after IMPORT completes. Another could be to suspend the job and give you the opportunity to fix the file. Then as you propose perhaps you could also decide to ignore the errors entirely with an IMPORT option.

I agree that currently the situation is not appropriate, so I filed the following issue for you: #25751

Regarding the overall performance of CockroachDB

I recall on Gitter you referred to your setup where you were using Apple Mac Minis to run CockroachDB. Although this hardware enables CockroachDB to run it is certainly not appropriate to make CockroachDB run fast. If this is the hardware you are using to run your master IMPORT job, then I am not surprised that you observe relatively poor performance.

There are many ways to select hardware for server software, and we can certainly make recommendations as to which server type to select if you use one of the public clouds. In general we recommend a minimum of 6-8 cores per CockroachDB node and 4GB of RAM, and depending on your workload perhaps more.

Again as I wrote earlier I am not an orchestration expert so I will refrain from commenting on your k8s/dcos issues. However meanwhile I'd like to point out that other users have had some success before. So perhaps if we identify the issues you're running into more specifically, it is likely we can help alleviate the problems you see.

@maddyblue
Copy link
Contributor Author

We're tracking the error message problem in #25532.

@thstart
Copy link

thstart commented May 21, 2018

I'm adding 3 more Dell PowerEdge servers to address hardware performance. The only things for these issues that matter is the CRDB:

  1. to stop having communication issues which is independent from server hardware.
  2. IMPORT to be able to finish.

@thstart
Copy link

thstart commented May 25, 2018

I made another attempt to IMPORT. Again unsuccessful with message:"Error : pq: Context deadline exceeded". Now I have 6 servers and I hoped with more servers CRDB will find a way to finish.

@maddyblue
Copy link
Contributor Author

We're tracking that bug in #25866.

@thstart
Copy link

thstart commented May 25, 2018

all but one node connected, only one node got following errors when trying to connect to from others
:W180525 07:29:47.488865 354218 vendor/google.golang.org/grpc/clientconn.go:1158 grpc: addrConn.createTransport failed to connect to {xxx.xxx.xxx.xxx:26257 0 }. Err :connection error: desc = “transport: Error while dialing dial tcp xxx.xxx.xxx.xxx:26257: connect: connection refused”. Reconnecting…

So I suppose with 6 CRDB servers it will be able to finish when you fix it?

craig bot pushed a commit that referenced this issue Jun 26, 2018
26811: kv: update the TCS's txn on requests way out r=andreimatei a=andreimatei

The TxnCoordSender maintains a copy of the transaction record, used for
things like heartbeating and creating new transactions after a
TransactionAbortedError. This copy is supposed to be kept in sync with
the client.Txn's copy. Before this patch, the syncing was done by
updating the TCS's txn when a response comes back from a request.
This patch moves to updating the TCS's txn on a request's way out, in
addition to continuing to update it when a request comes back.
Besides being the sane thing to do™, this assures that, if the heartbeat
loop triggers before the response to the BeginTransaction's batch comes
back, the transaction already has the key set. Without this patch, if
the heartbeat loop triggered before the BeginTxn response, it would
heartbeat key /Min, which is non-sensical (and creating load on range 0
for TPCC loadtests).

Release note: None

26881: importccl: restart IMPORT on worker node failure r=mjibson a=mjibson

Attempt to detect a context canceled error in IMPORT which is caused by
a node going away in the dist SQL run. Send a special error back to the
job registry indicating a restart should happen instead of a failure.

We are shipping this with a skipped test because it is flakey. We are
ok doing that because it is still better than what we had before in many
cases, just not all. We will work to improve the other things so that we
can correctly detect when IMPORT can be restarted due to a node outage,
which will allow us to unskip this test.

Fixes #25866
Fixes #25480

Release note (bug fix): IMPORT now detects node failure and will restart
instead of fail.

26968: settings: bump minimum supported version to v2.0 r=nvanbenschoten a=nvanbenschoten

We're currently shipping v2.1 alphas, so enforce a minimum binary
version of v2.0. This ensures that no one can upgrade directly from
v1.1 to v2.1. Instead, they need to make a pit sop in v2.0.

Release note: None

26984: storageccl: retry SST chunks with new splits on err r=dt a=dt

Simpler alternative to #26930.

Closes #26930.

Previously an ImportRequest would fail to add SSTables that spanned the
boundries of the target range(s). This reattempts the AddSSTable call
with re-chunked SSTables that avoid spanning the bounds returned in
range mismatch error. It does this by iterating the SSTable to build and
add smaller sstables for either side of the split.

This error currently happens rarely in practice -- we usually explicitly
split ranges immediately before sending an Import with matching boundsto
them. Usually the empty, just-split range has no reason to split again,
so the Import usually succeeds.

However in some cases, like resuming a prior RESTORE, we may be
re-Importing into ranges that are *not* empty and could have split at
points other than those picked by the RESTORE statement.

Fixes #17819.

Subsumes #24299.
Closes #24299.

Release note: none.

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Matt Jibson <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: David Taylor <[email protected]>
@craig craig bot closed this as completed in #26881 Jun 26, 2018
@tbg
Copy link
Member

tbg commented Jun 26, 2018 via email

@maddyblue
Copy link
Contributor Author

Yes, this is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants