Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kube attempts not catching connector exception properly #5526

Closed
ChristopheDuong opened this issue Aug 19, 2021 · 4 comments
Closed

Kube attempts not catching connector exception properly #5526

ChristopheDuong opened this issue Aug 19, 2021 · 4 comments

Comments

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Aug 19, 2021

Environment

  • Airbyte version: airbyte-cloud (based of airbyte OSS v0.29.3?)

Current Behavior

In the following issue, a failing connection has been setup where the destination connector throws an exception (NPE).
#5525

Notice in the attached logs below that it took 10h for the worker to timeout and start its second attempt at the sync:

2021-08-18 15:24:18 INFO  Exit code for pod destination-bigquery-denormalized-worker-113-0-ilyga is 2
2021-08-18 15:24:18 INFO  Closed all resources for pod destination-bigquery-denormalized-worker-113-0-ilyga
2021-08-19 01:28:48 WARN  Process is taking too long to finish. Killing it
2021-08-19 01:28:48 INFO  Destroying Kube process: source-intercom-worker-113-0-wqdop

Expected Behavior

Tell us what should happen.
Attempts should be retried without waiting 10h in between.

Additionally, the logs should probably be improved at counting the attempt numbers when stating for the second attempt:

2021-08-19 01:29:16 INFO  Starting attempt: 1 of 3

Which should be instead:

2021-08-19 01:29:16 INFO  Starting attempt: 2 of 3

and appropriately split in different attempt tabs instead of being in the same log output

Logs

If applicable, please upload the logs from the failing operation.
For sync jobs, you can download the full logs from the UI by going to the sync attempt page and
clicking the download logs button at the top right of the logs display window.

LOG
2021-08-18 15:23:48 ERROR Caused by: java.lang.NullPointerException
2021-08-18 15:23:48 ERROR 	... 4 more
2021-08-18 15:24:18 INFO  Exit code for pod destination-bigquery-denormalized-worker-113-0-ilyga is 2
2021-08-18 15:24:18 INFO  Closed all resources for pod destination-bigquery-denormalized-worker-113-0-ilyga
2021-08-19 01:28:48 WARN  Process is taking too long to finish. Killing it
2021-08-19 01:28:48 INFO  Destroying Kube process: source-intercom-worker-113-0-wqdop
2021-08-19 01:28:48 INFO  Destroyed Kube process: source-intercom-worker-113-0-wqdop
2021-08-19 01:28:48 WARN  airbyte-source gobbler IOException: Socket closed. Typically happens when cancelling a job.
2021-08-19 01:29:16 INFO  Unable to find pod source-intercom-worker-113-0-wqdop to retrieve exit value. Defaulting to  value 143. This is expected if the job was cancelled.
2021-08-19 01:29:16 INFO  Closed all resources for pod source-intercom-worker-113-0-wqdop
2021-08-19 01:29:16 INFO  Unable to find pod source-intercom-worker-113-0-wqdop to retrieve exit value. Defaulting to  value 143. This is expected if the job was cancelled.
2021-08-19 01:29:16 INFO  Closed all resources for pod source-intercom-worker-113-0-wqdop
2021-08-19 01:29:16 INFO  Unable to find pod source-intercom-worker-113-0-wqdop to retrieve exit value. Defaulting to  value 143. This is expected if the job was cancelled.
2021-08-19 01:29:16 INFO  Closed all resources for pod source-intercom-worker-113-0-wqdop
2021-08-19 01:29:16 INFO  Unable to find pod source-intercom-worker-113-0-wqdop to retrieve exit value. Defaulting to  value 143. This is expected if the job was cancelled.
2021-08-19 01:29:16 INFO  Closed all resources for pod source-intercom-worker-113-0-wqdop
2021-08-19 01:29:16 INFO  Unable to find pod source-intercom-worker-113-0-wqdop to retrieve exit value. Defaulting to  value 143. This is expected if the job was cancelled.
2021-08-19 01:29:16 INFO  Closed all resources for pod source-intercom-worker-113-0-wqdop
2021-08-19 01:29:16 INFO  Unable to find pod source-intercom-worker-113-0-wqdop to retrieve exit value. Defaulting to  value 143. This is expected if the job was cancelled.
2021-08-19 01:29:16 INFO  Closed all resources for pod source-intercom-worker-113-0-wqdop
2021-08-19 01:29:16 ERROR Sync worker failed.
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.net.SocketException: Broken pipe
	at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
	at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
	at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:140) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
	at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:52) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:146) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
	Suppressed: io.airbyte.workers.WorkerException: Source process exit with code 143. This warning is normal if the job was cancelled.
		at io.airbyte.workers.protocols.airbyte.DefaultAirbyteSource.close(DefaultAirbyteSource.java:135) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:121) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:52) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
		at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:146) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
		at java.lang.Thread.run(Thread.java:832) [?:?]
	Suppressed: java.net.SocketException: Socket closed
		at sun.nio.ch.NioSocketImpl.ensureOpenAndConnected(NioSocketImpl.java:165) ~[?:?]
		at sun.nio.ch.NioSocketImpl.beginWrite(NioSocketImpl.java:366) ~[?:?]
		at sun.nio.ch.NioSocketImpl.implWrite(NioSocketImpl.java:411) ~[?:?]
		at sun.nio.ch.NioSocketImpl.write(NioSocketImpl.java:440) ~[?:?]
		at sun.nio.ch.NioSocketImpl$2.write(NioSocketImpl.java:826) ~[?:?]
		at java.net.Socket$SocketOutputStream.write(Socket.java:1052) ~[?:?]
		at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:242) ~[?:?]
		at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:312) ~[?:?]
		at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:290) ~[?:?]
		at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:131) ~[?:?]
		at java.io.OutputStreamWriter.write(OutputStreamWriter.java:208) ~[?:?]
		at java.io.BufferedWriter.flushBuffer(BufferedWriter.java:120) ~[?:?]
		at java.io.BufferedWriter.flush(BufferedWriter.java:256) ~[?:?]
		at io.airbyte.workers.protocols.airbyte.DefaultAirbyteDestination.notifyEndOfStream(DefaultAirbyteDestination.java:107) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
		at io.airbyte.workers.protocols.airbyte.DefaultAirbyteDestination.close(DefaultAirbyteDestination.java:120) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:121) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:52) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
		at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:146) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
		at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: java.lang.RuntimeException: java.net.SocketException: Broken pipe
	at io.airbyte.workers.DefaultReplicationWorker.lambda$getReplicationRunnable$2(DefaultReplicationWorker.java:234) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
	... 1 more
Caused by: java.net.SocketException: Broken pipe
	at sun.nio.ch.NioSocketImpl.implWrite(NioSocketImpl.java:420) ~[?:?]
	at sun.nio.ch.NioSocketImpl.write(NioSocketImpl.java:440) ~[?:?]
	at sun.nio.ch.NioSocketImpl$2.write(NioSocketImpl.java:826) ~[?:?]
	at java.net.Socket$SocketOutputStream.write(Socket.java:1052) ~[?:?]
	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:242) ~[?:?]
	at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:312) ~[?:?]
	at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:290) ~[?:?]
	at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:131) ~[?:?]
	at java.io.OutputStreamWriter.write(OutputStreamWriter.java:208) ~[?:?]
	at java.io.BufferedWriter.flushBuffer(BufferedWriter.java:120) ~[?:?]
	at java.io.BufferedWriter.write(BufferedWriter.java:233) ~[?:?]
	at java.io.Writer.write(Writer.java:249) ~[?:?]
	at io.airbyte.workers.protocols.airbyte.DefaultAirbyteDestination.accept(DefaultAirbyteDestination.java:99) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
	at io.airbyte.workers.protocols.airbyte.DefaultAirbyteDestination.accept(DefaultAirbyteDestination.java:50) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
	at io.airbyte.workers.DefaultReplicationWorker.lambda$getReplicationRunnable$2(DefaultReplicationWorker.java:219) ~[io.airbyte-airbyte-workers-0.29.3-alpha.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
	... 1 more
2021-08-19 01:29:16 INFO  sync summary: io.airbyte.config.ReplicationAttemptSummary@3c362cf5[status=failed,recordsSynced=513,bytesSynced=1019790,startTime=1629300177620,endTime=1629336556166]
2021-08-19 01:29:16 INFO  Source did not output any state messages
2021-08-19 01:29:16 WARN  State capture: No state retained.
2021-08-19 01:29:16 INFO  Stopping cancellation check scheduling...
2021-08-19 01:29:16 INFO  Last output present: true. Should attempt again: true
2021-08-19 01:29:16 INFO  Starting attempt: 1 of 3
2021-08-19 01:29:16 INFO  Executing worker wrapper. Airbyte version: dev
2021-08-19 01:29:16 INFO  start sync worker. job id: 113 attempt id: 0

Steps to Reproduce

Are you willing to submit a PR?

Remove this with your answer.

@ChristopheDuong ChristopheDuong added the type/bug Something isn't working label Aug 19, 2021
@cgardens
Copy link
Contributor

@ChristopheDuong are we sure that this isn't a problem specifically with BQ? I thought our other destinations were pretty good at failing quickly. BQ looks a little different than other destinations (because it uses the BQ client) instead of some of the JDBC abstractions that provide. It is very possible that it is a problem with kube attempts like you say, but I want to make sure we rule out that it's a BQ destination problem specifically.

@davinchia
Copy link
Contributor

Tested Intercom (1.1) -> BQ Denormalised (1.2) on the current cloud (0.29.12-alpha) and no errors.

I'm thinking this was fixed in #5467 and popped up because we were on an older version.

I also tested the source-e2e-test (failure after 3 records) -> BQ Denormalised in dev and saw correct failure behaviour (immediate retries without the 10 hour timout). This was on source and not destination failure, so it's not an exact replica. My recent PR getting rid of retries within an attempt also simplified this behaviour, so there might be something there.

Going to do one last test on my local Kube for the BQ Denormalised connector where I throw an exception just to confirm this is working fine.

@davinchia davinchia added this to the Core 2021-08-25 milestone Aug 25, 2021
@davinchia davinchia self-assigned this Aug 25, 2021
@davinchia
Copy link
Contributor

davinchia commented Aug 25, 2021

Built a BQ denormalised image that fails on purpose and conducted a test using the infinite e2e source test - confirmed the retry attempts work as expected on Kube. No masking of failures or long wait times.

@davinchia
Copy link
Contributor

davinchia commented Aug 25, 2021

@ChristopheDuong @cgardens reasonably confident this works as expected - I'm inclined to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants