Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 and GCS writes are limited to 50GB #5720

Closed
flagbug opened this issue Aug 28, 2021 · 8 comments · Fixed by #5890 or #9039
Closed

S3 and GCS writes are limited to 50GB #5720

flagbug opened this issue Aug 28, 2021 · 8 comments · Fixed by #5890 or #9039
Assignees

Comments

@flagbug
Copy link

flagbug commented Aug 28, 2021

Enviroment

  • Airbyte version: 0.29.12-alpha
  • Environent: Debian 10, GCP
  • Deployment: Docker
  • Source Connector and version: mssql 0.3.4
  • Destination Connector and version: gcs 0.1.0
  • Step where error happened: Sync job

Current Behavior

When syncing our MSSQL database to Google Cloud Storage, after reading few million rows, the job stops with an java.lang.IndexOutOfBoundsException.

Note that the job doesn't fail, it just stops and never continues.

Expected Behavior

The syncing just works™

Logs

2021-08-26 19:33:48 INFO () DefaultReplicationWorker(lambda$getReplicationRunnable$2):223 - Records read: 256184000
2021-08-26 19:33:48 INFO () DefaultReplicationWorker(lambda$getReplicationRunnable$2):223 - Records read: 256185000
2021-08-26 19:33:49 INFO () DefaultReplicationWorker(lambda$getReplicationRunnable$2):223 - Records read: 256186000
2021-08-26 19:33:49 INFO () DefaultReplicationWorker(lambda$getReplicationRunnable$2):223 - Records read: 256187000
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[33mWARN�[m i.a.i.b.FailureTrackingAirbyteMessageConsumer(close):78 - {} - Airbyte message consumer: failed.
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[33mWARN�[m i.a.i.d.g.w.BaseGcsWriter(close):125 - {} - Failure detected. Aborting upload of stream 'Andr'...
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[32mINFO�[m a.m.s.MultiPartOutputStream(close):158 - {} - Called close() on [MultipartOutputStream for parts 1 - 10000]
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[32mINFO�[m a.m.s.MultiPartOutputStream(close):158 - {} - Called close() on [MultipartOutputStream for parts 1 - 10000]
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[33mWARN�[m a.m.s.MultiPartOutputStream(close):160 - {} - [MultipartOutputStream for parts 1 - 10000] is already closed
2021-08-26 19:33:49 INFO () DefaultReplicationWorker(lambda$getReplicationRunnable$2):223 - Records read: 256188000
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[32mINFO�[m a.m.s.StreamTransferManager(abort):470 - {} - [Manager uploading to mimo_airbyte_sync/production/Andr/2021_08_26_1629985551451_0.avro with id ABPnzm7-I...9-fC_bQ0Y]: Aborted
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[33mWARN�[m i.a.i.d.g.w.BaseGcsWriter(close):127 - {} - Upload of stream 'AndrPro' aborted.
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[33mWARN�[m i.a.i.d.g.w.BaseGcsWriter(close):125 - {} - Failure detected. Aborting upload of stream 'Chap...
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[32mINFO�[m a.m.s.MultiPartOutputStream(close):158 - {} - Called close() on [MultipartOutputStream for parts 1 - 10000]
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[32mINFO�[m a.m.s.MultiPartOutputStream(close):158 - {} - Called close() on [MultipartOutputStream for parts 1 - 10000]
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[33mWARN�[m a.m.s.MultiPartOutputStream(close):160 - {} - [MultipartOutputStream for parts 1 - 10000] is already closed
2021-08-26 19:33:49 INFO () DefaultReplicationWorker(lambda$getReplicationRunnable$2):223 - Records read: 256189000
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[32mINFO�[m a.m.s.StreamTransferManager(abort):470 - {} - [Manager uploading to mimo_airbyte_sync/production/Chap/2021_08_26_1629985551451_0.avro with id ABPnzm6nq...ivdTH52Xc]: Aborted
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[33mWARN�[m i.a.i.d.g.w.BaseGcsWriter(close):127 - {} - Upload of stream 'Chap' aborted.
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[33mWARN�[m i.a.i.d.g.w.BaseGcsWriter(close):125 - {} - Failure detected. Aborting upload of stream 'AndRec'...
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[32mINFO�[m a.m.s.MultiPartOutputStream(close):158 - {} - Called close() on [MultipartOutputStream for parts 1 - 10000]
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[32mINFO�[m a.m.s.MultiPartOutputStream(close):158 - {} - Called close() on [MultipartOutputStream for parts 1 - 10000]
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[33mWARN�[m a.m.s.MultiPartOutputStream(close):160 - {} - [MultipartOutputStream for parts 1 - 10000] is already closed
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[32mINFO�[m a.m.s.StreamTransferManager(abort):470 - {} - [Manager uploading to mimo_airbyte_sync/production/AndrRec/2021_08_26_1629985551451_0.avro with id ABPnzm5DW...aOZGkIUaE]: Aborted
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[33mWARN�[m i.a.i.d.g.w.BaseGcsWriter(close):127 - {} - Upload of stream 'AndrRec' aborted.
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[33mWARN�[m i.a.i.d.g.w.BaseGcsWriter(close):125 - {} - Failure detected. Aborting upload of stream 'LessPro'...
2021-08-26 19:33:49 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-08-26 19:33:49 �[32mINFO�[m a.m.s.MultiPartOutputStream(close):158 - {} - Called close() on [MultipartOutputStream for parts 1 - 10000]
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - Exception in thread "main" java.lang.IndexOutOfBoundsException: This stream was allocated the part numbers from 1 (inclusive) to 10001 (exclusive)and it has gone beyond the end.
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at alex.mojaki.s3upload.MultiPartOutputStream.putCurrentStream(MultiPartOutputStream.java:121)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at alex.mojaki.s3upload.MultiPartOutputStream.checkSize(MultiPartOutputStream.java:110)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at alex.mojaki.s3upload.MultiPartOutputStream.write(MultiPartOutputStream.java:143)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at org.apache.avro.file.DataFileWriter$BufferedFileOutputStream$PositionFilter.write(DataFileWriter.java:476)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:123)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at org.apache.avro.io.BufferedBinaryEncoder$OutputStreamSink.innerWrite(BufferedBinaryEncoder.java:227)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at org.apache.avro.io.BufferedBinaryEncoder.writeFixed(BufferedBinaryEncoder.java:157)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at org.apache.avro.file.DataFileStream$DataBlock.writeBlockTo(DataFileStream.java:394)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at org.apache.avro.file.DataFileWriter.writeBlock(DataFileWriter.java:408)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at org.apache.avro.file.DataFileWriter.writeIfBlockFull(DataFileWriter.java:351)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:320)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at io.airbyte.integrations.destination.gcs.avro.GcsAvroWriter.write(GcsAvroWriter.java:90)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at io.airbyte.integrations.destination.gcs.GcsConsumer.acceptTracked(GcsConsumer.java:105)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at io.airbyte.integrations.base.FailureTrackingAirbyteMessageConsumer.accept(FailureTrackingAirbyteMessageConsumer.java:66)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at io.airbyte.integrations.base.IntegrationRunner.consumeWriteStream(IntegrationRunner.java:136)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at io.airbyte.integrations.base.IntegrationRunner.run(IntegrationRunner.java:117)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	at io.airbyte.integrations.destination.gcs.GcsDestination.main(GcsDestination.java:48)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 	Suppressed: java.lang.IndexOutOfBoundsException: This stream was allocated the part numbers from 1 (inclusive) to 10001 (exclusive)and it has gone beyond the end.
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 		at alex.mojaki.s3upload.MultiPartOutputStream.putCurrentStream(MultiPartOutputStream.java:121)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 		at alex.mojaki.s3upload.MultiPartOutputStream.close(MultiPartOutputStream.java:164)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 		at java.base/java.io.FilterOutputStream.close(FilterOutputStream.java:188)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 		at java.base/java.io.FilterOutputStream.close(FilterOutputStream.java:188)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 		at org.apache.avro.file.DataFileWriter.close(DataFileWriter.java:461)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 		at io.airbyte.integrations.destination.gcs.avro.GcsAvroWriter.closeWhenFail(GcsAvroWriter.java:102)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 		at io.airbyte.integrations.destination.gcs.writer.BaseGcsWriter.close(BaseGcsWriter.java:126)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 		at io.airbyte.integrations.destination.gcs.GcsConsumer.close(GcsConsumer.java:111)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 		at io.airbyte.integrations.base.FailureTrackingAirbyteMessageConsumer.close(FailureTrackingAirbyteMessageConsumer.java:82)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 		at io.airbyte.integrations.base.IntegrationRunner.consumeWriteStream(IntegrationRunner.java:130)
2021-08-26 19:33:49 ERROR () LineGobbler(voidCall):85 - 		... 2 more

Steps to Reproduce

  1. Sync a lot of rows to GCS?

Are you willing to submit a PR?

Unlikely, except if it's a very obvious fix 😄

@flagbug flagbug added the type/bug Something isn't working label Aug 28, 2021
@flagbug
Copy link
Author

flagbug commented Aug 29, 2021

FYI, I tried this sync multiple times now and every single time it fails after exactly 256189000 rows.

Edit: Looks like this is rather happening after uploading exactly 10000 parts to the multipart stream

@flagbug
Copy link
Author

flagbug commented Aug 30, 2021

So I've looked into this to the best of my knowledge and it seems like this is a fundamental limitation of the way an S3 multipart upload works - there can not be more than 10000 parts. If I'm understanding the code correctly, this issue also impacts the native S3 destination. This means that right now with the limit of 5MB per part set in

Airbyte can't write files larger than 10000 * 5MB to an S3 destination.

According to the documentation in the library that is used for writing to the underlying S3 destination at http://alexmojaki.github.io/s3-stream-upload/javadoc/apidocs/alex/mojaki/s3upload/StreamTransferManager.html#partSize-long-, what needs to be done here is to increase the size of the parts itself, since the number of parts can't be increased.

Should this be a configurable option in the destination settings?

@flagbug flagbug changed the title Syncing to Google Cloud Storage fails with java.lang.IndexOutOfBoundsException S3 and GCS writes are limited to 50GB Aug 30, 2021
@sherifnada
Copy link
Contributor

sherifnada commented Aug 30, 2021

@flagbug thanks for the thorough write up! will take a look

@sherifnada sherifnada added area/connectors Connector related issues priority/medium Medium priority and removed priority/medium Medium priority labels Aug 31, 2021
@etsybaev
Copy link
Contributor

etsybaev commented Aug 31, 2021

Hi @flagbug.

Many thanks for your investigation. Would it be an acceptable fix if we made those args configurable from UI?

Regards,
Eugene

@flagbug
Copy link
Author

flagbug commented Aug 31, 2021

@etsybaev For me personally sure, but I guess this is something that the Airbyte team needs to decide if it makes sense 😄

@flagbug
Copy link
Author

flagbug commented Sep 20, 2021

@etsybaev Unfortunately it looks like the BigQuery GCS staging functionality implemented in #5614 still suffers from this problem, since it doesn't seem to expose this new chunk size option.

@PS-istarostenko
Copy link

I believe the same applies for Snowflake. Seems that the fix #5890 only resolved the issue for S3 and GCS destinations, but not when using storage buckets as a staging loading method for data warehouses.

@sherifnada
Copy link
Contributor

@etsybaev re-opening this issue

@mkhokh-33 mkhokh-33 assigned mkhokh-33 and unassigned etsybaev Nov 5, 2021
@alexandr-shegeda alexandr-shegeda moved this to Implementation in progress in GL Roadmap Dec 17, 2021
@yurii-bidiuk yurii-bidiuk moved this from Implementation in progress to Internal Review in GL Roadmap Dec 22, 2021
@alexandr-shegeda alexandr-shegeda moved this from Internal review to Done in GL Roadmap Dec 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
7 participants