-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caused by: javax.net.ssl.SSLException: java.net.SocketException: Broken pipe (Write failed) while writing #410
Comments
May you provide more details on why do you want to implement a native FS for Flink using GCS connector instead of using Flink HDFS support with GSC connector? Also, my you provide the next information:
Regarding OOMs, GCS connectors allocates |
I had to go with custom one as HDFS connector doesn't support Flink's StreamingFile sink. Only hdfs:// can be used with this sink as it supports trim operation. StreamingSink is nice as it natively integrates with Flink check pointing.
But 73 MiB explains it. I thought buffers are around 1MiB. Over long weekend I rewrote it by adding local file buffer and limiting upload parallelism to a preconfigured number and was able to run it on 3Gb of heap. |
Interesting, does it mean that for a Flink native FS and If you are writing small files then you can reduce GCS upload chunk size ( For small files direct upload makes sense, because it's more efficient and anyway you don't need resumable upload for files that can be uploaded in a single request and/or retried from local FS cache. But generally speaking, GCS is inefficient when processing many small files, that's why you may want to change your pipeline to write fewer bigger files, ideally you should write 500+ MiB files to GCS, or at least 100 MiB files. Even if you will make it work with writing many small files, it will be less efficient when you will need to read and process these files. Seems like |
I use approach similar to one used in S3 FS: https://github.com/apache/flink/tree/master/flink-filesystems/flink-s3-fs-base. 2 important methods: persist and closeForCommit. Pipeline controls when those methods are called (with bulk sink and 10 minutes checkpoints it will happen every 10 mins). So, my original implementation was to open GCS stream, keep writing till persist is called, close file on persist and open new one. I use combine to create a final file when closeForCommit is called. Rollback then just removes non persisted files. Another quick question -> how does it deal with streams that are open for long time (minutes), but don't receive too much data? Like, stream is open for 10 minutes, but all data is written in the first couple seconds? Or every couple seconds? |
Stream open for a long time should not be a problem, because GCS connector cashes data until there will be enough of it to send a request to GCS API, so open stream doesn't mean an open GCS connection. That said direct upload could open a single long running request that will be less reliable because it can not recover from transient network failures for long uploads. |
Coming back to retrying hadoop-connectors/util/src/main/java/com/google/cloud/hadoop/util/RetryHttpInitializer.java Lines 236 to 237 in d51f2b6
It must mean that for your usage pattern number of retries is not sufficient, you can adjust it thought Closing this issues because retries are already in-place. |
Hello,
I'm working on native GCS support for Apache Flink. I implemented a native FS by wrapping it around
GoogleHadoopFileSystem
and it works fine most of the time till it hits a wall around 250 open files.Errors I see in logs are:
and sometimes it hits OOM.
I tried to play with different modes:
like setting
fs.gs.outputstream.direct.upload.enable
totrue
/false
, but can't find anything that works reliably.The way wrapper is implemented:
It returns
fs.create(new org.apache.hadoop.fs.Path(currentChunkName));
to Flink and Flink is responsible for closing it. So, can be open for several minutes (depending of the job's config).I'm thinking maybe I should switch to writing files locally and uploading them when stream is closed. Is there way to configure
GoogleHadoopFileSystem
to do it that way (local and upload on close)? Or, am I doing anything wrong with letting Flink to open unlimited number of files?Any advice will be appreciated. I will contribute it to Flink once it is stable enough.
Thank you
The text was updated successfully, but these errors were encountered: