[Bug]: TextIO read record twice for large delimited files #29146

vatanrathi · 2023-10-26T06:08:48Z

What happened?

Beam TextIO reader reads same record twice for large (~45GB) delimited (.DAT) files . No issues encountered when same file was splitted into smaller files and processed together.
This behaviour is not intermittent or random. Same records appears twice in every run.

wc -l test_tx_20211101.dat
78132206 test_tx_20211101.dat

record count of 78132206 was also verified loading file in spark which gives same count
Note: records are delimited with CRLF (^M$)

pipeline to read file:

input.apply("Read lines", TextIO.read().from(<s3 file path)
.apply("timestamp", WithTimestamp.of(e -> Instant.now()))
.setCoder(StringUtf8Coder.of())
.apply(Combine.globally(Count.<String>combineFn()).withoutDefaults())
.apply(ParDo.of(new LogOutput<>("Pcollection count: ")))

Same code was executed with beam 2.23.0 + aws sdk1 & beam 2.48.0 + aws sdk2

beam 2.23.0 + aws sdk1 --> Pcollection count: 78132206
beam 2.48.0 + aws sdk2 --> Pcollection count: 78132208

To eliminate the possibility of a record being corrupted , I deleted first (of 2) record which appeared twice.. Now the next record to it appeared twice and second duplicate record also shift down to next one .. this implies that something went wrong while read at certain byte offset

**Scenerio ** records at line # 33554433 & 67108865 appeared twice
Line# item
1
.
.
33554433 abc
.
.
.
67108865 xyz
.
.
78132206

In second test, when record at line # 33554433 was deleted, now the records at 33554433 (which was at line# 33554434) and 67108865 (which was at line# 67108866) were read twice

It seems records are read successfully until records# 33554432 and then record#33554433 read twice
Note: 67108865 - 33554433 = 33554432

Record#33554433 --> Line 33554433 of 78132206 ; Word 871941986 of 2028601968; Byte 21374174185 of 49770215222
Record#67108865 --> Line 67108865 of 78132206 ; Word 1743499355 of 2028601968; Byte 42748346369 of 49770215222

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

The text was updated successfully, but these errors were encountered:

vatanrathi · 2023-10-26T06:11:15Z

@mosche @aromanenko-dev Can you please suggest what might be going on here ? Or, please assign to someone who can help

johnjcasey · 2023-10-27T15:21:14Z

Is it possible for you to try and narrow down which beam version this was introduced in, by replaying this for more beam versions? identifying the version that this was introduced in will greatly help diagnostics

vatanrathi · 2023-11-07T11:31:29Z

@johnjcasey beam version <2.48 had a performance issues (#25991) with large files which is stopping me to test this file. I tried running on 2.40, and it is still running after 18hrs.. so can you suggest any other option to provide you stats you are after ?

johnjcasey · 2023-11-07T14:35:19Z

I'm trying to understand which beam version introduced this issue. 2.23 - 2.48 is a very large range of changes to try and debug.

Is it feasible to share this file? I could try and reproduce this with a non S3 filesystem.

vatanrathi · 2023-11-08T00:15:34Z

I cant share file as it has company sensitive data.
I feel this is related to aws2 sdk..

johnjcasey · 2023-11-08T15:11:41Z

I assume not, but is it possible to load this file into GCS, and try from there? just to try and isolate the issue?

vatanrathi · 2023-11-15T01:27:13Z

I processed same file for all versions starting 2.40 and can confirm that issue started appearing from 2.43.0 onwards.
Can we identify breaking change?

aromanenko-dev · 2023-11-15T08:25:31Z

It would be great to narrow down this issue and create a reproducible use case.

johnjcasey · 2023-11-16T15:35:27Z

What is the typical width of a line in your file? I'm going to try and generate a scenario that reproduces this

johnjcasey · 2023-11-16T15:44:40Z

Also, I have a commit in 2.43 that is at least a plausible cause for this bug. #23196

@vatanrathi are you able to run a custom version of beam against your file, to verify if this is the faulty commit or not?

vatanrathi · 2023-11-17T00:23:05Z

Hi @johnjcasey yes, I should be able to execute test with a custom version.
Please share the same

johnjcasey · 2023-11-21T16:26:58Z

Here is the branch with the revert. It is otherwise 2.43.0, so it should be a fair comparison. https://github.com/johnjcasey/beam/tree/2.43.0-no-fileio-improvement

johnjcasey · 2023-11-21T16:33:54Z

Also, is it possible to create a file that has the same structure, but without private data? I'm trying to brainstorm ways to make it so I can debug locally.

johnjcasey · 2023-11-28T18:29:54Z

Hi @vatanrathi , did you have any luck with that branch?

vatanrathi · 2023-12-01T09:54:13Z

@johnjcasey I have been busy with something else.. I apologise for the delay.. I will get it tested over the weekend.Just to confirm your changes are only in Java core module ?

vatanrathi · 2023-12-04T02:54:51Z

@johnjcasey I can confirm that duplicate issue is resolved with custom java core package built with your branch. So, it seems bug was introduced as part of the that change mentioned above. Would you be able to let me know how we can take it further so fix is available in main beam release ?

johnjcasey · 2023-12-04T18:18:41Z

that is good and bad news. Thanks for all your help debugging this. Someone from my team will be looking at this in depth.

It would be incredibly helpful if we could try and repro this locally. Is there a way you could create a similar file that repros the issue, without containing sensitive data?

vatanrathi · 2023-12-20T11:47:23Z

Hi @johnjcasey is there an update on bug fix ? Unfortunately, i was unable to generate a test file for you.

johnjcasey · 2023-12-20T16:06:50Z

We are planning to address it in Q1. We will need to generate a test file on our end to replicate this first, which will be hard. Can you tell us more about your file? How big was it, how many lines, etc.?

shunping · 2023-12-21T02:09:59Z

Could we also know if the line has a fixed size or not?

shunping · 2023-12-21T03:27:56Z

@vatanrathi To isolate the problem, could you add .withDelimiter(new byte[] { '\r', '\n' }) when calling TextIO.read() and run your pipeline with the original release 2.43.0?

shunping · 2024-01-02T14:00:21Z

Hi @vatanrathi, hope you enjoyed the holidays. I am checking to see if you have tried my comment (#29146 (comment))?

In addition to the questions about the file in use (the size of the file, whether the line has a fixed size), could you also let us know which runner (e.g. direct runner, dataflow runner, flink runner etc) you were using to run Beam?

vatanrathi · 2024-01-04T03:21:18Z

Hi @shunping I havent tried your suggested approach as our nonprod aws systems are down until 8th Jan. I will do it once it is online and will confirm the results.
Also, it is not a fixed width file and runner is spark

vatanrathi · 2024-01-04T03:24:25Z

49770215222

File size 49.78 GB. Total records are 78132206 including header

shunping · 2024-01-04T03:25:03Z

Thanks for your reply! Please keep me posted on how it goes with my suggestion. I am actively working on this issue right now, and I will let you know if i have other questions or find out anything.

shunping · 2024-01-10T18:53:47Z

Hi @vatanrathi, do you have a chance to run the pipeline with my suggestion in #29146 (comment)?

Also could you let me know what version of spark you are running the pipeline on?

vatanrathi · 2024-01-11T00:07:01Z

Hi @shunping yes tried with adding suggested delimiter, but my pipelines failed to generate any output parquet files after adding that.
I am running spark 3.1.1

shunping · 2024-01-11T17:33:39Z

Hi @vatanrathi, I think there could be a problem when we split the file, but I need some more info from you to confirm my guess.

Are you able to run the beam in my branch https://github.com/shunping/beam/tree/add-log-to-2.43.0 (where I added a logging message in FileBasedSource), and send me back some (if not all ) of the output logs? I would need the log messages that contain "Beam-issue-29146" or "split" or "Split" at least to further debug the issue.

Thanks for your help!

vatanrathi · 2024-01-12T05:47:57Z

Hi @shunping i got required logs but i am unable to share it.. my org doesn't allow any kind of data to be shared outside including github. Is there a specific info you are after which i can check in logs..
I know its not very helpful ..

shunping · 2024-01-13T05:46:42Z

@vatanrathi, no worries! That's totally understandable.

Meanwhile, I am able to reproduce the issue in my environment and I have a fix for you to try.
Could you run your pipeline with Beam in this branch https://github.com/shunping/beam/tree/fix-dup-rows-2.43?

Thanks!

vatanrathi · 2024-01-14T23:09:55Z

@shunping Thats great! I will get it tested today and will confirm

vatanrathi · 2024-01-15T12:07:57Z

@shunping I tested core module with your branch and it returns correct result without any duplicates.

shunping · 2024-01-16T01:16:55Z

Great! Thanks for verifying the fix, @vatanrathi. I will push it to the master branch later this week.

vatanrathi added awaiting triage bug labels Oct 26, 2023

github-actions bot added java spark P2 labels Oct 26, 2023

johnjcasey removed the awaiting triage label Jan 3, 2024

johnjcasey assigned johnjcasey and shunping Jan 3, 2024

johnjcasey removed their assignment Jan 3, 2024

shunping mentioned this issue Jan 17, 2024

Fix an edge case of getting duplicated records when using TextIO. #30026

Merged

3 tasks

damondouglas closed this as completed in #30026 Jan 17, 2024

github-actions bot added this to the 2.54.0 Release milestone Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: TextIO read record twice for large delimited files #29146

[Bug]: TextIO read record twice for large delimited files #29146

vatanrathi commented Oct 26, 2023

vatanrathi commented Oct 26, 2023

johnjcasey commented Oct 27, 2023 •

edited

Loading

vatanrathi commented Nov 7, 2023

johnjcasey commented Nov 7, 2023

vatanrathi commented Nov 8, 2023

johnjcasey commented Nov 8, 2023

vatanrathi commented Nov 15, 2023

aromanenko-dev commented Nov 15, 2023

johnjcasey commented Nov 16, 2023

johnjcasey commented Nov 16, 2023

vatanrathi commented Nov 17, 2023

johnjcasey commented Nov 21, 2023

johnjcasey commented Nov 21, 2023

johnjcasey commented Nov 28, 2023

vatanrathi commented Dec 1, 2023

vatanrathi commented Dec 4, 2023 •

edited

Loading

johnjcasey commented Dec 4, 2023

vatanrathi commented Dec 20, 2023

johnjcasey commented Dec 20, 2023

shunping commented Dec 21, 2023

shunping commented Dec 21, 2023

shunping commented Jan 2, 2024

vatanrathi commented Jan 4, 2024

vatanrathi commented Jan 4, 2024

shunping commented Jan 4, 2024 •

edited

Loading

shunping commented Jan 10, 2024

vatanrathi commented Jan 11, 2024

shunping commented Jan 11, 2024 •

edited

Loading

vatanrathi commented Jan 12, 2024

shunping commented Jan 13, 2024 •

edited

Loading

vatanrathi commented Jan 14, 2024

vatanrathi commented Jan 15, 2024

shunping commented Jan 16, 2024

[Bug]: TextIO read record twice for large delimited files #29146

[Bug]: TextIO read record twice for large delimited files #29146

Comments

vatanrathi commented Oct 26, 2023

What happened?

Issue Priority

Issue Components

vatanrathi commented Oct 26, 2023

johnjcasey commented Oct 27, 2023 • edited Loading

vatanrathi commented Nov 7, 2023

johnjcasey commented Nov 7, 2023

vatanrathi commented Nov 8, 2023

johnjcasey commented Nov 8, 2023

vatanrathi commented Nov 15, 2023

aromanenko-dev commented Nov 15, 2023

johnjcasey commented Nov 16, 2023

johnjcasey commented Nov 16, 2023

vatanrathi commented Nov 17, 2023

johnjcasey commented Nov 21, 2023

johnjcasey commented Nov 21, 2023

johnjcasey commented Nov 28, 2023

vatanrathi commented Dec 1, 2023

vatanrathi commented Dec 4, 2023 • edited Loading

johnjcasey commented Dec 4, 2023

vatanrathi commented Dec 20, 2023

johnjcasey commented Dec 20, 2023

shunping commented Dec 21, 2023

shunping commented Dec 21, 2023

shunping commented Jan 2, 2024

vatanrathi commented Jan 4, 2024

vatanrathi commented Jan 4, 2024

shunping commented Jan 4, 2024 • edited Loading

shunping commented Jan 10, 2024

vatanrathi commented Jan 11, 2024

shunping commented Jan 11, 2024 • edited Loading

vatanrathi commented Jan 12, 2024

shunping commented Jan 13, 2024 • edited Loading

vatanrathi commented Jan 14, 2024

vatanrathi commented Jan 15, 2024

shunping commented Jan 16, 2024

johnjcasey commented Oct 27, 2023 •

edited

Loading

vatanrathi commented Dec 4, 2023 •

edited

Loading

shunping commented Jan 4, 2024 •

edited

Loading

shunping commented Jan 11, 2024 •

edited

Loading

shunping commented Jan 13, 2024 •

edited

Loading