Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset-serialize benchmark is failing with segfaults #166

Closed
austin3dickey opened this issue Oct 23, 2023 · 11 comments · Fixed by voltrondata-labs/benchmarks#152
Closed
Assignees

Comments

@austin3dickey
Copy link
Contributor

More details to come.

@austin3dickey
Copy link
Contributor Author

The only machine that runs this is ursa-i9-9960x. Here is a build link.

231022-13:22:15.225 INFO: Initializing adapter
231022-13:22:15.255 INFO: source nyctaxi_multi_parquet_s3: download, if required
231022-13:22:15.263 INFO: constructed Dataset object for source in 0.0066 s
231022-13:22:15.263 INFO: case ('1pc', 'parquet'): create directory
231022-13:22:15.263 INFO: directory created, path: /dev/shm/bench-cd80377a/1pc-parquet-eed70f79-5ee5-4156-b932-79bdf3b754d0
231022-13:22:15.263 INFO: read 561000 rows of dataset nyctaxi_multi_parquet_s3 into memory
231022-13:22:15.528 INFO: read source dataset into memory in 0.2645 s
231022-13:22:20.250 INFO: try to perform login
231022-13:22:20.250 INFO: try: POST to https://conbench.ursa.dev/api/login/
231022-13:22:20.536 INFO: POST request to https://conbench.ursa.dev/api/login/: took 0.2858 s, response status code: 204
231022-13:22:20.536 INFO: ConbenchClient: initialized
231022-13:22:20.536 INFO: try: POST to https://conbench.ursa.dev/api/benchmark-results/
231022-13:22:20.607 INFO: POST request to https://conbench.ursa.dev/api/benchmark-results/: took 0.0712 s, response status code: 201
231022-13:22:20.671 INFO: stdout of ['du', '-sh', '/dev/shm/bench-cd80377a/1pc-parquet-eed70f79-5ee5-4156-b932-79bdf3b754d0']: 20M
231022-13:22:20.672 INFO: removing directory: /dev/shm/bench-cd80377a/1pc-parquet-eed70f79-5ee5-4156-b932-79bdf3b754d0
231022-13:22:20.674 INFO: case ('1pc', 'arrow'): create directory
231022-13:22:20.674 INFO: directory created, path: /dev/shm/bench-cd80377a/1pc-arrow-b8bced17-2cee-4bca-83d4-534d23e2f468
231022-13:22:20.674 INFO: read 561000 rows of dataset nyctaxi_multi_parquet_s3 into memory
231022-13:22:20.814 INFO: read source dataset into memory in 0.1400 s
Fatal Python error: Segmentation fault

Interestingly, sometimes one or two cases succeed before the segfault, and sometimes none of them do.

@austin3dickey
Copy link
Contributor Author

Here's a breakdown of the number of successful dataset-serialize results per run this month:

       run_timestamp        |              run_id              | num_results
----------------------------+----------------------------------+-------------
 2023-10-01 20:47:05.27222  | 1653cbab792a4905950da7a357c27aab |          24
 2023-10-01 23:41:55.081723 | 1e2c0d5208784aa9a98e28fd387a8c67 |          24
 2023-10-02 02:37:15.753681 | 08be4f7cab094940b1c8c31ca9e902c4 |          24
 2023-10-02 05:34:42.301889 | 497dc6271cc541abbd2adec51695259d |          24
 2023-10-02 15:48:42.227189 | 1077a66e57a74edfbea848d528262f86 |          24
 2023-10-03 08:28:15.839898 | 6ca857816880414aa3ab96e2daf0860d |          24
 2023-10-03 14:49:27.261457 | c99fb3bbd61d429d9af8510254f6a8e1 |          24
 2023-10-03 20:20:28.55793  | a1e3d4d07c28450e88e3eba64f420707 |          24
 2023-10-03 23:11:30.643553 | d6530cb0f1cf41b8b1874677ef3ab37a |          24
 2023-10-04 11:00:46.131435 | 3564dbe69233453f8970fd6127ca222b |          24
 2023-10-04 16:00:31.168945 | 7deb05ad67484f16bd92545e79d91f90 |          24
 2023-10-05 08:23:45.134169 | aa5c53940d2942bbad57dad7eafc7e7f |          24
 2023-10-05 11:21:23.874677 | 980a34c6cdb7424189e0de6ca2924057 |          24
 2023-10-05 14:15:26.585407 | 7131e14847454a49a5bbd2cb428e3e67 |          24
 2023-10-05 17:28:01.860131 | 9fe656d750ae4a5c8d8236eed69d7f2b |          24
 2023-10-05 20:19:41.701115 | 8a68c0b0b8ce41db8631e8e236af9235 |          24
 2023-10-05 23:26:28.99192  | f4d6b6343f7a436ba3894f871a524b0c |          24
 2023-10-06 02:22:16.671595 | 8016d5c3fc3e413ca0e8ab7f7bb52e1c |          24
 2023-10-06 05:28:59.872214 | 141651da421049f3b1669d7a3f5a4d88 |          24
 2023-10-06 08:24:59.470588 | 69103244469d4b01bbe493f39194a3fb |          24
 2023-10-06 11:20:45.434535 | c462c49764b5442fa3ac1af671f88b08 |          24
 2023-10-06 14:29:45.198447 | a2af089b4fa1421cbed1d59b18694a87 |          24
 2023-10-06 17:24:26.104814 | 200947668e5c41a79243aaf31ab42380 |          24
 2023-10-06 20:07:02.596901 | 5627438887054b109842daf9287b4332 |           0
 2023-10-06 22:50:22.449426 | 0ecb0e4f83d0406593087e08c0b61fb3 |           1
 2023-10-07 02:57:22.516069 | f6ccd1c4b5fb41c396811b76eef3902d |           0
 2023-10-07 04:32:45.674441 | 36e00707836945e188083a5ebc84f3f6 |           1
 2023-10-07 23:33:45.35763  | cb02ebcfc6d54f649db4dbae0f86c442 |           2
 2023-10-08 22:09:50.399974 | 58f58379c6c545a0b8d4a82e7d894626 |           1
 2023-10-10 01:12:00.395274 | e79336051c534dd0be81d1280d462968 |           1
 2023-10-10 04:07:00.964479 | 6bb312c6084840ef97269ae527a759d1 |           0
 2023-10-10 06:09:36.389056 | 309cc226fc3d4322baaf69faff63c956 |           1
 2023-10-10 08:55:24.822424 | a91a6a2060c545f8b13ad0fe2e082475 |           0
 2023-10-10 11:04:34.424189 | b0b0939418b14986a79b32547489c6a1 |           0
 2023-10-10 13:31:12.50375  | f16770c6bc95419ba5011d10ea3a974d |           0
 2023-10-10 16:07:28.651402 | 870f5c2b5b7c427fbc8a07c8708c2434 |           1
 2023-10-10 18:33:16.425559 | 7fe710688b4d47f6a3932bfab9c599f7 |           1
 2023-10-10 23:12:45.442182 | adf798f0306b44ca902b24e95182072b |           0
 2023-10-11 01:06:54.330537 | e053efe4fccb48d8923ca49f22aa3928 |           0
 2023-10-11 02:51:32.430454 | c985261e64e441fbb2d8da74ee0a271e |           1
 2023-10-11 05:23:57.397817 | 3cedb97c4d0e4ca48c392562d7822c71 |           0
 2023-10-11 07:53:29.72254  | 517c2fb5d61645c0937a4a80e4c02ab1 |           1
 2023-10-11 10:22:12.897347 | 0a40322ac5ad4cc69b1e2da555cc9884 |           1
 2023-10-11 13:45:02.667613 | 05f2a40424984d62808b6600a7be60e8 |           0
 2023-10-11 15:23:16.871397 | 2dcd6824d73c4e238bbc36a77578eafc |           2
 2023-10-11 17:51:40.61575  | 94b8fa73017b4526bb66c179a67c97f5 |           1
 2023-10-11 20:21:02.843915 | 4a441cef677240cfbc593b858d5e6fc5 |           1
 2023-10-11 22:52:02.991529 | da5bc2af4a9e4e849115de5f872e1a43 |           2
 2023-10-12 01:23:48.596752 | 4912265bd177431e9ccab6a865979bc2 |           0
 2023-10-12 04:56:52.251901 | db4cf82bc3e1491ba1500984e727eb02 |           0
 2023-10-12 06:17:58.084406 | 2fd42faf48404478affd58f9bcfbb26e |           1
 2023-10-12 08:52:12.650021 | bb99d7aba6ce4bd79c1b9bbf0ace732f |           2
 2023-10-12 11:23:30.072888 | 8538c42b61e24b8f897eecba8ab51084 |           2
 2023-10-12 14:35:46.545025 | 461fd9038239478199c4ce554783b107 |           0
 2023-10-12 17:12:39.307677 | e09f15cd85d8455f80562dbb91f37389 |           0
 2023-10-12 18:51:16.012983 | 2f6706c923c144ae89eddf11447734bc |           0
 2023-10-12 21:24:05.360257 | 4b879813b1fd466cac3bd9a42b5f0eff |           0
 2023-10-12 23:52:24.865696 | 810555452bb7453eb3637d74cdac4f05 |           1
 2023-10-13 02:24:02.374777 | 6b2f3f65be194fa8aac8c854d4491958 |           0
 2023-10-13 04:53:12.668694 | e9c7b12f180944fd9339d1acc89f27dc |           1
 2023-10-13 07:19:54.560573 | 85c7fdd868d04e6aa61c898cf5a9f3a5 |           1
 2023-10-13 10:33:45.433656 | d8e1f73c854e4d2fa7dcc67f9b28dc53 |           0
 2023-10-13 13:22:57.123444 | e6c83373db21434f940879904d46fbfc |           0
 2023-10-13 15:01:25.231645 | ab8fb74aed7741aba3d2ea6633c060d2 |           1
 2023-10-13 18:30:03.608629 | 0122c1ead02d439f97e113243f87d337 |           0
 2023-10-13 21:46:14.986946 | 7d324417a2f14070bb6a05ce184ae250 |           0
 2023-10-14 00:40:17.52854  | 5f5c93257e234f21a40a6d0f875a9cc9 |           0
 2023-10-14 02:10:36.812122 | 8b67a831fb3947c599bc638ba6b2269a |           2
 2023-10-16 10:04:47.249041 | 5635a683e0b94470bcf0b3ce35ec1f9b |           1
 2023-10-16 12:52:22.633604 | 6b2c2f77b61a4a39a6b33f834edd6b5d |           1
 2023-10-16 17:22:58.089709 | f4fc993a105944aeabe04a56d6fc3a9e |           0
 2023-10-16 18:09:30.601348 | 75f834400473428a87ff70013ccc684f |           0
 2023-10-17 01:42:14.372681 | c5d83122840d4afc827dac61c8f7df22 |           1
 2023-10-17 05:01:10.351914 | f3a18df396fa485ba6cf49231fae70fd |           0
 2023-10-17 08:44:35.353271 | fa505501b1284dcd9ce7a347decbff54 |           0
 2023-10-17 16:31:06.679952 | 56a088bf01764184bf34d59c108ee4e0 |           1
 2023-10-17 20:40:37.481827 | fc2b2170fda04f568c7e5c9a47e7de95 |           0
 2023-10-18 09:35:43.28765  | b57227ca04d64bcaa63b3a311b6f6743 |           0
 2023-10-18 11:12:48.319844 | 7f6271767bf741bf91ea38ae207b7cea |           0
 2023-10-18 14:57:54.460282 | 0ff935e33bc04e8e9d833f99187b8a72 |           0
 2023-10-18 16:10:27.399072 | d379f727daef47d0b14f7647e3ef089b |           1
 2023-10-19 02:44:17.844343 | 03302b3b966049a6ace506ccdb307395 |           1
 2023-10-19 11:32:19.094086 | 8660f5699ed84c58a5b316432271d9a0 |           0
 2023-10-19 13:53:01.875819 | 04f8720de29146db9a50344477afe4cc |           1
 2023-10-19 17:35:08.017099 | 6b5cc6e3f4ac4b45bca66184ee8487ca |           1
 2023-10-19 20:03:41.026884 | aeca00a62f38400baa34aad1782ffd5d |           3
 2023-10-19 22:33:13.40542  | ea9f57e2e9444fc8ab74ca163a2ef4d6 |           1
 2023-10-20 09:51:30.247996 | b97ffbffbe28497d9cda75b124bfb704 |           1
 2023-10-22 18:22:19.403434 | 779a94ec29b649c29c5e4d1be968d6e0 |           1
 2023-10-23 15:02:29.371518 | 9acfb5dd28cc48bcb5bcf57d6ba0cdf7 |           0

@austin3dickey
Copy link
Contributor Author

The first run without 24 results was this one:
https://conbench.ursa.dev/runs/5627438887054b109842daf9287b4332/

On commit apache/arrow@d7017dd, which has the message GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files (#37854). Interesting!

@austin3dickey austin3dickey changed the title dataset-serialize benchmark is failing dataset-serialize benchmark is failing with segfaults Oct 23, 2023
@austin3dickey
Copy link
Contributor Author

@jorisvandenbossche It looks like the dataset-serialize benchmark started segfaulting after apache/arrow#37854 was merged. Do you think we'll need to make changes to how the benchmark is run or is there something that needs to be fixed on the Arrow side?

@austin3dickey
Copy link
Contributor Author

We could probably just set pre_buffer=False. Or someone could research whether there's a way to consistently avoid the segfault (which I'm assuming is memory-related? not quite sure) even with pre_buffer=True.

I think it depends on what the Arrow community wants to actually be measuring here. For instance, it may not make sense to compare the benchmark timings measured with and without pre_buffer.

@jorisvandenbossche
Copy link

We could probably just set pre_buffer=False

We could do that short-term to get the benchmark working again. The benchmark is actually about writing if I am reading it correctly, and so it segfaults in the setup, thus changing this won't impact the actual benchmark.

(although it is a bit strange that it still logs the timing info after reading)

But the change that started this (pre_buffer default change) should not cause a segfault. If that is happening, that's a critical bug, and something we should still try to reproduce outside of the benchmarks.

We did have some crashes on the main Arrow CI as well after merging that PR, but those were fixed with apache/arrow#38073

@austin3dickey
Copy link
Contributor Author

Okay, I opened apache/arrow#38438. I'll try to see if using pre_buffer=False fixes the problem.

@austin3dickey
Copy link
Contributor Author

I was able to avoid the segfault locally by setting pre_buffer=False in voltrondata-labs/benchmarks#152. Once I merge that, this issue can be closed.

Like you said though, apache/arrow#38438 seems like a critical bug.

@austin3dickey
Copy link
Contributor Author

@jorisvandenbossche
Copy link

Thanks for opening the issue! Will try to further look into that tomorrow.

@mapleFU
Copy link

mapleFU commented Oct 25, 2023

I've try to fix it here: apache/arrow#38466

Not sure this really fix the bug, you can have a try here...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants