-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: Debug timeouts #55687
CI: Debug timeouts #55687
Conversation
mroeschke
commented
Oct 25, 2023
•
edited
Loading
edited
- The python dev jobs look to be timing out
- Sometimes, the ubuntu single cpu jobs are timing out
The recently un-skipped pyarrow parser tests are a likely candidate |
xref #51409 i never figured out how to access PYTEST_CURRENT_TEST when a freeze/timeout occurs, but it seems related |
pyarrow isn't installed on the python dev tests. I suspect NEP 50 (Matt, did you ever get your old PR working?) is the cause, or something else changing on the latest numpy dev. Not sure why the actual numpydev build isn't failing though. (maybe a cache issue?) You can see a huge amount of Fs (for failures) in the log before the job stops. |
Ah good point. No, I never addressed the NEP 50 failures but that could be the issue as other libraries are starting to see NEP 50 being activated #48867 (comment) |
Sorry for pushing to your branch. |
No problem push as needed! |
Log from a test that's timing out on single cpu test 2023-10-26T04:06:36.0274702Z pandas/tests/io/parser/usecols/test_parse_dates.py::test_usecols_with_parse_dates[pyarrow-usecols1] XFAIL
2023-10-26T05:30:32.0440504Z ##[error]The operation was canceled. |
Can anyone reproduce this locally? |
I tried reproducing the Ubuntu 311 Single CPU failures locally on my MacOS but I couldn't reproduce the timeout. It appears it's timing out around the pyarrow CSV tests. |
Example of the logs where the stall happens 2023-10-31T18:14:24.1476884Z pandas/tests/io/parser/test_parse_dates.py::test_invalid_parse_delimited_date[pyarrow-13/13/2019]
2023-10-31T18:14:24.1481880Z SETUP F configure_tests
2023-10-31T18:14:24.1483449Z SETUP F all_parsers['pyarrow']
2023-10-31T18:14:24.1485985Z SETUP F pyarrow_xfail
2023-10-31T18:14:24.1490063Z SETUP F date_string['13/13/2019']
2023-10-31T18:14:24.4276588Z pandas/tests/io/parser/test_parse_dates.py::test_invalid_parse_delimited_date[pyarrow-13/13/2019] (fixtures used: all_parsers, configure_tests, date_string, pyarrow_xfail, request)XFAIL
2023-10-31T18:14:24.4282882Z TEARDOWN F date_string['13/13/2019']
2023-10-31T18:14:24.4284062Z TEARDOWN F pyarrow_xfail
2023-10-31T18:14:24.4285135Z TEARDOWN F all_parsers['pyarrow']
2023-10-31T18:14:24.4290439Z TEARDOWN F configure_tests
2023-10-31T18:14:24.4298463Z pandas/tests/io/parser/test_parse_dates.py::test_invalid_parse_delimited_date[pyarrow-13/2019]
2023-10-31T18:14:24.4304013Z SETUP F configure_tests
2023-10-31T18:14:24.4305717Z SETUP F all_parsers['pyarrow']
2023-10-31T18:14:24.4308194Z SETUP F pyarrow_xfail
2023-10-31T18:14:24.4312410Z SETUP F date_string['13/2019']
2023-10-31T18:14:24.7108009Z pandas/tests/io/parser/test_parse_dates.py::test_invalid_parse_delimited_date[pyarrow-13/2019] (fixtures used: all_parsers, configure_tests, date_string, pyarrow_xfail, request)XFAIL
2023-10-31T18:14:24.7113099Z TEARDOWN F date_string['13/2019']
2023-10-31T18:14:24.7114685Z TEARDOWN F pyarrow_xfail
2023-10-31T18:14:24.7115815Z TEARDOWN F all_parsers['pyarrow']
2023-10-31T18:14:24.7120264Z TEARDOWN F configure_tests
2023-10-31T18:14:24.7130084Z pandas/tests/io/parser/test_parse_dates.py::test_invalid_parse_delimited_date[pyarrow-a3/11/2018]
2023-10-31T18:14:24.7135875Z SETUP F configure_tests
2023-10-31T18:14:24.7137102Z SETUP F all_parsers['pyarrow']
2023-10-31T18:14:24.7139492Z SETUP F pyarrow_xfail
2023-10-31T18:14:24.7143338Z SETUP F date_string['a3/11/2018']
2023-10-31T19:48:18.0840443Z ##[error]The operation was canceled. Interestingly it seems like there's a small delay between the end of the setup and the running of the tests |
Not sure about the status of this PR, but I'm fine with just skipping the hanging tests or marking all pyarrow CSV tests as single CPU, if we can't make the timeout go away. |
Yeah I'm still trying to narrow down which test is hanging (so far there doesn't seem to be a consistent offender)
They are already marked as single cpu 😢 |
FWIW, I think using xfail is also not necessarily the best approach for some of those pyarrow csv tests.
The first PR also has a comment
that was removed. So it seems we experienced this in the past as well. |
Based on your hint of hanging on reading an empty file, and the fact we run our tests in parallel, I think I might have a reproducer: import pandas as pd
from io import StringIO
from concurrent.futures import ThreadPoolExecutor
data = "x,y,z"
def read_csv_pyarrow(i):
try:
pd.read_csv(StringIO(data), engine="pyarrow")
except:
pass
print(i)
return i
with ThreadPoolExecutor(4) as e:
list(e.map(read_csv_pyarrow, range(4))) gives:
Will report this to Arrow then! |
Note that I have a sketchy BytesIOWrapper class here that converts StringIO -> BytesIO that might also be part of the problem. Lines 1093 to 1119 in 09ed69e
|
I adapted the reproducer to pure pyarrow using BytesIO, and that also hangs sometimes, so it seems to at least also be an issue on that side |
Here is some info from helgrind that might be of use: ==87545== Possible data race during write of size 4 at 0x642320 by thread #1
==87545== Locks held: none
==87545== at 0x399780: PyMem_SetAllocator (obmalloc.c:544)
==87545== by 0x398844: pymem_set_default_allocator (obmalloc.c:251)
==87545== by 0x3E7C35: _PyRuntimeState_Fini (pystate.c:165)
==87545== by 0x3C6B52: UnknownInlinedFun (pylifecycle.c:227)
==87545== by 0x3C6B52: Py_FinalizeEx (pylifecycle.c:2027)
==87545== by 0x3D2A4F: Py_RunMain (main.c:682)
==87545== by 0x398006: Py_BytesMain (main.c:734)
==87545== by 0x49A40CF: (below main) (libc_start_call_main.h:58)
==87545== Address 0x642320 is 0 bytes inside data symbol "_PyMem_Raw"
==87545==
==87545== ----------------------------------------------------------------
==87545==
==87545== Lock at 0x7E803A68 was first observed
==87545== at 0x4852F6B: pthread_mutex_init (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==87545== by 0x6CDD46EF: je_arrow_private_je_malloc_mutex_init (mutex.c:171)
==87545== by 0x6CDE1A0A: je_arrow_private_je_ecache_init (ecache.c:9)
==87545== by 0x6CDD626C: je_arrow_private_je_pac_init (pac.c:49)
==87545== by 0x6CDD4B84: je_arrow_private_je_pa_shard_init (pa.c:43)
==87545== by 0x6CD83A76: je_arrow_private_je_arena_new (arena.c:1648)
==87545== by 0x6CD7C708: je_arrow_private_je_arena_choose_hard (jemalloc.c:582)
==87545== by 0x6CDDF6BA: arena_choose_impl (jemalloc_internal_inlines_b.h:46)
==87545== by 0x6CDDF6BA: arena_choose_impl (jemalloc_internal_inlines_b.h:32)
==87545== by 0x6CDDF6BA: arena_choose (jemalloc_internal_inlines_b.h:88)
==87545== by 0x6CDDF6BA: je_arrow_private_je_tsd_tcache_data_init (tcache.c:740)
==87545== by 0x6CDDF967: je_arrow_private_je_tsd_tcache_enabled_data_init (tcache.c:644)
==87545== by 0x6CDE12E1: tsd_data_init (tsd.c:244)
==87545== by 0x6CDE12E1: je_arrow_private_je_tsd_fetch_slow (tsd.c:311)
==87545== by 0x6CD7CE74: tsd_fetch_impl (tsd.h:422)
==87545== by 0x6CD7CE74: tsd_fetch (tsd.h:448)
==87545== by 0x6CD7CE74: imalloc (jemalloc.c:2681)
==87545== by 0x6CD7CE74: je_arrow_mallocx (jemalloc.c:3424)
==87545== by 0x6C0C346C: arrow::memory_pool::internal::JemallocAllocator::AllocateAligned(long, long, unsigned char**) (in /home/willayd/mambaforge/envs/pandas-dev/lib/libarrow.so.1400)
==87545== Address 0x7e803a68 is in a rw- anonymous segment
==87545==
==87545== Possible data race during write of size 8 at 0x7E806040 by thread #24
==87545== Locks held: 1, at address 0x7E803A68
==87545== at 0x6CDC8891: atomic_store_zu (atomic.h:93)
==87545== by 0x6CDC8891: je_arrow_private_je_eset_insert (eset.c:109)
==87545== by 0x6CDCB1C8: extent_deactivate_locked_impl (extent.c:256)
==87545== by 0x6CDCB1C8: extent_deactivate_locked (extent.c:263)
==87545== by 0x6CDCB1C8: je_arrow_private_je_extent_record (extent.c:950)
==87545== by 0x6CDD5909: pac_dalloc_impl (pac.c:277)
==87545== by 0x6CD81CB2: je_arrow_private_je_arena_slab_dalloc (arena.c:570)
==87545== by 0x6CDDD6E5: tcache_bin_flush_impl (tcache.c:477)
==87545== by 0x6CDDD6E5: tcache_bin_flush_bottom (tcache.c:519)
==87545== by 0x6CDDD6E5: je_arrow_private_je_tcache_bin_flush_small (tcache.c:529)
==87545== by 0x6CDDE364: tcache_flush_cache (tcache.c:790)
==87545== by 0x6CDDE9BE: tcache_destroy.constprop.0 (tcache.c:809)
==87545== by 0x6CDE116B: tsd_do_data_cleanup (tsd.c:382)
==87545== by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:408)
==87545== by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:388)
==87545== by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:73)
==87545== by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:22)
==87545== by 0x4A1393F: start_thread (pthread_create.c:455)
==87545== by 0x4AA42E3: clone (clone.S:100)
==87545== Address 0x7e806040 is in a rw- anonymous segment
==87545==
==87545== ----------------------------------------------------------------
==87545==
==87545== Possible data race during read of size 1 at 0x6F41677C by thread #24
==87545== Locks held: none
==87545== at 0x6CD81459: atomic_load_b (atomic.h:89)
==87545== by 0x6CD81459: background_thread_indefinite_sleep (background_thread_inlines.h:45)
==87545== by 0x6CD81459: arena_background_thread_inactivity_check (arena.c:207)
==87545== by 0x6CD81459: je_arrow_private_je_arena_handle_deferred_work (arena.c:223)
==87545== by 0x6CD81CE2: je_arrow_private_je_arena_slab_dalloc (arena.c:572)
==87545== by 0x6CDDD6E5: tcache_bin_flush_impl (tcache.c:477)
==87545== by 0x6CDDD6E5: tcache_bin_flush_bottom (tcache.c:519)
==87545== by 0x6CDDD6E5: je_arrow_private_je_tcache_bin_flush_small (tcache.c:529)
==87545== by 0x6CDDE364: tcache_flush_cache (tcache.c:790)
==87545== by 0x6CDDE9BE: tcache_destroy.constprop.0 (tcache.c:809)
==87545== by 0x6CDE116B: tsd_do_data_cleanup (tsd.c:382)
==87545== by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:408)
==87545== by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:388)
==87545== by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:73)
==87545== by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:22)
==87545== by 0x4A1393F: start_thread (pthread_create.c:455)
==87545== by 0x4AA42E3: clone (clone.S:100)
==87545== Address 0x6f41677c is in a rw- anonymous segment
==87545==
==87545== ----------------------------------------------------------------
==87545==
==87545== Lock at 0x6F403AE8 was first observed
==87545== at 0x4852F6B: pthread_mutex_init (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==87545== by 0x6CDD46EF: je_arrow_private_je_malloc_mutex_init (mutex.c:171)
==87545== by 0x6CDE1A0A: je_arrow_private_je_ecache_init (ecache.c:9)
==87545== by 0x6CDD626C: je_arrow_private_je_pac_init (pac.c:49)
==87545== by 0x6CDD4B84: je_arrow_private_je_pa_shard_init (pa.c:43)
==87545== by 0x6CD83A76: je_arrow_private_je_arena_new (arena.c:1648)
==87545== by 0x6CD7A254: je_arrow_private_je_arena_init (jemalloc.c:443)
==87545== by 0x6CD7BBAC: malloc_init_hard_a0_locked (jemalloc.c:1885)
==87545== by 0x6CD7BE70: malloc_init_hard (jemalloc.c:2129)
==87545== by 0x400536D: call_init.part.0 (dl-init.c:90)
==87545== by 0x4005472: call_init (dl-init.c:136)
==87545== by 0x4005472: _dl_init (dl-init.c:137)
==87545== by 0x4001561: _dl_catch_exception (dl-catch.c:211)
==87545== Address 0x6f403ae8 is in a rw- anonymous segment
==87545==
==87545== Possible data race during write of size 8 at 0x6F4060C0 by thread #24
==87545== Locks held: 1, at address 0x6F403AE8
==87545== at 0x6CDC8891: atomic_store_zu (atomic.h:93)
==87545== by 0x6CDC8891: je_arrow_private_je_eset_insert (eset.c:109)
==87545== by 0x6CDCB1C8: extent_deactivate_locked_impl (extent.c:256)
==87545== by 0x6CDCB1C8: extent_deactivate_locked (extent.c:263)
==87545== by 0x6CDCB1C8: je_arrow_private_je_extent_record (extent.c:950)
==87545== by 0x6CDD5909: pac_dalloc_impl (pac.c:277)
==87545== by 0x6CDD38F9: large_dalloc_finish_impl (large.c:253)
==87545== by 0x6CDD38F9: je_arrow_private_je_large_dalloc (large.c:273)
==87545== by 0x6CDDECE7: arena_dalloc_large_no_tcache (arena_inlines_b.h:253)
==87545== by 0x6CDDECE7: arena_dalloc_no_tcache (arena_inlines_b.h:276)
==87545== by 0x6CDDECE7: arena_dalloc (arena_inlines_b.h:308)
==87545== by 0x6CDDECE7: idalloctm (jemalloc_internal_inlines_c.h:120)
==87545== by 0x6CDDECE7: tcache_destroy.constprop.0 (tcache.c:817)
==87545== by 0x6CDE116B: tsd_do_data_cleanup (tsd.c:382)
==87545== by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:408)
==87545== by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:388)
==87545== by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:73)
==87545== by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:22)
==87545== by 0x4A1393F: start_thread (pthread_create.c:455)
==87545== by 0x4AA42E3: clone (clone.S:100)
==87545== Address 0x6f4060c0 is in a rw- anonymous segment
==87545==
==87545== ----------------------------------------------------------------
==87545==
==87545== Possible data race during read of size 1 at 0x6F4166AC by thread #24
==87545== Locks held: none
==87545== at 0x6CD81459: atomic_load_b (atomic.h:89)
==87545== by 0x6CD81459: background_thread_indefinite_sleep (background_thread_inlines.h:45)
==87545== by 0x6CD81459: arena_background_thread_inactivity_check (arena.c:207)
==87545== by 0x6CD81459: je_arrow_private_je_arena_handle_deferred_work (arena.c:223)
==87545== by 0x6CDD393B: large_dalloc_finish_impl (large.c:255)
==87545== by 0x6CDD393B: je_arrow_private_je_large_dalloc (large.c:273)
==87545== by 0x6CDDECE7: arena_dalloc_large_no_tcache (arena_inlines_b.h:253)
==87545== by 0x6CDDECE7: arena_dalloc_no_tcache (arena_inlines_b.h:276)
==87545== by 0x6CDDECE7: arena_dalloc (arena_inlines_b.h:308)
==87545== by 0x6CDDECE7: idalloctm (jemalloc_internal_inlines_c.h:120)
==87545== by 0x6CDDECE7: tcache_destroy.constprop.0 (tcache.c:817)
==87545== by 0x6CDE116B: tsd_do_data_cleanup (tsd.c:382)
==87545== by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:408)
==87545== by 0x6CDE116B: je_arrow_private_je_tsd_cleanup (tsd.c:388)
==87545== by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:73)
==87545== by 0x4A10630: __nptl_deallocate_tsd (nptl_deallocate_tsd.c:22)
==87545== by 0x4A1393F: start_thread (pthread_create.c:455)
==87545== by 0x4AA42E3: clone (clone.S:100)
==87545== Address 0x6f4166ac is in a rw- anonymous segment |
You can maybe post that on the arrow issue? |
These Arrow CSV tests don't run in parallel anymore (they are run on the single CPU job with pytest xdist disabled) |
Closing since I merged a PR to skip these empty CSV tests |