-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hanging tasks on Github Actions Ubuntu Runners (R CMD Check) #53
Comments
I can see the work you've put into this and fully commend the effort! Is your hypothesis that the instances are running out of memory on the Github runners? I can imagine they'd be quite resource constrained. On my Ubuntu laptop, the 1,000 task reprex showed 3.55GB of memory usage at the end vs the 300 task example, which only took up about 400MB and ran quite snappily. I can see the callr example on the Github runner only used 80MB memory in the end, but maybe that instance only had that much RAM remaining - so I'm not quite sure what to make of this. |
Thanks! I was trying to rule out memory as a possible explanation. I can see it may be a factor for https://github.com/wlandau/mirai/blob/reprex/tests/test.R. Maybe even https://github.com/wlandau/mirai/blob/reprex2/tests/test.R, but that seems less likely. I wonder how much memory the server daemon in the |
I just pushed wlandau@30b6268 to take more memory readings, and memory usage on the dispatcher and server do not look very different from beginning to end. |
I was thinking just in terms of the client actually. As it's all on one machine. I assume caching to disk is set up so it doesn't OOM but will likely slow to a crawl - and that may be what we are experiencing. It seems also it is only on the Ubuntu runners. I ran the same 1000 task reprex on Windows/Mac and it seems to succeed there. With Mac I have the printouts: https://github.com/shikokuchuo/mirai/actions/runs/4672218165/jobs/8274151885 |
https://github.com/shikokuchuo/mirai/actions/runs/4672298324/jobs/8274326971 just calls |
From my tests, it really seems that it is just the fact that the Ubuntu runners are memory constrained. Everything works for small tasks of For the Mac machine the tests consistently run all the way through without problem. Given the above, I don't believe this is a cause of concern. |
Seems consistent with how much trouble I have been having when I peel back layers to isolate the cause. But it's still strange. In the case of In all these tests, the data objects are small. An empty target object like the one in https://github.com/ropensci/targets/blob/13f5b314cd4ac5c46f86551650b6f37fd54dffe4/tests/testthat/test-tar_make.R#L17-L31 is only around 30 Kb, and each of the task data objects from https://github.com/wlandau/mirai/blob/reprex2/tests/test.R is only 20 Kb. I have been using If it is just |
In |
Honestly I'm not worried until I see it outside of Github actions. Our main production machine runs Ubuntu on 'data-heavy' workloads using The tests vary on Ubuntu - sometimes it finishes only 1 task, sometimes up to 10 - so we can rule out a deterministic reason. It strongly suggests an external cause, which given we can't replicate the behaviour anywhere else, we can't diagnose.
Yes, that's a good idea to enable on Mac. Should also work on Windows? |
However, all your hard work does have a good result: Seeing the memory usage on your 1,000 test case caused me to re-visit memory handling. In fact I had attempted to optimise this a couple of weeks ago, but this led to the failures in the throughput tests (with the memory corruption errors). I am now taking a slightly different approach by simply removing the reference to the 'aio' external pointer when the results have been retrieved (and cached). This will allow the garbage collector to reclaim the resources in its own time. If you have additional testing scripts, please can you re-run on nanonext 0.8.1.9016 and mirai 0.8.2.9023. I would like to make sure I haven't broken anything inadvertently. I will also need to run this on a staging machine to monitor, but if successful should be in a position to release |
Thank you so much! Your updates to I am really excited for the next CRAN releases of |
Thanks that's really good news! |
I wasn't actually trying to fix the test issues... just a note before I forget - the 'aios' now pretty much clean up after themselves as soon as |
Thanks, noted. When crew checks a task, it uses the more minimal .unresolved() from nanonext, then moves the resolved mirai object from the list of tasks to the list of results. After that, the user can call controller$pop() to download the data and release the aio. From what I know about R's copy-on-modify system, I believe this approach avoids copying the pointer. At any rate, the tests I need to pass are passing, thanks to you. |
Yes, actually I was being too conservative - the |
FYI it appears I am still getting intermittent hanging tests on R CMD check. The good news is that it still works most of the time, so I have implemented timeouts and retries as a workaround. I think that's good enough for me for now. Maybe we could just keep an eye on it going forward? |
For example, the workflow at https://github.com/ropensci/targets/actions/runs/4684600263/jobs/8300914514 reached a timeout in a |
Might be able to fix it these couple of days hopefully. I'm trying to pin down this elusive segfault on CRAN OpenBlas machine ( I mentioned before all these exotic setups tend to get me!!) In the middle of simplifying and making more robust certain things in However, I've also cut out the new features we're not using yet so that |
OK! Do you want to re-run your tests with I have taken to testing across (a good selection of) the rhub platforms and the segfault no longer appears (was quite consistently, although randomly reproducible for the last CRAN release). Sorry - not completely fixed - am on to it! |
Thanks for working on this. I just pushed updates to |
I think nanonext v0.8.1.9020 or the latest v0.8.1.9022 should do the trick actually. |
With nanonext v0.8.1.9021 and mirai 0.89.2.9027, I did observe one hanging test which succeeded on the first retry: https://github.com/ropensci/targets/actions/runs/4690189013/jobs/8315468527#step:9:258. Another just like it ran on Mac OS: https://github.com/ropensci/targets/actions/runs/4690189013/jobs/8315467814#step:10:257. Currently trying again with |
With nanonext v0.8.1.9022, it looks like all the targets checks succeeded on the first try. I am running the jobs again to confirm. |
Meant to say v0.8.1.9022 above. |
In later commits of |
nanonext 0.8.1.9025 and mirai 0.8.2.9028 are the release candidates. If final testing doesn't throw any errors, nanonext on track for release tmrw. If you notice any particular occasions it hangs e.g. when scaling up / down etc. I will have more to go on. I have covered the general bases - it is much more robust (and actually more performant it seems). |
Glad you're seeing performance gains, and I'm glad the packages are poised for CRAN. I'm afraid I do still see the same timeouts with nanonext 0.8.1.9025 and mirai 0.8.2.9028. The commits to https://github.com/ropensci/targets/tree/try-builds since ropensci/targets@8b4cf6e show intermittent failures testing |
That's really useful info, and it allowed me to simplify |
I wonder, could it have something to do with repeated calls to |
That's really good that it's allowed you to simplify things. I don't see how repeated calls to the above functions can cause problems. Would you mind testing the latest build of mirai v. 0.8.4.9004 85953f8? I got CRAN feedback and I'm doing final tests on it. Notable change is that the default |
This is a tricky issue. I just ran the tests from yesterday again just to confirm - and they do appear to be fixed - just through not polling Did you have any specific concern with |
With I have not been able to isolate this in a small reproducible example without |
Thanks! At least it is no worse than before.
Let me know what you want to try. I was going to suggest if it was at all possible to instrument the tests a bit (so we can try to isolate where exactly it is hanging). I just had a thought - as the tests from yesterday were all just all repeatedly doing one run, which now succeeds. In the targets tests that fail, is |
If you have any suggestions for how to isolate the tests, I am eager to try. As soon as I try to peel back the layers, either the test passes or the error is different.
I checked locally, and those |
That's good to know re. I am a fan of lots of print statements, but I have first hand experience of the disappearing failures from yesterday. Let me have a think. I'm also going to attempt another CRAN release in the meantime (wish me luck!). |
Drive-by comment: It sounds like you cannot reproduce the GitHub Actions
issues locally. GitHub Actions runs with two CPU cores. Although less
complicated than this, I had corner cases way back that only appeared when
running on hosts with a single or two cores. If this is the case here,
Linux containers provide ways to limit the CPU resources. Alt, Posit Cloud
Free runs on a single core, so maybe worth a try to reproduce there. Again,
just a thought.
…On Wed, May 10, 2023, 05:54 shikokuchuo ***@***.***> wrote:
That's good to know re. saisei(), that certainly narrows things down.
I am a fan of lots of print statements, but I have first hand experience
of the disappearing failures from yesterday.
Let me have a think. I'm also going to attempt another CRAN release in the
meantime (wish me luck!).
—
Reply to this email directly, view it on GitHub
<#53 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMKXUWJDCOWYRKI2ZM2JELXFOFXVANCNFSM6AAAAAAW2SKFVU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Thanks @HenrikBengtsson ! Didn't realise you were still receiving these - apologies for the volume!! I will sign up for Posit Cloud - seems worth a try. |
Another thought. The targets tests do they make use of timeouts/auto-scaling? Just looking through the code again, one possibility is that the dispatcher has sent the task but somehow the server has exited and the disconnection has not been registered. Hence it is just waiting to receive when there is no actual server. I don't know how this can be yet. But to rule out this possibility, is it possible to run the tests on a vanilla non-scaling, no timeout daemons setting? |
Thanks, @HenrikBengtsson. I did try a GitHub Codespace based on https://github.com/ropensci/targets/blob/debug-targets/Dockerfile, which has similar resources, but I could not produce the exact type of hanging I see on GitHub Actions. Posit Cloud is a great idea. @shikokuchuo, all the |
Indeed, from https://github.com/ropensci/targets/actions/runs/4941225899/jobs/8833634834#step:8:341, it looks like the R process of the server is still actually running, as seen from the output of the |
I also ran a round of tests that disabled auto-scaling: I manually launched all the workers beforehand up front, waited 3 seconds, and then allowed the pipeline of tasks to start. Looks like I see the same hanging where the dispatcher and server look fine by all accounts (NNG + |
What would really help is to instrument the script just before and after the mirai is sent - and also from inside the mirai task have it cat to a log file before targets begins, and then one afterwards. This will confirm where it is stuck - it has not received the task, the eval is failing, or the send is failing. |
I just submitted another round of tests with more logging. From the 2 lines at https://github.com/ropensci/targets/actions/runs/4947810841/jobs/8847691329#step:8:320, it looks like the task is successfully sent from the client side. From the 2 lines at https://github.com/ropensci/targets/actions/runs/4947810841/jobs/8847691329#step:8:323, the target successfully ran, but for some reason the return value is not sent back to the client or registered as "complete" in |
I doubt it has to do with #42 because the defaults in |
Fantastic! Thanks Will this is super useful. This really helps me focus on the right area of code. I am taking part in Cambridge Tech Week today. Will be able to make a start tomorrow. |
One thing that came to me was: in the tests do you actually retrieve the results by accessing $data or using unresolved (rather than .unresolved)? If you don't the individual contexts are not destroyed, in which case it is possible this becomes a resource issue. I was debugging a Windows issue on my trusty Intel Atom netbook and things were at least an order of magnitude slower. On the other hand GitHub Linux runners are supposedly dual core with 7Gb RAM, which should actually be ample power. |
Sorry, I'm not sure I understand. Does After retrieving the result, |
Possibly useful: I just hit the same timeout (capped at 60s) on my local Ubuntu machine which has 4 cores and 16 GB memory. |
Thanks, it's just as I thought, which is fine. I wanted to double check my understanding. There is nothing special about |
@wlandau with your experiences, and also with extra testing on certain rhub configurations - I am leaning towards this being a resource issue again. NNG does create a fair amount of threads. Simulated mirai/targets heavy load tasks complete on rhub (some much quicker than others), but fail on Github. Never fail locally on my laptop. Happy to investigate more next week, but just to let you know my thoughts. The code for sending back a completed task from the daemon is quite synchronous and robust in my opinion - it seems unlikely to be a fault there. |
That would make sense for GitHub Actions, as we have discussed. But as I am kicking the tires with |
Let's port further discussion of the actual targets cases to #58. For the record I am satisfied that nanonext 0.8.3 / mirai 0.8.7/0.8.7.9001 work as intended and the just targets/mirai reprexes (after fixing a couple of bugs in them) no longer hang on Github. |
As you know, I have been struggling with the final stages of ropensci/targets#1044, which integrates
crew
intotargets
.targets
encodes instructions in special classed environments which govern the behavior of tasks and data. InR CMD check
on GitHub Actions Ubuntu runners, when many of these objects are sent to and frommirai()
tasks, the overall work stalls and times out. It only happens on GitHub Actions Ubuntu runners (probably Windows too, but it didn't seem worth checking), and it only happens insideR CMD check
.After about a week of difficult troubleshooting, I managed to reproduce the same kind of stalling using just
mirai
andnanonext
. I have one example with 1000 tasks at https://github.com/wlandau/mirai/blob/reprex/tests/test.R, and I have another example at https://github.com/wlandau/mirai/blob/reprex2/tests/test.R which has 300 tasks and usescallr
to launch the server process. In the first example, you can see time stamps starting at https://github.com/wlandau/mirai/actions/runs/4670004460/jobs/8269199542#step:9:105. The tasks get submitted within about a 20-second window, then something appears to freeze, and then the 5-minute timeout is reached. In the second example, the timestamps at https://github.com/wlandau/mirai/actions/runs/4670012640/jobs/8269219432#step:9:99 show activity within the first 8 seconds, and only 5 of the 300 tasks run within the full 5 minutes. (I know you have a preference againstcallr
, but it was hard to find ways to get this problem to reproduce, and I thinkmirai
servers can be expected to work if launched fromcallr::r_bg()
.)Sorry I have not been able to do more to isolate the problem. I still do not understand why it happens, and I was barely able to create examples that do not use
targets
orcrew
. I hope this much is helpful.The text was updated successfully, but these errors were encountered: