Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hanging tasks on Github Actions Ubuntu Runners (R CMD Check) #53

Closed
wlandau opened this issue Apr 11, 2023 · 118 comments
Closed

Hanging tasks on Github Actions Ubuntu Runners (R CMD Check) #53

wlandau opened this issue Apr 11, 2023 · 118 comments

Comments

@wlandau
Copy link

wlandau commented Apr 11, 2023

As you know, I have been struggling with the final stages of ropensci/targets#1044, which integrates crew into targets. targets encodes instructions in special classed environments which govern the behavior of tasks and data. In R CMD check on GitHub Actions Ubuntu runners, when many of these objects are sent to and from mirai() tasks, the overall work stalls and times out. It only happens on GitHub Actions Ubuntu runners (probably Windows too, but it didn't seem worth checking), and it only happens inside R CMD check.

After about a week of difficult troubleshooting, I managed to reproduce the same kind of stalling using just mirai and nanonext. I have one example with 1000 tasks at https://github.com/wlandau/mirai/blob/reprex/tests/test.R, and I have another example at https://github.com/wlandau/mirai/blob/reprex2/tests/test.R which has 300 tasks and uses callr to launch the server process. In the first example, you can see time stamps starting at https://github.com/wlandau/mirai/actions/runs/4670004460/jobs/8269199542#step:9:105. The tasks get submitted within about a 20-second window, then something appears to freeze, and then the 5-minute timeout is reached. In the second example, the timestamps at https://github.com/wlandau/mirai/actions/runs/4670012640/jobs/8269219432#step:9:99 show activity within the first 8 seconds, and only 5 of the 300 tasks run within the full 5 minutes. (I know you have a preference against callr, but it was hard to find ways to get this problem to reproduce, and I think mirai servers can be expected to work if launched from callr::r_bg().)

Sorry I have not been able to do more to isolate the problem. I still do not understand why it happens, and I was barely able to create examples that do not use targets or crew. I hope this much is helpful.

@shikokuchuo
Copy link
Owner

I can see the work you've put into this and fully commend the effort! Is your hypothesis that the instances are running out of memory on the Github runners? I can imagine they'd be quite resource constrained.

On my Ubuntu laptop, the 1,000 task reprex showed 3.55GB of memory usage at the end vs the 300 task example, which only took up about 400MB and ran quite snappily.

I can see the callr example on the Github runner only used 80MB memory in the end, but maybe that instance only had that much RAM remaining - so I'm not quite sure what to make of this.

@wlandau
Copy link
Author

wlandau commented Apr 11, 2023

I can see the work you've put into this and fully commend the effort! Is your hypothesis that the instances are running out of memory on the Github runners? I can imagine they'd be quite resource constrained.

Thanks! I was trying to rule out memory as a possible explanation. I can see it may be a factor for https://github.com/wlandau/mirai/blob/reprex/tests/test.R. Maybe even https://github.com/wlandau/mirai/blob/reprex2/tests/test.R, but that seems less likely. I wonder how much memory the server daemon in the reprex2 branch is using during R CMD check.

@wlandau
Copy link
Author

wlandau commented Apr 11, 2023

I just pushed wlandau@30b6268 to take more memory readings, and memory usage on the dispatcher and server do not look very different from beginning to end.

@shikokuchuo
Copy link
Owner

I was thinking just in terms of the client actually. As it's all on one machine. I assume caching to disk is set up so it doesn't OOM but will likely slow to a crawl - and that may be what we are experiencing.

It seems also it is only on the Ubuntu runners. I ran the same 1000 task reprex on Windows/Mac and it seems to succeed there. With Mac I have the printouts: https://github.com/shikokuchuo/mirai/actions/runs/4672218165/jobs/8274151885
Couldn't figure out on Windows, but looking at the test timing it is well within the timeout.

@shikokuchuo
Copy link
Owner

https://github.com/shikokuchuo/mirai/actions/runs/4672298324/jobs/8274326971 just calls rnorm() but succeeds on Mac and fails on Ubuntu. On the face of it suggests that it is a memory issue or some other peculiarity of the runners rather than something about the complexity of the input/output objects.

@shikokuchuo shikokuchuo changed the title Hanging tasks when the input and output data objects are complicated Hanging tasks on Github Actions Ubuntu Runners (R CMD Check) Apr 11, 2023
@shikokuchuo
Copy link
Owner

From my tests, it really seems that it is just the fact that the Ubuntu runners are memory constrained.

Everything works for small tasks of rnorm(1e3) size: https://github.com/shikokuchuo/mirai/actions/runs/4675768366
But not for larger objects e.g. rnorm(3e5): https://github.com/shikokuchuo/mirai/actions/runs/4675934509

For the Mac machine the tests consistently run all the way through without problem.
The only command is rnorm() and the return value is just a vector of doubles so nothing complex going on.

Given the above, I don't believe this is a cause of concern.
Indeed you may wish to try your targets tests again, but ensuring the payloads are minimal.

@wlandau
Copy link
Author

wlandau commented Apr 12, 2023

Seems consistent with how much trouble I have been having when I peel back layers to isolate the cause. But it's still strange. In the case of targets, it does not take many tasks to hit the deadlock. Sometimes it only takes one, as in https://github.com/ropensci/targets/actions/runs/4670430552/jobs/8270146282. That workflow got stuck at https://github.com/ropensci/targets/blob/13f5b314cd4ac5c46f86551650b6f37fd54dffe4/tests/testthat/test-tar_make.R#L17-L31, so I ended up having to skip all the crew tests on Ubuntu R CMD check.

In all these tests, the data objects are small. An empty target object like the one in https://github.com/ropensci/targets/blob/13f5b314cd4ac5c46f86551650b6f37fd54dffe4/tests/testthat/test-tar_make.R#L17-L31 is only around 30 Kb, and each of the task data objects from https://github.com/wlandau/mirai/blob/reprex2/tests/test.R is only 20 Kb. I have been using clustermq in situations like this for several years and have never encountered anything like this, which made me suspect something about mirai is a factor.

If it is just R CMD check + Ubuntu + GitHub Actions + 1 GB memory, this is not so limiting. But it makes me worry about how this may affect data-heavy workloads on normal machines.

@wlandau
Copy link
Author

wlandau commented Apr 12, 2023

In targets, I just re-enabled crew tests on Mac OS. In the previous workflow without crew tests, the check time was around 9 minutes. With crew tests, check time was around 12 minutes. (It also appears to have trouble saving the package cache, but that is probably unrelated.)

@shikokuchuo
Copy link
Owner

shikokuchuo commented Apr 12, 2023

If it is just R CMD check + Ubuntu + GitHub Actions + 1 GB memory, this is not so limiting. But it makes me worry about how this may affect data-heavy workloads on normal machines.

Honestly I'm not worried until I see it outside of Github actions. Our main production machine runs Ubuntu on 'data-heavy' workloads using mirai literally 24/7.

The tests vary on Ubuntu - sometimes it finishes only 1 task, sometimes up to 10 - so we can rule out a deterministic reason. It strongly suggests an external cause, which given we can't replicate the behaviour anywhere else, we can't diagnose.

In targets, I just re-enabled crew tests on Mac OS. In the previous workflow without crew tests, the check time was around 9 minutes. With crew tests, check time was around 12 minutes. (It also appears to have trouble saving the package cache, but that is probably unrelated.)

Yes, that's a good idea to enable on Mac. Should also work on Windows?

@shikokuchuo
Copy link
Owner

However, all your hard work does have a good result:

Seeing the memory usage on your 1,000 test case caused me to re-visit memory handling.

In fact I had attempted to optimise this a couple of weeks ago, but this led to the failures in the throughput tests (with the memory corruption errors). I am now taking a slightly different approach by simply removing the reference to the 'aio' external pointer when the results have been retrieved (and cached). This will allow the garbage collector to reclaim the resources in its own time.

If you have additional testing scripts, please can you re-run on nanonext 0.8.1.9016 and mirai 0.8.2.9023. I would like to make sure I haven't broken anything inadvertently.

I will also need to run this on a staging machine to monitor, but if successful should be in a position to release nanonext to CRAN at some point tomorrow. The recent builds have been really solid.

@wlandau
Copy link
Author

wlandau commented Apr 12, 2023

Thank you so much! Your updates to nanonext and mirai appear to have eliminated problems in the targets tests, which was what I was mainly worried about. The load tests in crew still appear to hang and time out, but because of the high loads in those tests, I am willing to believe that those remaining instances are indeed due to constrained memory on Ubuntu runners.

I am really excited for the next CRAN releases of nanonext and mirai. After those builds complete, I can release crew and then targets, broadcast our progress, and invite others to write their own crew launchers.

@wlandau wlandau closed this as completed Apr 12, 2023
@shikokuchuo
Copy link
Owner

Thanks that's really good news!

@shikokuchuo
Copy link
Owner

I wasn't actually trying to fix the test issues... just a note before I forget - the 'aios' now pretty much clean up after themselves as soon as unresolved() returns successfully or after call_mirai() or after the $data is actually accessed. Note that it'd be a mistake to make a copy of an 'aio' before that point as you'd duplicate the reference to the external pointer. Just an FYI as I know you hang on to the 'aios'.

@wlandau
Copy link
Author

wlandau commented Apr 12, 2023

Thanks, noted. When crew checks a task, it uses the more minimal .unresolved() from nanonext, then moves the resolved mirai object from the list of tasks to the list of results. After that, the user can call controller$pop() to download the data and release the aio. From what I know about R's copy-on-modify system, I believe this approach avoids copying the pointer. At any rate, the tests I need to pass are passing, thanks to you.

@shikokuchuo
Copy link
Owner

Yes, actually I was being too conservative - the mirai are environments so if you copy them just the reference to the environment is duplicated. The external pointer sits within the environment, and is the same no matter through which reference you access it. And the cleanup works regardless. Forget my previous comment!!

@wlandau
Copy link
Author

wlandau commented Apr 13, 2023

FYI it appears I am still getting intermittent hanging tests on R CMD check. The good news is that it still works most of the time, so I have implemented timeouts and retries as a workaround. I think that's good enough for me for now. Maybe we could just keep an eye on it going forward?

@wlandau
Copy link
Author

wlandau commented Apr 13, 2023

For example, the workflow at https://github.com/ropensci/targets/actions/runs/4684600263/jobs/8300914514 reached a timeout in a crew test, but the first retry succeeded.

@shikokuchuo
Copy link
Owner

Maybe we could just keep an eye on it going forward?

Might be able to fix it these couple of days hopefully. I'm trying to pin down this elusive segfault on CRAN OpenBlas machine ( I mentioned before all these exotic setups tend to get me!!)

In the middle of simplifying and making more robust certain things in nanonext, which will push back the release slightly.

However, I've also cut out the new features we're not using yet so that mirai only depends on the existing nanonext 0.8.1. So should be able to release around the same time.

@shikokuchuo
Copy link
Owner

shikokuchuo commented Apr 13, 2023

OK! Do you want to re-run your tests with nanonext fb8b05d v0.8.1.9020 and mirai 4e1862e v0.8.2.9027.

I have taken to testing across (a good selection of) the rhub platforms and the segfault no longer appears (was quite consistently, although randomly reproducible for the last CRAN release). The changes have a good chance of also fixing your hanging tests.

Sorry - not completely fixed - am on to it!

@wlandau
Copy link
Author

wlandau commented Apr 13, 2023

Thanks for working on this. I just pushed updates to crew and targets that use those versions of mirai and nanonext, and I look forward to trying more updates when you are ready.

@shikokuchuo
Copy link
Owner

I think nanonext v0.8.1.9020 or the latest v0.8.1.9022 should do the trick actually. crew always uses the dispatcher right? There are some edge cases without the dispatcher I need to make safe. But otherwise I think should be fine. Do let me know if anything seems to be off. Thanks!

@wlandau
Copy link
Author

wlandau commented Apr 13, 2023

With nanonext v0.8.1.9021 and mirai 0.89.2.9027, I did observe one hanging test which succeeded on the first retry: https://github.com/ropensci/targets/actions/runs/4690189013/jobs/8315468527#step:9:258. Another just like it ran on Mac OS: https://github.com/ropensci/targets/actions/runs/4690189013/jobs/8315467814#step:10:257. Currently trying again with nanonext 0.8.1.9022. Interestingly, I am seeing "read ECONNRESET" in the annotations at https://github.com/ropensci/targets/actions/runs/4690189013. Not sure if that is related.

@wlandau
Copy link
Author

wlandau commented Apr 13, 2023

With nanonext v0.8.1.9022, it looks like all the targets checks succeeded on the first try. I am running the jobs again to confirm.

@wlandau
Copy link
Author

wlandau commented Apr 13, 2023

Meant to say v0.8.1.9022 above.

@wlandau
Copy link
Author

wlandau commented Apr 13, 2023

In later commits of targets using mirai 0.8.2.9027 and nanonext 0.8.1.9022, I unfortunately still notice sporadic hanging. Examples:

@shikokuchuo
Copy link
Owner

nanonext 0.8.1.9025 and mirai 0.8.2.9028 are the release candidates. If final testing doesn't throw any errors, nanonext on track for release tmrw.

If you notice any particular occasions it hangs e.g. when scaling up / down etc. I will have more to go on. I have covered the general bases - it is much more robust (and actually more performant it seems).

@wlandau
Copy link
Author

wlandau commented Apr 14, 2023

Glad you're seeing performance gains, and I'm glad the packages are poised for CRAN. I'm afraid I do still see the same timeouts with nanonext 0.8.1.9025 and mirai 0.8.2.9028. The commits to https://github.com/ropensci/targets/tree/try-builds since ropensci/targets@8b4cf6e show intermittent failures testing crew. I disabled the retries on that branch so they would be easier to see.

@wlandau
Copy link
Author

wlandau commented May 10, 2023

That's really useful info, and it allowed me to simplify crew quite a lot: wlandau/crew@ff833e7. But unfortunately, I still see hanging tasks on my end: https://github.com/ropensci/targets/actions/runs/4936193312/jobs/8823439823. Based on previous logs, this is happening when checks on daemons() are okay, but a mirai tasks stays in an unresolved state indefinitely. It's really hard to reproduce without targets.

@wlandau
Copy link
Author

wlandau commented May 10, 2023

I wonder, could it have something to do with repeated calls to daemons() combined with repeated calls to nanonext::.unresolved()?

@shikokuchuo
Copy link
Owner

That's really good that it's allowed you to simplify things. I don't see how repeated calls to the above functions can cause problems.

Would you mind testing the latest build of mirai v. 0.8.4.9004 85953f8? I got CRAN feedback and I'm doing final tests on it. Notable change is that the default asyncdial is now FALSE across the package for additional safety. Also caught one bug in saisei().

@shikokuchuo
Copy link
Owner

I wonder, could it have something to do with repeated calls to daemons() combined with repeated calls to nanonext::.unresolved()?

This is a tricky issue. I just ran the tests from yesterday again just to confirm - and they do appear to be fixed - just through not polling daemons() right at the start. But perhaps it is still something to do with polling.

Did you have any specific concern with nanonext::.unresolved()? I don't believe that function has the capacity to hang as it is essentially just reading an int in a C struct on the local process.

@wlandau
Copy link
Author

wlandau commented May 10, 2023

With mirai 0.8.4.9004, the hanging is rare, but it still happens: https://github.com/ropensci/targets/actions/runs/4936588271/jobs/8824297563. From previous tests, I believe this is still a case where the dispatcher is running, daemons() polls just fine, and the server is running fine, but the task is still stuck at unresolved.

I have not been able to isolate this in a small reproducible example without targets, even after weeks of trying, so I wonder if it would be possible to run the same tests on a dev fork of mirai which prints a verbose log (or even a trace).

@shikokuchuo
Copy link
Owner

With mirai 0.8.4.9004, the hanging is rare, but it still happens: https://github.com/ropensci/targets/actions/runs/4936588271/jobs/8824297563. From previous tests, I believe this is still a case where the dispatcher is running, daemons() polls just fine, and the server is running fine, but the task is still stuck at unresolved.

Thanks! At least it is no worse than before.

I have not been able to isolate this in a small reproducible example without targets, even after weeks of trying, so I wonder if it would be possible to run the same tests on a dev fork of mirai which prints a verbose log (or even a trace).

Let me know what you want to try. I was going to suggest if it was at all possible to instrument the tests a bit (so we can try to isolate where exactly it is hanging).

I just had a thought - as the tests from yesterday were all just all repeatedly doing one run, which now succeeds. In the targets tests that fail, is saisei() being called? I wonder if it is to do with switching the listeners between tasks. Do you have any tests in which saisei() isn't called for comparison?

@wlandau
Copy link
Author

wlandau commented May 10, 2023

Let me know what you want to try. I was going to suggest if it was at all possible to instrument the tests a bit (so we can try to isolate where exactly it is hanging).

If you have any suggestions for how to isolate the tests, I am eager to try. As soon as I try to peel back the layers, either the test passes or the error is different.

I just had a thought - as the tests from yesterday were all just all repeatedly doing one run, which now succeeds. In the targets tests that fail, is saisei() being called?

I checked locally, and those targets tests actually do not call saisei() at all.

@shikokuchuo
Copy link
Owner

That's good to know re. saisei(), that certainly narrows things down.

I am a fan of lots of print statements, but I have first hand experience of the disappearing failures from yesterday.

Let me have a think. I'm also going to attempt another CRAN release in the meantime (wish me luck!).

@HenrikBengtsson
Copy link

HenrikBengtsson commented May 10, 2023 via email

@shikokuchuo
Copy link
Owner

Drive-by comment: It sounds like you cannot reproduce the GitHub Actions issues locally. GitHub Actions runs with two CPU cores. Although less complicated than this, I had corner cases way back that only appeared when running on hosts with a single or two cores. If this is the case here, Linux containers provide ways to limit the CPU resources. Alt, Posit Cloud Free runs on a single core, so maybe worth a try to reproduce there. Again, just a thought.

Thanks @HenrikBengtsson ! Didn't realise you were still receiving these - apologies for the volume!! I will sign up for Posit Cloud - seems worth a try.

@shikokuchuo
Copy link
Owner

I checked locally, and those targets tests actually do not call saisei() at all.

Another thought. The targets tests do they make use of timeouts/auto-scaling?

Just looking through the code again, one possibility is that the dispatcher has sent the task but somehow the server has exited and the disconnection has not been registered. Hence it is just waiting to receive when there is no actual server. I don't know how this can be yet. But to rule out this possibility, is it possible to run the tests on a vanilla non-scaling, no timeout daemons setting?

@wlandau
Copy link
Author

wlandau commented May 10, 2023

Thanks, @HenrikBengtsson. I did try a GitHub Codespace based on https://github.com/ropensci/targets/blob/debug-targets/Dockerfile, which has similar resources, but I could not produce the exact type of hanging I see on GitHub Actions. Posit Cloud is a great idea.

@shikokuchuo, all the targets tests do make use of auto-scaling, but the only timeouts are the R.utils::withTimeout() wrapper around everything to anticipate the hanging. From what I remember from the previous test output, the server process is actually still running, which I confirmed with both daemons()$daemons and printing the processx handle. I just reinserted some logging to confirm.

@wlandau
Copy link
Author

wlandau commented May 10, 2023

From what I remember from the previous test output, the server process is actually still running, which I confirmed with both daemons()$daemons and printing the processx handle

Indeed, from https://github.com/ropensci/targets/actions/runs/4941225899/jobs/8833634834#step:8:341, it looks like the R process of the server is still actually running, as seen from the output of the processx handle. From https://github.com/ropensci/targets/actions/runs/4941225899/jobs/8833634834#step:8:332, the server is online and has received a task but has not completed it.

@wlandau
Copy link
Author

wlandau commented May 10, 2023

I also ran a round of tests that disabled auto-scaling: I manually launched all the workers beforehand up front, waited 3 seconds, and then allowed the pipeline of tasks to start. Looks like I see the same hanging where the dispatcher and server look fine by all accounts (NNG + processx handles) but the task stays unresolved indefinitely: https://github.com/ropensci/targets/actions/runs/4941363308/jobs/8833895886.

@shikokuchuo
Copy link
Owner

From what I remember from the previous test output, the server process is actually still running, which I confirmed with both daemons()$daemons and printing the processx handle

Indeed, from https://github.com/ropensci/targets/actions/runs/4941225899/jobs/8833634834#step:8:341, it looks like the R process of the server is still actually running, as seen from the output of the processx handle. From https://github.com/ropensci/targets/actions/runs/4941225899/jobs/8833634834#step:8:332, the server is online and has received a task but has not completed it.

What would really help is to instrument the script just before and after the mirai is sent - and also from inside the mirai task have it cat to a log file before targets begins, and then one afterwards. This will confirm where it is stuck - it has not received the task, the eval is failing, or the send is failing.

@wlandau
Copy link
Author

wlandau commented May 11, 2023

I just submitted another round of tests with more logging. From the 2 lines at https://github.com/ropensci/targets/actions/runs/4947810841/jobs/8847691329#step:8:320, it looks like the task is successfully sent from the client side. From the 2 lines at https://github.com/ropensci/targets/actions/runs/4947810841/jobs/8847691329#step:8:323, the target successfully ran, but for some reason the return value is not sent back to the client or registered as "complete" in daemons()$daemons.

@wlandau
Copy link
Author

wlandau commented May 11, 2023

I doubt it has to do with #42 because the defaults in crew_controller_local() opt for completely persistent workers: maxtasks = Inf, idletime = Inf, walltime = Inf.

@shikokuchuo
Copy link
Owner

I just submitted another round of tests with more logging. From the 2 lines at https://github.com/ropensci/targets/actions/runs/4947810841/jobs/8847691329#step:8:320, it looks like the task is successfully sent from the client side. From the 2 lines at https://github.com/ropensci/targets/actions/runs/4947810841/jobs/8847691329#step:8:323, the target successfully ran, but for some reason the return value is not sent back to the client or registered as "complete" in daemons()$daemons.

Fantastic! Thanks Will this is super useful. This really helps me focus on the right area of code. I am taking part in Cambridge Tech Week today. Will be able to make a start tomorrow.

@shikokuchuo
Copy link
Owner

I doubt it has to do with #42 because the defaults in crew_controller_local() opt for completely persistent workers: maxtasks = Inf, idletime = Inf, walltime = Inf.

One thing that came to me was: in the tests do you actually retrieve the results by accessing $data or using unresolved (rather than .unresolved)? If you don't the individual contexts are not destroyed, in which case it is possible this becomes a resource issue. I was debugging a Windows issue on my trusty Intel Atom netbook and things were at least an order of magnitude slower. On the other hand GitHub Linux runners are supposedly dual core with 7Gb RAM, which should actually be ample power.

@wlandau
Copy link
Author

wlandau commented May 11, 2023

crew uses .unresolved() to check the status of the mirai tasks while avoiding downloading the data right away. This happens during the collect() method of the controller which makes a note of all the done tasks and thus allows the right auto-scaling decisions to be made. Then in a separate step, the pop() method calls $data on the mirai object of a task previously discovered to be done. unresolved() is not used at all.

in the tests do you actually retrieve the results by accessing $data or using unresolved (rather than .unresolved)? If you don't the individual contexts are not destroyed, in which case it is possible this becomes a resource issue.

Sorry, I'm not sure I understand. Does $data or unresolved() preserve contexts? And does preserving the context create a resource issue?

After retrieving the result, pop() removes the last remaining reference to the task, which I would hope would allow the garbage collector to clean it up.

@wlandau
Copy link
Author

wlandau commented May 11, 2023

Possibly useful: I just hit the same timeout (capped at 60s) on my local Ubuntu machine which has 4 cores and 16 GB memory.

@shikokuchuo
Copy link
Owner

crew uses .unresolved() to check the status of the mirai tasks while avoiding downloading the data right away. This happens during the collect() method of the controller which makes a note of all the done tasks and thus allows the right auto-scaling decisions to be made. Then in a separate step, the pop() method calls $data on the mirai object of a task previously discovered to be done. unresolved() is not used at all.

in the tests do you actually retrieve the results by accessing $data or using unresolved (rather than .unresolved)? If you don't the individual contexts are not destroyed, in which case it is possible this becomes a resource issue.

Sorry, I'm not sure I understand. Does $data or unresolved() preserve contexts? And does preserving the context create a resource issue?

After retrieving the result, pop() removes the last remaining reference to the task, which I would hope would allow the garbage collector to clean it up.

Thanks, it's just as I thought, which is fine. I wanted to double check my understanding. There is nothing special about unresolved(), it triggers the active binding the same as if it were accessed directly. This destroys the context as it is no longer of any use. I don't think this line of attack takes us anywhere.

@shikokuchuo
Copy link
Owner

shikokuchuo commented May 12, 2023

@wlandau with your experiences, and also with extra testing on certain rhub configurations - I am leaning towards this being a resource issue again. NNG does create a fair amount of threads. Simulated mirai/targets heavy load tasks complete on rhub (some much quicker than others), but fail on Github. Never fail locally on my laptop. Happy to investigate more next week, but just to let you know my thoughts.

The code for sending back a completed task from the daemon is quite synchronous and robust in my opinion - it seems unlikely to be a fault there.

@wlandau
Copy link
Author

wlandau commented May 13, 2023

That would make sense for GitHub Actions, as we have discussed. But as I am kicking the tires with crew.cluster, which allows me to run many more workers than before, I am noticing hanging tasks on a fairly light version of the types of simulation pipelines my team and I run on a regular basis. I used a powerful SGE cluster with plenty of memory, so unless there is an intrinsic maximum number of threads or there is something else I am missing, I am not sure which resources would be lacking. I opened a new issue at #58. Sorry to bother you again about this.

@shikokuchuo
Copy link
Owner

Let's port further discussion of the actual targets cases to #58. For the record I am satisfied that nanonext 0.8.3 / mirai 0.8.7/0.8.7.9001 work as intended and the just targets/mirai reprexes (after fixing a couple of bugs in them) no longer hang on Github.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants