Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taking Structured Concurrency Seriously #33248

Open
c42f opened this issue Sep 13, 2019 · 76 comments
Open

Taking Structured Concurrency Seriously #33248

c42f opened this issue Sep 13, 2019 · 76 comments
Labels
io Involving the I/O subsystem: libuv, read, write, etc. julep Julia Enhancement Proposal parallelism Parallel or distributed computation

Comments

@c42f
Copy link
Member

c42f commented Sep 13, 2019

In julia 1.3 the threadsafe runtime for truly parallel tasks has arrived which will greatly increase the interest in concurrent computation in Julia's core numerical and technical computing community. For the moment, Threads.@spawn is experimental, but now is a good time to think about APIs for expressing concurrent computation in a safe and composable way with the aim to use such an API as an ecosystem gets built on top of the new functionality.

There's been a lot of recent excitement and hard work on libraries to support so-called structured concurrency, with the claim that this is fundamentally a better model for composable concurrent computation.

I'm convinced that people are onto something here and that the overall ideas are an excellent fit for the Julia language. However, there's a lot of prior art to absorb so I have started a document with the aim to do two things:

  • Survey existing ideas and implementations
  • Explore how these ideas fit within the Julia language and runtime.

Click here to read the WIP document

Meta note

I'm using this julep as a guinea pig to try out a go-proposals style workflow, as discussed at #33239. So let's try discussing this here (non-substantive procedural comments to go to the Juleps repo.

@tkf
Copy link
Member

tkf commented Sep 13, 2019

I'm copying some comments I made in https://github.com/c42f/Juleps

Discuss "black box rule" and robust cancellation separately?

I think it might be better to emphasize that structured concurrency has a few sub-concepts.

My understanding is that what njs calls "black box rule" is a sub-concept of structured concurrency and it's more fundamental than robust cancellation. So, I'd suggest to introduce "black box rule" and cancellation as different sub-concepts from the get-go. I think this is particularly important in Julia because many compute-intensive task don't need cancellation and people may not want to have runtime and conceptual overhead of robust cancellation. An alternative term may be call tree.

One counter-example to my argument that "black box rule" is more fundamental is Go's errgroup and context. errgroup is the implementation of the "black box rule" while context is the implementation of robust cancellation. errgroup is implemented on top of context and it may contradict with my argument that "black box rule" (errgroup) is more fundamental than robust cancellation (errgroup). But I think the fact that they are implemented in separate packages in Go at least backs up my argument that those are different concepts.

So, maybe it is a good idea to organize the sections like this?

  • Structured Concurrency
    • Background terminology
    • The black box rule
    • Robust task cancellation
    • ...

A benefit of having different color

I suggested to add the following paragraph at the end of Syntax section c42f/Juleps#2

On the other hand, a syntax such as async/await is a useful visual marker that emphasise when the cancellation can happen (await) and what functions are cancellable (async). Note that this does not have to be implemented at language level. For example, Go's context and errgroup enforces reader to recognize where the cancellation can happen (listening to the Done channel) and what function can be cancelled (accept Context as an argument).

Also quoting another comment:

I just noticed that you touched this already in "The challenge of preemptive cancellation" but with relation to preemptive cancellation. But maybe it's useful to touch it a bit here in a different context?

Are go statements harmful?

Copying comment from c42f/Juleps#3

OK... I think this can be very controversial. But I suggest to add:

It is important to emphasize that, as absence of goto statement that can cross function boundaries is the major feature of modern languages, the absence of "go statement" (@spawn in Julia 1.3) is the major feature of the structured concurrency. This is because it is impossible for caller to know that a function follows structured concurrency without the language-level guarantee.

To emphasize that it has large impact in Julia than (say) Python, I also added:

Note that the situation in Julia is more serious compared to other languages which has functions with "two-colors" (see below) and plugable eventloop (e.g., Python) where it is possible to enforce the absence of "go statement" at the library level.

@c42f
Copy link
Member Author

c42f commented Sep 13, 2019

Are go statements harmful?

I'm convinced they are harmful. But luckily this isn't a matter of opinion if we can agree that

  1. Structured concurrent programs are simpler to reason about because they are more restricted in form and have many benefits in composability.
  2. There is no significant penalty for giving up the flexibility of an unstructured approach.

The first of these doesn't seem controversial to me.

Lucky for us, the second can be attacked empirically due to the efforts in Trio: @njsmith has repeatedly stated that they've found no reason to back off from their insistence that all tasks are started from a nursery. The Kotlin people seem to agree (though they do have a nod to backward compatibility with the global coroutine scope). I think the other empirical answer we can have to this is to try implementing a suite of standard concurrency patterns ourselves in Awaits.jl or equivalent prototype.

@c42f
Copy link
Member Author

c42f commented Sep 13, 2019

A benefit of having different color

syntax such as async/await is a useful visual marker that emphasise when the cancellation can happen (await) and what functions are cancellable (await)

I agree this can be seen as a benefit and I've updated the document. In a cooperatively concurrent system it also marks where task switching can happen which can otherwise be a source of timing bugs, for example JuliaWeb/HTTP.jl#382 (comment)

@tkf
Copy link
Member

tkf commented Sep 13, 2019

Lucky for us, the second can be attacked empirically due to the efforts in Trio

I also add that Trio convinced Python core devs to add structured concurrency in Python itself: Yury Selivanov - Asyncio in Python 3 7 and 3 8 - YouTube. (Although it says asyncio.TaskGroup (≈ nursery) is likely to be in Python 3.8, I don't think it is added yet.)

when the cancellation can happen (await) and what functions are cancellable (await)

Oops, the second await should be async. Thanks for fixing it in the Julep :)

In a cooperatively concurrent system it also marks where task switching can happen which can otherwise be a source of timing bugs

Just to be clear, we are already in Go/Kotlin group as there is no async/await. We can still take Go's context approach to pass around cancellation tokens (it's the Done channel in Go's context). This would clarify cancellable functions from the call signature (you can't call it unless you pass the cancellation token). However, we can't use this mechanism for I/O functions because existing I/O functions do not have this signature and their color is the same as synchronous functions.

I requested to add that paragraph because I thought it was important to recognize the pros and cons of having async/await syntax to assess the advantage and disadvantage that already exist in Julia. I think the main discussion point might be "Should we pass around cancellation tokens explicitly?"

@tkf
Copy link
Member

tkf commented Sep 13, 2019

By the way, it'd be nice to have a list of compute-intensive applications that require cancellation.

One example I found/implemented was stoppable reduce in Transducers.jl. It can be used to implement parallel findfirst.

@StefanKarpinski StefanKarpinski changed the title Julep: Structured Concurrency Taking Structured Concurrency Seriously Sep 13, 2019
@StefanKarpinski StefanKarpinski added julep Julia Enhancement Proposal io Involving the I/O subsystem: libuv, read, write, etc. parallelism Parallel or distributed computation labels Sep 13, 2019
@c42f
Copy link
Member Author

c42f commented Sep 14, 2019

The following thread about how Rust's new async/await relates to SC needs a careful read. There's many interesting points related to cancellation and zero cost abstractions.

https://trio.discourse.group/t/structured-concurrency-in-rust/73

@c42f
Copy link
Member Author

c42f commented Sep 14, 2019

We can still take Go's context approach to pass around cancellation tokens (it's the Done channel in Go's context). This would clarify cancellable functions from the call signature (you can't call it unless you pass the cancellation token). However, we can't use this mechanism for I/O functions because existing I/O functions do not have this signature and their color is the same as synchronous functions.

We could add versions of IO which use this mechanism and make deprecating the old ones a long term (Julia 2.0) goal. So I think it's worth considering, if only to understand which points in the design space we're giving up on. Actually a full implementation of SC seems to be a julia 2.0 goal in any case because we already have bare @async and it would certainly be breaking to change how that works.

The action of passing around a context could be wrapped up with macros to form the expected lexically scoped chain of custody for cancellation without much syntactic overhead. This would indeed give us colored functions (colored by calling convention), but at least the syntactic price would be no worse than python and there's no language change required.

Here's a related question: are there any systems which don't have colored functions, and which also have a compelling cancellation story?

  • Modern Go has them with Contexts
  • Kotlin has them with suspend functions and coroutine builders
  • Python has them with async/await

@c42f
Copy link
Member Author

c42f commented Sep 14, 2019

Over on the Julep, @vchuravy notes about the Tapir PR:

Note that this is primarily about:

  • Compiler representation of tasks
  • Experimenting with Cilk-style semantics (e.g. may happen in parallel or what I have taken to call structured parallelism)

My thought was that the outcome of the Tapir effort could be visible to users if we find that restricting semantics in a targeted way would greatly improve optimization opportunities. For example, maybe it could enable us to transform symmetric coroutines into lightweight state machines in some cases?

@vchuravy I think it would be really valuable if you could provide a summary of how you see this effort fitting into the bigger picture.

@tkf
Copy link
Member

tkf commented Sep 14, 2019

We could add versions of IO which use this mechanism and make deprecating the old ones a long term (Julia 2.0) goal.

I didn't think of this. That's an interesting possibility.

we already have bare @async

Good point. I forgot about it. I still think we can pretend to have structured concurrency before 2.0 since it is very likely that people are lured to @spawn (or whatever the API we'll have) because of the potential performance benefits.

The action of passing around a context could be wrapped up with macros to form the expected lexically scoped chain of custody for cancellation without much syntactic overhead.

That's actually the strategy I used in Awaits.jl.

are there any systems which don't have colored functions, and which also have a compelling cancellation story?

The only thing somewhat related to this I know is thredo (built on top of curio which is an async I/O library that inspired trio). In EuroPython 2018 (David Beazley - Die Threads - YouTube), the author talked about designing a threading library from scratch, motivated by the two-color problem and also robust cancellation (the talk is mostly demo of thredo). thredo API does not enforce the "black box rule" but it has cancellation and also has (optional) ThreadGroup (≈ nursery). The cancellation can happen only if you use thread building blocks like Queue, Lock, etc. imported from thredo. But it seems that this is treated as limitation rather than a feature for marking cancellable functions. For example, it has a monkey-patching helper to replace stdlib sockets used in third-party libraries with the thredo implementation. It also has implication in the CPU-intensive code as it's cancellable only when it hits curio's eventloop. He is mentioning that this is actually a desirable feature and I guess we all agree randomly inserting checkpoint would be awful for performance-oriented language like Julia.

But I've never played with it and I don't know if there is a discussion on structured concurrency aspect of thredo. AFAICT this is an experimental library and I don't think it's used in the wild. I'm just mentioning it since it seems to fill the design space of "cancellation without color."

@tkf
Copy link
Member

tkf commented Sep 15, 2019

are there any systems which don't have colored functions, and which also have a compelling cancellation story?

Java's Thread.interrupt was mentioned as an unsuccessful story in https://trio.discourse.group/t/structured-concurrency-in-rust/73/25

@c42f
Copy link
Member Author

c42f commented Sep 16, 2019

Hm. Cancellation seems to be a really, really hard problem, and I'm sure I don't understand how hard yet! But I'm naive enough to still think there's some hope, for example kill -9 is an example of timely, preemptive cancellation which actually works most of the time.

So what is it that makes kill -9 work when other things don't? It's (partly?) the strong isolation between resources of one process and another, allowing the resources to be gc'd by the OS kernel.

On slack, @vtjnash had some interesting things to say about that. To quote directly

I think the difference in practice is that sending kill -9 actually has the effect of closing all of the open resources. That’s a scenario the other processes are usually written to handle in some way.
That it also stops running the code is somewhat irrelevant, because it’s already been cut off from all side-effect channels

I think it's interesting is that in the real world cancellation is definitely a thing: if your process starts behaving weirdly you kill -9 it. If that locks up sibling processes on your OS, you restart the machine. If that causes a distributed system to fail, you might need to restart a whole bunch of machines.

There's plenty of messy examples outside of computing. Consider companies; if a business unit is failing to make profit, it's eventually restructured or terminated and its duties redistributed. If that causes the company to fails to service its debts, the company goes into bankruptcy and gets forcibly restructured or just plain killed off (leaving the investors to do the garbage collection).

It's all very messy, but I think there's some kind of interesting informal hierarchy of supervision going on, and I've got a feeling that this is where Erlang supervisors might come in. More reading to do...

@tkf
Copy link
Member

tkf commented Sep 16, 2019

Why are you considering preemptive cancellation? Isn't the difficulty of it the motivation for cooperative cancellation? I don't think kill -9 or even a reboot is a good model for a safe cancellation. For example, this won't work if you have a filesystem-based lock (and that's why you need to do rm -f .git/index.lock).

Even in the non-computing example, if a company with copyright or license for an artwork got a hard shutdown, it may not be able to release the resource (= copyright) and then make the artwork becomes unusable. (OK. I'm not a lawyer and I have no idea if that's a reasonable scenario.)

@c42f
Copy link
Member Author

c42f commented Sep 17, 2019

Why are you considering preemptive cancellation?

An excellent question. Maybe because I'm naive and I want things which I don't know are impossible yet ;-)

Let's lay out some desirable properties for cancelling things which I hope we can agree would be nice. This is not a judgement about whether these are mutually possible! I've labeled each property for sake of further discussion. I would like cancellation to be:

  • Graceful There should be a way to define actions to take for graceful shutdown. This is inherently specific to the code running, as described in https://trio.discourse.group/t/graceful-shutdown/93
  • Releasing Resources which were acquired must be released. There's various ways to handle this (code cooperatively cleaning things up, vs garbage collecting the resources). Your lock files example occurs because the supervisor (the OS) doesn't know about the lock file as a resource which must be reclaimed. On the other hand, the OS is very good at cleaning up things like file handles.
  • Timely It should occur by some deadline. Because buggy infinite loops are a fact of the world no matter how much we wish they'd go away.
  • Ergonomic The system needs to be both simple and have high perceived value so that the ecosystem converges on writing code which handles cancellation correctly. Very nicely put by njsmith here.

It seems there's several models of cancellation which could be considered which might satisfy some subset of these properties.

  1. Cooperative cancellation with task-defined handlers (Trio / deferred pthread_cancel style) where cancel points are highly restricted, and tasks may run cleanup code.
  2. Hard preemptive cancellation (kill -9 style) where you need enough isolation so that resources can be GC'd. Seems incompatible with cleanup defined in the task itself. Has several known in-process APIs which have been a historical disaster; Thread.stop etc.
  3. Preemptive cancellation with task-defined cleanup handlers (InterruptException style) — a weird mixture where cancel points are "almost" everywhere, but task-defined cleanup can still run. Known to be a reliability disaster if just tossed into a language without thinking very hard about what "almost" means. (Possibly rust-like?)
  4. Resource-based cancellation; @vtjnash pointed this out on slack — "Julia-style cancellation, aka OO cancellation. Any IO object can be cancelled by calling close, which initiates termination of outstanding requests". This seems to be a very different model conceptually as it focuses on closing resources rather than cancelling computations (though closing a resource may lead to cancelling a computation).

@stevengj
Copy link
Member

stevengj commented Sep 17, 2019

There is a formal structure of parallel programs that was developed for Cilk and is now used in Tapir/LLVM called serial semantics (also called the "serial elision" or "serial projection" property): this refers to a parallel program that could be correctly executed by running the tasks serially.

One nice thing about serial semantics is that it enables various analyses, e.g. efficient detection of race conditions, or static compiler optimizations in Tapir. So it is a nice property for a language to allow the programmer to express unambiguously.

Are serial semantics a strictly stronger restriction than "structured concurrency"?

@tkf
Copy link
Member

tkf commented Sep 17, 2019

@stevengj In Tapir PR #31086, @vchuravy said that

It is important to note that the semantics of this representation are parallel and not concurrent, by this extent this will not and cannot replace Julia Tasks.

So, I think it is not relevant for the cancellation part of structured concurrency. But I think it is related to the "black box rule" part (see my comment above #33248 (comment)) of structured concurrency. I'm not familiar with the serial-projection property but this property seems to imply that "black box rule" must be followed (e.g., both of them disallow leaking tasks). However, as @vchuravy demonstrated you can write concurrent code following "black box rule":

In order to exemplify this issue see the following Julia task code:

@sync begin
    ch1 = Channel(0)
    ch2 = Channel(0)
    @async begin
        take!(ch1)
        put!(ch2, 1)
    end
    @async begin
        put!(ch1, 1)
        take!(ch2)
    end
end

Doing a serial projection of this code leads to a deadlock.

So, I think serial-projection property is strictly stronger requirement than "black box rule", too.

@tkf
Copy link
Member

tkf commented Sep 17, 2019

@c42f IIUC, I guess I can (almost) summarise your models by how Cancel-Triggered Exception (CTE) behaves, like this?

CTE can happen anywhere CTE happen only at await etc.
CTE catchable Model 3 (e.g., Java) Model 1 (e.g., Trio)
not catchable Model 2 (e.g. kill -9) N/A

But I don't understand what is "Julia-style." It seems to fit in Model 3. Also, I don't understand why "resource-based cancellation" is not doable in Model 1. Trio has "I/O resource" that can be used as communication (lock, channel, etc.) and they are aware of cancellation.

@stevengj
Copy link
Member

So, I think serial-projection property is strictly stronger requirement than "black box rule", too.

Okay, so should we be talking about ways for users to express the stronger semantics (serial projection) or the weaker one (black box rule)? The former seems to allow strictly more tools and optimizations, so maybe it is more useful?

@tkf
Copy link
Member

tkf commented Sep 18, 2019

I'm in favor of having dedicated syntaxes to express the serial-projection property and black box rule.

I wonder if it is possible to "just" introduce a different version of @sync. That is to say for concurrent tasks we use

@sync begin
    ...
end

and for parallel computations with the serial-projection property we use

@sync_tapir begin  # maybe there is a better name
    ...
end

The body ... contains @spawn etc. In particular, replacing @sync_tapir with @sync is valid (the other way is not always true). This way, we can reuse functions for parallel and concurrent programs.

Naively thinking, this seems to be possible if we create a local "task group context" (a more sophisticated version of the vector assigned to Base.sync_varname; aka nursery) of different types in @sync and @sync_tapir. It is better to use type-based solution (rather than, say, macro-based solution) because it let us use the pattern to pass around the task group context and launch tasks inside other functions without immediately synchronizing them inside those functions. (FYI, if you want see actual (toy) code example, see, e.g., https://github.com/tkf/Awaits.jl/blob/24712770164cbd22c4de2568dc4ace1c75553eca/test/test_simple.jl#L21-L44 and Awaits.@go.)

@c42f
Copy link
Member Author

c42f commented Sep 18, 2019

Yes it seems like the serial projection property is rather restrictive and can't replace the general use of Tasks. But it's also a rather natural fit for data parallel problems in technical computing where the amount of parallel compute available at runtime is irrelevant to the algorithm.

So naively I'd favor having syntax for both; with @sync_tapir being an assertion by the programmer that the serial projection is valid for the contained code.

it let us use the pattern to pass around the task group context and launch tasks inside other functions without immediately synchronizing them inside those functions

I'm not so sure about this: serial projection seems to be a local property of an algorithm, and passing a @sync_tapir-generated context to an algorithm which doesn't have the serial projection property could cause deadlock. But perhaps I'm misunderstanding.

@vchuravy
Copy link
Member

So naively I'd favor having syntax for both; with @sync_tapir being an assertion by the programmer that the serial projection is valid for the contained code.

I am worried about the additional complexity for the user to reason about the difference between a task with SPP and concurrent task.
I am working on a couple of ideas, but I am hoping that we don't have to introduce a separate language concept. If that doesn't work out (you know research), we may want to consider having parallel constructs like for-loops with @threads or map to use tasks with SPP.

@tkf
Copy link
Member

tkf commented Sep 18, 2019

I totally agree that it's nice to have composable high-level abstractions like map, (parallel) foreach and reduce so that users can first try to solve their problem without getting into the details of parallel computing. (BTW, making reduce composable is where you start needing something like transducers 😉)

But I'm very curious to know if all programs with SPP are expressable by the higher-order functions (or more like, by reduce, as other examples I brought up can be expressed in terms of it). If not, maybe it makes sense to have @sync_tapir as low-level "unsafe" construct?

@tkf
Copy link
Member

tkf commented Sep 18, 2019

if all programs with SPP are expressable by the higher-order functions

Actually, the fib example in the Tapir PR is a good example where reduce is not enough.

@vchuravy
Copy link
Member

If not, maybe it makes sense to have @sync_tapir as low-level "unsafe" construct?

Yes I think this will be the case. But let's not get ahead of ourselves I have many issues to solve first :)

@c42f
Copy link
Member Author

c42f commented May 26, 2020

Hey guys sorry to be silent here for a while.

Yes, I agree it's unavoidable and especially hard to do in Julia because nobody wants random cancellation points to be inserted in their carefully written tight loops.

Agreed, although efficiency is just one aspect. The more difficult part is that inserting cancel points can break the programmer's expectations in terms of resource handling. I think I was harping on about this recently with the fairly trivial example of Base.lock() not being async signal-safe for InterruptException (ah yes, here it is: #35524 (comment)). We can obviously fix Base, but I think this emphasizes that writing code which is safe for use with InterruptException is currently subtle.

There seem to be some options for cancel points:

  • They could be restricted to certain function calls where the programmer is already thinking about error handling (trio uses IO for this)
  • We could have language syntax + semantics which makes resource acquisition and cleanup atomic with respect to cancellation. For this to succeed, it would have to be sufficiently nice that users write cancel-safe code by default.
  • Rely on users inserting explicit cancel points into their code
  • Make cancellation a property of resources which are being used by the code, rather than of the code itself. (I include this because Jameson has mentioned it on multiple occasions, though I don't fully understand how it would work.)

As has been discussed here, passing around nurseries gives you basically unrestricted dynamic task spawning again.

I don't think it's unrestricted. You can always reason about the scope (= nursery) of the task lexically because you either see @sync or the nursery as an argument.

Yes, exactly!

I think it's interesting to compare to Go, where people started off allowing for goroutines to be spawned anywhere, but current practice converged on passing the context everywhere (I think? I'm not a Go programmer though.) The point is to have a middle ground which is less onerous and more composable than passing a context everywhere explicitly, while also avoiding the downsides of being able to implicitly start new branches of a computation with side effects which outlive the caller.

@tkf
Copy link
Member

tkf commented May 26, 2020

There seem to be some options for cancel points:

For completeness, I think we might need something like disable_sigint but more generic (not just for sigint). Trio has CancelScope.shield, asyncio has shield, Kotlin has NonCancellable, and it's planned(?) for Project Loom (Java).

  • We could have language syntax + semantics which makes resource acquisition and cleanup atomic with respect to cancellation.

Are you thinking something better than the do block? Maybe the f(...)! syntax #7721 (comment)?

But I think it's reasonable to assume that it'd take some time to land. Meanwhile, how about discourage using lock(l)/trylock(l) and recommending/implementing lock(f, l)/trylock(f, l)?

@c42f
Copy link
Member Author

c42f commented May 26, 2020

Are you thinking something better than the do block? Maybe the f(...)! syntax #7721 (comment)?

Oh yes nice find, I don't think I've seen this discussion. It would be doubly nice if it could remove more uses of finalizers which are just generally problematic in many ways :)

@c42f
Copy link
Member Author

c42f commented May 26, 2020

Here's a conversation which touched on @sync vs Experimental.@sync from the Julia Computing slack (we didn't set out to discuss this, but it ended up relevant so I said I'd repost it here for everyone to read):

  • Chris Foster 4:55 PM
    @jeffbezanson Here's the trio issue by @njsmith describing what's wrong with trio's handling of child task errors: MultiError v2 python-trio/trio#611
    I think they reach mostly same conclusion we did today (I guess not by coincidence! I did read through that thread a while back.)
    Of course we've got the advantage that our error handling syntax is half baked right now 😄
    So at least if we get the internal representation of concurrent errors correct we can eventually make a nice API to go with it.

  • Jeff Bezanson 1:24 AM
    To fill everyone in, Chris & I had a discussion about task error handling and we decided:

    • Make excstack a first-class object and replace the .backtrace field in Task with it. A bunch of the C code for munging it can go away and/or move to julia.
    • Combine CapturedException, CompositeException, and TaskFailedException into one thing similar to trio's MultiError. Need a name for this! PropagatedExceptions? ExceptionTree?
    • That new thing should always be used instead of passing tasks around, since Tasks are hard to serialize. Instead we need to process a RawExceptionStack when it's serialized. Not sure exactly how --- we could replace the raw stack with a processed one inside ExceptionTree, or the raw stack itself could be internally converted to a processed form.
  • Jameson Nash 4:43 AM
    I’m not sure I entirely agree about the last point, since it implies we can’t start printing information to the user about an error in one task until we have confirmation that all tasks in a sync block have finished. Accordingly, I think we may want to think about moving towards having show_error do more of the work to process it, so that sync doesn’t need to wait for it to be in a consumable form.

  • Jeff Bezanson 4:46 AM
    We don't need to ban printing tasks --- a task will still contain its exception and backtrace, and you can still look at them.
    sync can also return the raw trace data from tasks, it doesn't need to process them.

  • Jameson Nash 4:49 AM
    The current intent for Experimental.@sync however is to be unable to even get the trace data from all of the Tasks

  • Jeff Bezanson 4:59 AM
    It doesn't seem right to me to insist on that in every case. Surely there are cases where you need to potentially collect errors from multiple tasks?

  • Jameson Nash 5:07 AM
    You can’t have both that and early return, unless the return set is the list of Tasks which lets show_error then do advanced diagnostics

  • Jeff Bezanson 5:11 AM
    I think it's fine to have "return as soon as there is one failure, and tell me just that one failure". But I think what trio does is as soon as there is one failure, try to cancel all the other tasks, then wait for them, then gather exceptions from all the failed ones. So we might need to provide both options.
    Does show_error need anything from a Task besides the exception and backtrace?

  • Jameson Nash 5:19 AM
    Yeah, “try to cancel” is horribly problematic though on many levels

  • Jeff Bezanson 5:19 AM
    Ok but that is a separable issue --- you can also just do what @sync does today, with no canceling.

  • Jameson Nash 5:20 AM
    Yes, but then you can’t collect the errors. You either must return the Tasks, or just the first error.

  • Jeff Bezanson 5:21 AM
    Why? You could still wait for all of them to finish.

  • Jameson Nash 5:21 AM
    I think if it has the Task object itself, we may be able to do some more interesting work showing data dependency cycles, meaning we can print much more useful stacktraces
    Because not waiting for them all to finish is explicitly why Experimental.@sync exists as an improvement over @sync

  • Jeff Bezanson 5:23 AM
    So waiting for a set of tasks to finish is inherently wrong?
    I just don't buy that there is never a need to propagate exceptions from multiple tasks. It seems pretty fundamental in trio.

  • Jameson Nash 5:23 AM
    Not necessarily, that’s just what’s being proposed there
    And I agree we do want to get the exception from multiple tasks, I’m just saying we can’t have both the experimental design proposal, and follow the last bullet point (of not returning the Task itself)
    And if we return the Task itself, we can also print much more interesting backtraces, that the runtime system couldn
    have computed, but must be deferred until error printing time (when we can examine the heap)

  • Jeff Bezanson 5:28 AM
    Ok, well I'm not totally opposed to throwing the Tasks themselves; I see you could do things like check which of them might be waiting for others.
    And Tasks are convenient in that they already bundle an exception and a backtrace; otherwise we need to awkwardly pass around two values for each error.

  • Jeff Bezanson 5:36 AM
    Another question is, do we always have a Task whenever we want to propagate an exception, or do we sometimes need to just propagate a bare pair of exception+backtrace?

  • Jameson Nash 6:04 AM
    We might need to propagate a bare pair if the root task (the one with the sync call) failed

  • Chris Foster 10:26 PM

    Yeah, “try to cancel” is horribly problematic though on many levels

    I'd love to understand better why this is.
    Cancellation seems fairly fundamental to the desire for usable structured concurrency rather than do-what-the-heck-you-like concurrency.
    Experimental.@sync just gives up on scoping of child tasks so I'm not at all convinced it's a long term improvement over @sync

  • Chris Foster 10:43 PM
    I understand that cancellation is hard because resource management is hard (more or less). Preemptive cancellation leaves resources dangling so you really can't do that without complete isolation between tasks enforced by the runtime (erlang style) so the runtime can GC them. Clearly that's completely incompatible with the Julia task model.
    Ok, so what are the options for cooperative cancellation?
    There's the trio / pthreads / etc style where cancellation is level triggered and causes a small well-documented set of IO functions to return an error. That seems pretty reasonable for non-buggy programs which do IO.
    @jameson I think you said the pthreads-like option is insufficient last time we discussed this, but I don't think you explained exactly why.

  • Chris Foster 10:57 PM
    Problem is, we can't cancel compute with that model and a lot of compute is what people are likely to be doing in Julia...
    So is this the main sticking point: there's no known models for safely cancelling compute which would fit with our runtime?

  • Jameson Nash 11:08 PM
    I’ve just never seen a cancellation system that isn’t some combination of:

    • rife with mistakes in the choice of functions (IO is the worst possible choice, IMO)
    • fails to provide any useful bound on time to completion (this is the thing you actually want to cancel)
    • incompatible with proper cleanup of resources (esp. trio, which tries to pretend otherwise)
    • deprecated due to those issues
  • Chris Foster 11:12 PM
    Hmm, interesting. So why is IO the worst choice, and what would be a better one?
    Time to completion is a real killer.

  • Jameson Nash 11:13 PM
    Transaction memory would probably be the main one
    IO is the worst choice because it guarantees you’ll leave the other actors in a broken state
    The goal is to cancel pure work (which might include IO with nobody listening), but instead it cancels the work most likely to lead to permanent corruption
    (and not to say I don’t use kill -9 with abandon, but I usually then do need to go fix the filesystem)
    My proposal (if I had time to work on this more), would be to make sure we have a way to close the dead IO objects explicitly when the nursery is exited
    So I’d propose ignoring tasks completely, and make open the primitive that’s dynamically scoped inside a nursery
    For the same reason actually too that r = open(cat *); wait(r); read(r) is a deadlock, because it waits for the Task instead of the unit of work (the readall call needs to happen before the wait )

  • Chris Foster 11:21 PM
    So is your point something like... it's not so much nesting of task lifetime which is important in structuring concurrency, but something related — nesting of any side effects the tasks may have?

  • Jameson Nash 11:22 PM
    I also think this isn’t something you can retrofit into an existing language, but fortunately, Julia has strong safe management of IO resources and exposes those handles typically to the user, so it’s something we should be able to do
    Right.

  • Chris Foster 11:23 PM
    Ok very interesting. Very easy to confuse those two things, but it's side effects which really matter

  • Jameson Nash 11:23 PM
    Actor model (including Trio) gets you pretty far by asserting that one task must be equivalent to one side effect

  • Chris Foster 11:30 PM
    I'm just trying to digest all that. So the idea is that if tasks communicate exclusively with things which can be opened, then ensuring those channels are closed in a nested way can ensure side effects are nested?
    What about shared memory though?

  • Jameson Nash 11:30 PM
    But doesn’t seem to nest correctly? For example, what if I wanted to write:

    with(database) do:
        for client in accept(port):
            @async with(client) do:
                handle(client)
    
    • Chris Foster
      So, I feel like this is a really important point you're making here. But I don't understand it — would you be able to expand a little?

    • Jameson Nash
      If one client dies, we don’t want it taking down the entire database. But so on client death, it may need to do IO with the database to cleanup and notify other live clients

    • Jameson Nash
      But if the database dies, we need all of the tasks to exit without attempting to cleanup the database (which is already dead)

  • Jameson Nash 11:32 PM
    Shared memory I think also has a close function?
    Not all of them are created via the open verb as well. Some (like Channel) are simply constructed.

  • Chris Foster 11:35 PM

    Shared memory I think also has a close function?

    Sure, in principle, so no more sharing of normal Arrays? I'm not sure I'm understanding 🙂

  • Jameson Nash 12:37 AM
    Ah, I thought you meant the stdlib of that name. I’ve done some design thinking on adding close to Array itself (currently available via empty! + reshape), but that seems tangential to this.

  • Chris Foster 12:51 AM
    Well I'm supposing that mutating memory (in a way which is visible to other tasks) should be correctly nested as much as other side effects should be.
    But I don't understand how that can be made practical.

  • Jeff Bezanson 1:00 AM
    This is a reason I think the current @sync is maybe not so crazy. If you can set up your tasks to work with cancellation, you can probably also set them up to just terminate. Cancellation just feels more magical and automated. Waiting for everything is a pain for debugging, but we can address that with better introspection and tooling i hope.

  • Jameson Nash 1:14 AM
    I think you might have close on a Lock, since that’s generally necessary the shared resource (the Array itself then being single-owner-at-a-time with exchange controlled by the lock)? I haven’t looked into that though.
    Yeah, we could probably implement deadlock detection, and then the current @sync design also might just magically work
    (deadlock here meaning detection that everyone in the sync-set is waiting for an internal, so non-IO, event)

@tkf
Copy link
Member

tkf commented May 26, 2020

I don't get why "IO is the worst choice." You should be ready for failures when dealing with IO. So, it seems to be the best choice.

Also, I'm not sure why it'll "leave the other actors in a broken state." Lexical scoping in structured concurrency is powerful exactly because it works nicely with lexically scoped resource management (Julia's do and Python's with).

I agree using resources as cancellation tokens is a good implementation strategy but I think it's rather a very low-level implementation detail. I think it's also hard to use in practice.


A case-study on using resources as cancellation tokens

In #34543, I wondered how to make Threads.foreach(f, channel::Channel) beahve nicely with errors. There are a few nice properties to have:

  1. When one task throws, all other tasks terminate reasonably soon (structured concurrency).
  2. Don't close the input channel.
  3. All items take!n from the input channel are processed.

Current implementation in #34543 is, roughly speaking, something like this

function tforeach1(f, channel::Channel; ntasks=Threads.nthreads())
    stop = Threads.Atomic{Bool}(false)
    @sync for _ in 1:ntasks
        Threads.@spawn try
            for item in channel
                f(item)
                stop[] && break
            end
        catch
            stop[] = true
            rethrow()
        end
    end
    return nothing
end

tforeach1 satisfies the property 2 and 3 but not 1. This is because a task can be stuck on take!(channel) in the line for item in channel.

Initially, I thought it might be better to use the channel as the resource that is closed to propagate the cancellation:

function tforeach2(f, channel::Channel; ntasks=Threads.nthreads())
    @sync for _ in 1:ntasks
        Threads.@spawn try
            for item in channel
                f(item)
            end
        catch
            close(channel)
            rethrow()
        end
    end
    return nothing
end

tforeach2 satisfies the property 1 and 3 but not 2. That is to say, we shouldn't use the resource we do not own as the cancellation token. So how about creating the resource that we "own"?

function tforeach3(f, channel::Channel; ntasks=Threads.nthreads())
    owned_channel = Channel() do owned_channel
        for item in channel
            put!(owned_channel, item)
        end
    end
    @sync for _ in 1:ntasks
        Threads.@spawn try
            for item in owned_channel
                f(item)
            end
        catch
            close(owned_channel)
            rethrow()
        end
    end
    return nothing
end

Unfortunately, tforeach3 satisfies the property 1 and 2 but not 3. The owned_channel may be closed after an itme is take!n from the input channel.

If we have something like Go's select (i.e., we have a way to say "exactly one of those effect happens"), it'd be possible to do this. But it's rather cumbersome:

function tforeach_safe(f, channel::Channel; ntasks = Threads.nthreads())
    done = Channel{Nothing}(ntasks)
    try
        @sync for _ in 1:ntasks
            Threads.@spawn while true
                @select begin
                    item = take!(channel) begin
                        # `take!(channel)` happened (but `take!(done)` didn't)
                        try
                            f(item)
                        catch
                            put!(done, nothing)
                            rethrow()
                        end
                    end
                    take!(done) begin
                        # `take!(done)` happened (but `take!(channel)` didn't)
                        break
                    end
                end
            end
        end
    finally
        close(done)
    end
end

(Note: more precisely we need to use maybe_take!(::Channel{T}) :: Union{Nothing,Some{T}} instead of take!(channel) to handle the case channel is closed. But it's a pseudo-code anyway.)

@tkf
Copy link
Member

tkf commented May 27, 2020

Actually, tforeach_safe is not enough because we don't control which take! is preferred in @select (at least if it is Go-like)... I guess we need something like

function tforeach_safe2(f, channel::Channel; ntasks = Threads.nthreads())
    taskref = Ref{Task}()
    done = Channel{Nothing}(ntasks)
    @sync try
        request = Channel(taskref = taskref) do request
            while true
                response = take!(request)
                @select begin
                    item = take!(channel) begin
                        put!(response, item)
                    end
                    take!(done) begin
                        close(response)
                        break
                    end
                end
            end
        end
        Base.@sync_add taskref[]
        for _ in 1:ntasks
            Threads.@spawn begin
                response = Channel(1)
                while true
                    try
                        put!(request, response)
                    catch
                        break  # closed by other worker tasks (or no more items)
                    end
                    item = try
                        take!(response)
                    catch
                        break  # closed by the `request` task
                    end
                    try
                        f(item)
                    catch
                        # Allow no more request (but still process
                        # requests that are already sent):
                        close(request)
                        put!(done, nothing)
                        rethrow()
                    end
                end
            end
        end
    finally
        close(done)
    end
    return nothing
end

@NHDaly
Copy link
Member

NHDaly commented Jul 27, 2020

Combine CapturedException, CompositeException, and TaskFailedException into one thing similar to trio's MultiError. Need a name for this! PropagatedExceptions? ExceptionTree?

Also, just wanted to chime-in regarding this point, and point out this package we ended up writing now that we're using more concurrency inside our code at RelationalAI:
https://github.com/NHDaly/ExceptionUnwrapping.jl

It's scary that some internal decisions to multithread some code can change the kinds of exceptions that are being thrown, so we're now using has_wrapped_exception(e, T) everywhere instead of e isa T in our catch blocks.

This seems relevant, though i think orthogonal, to the current discussion, so I just wanted to note it :)

@c42f
Copy link
Member Author

c42f commented Aug 22, 2020

I continue to try to find counter-examples to structured concurrency, or at least cases where it would uncomfortably constrain programming style. Recently I thought I found one but it turned out to be a false alarm. It's an interesting case to think through though.

The setting

For resource lifecycle handling, the open(thing) do resource ; ... end pattern is extremely useful because

  1. The user can't forget to call close
  2. The resource type owns the stack because it owns the open implementation; it can use block constructs like try ... finally, @sync, and control state can be kept in the stack. This seems closely related to the benefits of foldl based loops which @tkf has been emphasizing for a while.

The former is important for ergonomics, but the latter is what matters for structured concurrency as the open() may need to start async background tasks (eg, as a resource which communicates with a remote server). In strict structured concurrency these async background tasks can't escape the call to open(), so the pattern of passing a closure to be invoked with resource in the inner scope is very compatible with this.

For the same reason, the classic file IO like producer-consumer pattern of resource management with paired open/close can't be used in structured concurrency because it leaves no context to manage the child tasks:

resource = open(thing)
do_stuff(resource)
close(resource)

(That is, not without making the task nursery an explicit part of the open API's parameters, which is no good for composability as it essentially colors open() via calling convention.)

Why not simply transition to always using do blocks and closures for resource handling?

Well we should for most things, even just for the ergonomic benefits of writing code which always correctly closes resources.

But it's actually pretty inconvenient in the REPL! In the REPL you want to enter a context where resource is available and use it interactively. So the scoped resource management is pretty inconvenient in this important case.

I thought this was a problem but it actually isn't if the REPL itself maintains a nursery. One way out is to have a macro which, roughly speaking, turns the nested interface

open(thing) do resource
    do_stuff(resource)
end

into something which looks like the producer-consumer interface by introducing a task explicitly into the REPL nursery to manage the resource lifetime. Schematically,

ready = Channel()
done = Channel()
@async_in_ctx repl_nursery open(thing) do r
    global resource = r  # Or whatever is required to set this in the REPL context
    put!(ready, true)
    take!(done)
end

# REPL work ...
do_stuff(resource)

put!(done, true)

Working sketch

Here's a sketch that actually works in current Julia (though I resorted to bare @async because we don't have explicit nurseries, or a REPL which maintains one):

struct AsyncResource
    done::Channel
end

Base.close(r::AsyncResource) = put!(r.done, true)

macro asyncdo(ex)
    @assert ex.head == :(=)
    var = ex.args[1]
    call = ex.args[2]
    doblock = :(placeholder() do x
        # This `global` is problematic!
        # Should be an `outer` variable!?
        global $var = x
        put!(ready_done, true)
        @info "Ready $(current_task())"
        take!(ready_done)
        @info "Done $(current_task())"
        nothing
    end)
    doblock.args[1] = call
    quote
        ready_done = Channel()
        @async $doblock
        take!(ready_done)
        AsyncResource(ready_done)
    end
end

Kinda ugly usage example

julia> resource = @asyncdo io = open("resources.jl") # internally, calls do-based form of open()
[ Info: Ready Task (runnable) @0x00007ff3843549d0
AsyncResource(Channel{Any}(sz_max:0,sz_curr:0))

julia> read(io, 1)
1-element Array{UInt8,1}:
 0x23

julia> close(resource)
[ Info: Done Task (runnable) @0x00007ff3843549d0
true

julia> isopen(io)
false

Of course it looks pretty weird to use this for normal file streams as in this example. But I hope it makes sense if you imagine a world in which there's no such function open("file") but only the scoped form open(f, "file").

@StefanKarpinski
Copy link
Member

Any thoughts on how that might mesh with the io = open(path)! syntax proposal in #7721?

@c42f
Copy link
Member Author

c42f commented Aug 25, 2020

Any thoughts on how that might mesh with the io = open(path)! syntax proposal in #7721?

Aha, thanks for the very relevant link. I was hoping to re-read that discussion last week but I just couldn't find it. (More complete XRef: The relevant discussion of postfix-! syntax starts here #7721 (comment); Jeff suggested it could relate to a more general defer form here #7721 (comment).)

One interesting thing about defer-like approaches is that they avoid excessive nesting syntax (and, perhaps, strict nesting of resource lifetimes) and still provide guarantees about when cleanup will run. There's a lot to like about it, especially with the ! shorthand syntax.

On the other hand, having open(path)! surface syntax suggests that resources would be implemented as state machines with open and closed states rather than a higher order function accepting the user's callback. That's relevant because the following seem incompatible:

  1. Using explicit state machines for resource handling, where the resource implementer returns the resource from open().
  2. Structured concurrency with child task lifetimes which do not outlive function calls
  3. Ensuring that the use of concurrency within open() is an implementation detail. (AKA, avoiding colored functions; colored by the calling convention of whether open() accepts a nursery or not.)

I think there's various ways to attack this apparent problem.

  • Don't use explicit state machines for resource handling; invent some lowering for ! (or other convenient surface syntax) which lets the user use them like a state machine but the implementer implement them as a normal higher order function. FLoops.jl is some interesting prior art for this kind of approach. (The AsyncResource sketch shows one ugly and inefficient way that a callback-based open() can be transformed into a state machine. Perhaps there's something less bad!)
  • Naturally, one could ditch structured concurrency. Seems like a pity!
  • Colored functions; make the calling convention for "resource functions" include a nursery in some way. Could possibly be viable with the right lowering for !, as the appearance of ! in the source code seems like it would induce an explicit chain of custody. (Cf. the way that Kotlin's coroutine context is passed)

No doubt there's other options.

@tkf
Copy link
Member

tkf commented Aug 25, 2020

My current thinking is that the higher-order function pattern does have a limitation in that it is too flexible.

For example, in FGenerators.jl (the "sister project" of FLoops.jl), I added FGenerators.@with so that open(...) do pattern can be used with good performance-oriented iterations over "data collections with resources". This is required to avoid Boxing in the closures.

The closure problem itself may be solved at some point. But I think it's valuable to encode the basic property that the "do block" is executed exactly once. For example, Rust has a similar FnOnce trait (although Rust's resource management is usually RAII). Python's with block satisfies this property exactly. OTOH, in FGenerators.@with, the user has to manually make sure this is the case. This is obviously not ideal (and doing so rigorously ATM is virtually impossible because of multiple dispatch).

the open() may need to start async background tasks

3. Ensuring that the use of concurrency within open() is an implementation detail.

These points are interesting. In a way, Trio is not doing structured concurrency super strictly if you are viewing from this standing point. That is to say, it allows you to leak resource since you can call __aenter__. So, you can define open/close as state machine quite easily even when you need nursery as a sub-resource:

class ResourceWithNursery:

    async def __aenter__(self):
        self.manager = trio.open_nursery()
        nursery = await self.manager.__aenter__()
        nursery.start_soon(trio.sleep, 0.1)
        print("Enter")

    async def __aexit__(self, exc_type, exc_value, traceback):
        print("Exit")
        await self.manager.__aexit__(exc_type, exc_value, traceback)

Of course, most of the time Python programmers do not write resource managers by hand. Instead, it'd be derived from the generator syntax:

from contextlib import asynccontextmanager


@asynccontextmanager
async def resource_with_nursery():
    async with trio.open_nursery() as nursery:
        nursery.start_soon(trio.sleep, 0.1)
        print("Enter")
        yield
        print("Exit")

I think this is close to the spirit of "the implementer implement them as a normal higher order function." From my experience of playing with IRTools.jl a bit, I think something like @asynccontextmanager is totally possible.

But, the "lesson" I'd derive from this is rather that it is OK to provide "unsafe" API as long as the API has the clear indication of unsafeness to nudge the implementer to be very careful. Even Rust can't save you from miss-implemented drop anyway. The open/close state machine would be one of such "unsafe" APIs. To explicitly denote that certain APIs are low-level, I think Python does a good job at it with the "dunder" methods. I'm not suggesting dunder as the only solution but, since we already have __init__, it may not be so crazy to have __open__ and __close__. It gives you some alarm if you see code calling these functions manually.

A bit aside, but I think one of the designing challenges of open(...)!/defer-like approach is what to do when you want to control the lexical scope where the close must happen. For something like a mutex it matters a lot.

@StefanKarpinski
Copy link
Member

Other options are to disallow defer calls in global scope, ignore them (let finalizers handle it), or implicitly turn them into finalizers in global scope. Not sure if any of those approaches simplifies things.

@c42f
Copy link
Member Author

c42f commented Aug 26, 2020

At global scope it would seem logical to turn defer handlers into finalizers as there's no static lifetime information. I'm unsure how defer would interact with closure captures, but we should probably discuss that in #7721.

OTOH, in FGenerators.@with, the user has to manually make sure this is the case. This is obviously not ideal (and doing so rigorously ATM is virtually impossible because of multiple dispatch).

Interesting point. Maybe we could have a wrapper which dynamically asserts the call-once property, but can be optimized away when enough type information is available.

@c42f
Copy link
Member Author

c42f commented Aug 31, 2020

Some interesting discussion of Go's context here:

https://news.ycombinator.com/item?id=24323564#24327585

@c42f
Copy link
Member Author

c42f commented Oct 31, 2020

Swift now has a set of WIP proposals for concurrency in general, including one on structured concurrency. From a fairly quick read, their proposal seems very much similar to the Trio approach. However it's an interesting and very relevant read in its own right, as it's quite concise and discusses structured concurrency as a language feature rather than at the library level:

https://github.com/DougGregor/swift-evolution/blob/structured-concurrency/proposals/nnnn-structured-concurrency.md

I found the section on cancellation interesting. The proposed model seems quite standard: it's fully cooperative cancellation where they expect IO operations to be cancellable. To quote:

With that said, the general expectation is that asynchronous functions should attempt to respond to cancellation by promptly throwing or returning. In most functions, it should be sufficient to rely on lower-level functions that can wait for a long time (for example, I/O functions or Task.Handle.get()) to check for cancellation and abort early. Functions which perform a large amount of synchronous computation may wish to periodically check for cancellation explicitly.

Somewhat related to all this, it seems Swift will go down the path of colored functions — see the WIP async/await proposal.

@vtjnash
Copy link
Member

vtjnash commented Oct 31, 2020

I find these "standard" cancelation proposals (based from trio, inherited from C?) to not be very helpful. They are supposed to give you implicit cancelation after a timeout, then provide neither. They still only cancel after specific points explicitly check for the condition (when reaching IO operations) and they thus still require you to have resources—they just don't let you clean up resources properly (close is also a cancelation point in C, but also probably will leak memory and break other state since any cleanup is skipped). And worse, they seem to add yet another "color" of functions (yellow?): there's those that are canceled (nursery) and those that aren't (detached). The cancellation that I'm arguing we already have is thus equivalent in functionality but better on each axis: we define that the cancellation happens when interacting with a resource ✔️, except the check really is implicit ✔️(via close state checking) and fully scoped and doesn't require passing along additional handles ✔️(just using the ones it already has). This wasn't possible in C since there's no resource management of file descriptors, so pthreads has this sort of cancellation instead. But we don't need to keep inheriting the limitations of that past, if we have an IO, Channel, and Event system planned to handle this.

@c42f
Copy link
Member Author

c42f commented Nov 2, 2020

Right, I know you've said you don't like other cancellation systems on multiple occasions. To be clear, I'm not arguing that in this swift proposal Swift is doing it "the right way", only pointing out that it's interesting to see. BTW, did you read it yourself — I'd be curious if you think my summary is accurate?

The cancellation that I'm arguing we already have

My problem with the cancellation we already have is that using @sync correctly seems to be very tricky and manual — it's extremely easy for a task to accidentally crash and not have its siblings be cancelled. It's necessary to manually pass around cancellation tokens if there's no resource naturally available. For cancelling compute (and memory access), I wish there was a way to check "should this task be cancelled?" without creating a Channel or some other resource and passing it by hand.

Experimental.@sync intentionally leak tasks which may still be running, but I don't think this is a good solution unless the leaked tasks truly have no observable side effects. In current Julia there's many ways a task can have side effects visible to other tasks.

@tro3
Copy link

tro3 commented Dec 10, 2020

Okay, the overall language- vs library-level choices and concurrency strategy decisions are way beyond my Julia pay grade, but there may be some short-term fixes available for one of the comments above that I ran across while poking around #32677.

it's extremely easy for a task to accidentally crash and not have its siblings be cancelled

Completely agreed with this. But I think it this might be a more tactical implementation issue - the @sync task "manager" currently just walks through the @async tasks in order, waiting for each to complete. This allows interdependencies between the @async tasks to lock things up (for example when a task later in the order dies while a task earlier in the order is waiting for it).

An improved parallel structure would be to schedule the @sync task after the completion (for any reason) of each @async task, and have it then scan across the tasks for errors/completion, and immediately throw exceptions that arise. Simple example in the #32677 comments. This won't stop programmatic lockups (one can still generate lockups with Channels if one tries) and it does not guarantee the shutting down of the sibling tasks on an error, but it would prevent lockups due to premature crashes and give the user a guaranteed stacktrace to debug with. Not a total solution nor a grand API shift, but a pretty decent improvement for a minimal change.

@tk3369
Copy link
Contributor

tk3369 commented Jul 5, 2022

Catching up... I also just found https://github.com/JuliaConcurrent/Julio.jl

Does this package:

  1. handle all issues discussed in this issue, if not, where are the gaps?
  2. eliminate the need to do something in Base? why or why not?

Cc: @tkf

@jariji
Copy link
Contributor

jariji commented Dec 2, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc. julep Julia Enhancement Proposal parallelism Parallel or distributed computation
Projects
None yet
Development

No branches or pull requests