-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Straighten out error handling via a thread-local (but otherwise global) context #1327
Straighten out error handling via a thread-local (but otherwise global) context #1327
Conversation
How error contexts are implemented: How they are used: Similarly, for slicing: This replaces some gunky exception-chaining in the main branch (below), which I previously wasn't happy with, as it complicated the stack trace. Now we don't do exception-chaining within the Awkward library (we do still try-catch errors from other libraries, and of course StopIteration); it only raises the exception at the site of the error—it just has more information when it gets there. |
Codecov Report
|
Can you show what they look like now (as in "before" the pr) for comparison? This seems like a lot of specialized, custom handling that is fragile & likely to be hard to remember to include in the future (unless you implement a custom flake81 or pylint check, for example). I'm not sure I know exactly what this is solving - is it solving several things? This might be the best solution, but want to make sure other options have been exhausted. :) Note that error handling is changing a lot in Python 3.11; they have added a new way to add a note field ( Footnotes
|
I'd also benefit from learning a bit more about the wider context here, and wanted to write a comment rather than just +1 :) |
ImplementationAvoiding fragility is a high priority. This implementation doesn't adjust the stack trace in any way—all it does is it constructs the error message in a standardized way, in a central place. I'll expand on the wider motivation below, but the thing we're trying to have happen is for the error message, however deeply in the Awkward codebase it occurs, say something in its text about the From the sound of the word "error groups," core Python may be addressing the same thing: not all parts of the stack trace are equally important to all users/developers. In particular, the boundaries between calls within a library and calls between libraries are important in general, not just Awkward. When that feature becomes available, it would hurt nothing to use both. Similarly, Previous implementationsUsers' problems detangling their indexing errors from Awkward internals started in Awkward 0, and one of the reasons I was looking forward to putting the internals into C++ in Awkward 1 was to hide a lot of it from the stack trace. My thinking then was that the Python stack trace would terminate on a user call, like Meanwhile, I'm still hearing that ak.Array([[1, 2, 3], [], [4, 5]])[[[True, False, True], [], [False, True, True]]] Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jpivarski/irishep/awkward-1.0/awkward/highlevel.py", line 991, in __getitem__
tmp = ak._util.wrap(self.layout[where], self._behavior)
ValueError: in ListArray64 attempting to get 2, index out of range
(https://github.com/scikit-hep/awkward-1.0/blob/1.8.0rc3/src/cpu-kernels/awkward_ListArray_getitem_jagged_apply.cpp#L43) is not useful information for users to figure out their indexing problems. At the depth where the indexing error actually occurs, we don't know what the original slice/ In Awkward 2, I made slice errors look like this: ak._v2.Array([[1, 2, 3], [], [4, 5]])[[[True, False, True], [], [False, True, True]]] Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/highlevel.py", line 1018, in __getitem__
out = self._layout[where]
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/content.py", line 476, in __getitem__
return self._getitem(where)
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/content.py", line 582, in _getitem
return self._getitem(layout)
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/content.py", line 573, in _getitem
return self._getitem((where,))
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/content.py", line 514, in _getitem
out = next._getitem_next(nextwhere[0], nextwhere[1:], None)
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/regulararray.py", line 590, in _getitem_next
down = self._content._getitem_next_jagged(
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/listoffsetarray.py", line 324, in _getitem_next_jagged
return out._getitem_next_jagged(slicestarts, slicestops, slicecontent, tail)
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/listarray.py", line 352, in _getitem_next_jagged
self._handle_error(
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/content.py", line 212, in _handle_error
raise ak._v2._util.error(ValueError(message))
ValueError: cannot slice
<Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>
with
[[True, False, True], [], [False, True, True]]
Error details: index out of range while attempting to get index 2 (in compiled code:
https://github.com/scikit-hep/awkward-1.0/blob/1.8.0rc3/src/cpu-kernels/awkward_ListArray_getitem_jagged_apply.cpp#L43) but I did it by introducing Besides, some uses of the internal Besides besides, some slicing errors still weren't raising Besides besides besides, this only addresses slicing, not This implementationThe problem is that we have information at an important level of granularity when we first enter an Awkward operation (slice, One feature of this "outside Awkward" → "inside Awkward" concept is that it is global (per thread). A stack trace can only pass through such a boundary once. If an internal Awkward function calls another public API function, it doesn't count as another boundary cross—only the first one is the one we want to report. Although I avoid global state in almost every situation, this seems like the one-in-a-hundred exception. The global state is in When the boundary is crossed the first time ( def public_ak_function(args):
with ak._v2._util.SomeErrorContext(args):
do_the_function(args) The logic that I've described about crossing the boundary only once is implemented by the context manager. The things that need to be adhered to are
If any of these rules are not followed, nothing disastrous happens—we just don't get the pretty error message. Rule (1) isn't hard to enforce via code review, and any fly-by contributors who add a new function by copying an existing one would copy that context manager along with the rest of the formalism (like the way that all of these functions go in a particular submodule and are exposed to
MotivationThis is an important section, but I've just run out of time to write it. I'll follow up with another comment here in a few hours. |
Quick comment:
That's exactly what they are. :) - https://www.youtube.com/watch?v=OjPT15y2EpE It seems like at least the post-error message could be added via |
(I still intend to give a fuller motivation here.)
Yeah, that's what would make me uncomfortable: modifying something that could interact poorly with IPython, Rich, or even some plain Python modes that I don't know about. What we have here is just plain exception-throwing, without even as much as chaining. |
On this PR:My understanding of what you've written is:
I can see the benefit to solving the problem of traceback clarity. Tools like I want to prefix this with "I don't have a good idea of the best solution". What we're trying to solve here isn't just an Awkward problem (hence PEP-678)! My gut instinct is that this is a "Python problem" rather than a "library problem", because unless users are doing something awful with Implementation DetailsI'm not sure whether the exception-rewriting solution is ideal, though. On the one hand, it's a thorny issue - without PEP 678, there is no way to modify the printed exception without either
(1) is the most foolproof - we explicitly impose an error signature of The issue with rewriting (2) is that it makes some assumptions:
Maybe within the context of builtin exceptions these assumptions are acceptable, but this feels slightly fragile if, e.g. an external library implements their own exceptions (e.g. a custom container that raises an (3) might be slightly more robust - we can effectively copy (4) would be the "safest", but pretty incompatible with other tools. I suppose the questions that come to mind are:
If the answer to (1) is "very important", then I think one of "duck-like wrapper exception" or "re-create exception" are the most foolproof for most users. With the existing PR, I wonder whether In addition to how we attach this information to exceptions, there is also the associated change to each function that wants to implement this behaviour. My first impression is that we need a lot of boilerplate code in every Awkward function in order to handle this. Would it be acceptable just to create a decorator that captures the args and wraps any raised exceptions? I.e. def nice_function(func):
@wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except BaseException as exc:
sig = signature(func)
params = sig.bind(*args, **kwargs)
parts = ["("]
for p in params.args:
parts.append(repr(p))
for k,v in params.kwargs.items():
parts.append(f"{k}={v!r}")
parts.append(")")
call_msg = "".join(parts)
msg = f"Call to {func.__qualname__}{call_msg} failed!"
raise annotate(exc, msg)
return wrapper
@nice_function
def sort(...):
layout.sort(...)
Excuse the ugly pseudo code, this is just from my playing around with (3) Looking ForwardSo, I can see the benefit of this PR despite being slightly reluctant to start modifying exceptions (even if we are most likely the same entity that actually instantiated said exception). Another thought that occurs to me is that PEP 678 is a clear standard that would mostly solve this problem (in my opinion). Maybe, instead of rolling our own solution, we could encourage Rich & IPython to support 678 (once it is accepted) without testing the Python version. Then, Awkward could use My purist suggestion is just to use PEP 678 and let users of newer Python's benefit, but I can see that this is probably a rather extreme suggestion 😉 |
I think we can install an exception handler only if it's set to the default, which would cover the pure Python case. The "poor interaction" with other tools that have exception handlers is just because you'd have to pick one, they can't be combined. I'd expect libraries would pick up support for this for all Python versions. I didn't realize this was pulled out of PEP 654 into it's own PEP, though I remember the worries about it being added to an existing accepted PEP. I'm rather sad that IPython's |
On the motivation (which I have yet to write), better error messages, which have been a recurrent issue throughout Awkward's history and a general library problem, is not the only issue. The other motivation—the "why now?" question—has to do with #1321, handling errors if Awkward functions in a GPU context are asynchronous. Then a user error in (If we keep this PR open now to discuss it, then I'll have to develop the async prototype as a PR into this PR.) @agoose77, you have some great points, but in a few detail points I think you're assuming this does more than it does.
In this implementation, if our code calls a third-party library and that library raises an exception, it will not be rewritten. This annotates our own exceptions only, and all of the exceptions that we raise have no associated data other than the message. (I had the pleasure of reviewing them all last night!) Ideally, our code doesn't use much from third-party libraries (only when we're converting or interoperating with them, such as So the code that we've already written generally checks to ensure that NumPy isn't going to be raising exceptions, and those checks raise ValueErrors, IndexErrors, etc. with Awkward-meaningful messages. This PR replaces if check_for_bad_condition(intermediary_array):
raise ValueError("your Awkward input is invalid in such-and-such a way")
else:
do_calculation(intermediary_array) with if check_for_bad_condition(intermediary_array):
raise ak._v2._util.error(ValueError("your Awkward input is invalid in such-and-such a way"))
else:
do_calculation(intermediary_array) If
You're totally right about that—I agree it would be cleaner to do so. But it's also centralized code that can be fixed on one spot. Most of the changes in this PR are all the individual Also, notice that there's an
I'd rather have boilerplate inside our codebase than have functions modify our functions. Having everything be visibly laid out, so you can see what it all does, is better for maintenance, though understandably more effort to type. (Parts of Awkward 0 were decorator-based, and it caused more harm than good. We'd want to be careful with that.) Also, the decorator you describe would annotate all exceptions, chaining them with the try-except, and I only intended to annotate the exceptions we raise with Awkward-meaningful messages in them.
I think this is illustrating the main thing: we're talking about annotating exceptions we raise, and we know how we raise them. I'd like to backtrack from this generality. |
Right, I crossed some wires in my head w.r.t where this PR actually calls Given your explanations, it seems like we're not worrying about the case where an unexpected exception is raised - we're only worried about exceptions that are explicitly raised in the Awkward code-path (known ahead-of-time). That simplifies the scope of the problem a bit This is a big PR, and I am going in circles trying to write out a full reply 🤕 I have two separate axes of concern:
Given the explicit, opt-in approach, I think the rewriting approach here is the simplest. This just leaves the boilerplate of the exception context. With with ak._v2._util.OperationErrorContext(
"ak._v2.from_arrow_schema",
dict(schema=schema),
):
return _impl(schema) what this is mainly doing AFAICT is capturing the arguments to the high-level operation, so that the formatted traceback can guide the user as to what went wrong. This is quite similar in motivation to Rich's If we want to implement this (if we're dealing with async stuff, I'm guessing this becomes more important), then I would think that a simple decorator would remove most of the boilerplate: def operation(func):
signature = inspect.signature(func)
@wraps(func)
def wrapper(*args, **kwargs):
context = signature.bind_partial(*args, **kwargs)
context.apply_defaults()
push_context(func.__module__, func.__name__, context)
try:
return func(*args, **kwargs)
finally:
pop_context()
return wrapper Is this something you'd be on board with? RE
This immediately sounds like |
I'm not surprised that it's been done before. We may want to take control of the
This one, yes! The boilerplate of raising all exceptions like I'm assuming that a decorator like this would then make them look like @ak._v2._util.operation
def whatever(arguments, with_some="defaults"):
"""
Really long docstring.
"""
return _impl(arguments, with_some) Another reason that I had for wanting to move the implementations out to |
How does everyone feel about this now? Let me know if you object to merging it in its current state. (Silence is assumed to be consent!) @agoose77's idea of using a decorator to reduce boilerplate in src/awkward/_v2/operations/**/*.py is a good one, but I think it can be applied at a later date. The harder-to-merge part is all of the @henryiii suggested writing a flake8 check for the With this PR merged into main, #1331 can be turned to target main instead of this branch. Oh! And I never did write up that motivation in terms of delayed processing. I'll do that now because having #1331 to point to would make it easier to talk about. |
This does it: import ast
parsed = ast.parse(open(filename).read())
for node in ast.walk(parsed):
if isinstance(node, ast.Raise):
if not isinstance(node.exc, ast.Call) or ast.unparse(node.exc.func) not in (
"ak._v2._util.error", "ak._v2._util.indexerror"
):
raise ValueError(
f"{filename} line {node.lineno} needs exception to be wrapped in ak._v2._util.*error"
) although there should be a way of opting-out of some files (src/awkward/_v2/_connect/numba/**/*.py are excluded because those errors are not in Awkward operations and Numba does its own manipulation of error messages), and there should be a way to opt-out of individual lines, like with a |
Motivation (as promised)Apart from the long-standing issues with appropriateness of error messages for users, there's a new one regarding eagerness/laziness and error messages. @swishdiff and I talked about CUDA occupancy at length on Monday: the CUDA backend exists only for speeding things up, so keeping a GPU fully occupied is its raison d'être. Picking a concurrency model is not a premature optimization. We expanded this conversation to potential Awkward-CUDA users in Discussion #1321. The details are on that Discussion, but two things came out: (1) users are already dissatisfied with error messages and find the primary value to be one of locating the line number in their code, and (2) Awkward's current eagerness strategy would ensure that either the CPU is busy or the GPU is busy, never both. That's bad. (There's a secondary part to that story that @swishdiff brought up, that in addition to keeping both the CPU and GPU busy, you also have to keep all the processors on the GPU busy. With our strict data dependencies between subsequent Awkward-kernels in an Awkward-operation and unknown data dependencies between Awkward-operations (it depends on user code), it would be very difficult for us to run multiple Awkward-kernels, and hence CUDA-kernels, at the same time. The only way we could do this well is by letting the user put independent work on CUDA streams, so everything I say below about a "background worker" applies per CUDA stream.) We need to run our Awkward-kernels in a particular order to handle data dependencies, but the result does not need to be ready when an operation like The CUDA tools we looked at for doing this are (a) unaware of the Python steps we need to perform between CUDA-kernels and (b) don't seem to apply to Below is (the beginning of) an implementation of a "lagging"/"foot-dragging"/"delayed" executor as an This delayed array is similar to v1's VirtualArray, except that it is low-level (Indexes and buffers, not a Content node), it's completely invisible to users, has no evictable cache (it runs once and fills a permanent result), and would never be used for I/O. Its only intended use is for CUDA, but we can separate that part out and just have this The threading model is important to get right and keep simple, since "hanging"/"deadlock" is as hard to debug as segfaults. The worker has three states:
State 1 → 2 when a task appears on its queue, 2 → 1 when it completes, and 2 → 3 if it raises an exception. An exception in a task ruins the worker (3 is an absorbing state); a new worker needs to be made to replace it. A future has three states:
Attempting to view the If anything goes wrong in a task, it will be reported on the user's thread, either when trying to add a task to a dead worker or when trying to access the If you read the code above, you'll see that the stack trace the user sees is a relevant one for debugging the task itself—in other words, Awkward internals—but it's decoupled from the stack trace at the time when it was scheduled—in other words, how the user called
The thread-local ErrorContext is a way of getting that information from the time when a task is scheduled to the time when the exception is raised. Here are some examples showing how that works. First, a demo of how the delayed processing works when there are no exceptions. >>> import time
>>> from awkward._v2._delayed import *
>>> def task():
... print("begin")
... time.sleep(10)
... print("end")
... return 123
...
>>> worker = Worker(); worker.start()
>>> future = worker.schedule(task)
begin
>>> future.result() # processing has already begun; wait for it to end
end
123 Now if the future is scheduled in an OperationErrorContext, it will be able to talk about that context in its error message, even though the exception is raised long after >>> from awkward._v2._util import OperationErrorContext, error
>>> def task():
... print("begin")
... time.sleep(10)
... raise error(ValueError("oops"))
...
>>> def ak_whatever(**kwargs):
... with OperationErrorContext("ak.whatever", kwargs):
... future = worker.schedule(task)
... return future
...
>>> future = ak_whatever(args=123)
begin
>>> future.result()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/_delayed.py", line 62, in result
raise exception_value.with_traceback(traceback)
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/_delayed.py", line 45, in run
self._result = self._task()
File "<stdin>", line 4, in task
ValueError: while calling (from <stdin>, line 1)
ak.whatever(
args = 123
)
Error details: oops The worker does sequential work: if there's an exception anywhere in the sequence, it's the only exception because nothing can be scheduled or executed after that. In the following, we put two tasks onto the worker: >>> worker = Worker(); worker.start()
>>> future1 = ak_whatever(args=1)
begin
>>> future2 = ak_whatever(args=2) then wait a long time (more than 10 seconds), then try to put another one on: >>> future3 = ak_whatever(args=3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in ak_whatever
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/_delayed.py", line 102, in schedule
self._futures.put(future)
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/_delayed.py", line 73, in put
raise exception_value.with_traceback(traceback)
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/_delayed.py", line 45, in run
self._result = self._task()
File "<stdin>", line 4, in task
ValueError: while calling (from <stdin>, line 1)
ak.whatever(
args = 1
)
Error details: oops Note that the error message has If you try to evaluate Other than the fact that these stack traces are full stack traces from the worker thread (reported on the main thread), they are not being manipulated. The file and line number are part of the final error message. That's why this PR was written. |
per-file-ignores =
tests/*: T, AK1
dev/*: T, AK1
setup.py: T
localbuild.py: T
src/awkward/__init__.py: E402, F401, F403
./awkward/__init__.py: E402, F401, F403
src/awkward/_v2/_connect/numba/*: AK1 This check might still be grabbing a tiny bit too much:
|
It should be applied to files within None of this applies to v1. Looking at the Then the rest of these are actual surprises that I can investigate:
|
src/awkward/_v2/operations/structure/ak_broadcast_arrays.py:10:5: AK101 exception must be wrapped in ak._v2._util.*errordef broadcast_arrays(*arrays, **kwargs):
raise NotImplementedError Okay, I could wrap that. src/awkward/_v2/_connect/numpy.py:13:5: AK101 exception must be wrapped in ak._v2._util.*errorif not numpy_at_least("1.13.1"):
raise ImportError("NumPy 1.13.1 or later required") This is one that I'd want to write a src/awkward/_v2/_connect/pyarrow.py:38:9: AK101 exception must be wrapped in ak._v2._util.*errordef import_pyarrow(name):
if pyarrow is None:
raise ImportError(error_message.format(name))
return pyarrow Same here. src/awkward/_v2/_connect/pyarrow.py:44:9: AK101 exception must be wrapped in ak._v2._util.*errordef import_pyarrow_parquet(name):
if pyarrow is None:
raise ImportError(error_message.format(name)) And here. |
Pushed the custom check. It triggers only on the NotImplementedError, though we could allow NotImplementedErrors if you want. I avoided the unparse step just in case it was slow (flake8 is pretty slow on such a large amount of code), though you can add it back if you want. I'm assuming that wrapping it in |
def main(path): | ||
with open(path) as f: | ||
code = f.read() | ||
|
||
node = ast.parse(code) | ||
plugin = AwkwardASTPlugin(node) | ||
for err in plugin.run(): | ||
print(f"{path}:{err.line_number}:{err.offset} {err.msg}") | ||
|
||
|
||
if __name__ == "__main__": | ||
for item in sys.argv[1:]: | ||
main(item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is only for debugging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, visitor pattern, checking each attribute individually—that all is fine. Better/more formal, in fact: I was trying to make it simple.
The unparse
could only ever be applied to an expression that is called, which is hardly ever anything more than a Name with Attributes, and it was only doing that if it's to the right of a Raise, so I think the execution time was pretty well controlled.
But this is really nice! (And now I have an example I'll be looking up if I ever want to add another one.) Thanks!
Thanks for the explanation @jpivarski, the additional motivation is much clearer now.
I think async is a reasonable description. Given the fact that we're communicating with a remote executor, and tasks can operate out of order (independent streams), it seems like a good label. I suspect many workloads will not involve large numbers of concurrent streams, because analyses tend to evolve forwards rather than sideways, but that's by-the-by! Also, I don't think "async" in Python makes the async label unusable.
I suppose that this is a reasonable limitation - if any user operation fails, we need to blow up at some point to report the exception. Unlike fully-fledged async executors, we don't really need a recovery mechanism: errors are likely to be deterministic errors (bad data) or semi-deterministic (OOM). Overall this sounds like a reasonable direction to me, though I confess I've not spent a lot of time thinking about it yet. The executor support would be interesting - we could even use it to add threading support to Awkward CPU by releasing the GIL. It would probably be quite easy1 once the CUDA work is done. I'm not sure how much benefit this would bring, because again, this would only speed up concurrent kernels, whereas CUDA does this in addition to running different algorithms. Footnotes
|
With v2 reducing the amount of C++ code, we'll be releasing the GIL on all of that C++ code. So, AwkwardForth, ArrayBuilders working on parsing large JSON, etc. Keep in mind that this delayed thread (one per CUDA stream) is a very restricted model of concurrency—not even as general as Python's Executors. Since it runs everything in order, just one of these background threads is technically not "concurrent," though the value we're looking for is to have one attached to each CUDA stream, and the multiple CUDA streams would be concurrent with each other (and parallel). Another thing to keep in mind about a background thread attached to a CUDA stream is that it doesn't have to be fast! It's a bit like an I/O thread, in that the CPU part is just waiting on the GPU as an external resource. |
Since it looks like all of the tests are going to pass, I'm going to squash-and-merge. Is everybody ready (nothing more to add)? |
For sure, just as having one proc/thread is a mostly useless kind of concurrency in conventional threading models 😄 This reminds me of Dask's Could you clarify what you mean by this being less general than Python's executors? AFAICR these all assign work to a pool of sequential executors, which seems to map well to the model you've set out above.
Right, and to clarify, I mean that a "multiple, sequential worker" executor could provide a lightweight mechanism for multiple-CPU execution of Awkward kernels (i.e., with performance boost as an explicit goal). Clearly, this parallelism would be limited - users would only gain any performance benefits on multi-core PCs where they are operating on independent arrays in the same program (assuming that these are assigned to different workers). Moreover, I don't know why I'm suggesting making any more work for ourselves 😆 |
I meant in the sense that even though we have a main thread and a background thread, there's no concurrency in the non-error behavior. Those two threads don't count as concurrency. If you take the pair of them as one unit and have multiple units, then you can get concurrent behavior among the units, but that would also happen with regular threads, and these are thread pairs.
A Python ThreadPoolExecutor (or a TBB executor, etc.) is a pool of threads that do the tasks they've been given in an arbitrary order. With a ThreadPoolExecutor, you don't know which thread is going to run a given task, if it's going to be before or after or at the same time as another task, etc. Our background thread (singular, always only one of these per main thread) executes its tasks in exactly the order given and one does not begin until the previous one ends. All it buys is looseness between the main thread (controlled by the user's Python process) and the background thread, which is a "GPU shepherd" that keeps the GPU busy. (The GPU runs a bunch of concurrent sub-tasks, but that's another story.)
We gain that in a different way, from Dask. This "background thread" mechanism is useful for dealing with an external resource (quite a lot like an I/O thread, if you count controlling a GPU as "I/O"), which isn't what we have when trying to accelerate work on a CPU: the cpu-kernels compete with Python for the same resource (CPU cores). Dask will scale out threads, processes, and remote processes running ordinary single-threaded Awkward tasks. For that purpose, these background threads are an unnecessary complication. Maybe a different word would help (other than "asynchronous" and "delayed," the two I've used so far): maybe we can call it a "shadow thread" because there's only one of them behind a user thread and it mimics what the user thread could have done on its own, possibly offset in time. |
I think I see what you're saying. I believe we're using slightly different interpretations of "thread", which your point on "thread pairs" clarified - I am referring to a wheel-hub model, where we have (background) threads, and a single main thread (hence multiple threads = multiple background (concurrent) threads).
Right, I ended up deleting this from my last comment, but that's where I draw the distinction - in the normal futures executor, tasks are consumed eagerly rather than scheduled.
Oh! Are you referring to a graph LR;
Python --> shepherd
shepherd --> w1(worker 1)
shepherd --> w2(worker 2)
model? I.e. there is a single "shepherd" thread that blockingly manages the GPU, the main thread that doesn't block (unless the user tries to resolve a future), and then N workers? That would explain why we're crossing wires!
Yes, of course. I was thinking about this finer grained parallelism having benefits, but on second thoughts there's no compelling case for it.
Yes, if we only have one shepherd, then we're not concurrent. Shadow thread works pretty well to make the important features known. I'm 👍 on that. |
That's it! (And that graph is really cool!) There's only one shepherd/shadow per user thread (and Dask can make multiple of those). The shepherd/shadow is controlling a CUDA stream, which internally has a lot of workers, though that's something that we only see through CUDA tools. This is helping to improve the nomenclature. (@swishdiff, feel free to rename "Worker" as "Shepherd" or "Shadow." In the generality of src/awkward/_v2/_delayed.py, there's no "shepherding" because there's no GPU yet, but its main application will be pairing it with a GPU. Given that generality, maybe "Shadow" is best?) |
@jpivarski and I discussed this a little offline, and I realised that we still had slightly different ideas about how the shadow thread system would work. I was picturing something like this, graph LR;
Python <--> Shepherd
Shepherd <--> w1(Worker 1)
Shepherd <--> w2(Worker 2)
all localised to the host. In my understanding, GPU streams would be communicated with from the workers, and the shepherd's role was something like a supervisor/scheduler. But actually the "worker" here is the GPU stream. The worker-stream pair represents the mapping of each stream to a CPU thread: graph LR;
Python --> w1
Python --> w2
w1("Worker 1 [Host]") --> s1("Stream 1 [GPU]")
w2("Worker 2 [Host]") --> s2("Stream 2 [GPU]")
I don't want to put words into Jim's mouth, but I think the endpoint of our conversation is that Python (the main thread) can talk to multiple workers (shepherds) that keep GPU streams busy. These streams (and therefore workers) offer concurrency between one another, so we can (where data relationships permit) compute independent array operations in parallel. |
Yes, that's right, and what I said about the main thread only having one worker/shadow/shepherd was me flaking out: I just hadn't thought of the fact that a main thread can run multiple of these without any problems. It has to send independent tasks to each (and maybe we'll need some way to make sure of that... maybe through the |
@swishdiff This is setting things up so that we'll have an error state to send to the background thread that we talked about today.
Here's what it looks like in this PR:
Even though the error occurred deep inside
_to_numpy
,toRegularArray
,_handle_error
, the error message knows that the Awkward operation isak._v2.to_numpy
, called from<stdin>, line 1
. Operations, likeak.this
andak.that
, as well as slices and NumPy ufunc calls, are special—they have more granularity than other functions—so we call traceback.extract_traceback and hold onto that stack location for that one point (per thread—we don't know if a user is running this in threads), as well as the function arguments, for the print-out. This memory doesn't leak because a non-reenterant error context is made with a context manager (Python is certain to give it up before control returns to the user).This does imply a few things for our coding style:
ak._v2._util.error
before raising them.ak.this
andak.that
operations must set up a error context manager. They're all implemented in separate files, so I've made the function itself just do the context manager and call_impl
in the same file, and_impl
does all of the work.ak.this
callsak.that
, which callsak.this
, only the firstak.this
is tracked for the error message). Soak.this
functions callingak.that
functions is now "bad form," but not broken. We should try to avoid these nested calls—think of this as a flat set of functions that are all user-oriented, not dog-fooded, but it isn't terrible if it happens by accident.Since this affects coding style for everyone, let me link everyone in here: @ianna, @ioanaif, @agoose77, @henryiii.
I think this PR is ready to go, but I'm going to leave it open to let everyone have a chance to comment, ask questions, or to object.