Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Straighten out error handling via a thread-local (but otherwise global) context #1327

Merged
merged 16 commits into from
Mar 3, 2022

Conversation

jpivarski
Copy link
Member

@jpivarski jpivarski commented Mar 1, 2022

@swishdiff This is setting things up so that we'll have an error state to send to the background thread that we talked about today.

Here's what it looks like in this PR:

ak._v2.to_numpy(ak._v2.Array([[1, 2, 3], [], [4, 5]]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/operations/convert/ak_to_numpy.py", line 41, in to_numpy
    return _impl(array, allow_missing)
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/operations/convert/ak_to_numpy.py", line 50, in _impl
    return layout.to_numpy(allow_missing=allow_missing)
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/content.py", line 1276, in to_numpy
    return self._to_numpy(allow_missing)
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/listoffsetarray.py", line 2182, in _to_numpy
    return ak._v2.operations.convert.to_numpy(self.toRegularArray(), allow_missing)
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/listoffsetarray.py", line 159, in toRegularArray
    self._handle_error(
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/content.py", line 212, in _handle_error
    raise ak._v2._util.error(ValueError(message))
ValueError: while calling (from <stdin>, line 1)

    ak._v2.to_numpy(
        array = <Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>,
        allow_missing = True
    )

Error details: cannot convert to RegularArray because subarray lengths are not regular (in compiled code:
https://github.com/scikit-hep/awkward-1.0/blob/1.8.0rc3/src/cpu-kernels/awkward_ListOffsetArray_toRegularArray.cpp#L22)

Even though the error occurred deep inside _to_numpy, toRegularArray, _handle_error, the error message knows that the Awkward operation is ak._v2.to_numpy, called from <stdin>, line 1. Operations, like ak.this and ak.that, as well as slices and NumPy ufunc calls, are special—they have more granularity than other functions—so we call traceback.extract_traceback and hold onto that stack location for that one point (per thread—we don't know if a user is running this in threads), as well as the function arguments, for the print-out. This memory doesn't leak because a non-reenterant error context is made with a context manager (Python is certain to give it up before control returns to the user).

This does imply a few things for our coding style:

  1. All exceptions from Awkward code must pass their exceptions through ak._v2._util.error before raising them.
  2. All ak.this and ak.that operations must set up a error context manager. They're all implemented in separate files, so I've made the function itself just do the context manager and call _impl in the same file, and _impl does all of the work.
  3. The "non-reenterant error context" is non-reenterant in the sense that it only pays attention to the first level of calling (i.e. if ak.this calls ak.that, which calls ak.this, only the first ak.this is tracked for the error message). So ak.this functions calling ak.that functions is now "bad form," but not broken. We should try to avoid these nested calls—think of this as a flat set of functions that are all user-oriented, not dog-fooded, but it isn't terrible if it happens by accident.

Since this affects coding style for everyone, let me link everyone in here: @ianna, @ioanaif, @agoose77, @henryiii.

I think this PR is ready to go, but I'm going to leave it open to let everyone have a chance to comment, ask questions, or to object.

@jpivarski
Copy link
Member Author

How error contexts are implemented:

https://github.com/scikit-hep/awkward-1.0/blob/813b46234f727e271a5e50e3df7b5195b874d522/src/awkward/_v2/_util.py#L98-L302

How they are used:

https://github.com/scikit-hep/awkward-1.0/blob/813b46234f727e271a5e50e3df7b5195b874d522/src/awkward/_v2/operations/convert/ak_to_numpy.py#L35-L52

Similarly, for slicing:

https://github.com/scikit-hep/awkward-1.0/blob/813b46234f727e271a5e50e3df7b5195b874d522/src/awkward/_v2/contents/content.py#L474-L478

This replaces some gunky exception-chaining in the main branch (below), which I previously wasn't happy with, as it complicated the stack trace.

https://github.com/scikit-hep/awkward-1.0/blob/963b9f4b2530e8a56f84fadec39fefda4c26d110/src/awkward/_v2/contents/content.py#L569-L618

Now we don't do exception-chaining within the Awkward library (we do still try-catch errors from other libraries, and of course StopIteration); it only raises the exception at the site of the error—it just has more information when it gets there.

@codecov
Copy link

codecov bot commented Mar 1, 2022

Codecov Report

Merging #1327 (f962184) into main (b2fd2be) will increase coverage by 0.33%.
The diff coverage is 50.22%.

Impacted Files Coverage Δ
src/awkward/_v2/_connect/cling.py 0.00% <0.00%> (ø)
src/awkward/_v2/_connect/pyarrow.py 85.74% <0.00%> (ø)
src/awkward/_v2/_lookup.py 97.50% <0.00%> (ø)
src/awkward/_v2/_prettyprint.py 66.09% <0.00%> (+2.29%) ⬆️
src/awkward/_v2/_typetracer.py 69.14% <0.00%> (ø)
src/awkward/_v2/behaviors/string.py 90.00% <ø> (ø)
src/awkward/_v2/forms/bitmaskedform.py 78.04% <0.00%> (ø)
src/awkward/_v2/forms/bytemaskedform.py 77.33% <0.00%> (ø)
src/awkward/_v2/forms/emptyform.py 79.62% <0.00%> (-0.38%) ⬇️
src/awkward/_v2/forms/form.py 90.06% <0.00%> (ø)
... and 141 more

@henryiii
Copy link
Member

henryiii commented Mar 1, 2022

Can you show what they look like now (as in "before" the pr) for comparison? This seems like a lot of specialized, custom handling that is fragile & likely to be hard to remember to include in the future (unless you implement a custom flake81 or pylint check, for example). I'm not sure I know exactly what this is solving - is it solving several things? This might be the best solution, but want to make sure other options have been exhausted. :)

Note that error handling is changing a lot in Python 3.11; they have added a new way to add a note field (.add_note() on BaseException that fills __note__: Tuple[str, ...]), and they are also introducing error groups. IPython and Rich both implement a "hide" feature to hide parts of the traceback, too. You also have access to chained exceptions, etc. Just want to make sure this is really the best way to implement whatever you are trying to solve, and that it won't muddle with things like the ability to use a traceback formatter like Rich.

Footnotes

  1. I believe this is actually pretty easy, FYI.

@agoose77
Copy link
Collaborator

agoose77 commented Mar 1, 2022

I'd also benefit from learning a bit more about the wider context here, and wanted to write a comment rather than just +1 :)

@jpivarski
Copy link
Member Author

jpivarski commented Mar 1, 2022

This seems like a lot of specialized, custom handling that is fragile & likely to be hard to remember to include in the future

Implementation

Avoiding fragility is a high priority. This implementation doesn't adjust the stack trace in any way—all it does is it constructs the error message in a standardized way, in a central place. I'll expand on the wider motivation below, but the thing we're trying to have happen is for the error message, however deeply in the Awkward codebase it occurs, say something in its text about the ak.something/slice/ufunc that it came from, since this is the most useful information to a user of Awkward Array or developer of a library on top of Awkward Array. For Awkward developers, the full stack trace is still there, unmodified, so it's just as useful as it's ever been for debugging Awkward. It's just that the last message emphasizes information for Awkward users.

From the sound of the word "error groups," core Python may be addressing the same thing: not all parts of the stack trace are equally important to all users/developers. In particular, the boundaries between calls within a library and calls between libraries are important in general, not just Awkward. When that feature becomes available, it would hurt nothing to use both.

Similarly, .add_note() could structure this block text message in a better way, and with our error-handling being centralized by this implementation, that would be easier to add in the future.

Previous implementations

Users' problems detangling their indexing errors from Awkward internals started in Awkward 0, and one of the reasons I was looking forward to putting the internals into C++ in Awkward 1 was to hide a lot of it from the stack trace. My thinking then was that the Python stack trace would terminate on a user call, like my_array[my_complicated_slice] with the user's line number not far from the bottom, making it easier to pick out. As it turned out, we did a lot of the implementation of Awkward 1 in Python anyway, and that which was done in Awkward 0 wasn't an asset because there was no debugging information for us developers. Later, I added the __LINE__ to every C++ exception to get more of this debugging information, walking backward on that initial hope.

Meanwhile, I'm still hearing that

ak.Array([[1, 2, 3], [], [4, 5]])[[[True, False, True], [], [False, True, True]]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward-1.0/awkward/highlevel.py", line 991, in __getitem__
    tmp = ak._util.wrap(self.layout[where], self._behavior)
ValueError: in ListArray64 attempting to get 2, index out of range

(https://github.com/scikit-hep/awkward-1.0/blob/1.8.0rc3/src/cpu-kernels/awkward_ListArray_getitem_jagged_apply.cpp#L43)

is not useful information for users to figure out their indexing problems. At the depth where the indexing error actually occurs, we don't know what the original slice/ak.whatever/ufunc was, which is what they really want to know.

In Awkward 2, I made slice errors look like this:

ak._v2.Array([[1, 2, 3], [], [4, 5]])[[[True, False, True], [], [False, True, True]]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/highlevel.py", line 1018, in __getitem__
    out = self._layout[where]
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/content.py", line 476, in __getitem__
    return self._getitem(where)
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/content.py", line 582, in _getitem
    return self._getitem(layout)
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/content.py", line 573, in _getitem
    return self._getitem((where,))
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/content.py", line 514, in _getitem
    out = next._getitem_next(nextwhere[0], nextwhere[1:], None)
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/regulararray.py", line 590, in _getitem_next
    down = self._content._getitem_next_jagged(
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/listoffsetarray.py", line 324, in _getitem_next_jagged
    return out._getitem_next_jagged(slicestarts, slicestops, slicecontent, tail)
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/listarray.py", line 352, in _getitem_next_jagged
    self._handle_error(
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/contents/content.py", line 212, in _handle_error
    raise ak._v2._util.error(ValueError(message))
ValueError: cannot slice

    <Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>

with

    [[True, False, True], [], [False, True, True]]

Error details: index out of range while attempting to get index 2 (in compiled code:
https://github.com/scikit-hep/awkward-1.0/blob/1.8.0rc3/src/cpu-kernels/awkward_ListArray_getitem_jagged_apply.cpp#L43)

but I did it by introducing NestedIndexError, which Content.__getitem__ catches, annotates as an IndexError, and re-raises. That's what's currently in main, and I don't like what it does to the stack trace—when debugging, we have to follow the chained exception. Catching other library's exceptions is a useful tool because we can't control what other libraries do, and maybe some of the things they think should stop the world shouldn't. But within a library, there's got to be a better way to do it.

Besides, some uses of the internal Content._carry method were part of slicing and some weren't, so they took a third argument to indicate whether they should raise NestedIndexError or not. That was ugly.

Besides besides, some slicing errors still weren't raising NestedIndexError and weren't participating in this error-message handling.

Besides besides besides, this only addresses slicing, not ak.whatever or ufuncs.

This implementation

The problem is that we have information at an important level of granularity when we first enter an Awkward operation (slice, ak.whatever, ufunc) from the outside (end-user or downstream library call) and we want that information in the final error message. I had considered passing that information down in an error_context object—but that would mean that every internal function needs another argument, which seemed like a bad idea.

One feature of this "outside Awkward" → "inside Awkward" concept is that it is global (per thread). A stack trace can only pass through such a boundary once. If an internal Awkward function calls another public API function, it doesn't count as another boundary cross—only the first one is the one we want to report. Although I avoid global state in almost every situation, this seems like the one-in-a-hundred exception. The global state is in ErrorContext._slate, a class object attribute, and this is a threading.local() object.

When the boundary is crossed the first time (ErrorContext._slate is clean), it gets set with the information about that boundary crossing. If we ever enter a public API from another internal function (not forbidden, but I'd try to avoid it), the slate is left as-is. It's important to clean the slate after the public API call, and that is implemented using a context manager, so that all public API calls look like this:

def public_ak_function(args):
    with ak._v2._util.SomeErrorContext(args):
        do_the_function(args)

The logic that I've described about crossing the boundary only once is implemented by the context manager.

The things that need to be adhered to are

  1. public API functions should use the context manager on entry
  2. internal exceptions should be wrapped with ak._v2._util.error(ExceptionConstructor(...)) to postprocess the exception messages in a centralized place
  3. it would be nice to not have internal Awkward functions call Awkward API functions, but not necessary as the context manager takes care of that.

If any of these rules are not followed, nothing disastrous happens—we just don't get the pretty error message. Rule (1) isn't hard to enforce via code review, and any fly-by contributors who add a new function by copying an existing one would copy that context manager along with the rest of the formalism (like the way that all of these functions go in a particular submodule and are exposed to ak.* in a standardized way). Rule (2) could be hard to enforce by eye (bare exceptions would be hard to spot), but not hard to write an AST-checker for. I don't know how to write flake8 checkers, but I can write a Python AST crawler that would do it if you want to collaborate. There might be a few other local Awkward rules that we're following informally that can get formalized in this way.

I'm not sure I know exactly what this is solving - is it solving several things?

Motivation

This is an important section, but I've just run out of time to write it. I'll follow up with another comment here in a few hours.

@henryiii
Copy link
Member

henryiii commented Mar 1, 2022

Quick comment:

flake8 checkers, but I can write a Python AST crawler

That's exactly what they are. :) - https://www.youtube.com/watch?v=OjPT15y2EpE

It seems like at least the post-error message could be added via sys.excepthook; and maybe computing the truncation of the stack as well. But then you'd not mix well with other error formatters (IPython, Rich) - but those both support frame hiding.

@jpivarski
Copy link
Member Author

(I still intend to give a fuller motivation here.)

It seems like at least the post-error message could be added via sys.excepthook; and maybe computing the truncation of the stack as well. But then you'd not mix well with other error formatters (IPython, Rich) - but those both support frame hiding.

Yeah, that's what would make me uncomfortable: modifying something that could interact poorly with IPython, Rich, or even some plain Python modes that I don't know about. What we have here is just plain exception-throwing, without even as much as chaining.

@agoose77
Copy link
Collaborator

agoose77 commented Mar 1, 2022

On this PR:

My understanding of what you've written is:

  • When high-level operations fail (ak.xxx, array[...]), the error messages are currently hard to read (especially for beginners)
  • This PR adds machinery to make the raised Exception more readable by rewriting the content to include mention of the top-level context.

I can see the benefit to solving the problem of traceback clarity. Tools like rich and IPython install their own traceback formatters, and can shorten long tracebacks, but don't really tackle the problem of "the last exception should be the most relevant". For experienced users of Awkward, maybe the leaf exception is the most important, but understanding why some kernel failed is not as important as "this slice failed".

I want to prefix this with "I don't have a good idea of the best solution". What we're trying to solve here isn't just an Awkward problem (hence PEP-678)! My gut instinct is that this is a "Python problem" rather than a "library problem", because unless users are doing something awful with repr on the exception objects, this shouldn't affect the runtime behaviour.

Implementation Details

I'm not sure whether the exception-rewriting solution is ideal, though. On the one hand, it's a thorny issue - without PEP 678, there is no way to modify the printed exception without either

  1. Creating a generic AwkwardError exception, and raising it from the actual error.
  2. Creating a new exception of the same type, and modifying the message
  3. Creating a mixin AwkwardExplainedException, and overriding the repr/str
  4. Installing a custom traceback handler

(1) is the most foolproof - we explicitly impose an error signature of AwkwardError on all operations. However, this is pretty terrible UX - now users need to unwrap exceptions if they want to handle them.

The issue with rewriting (2) is that it makes some assumptions:

  • the args value of the exception can be modified & types changed
  • the metadata associated with the exception is not important

Maybe within the context of builtin exceptions these assumptions are acceptable, but this feels slightly fragile if, e.g. an external library implements their own exceptions (e.g. a custom container that raises an IndexError subclass) that are ultimately raised inside the awkward context.

(3) might be slightly more robust - we can effectively copy args and __dict__ to make fewer assumptions about the underlying exception than (2), but it's still pretty unpleasant to be creating a new exception object.

(4) would be the "safest", but pretty incompatible with other tools.

I suppose the questions that come to mind are:

  1. How important is this vs the status quo?
  2. Should this be opt-in/opt-out?

If the answer to (1) is "very important", then I think one of "duck-like wrapper exception" or "re-create exception" are the most foolproof for most users.

With the existing PR, I wonder whether ak._util.error should leave the responsibility of the exception formatting to the context object rather than switching with isinstance inline?

In addition to how we attach this information to exceptions, there is also the associated change to each function that wants to implement this behaviour. My first impression is that we need a lot of boilerplate code in every Awkward function in order to handle this. Would it be acceptable just to create a decorator that captures the args and wraps any raised exceptions? I.e.

def nice_function(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except BaseException as exc:
            sig = signature(func)
            params = sig.bind(*args, **kwargs)
            parts = ["("]
            for p in params.args:
                parts.append(repr(p))
            for k,v in params.kwargs.items():
                parts.append(f"{k}={v!r}")
            parts.append(")")
            call_msg = "".join(parts)
            msg = f"Call to {func.__qualname__}{call_msg} failed!"
            raise annotate(exc, msg)
    return wrapper
    
@nice_function
def sort(...):
    layout.sort(...)
    

Excuse the ugly pseudo code, this is just from my playing around with (3)

Looking Forward

So, I can see the benefit of this PR despite being slightly reluctant to start modifying exceptions (even if we are most likely the same entity that actually instantiated said exception).

Another thought that occurs to me is that PEP 678 is a clear standard that would mostly solve this problem (in my opinion). Maybe, instead of rolling our own solution, we could encourage Rich & IPython to support 678 (once it is accepted) without testing the Python version. Then, Awkward could use exception.__dict__ to set __notes__ and work on old Python versions that use relatively recent IPython or Rich. This would only leave pure-Python unsupported, which we could perhaps handle separately, with our own opt-in handler? This is not foolproof either - someone could raise an exception that sets __slots__.

My purist suggestion is just to use PEP 678 and let users of newer Python's benefit, but I can see that this is probably a rather extreme suggestion 😉

@henryiii
Copy link
Member

henryiii commented Mar 1, 2022

I think we can install an exception handler only if it's set to the default, which would cover the pure Python case. The "poor interaction" with other tools that have exception handlers is just because you'd have to pick one, they can't be combined. I'd expect libraries would pick up support for this for all Python versions. I didn't realize this was pulled out of PEP 654 into it's own PEP, though I remember the worries about it being added to an existing accepted PEP.

I'm rather sad that IPython's __tracebackhide__ hasn't made it into a PEP.

@jpivarski
Copy link
Member Author

On the motivation (which I have yet to write), better error messages, which have been a recurrent issue throughout Awkward's history and a general library problem, is not the only issue. The other motivation—the "why now?" question—has to do with #1321, handling errors if Awkward functions in a GPU context are asynchronous. Then a user error in ak.whatever could be raised long after ak.whatever has returned, and @swishdiff and I were thinking about how to tell users that their inputs to ak.whatever were wrong when ak.whatever isn't even in the call stack. I had an idea about how to support async Awkward as an alternate mode of operation, and this error tracking is a prerequisite for that.

(If we keep this PR open now to discuss it, then I'll have to develop the async prototype as a PR into this PR.)


@agoose77, you have some great points, but in a few detail points I think you're assuming this does more than it does.

Maybe within the context of builtin exceptions these assumptions are acceptable, but this feels slightly fragile if, e.g. an external library implements their own exceptions (e.g. a custom container that raises an IndexError subclass) that are ultimately raised inside the awkward context.

In this implementation, if our code calls a third-party library and that library raises an exception, it will not be rewritten. This annotates our own exceptions only, and all of the exceptions that we raise have no associated data other than the message. (I had the pleasure of reviewing them all last night!)

Ideally, our code doesn't use much from third-party libraries (only when we're converting or interoperating with them, such as ak.to_pandas or JAX autodiff), and when we do use third-party libraries, NumPy mostly, we check conditions that would lead to exceptions (within reason). For an example of the latter, if we do some complex NumPy gymnastics to implement an Awkward operation—let's say one of the steps is to slice an array by an array—we probably don't want the NumPy exception to pass to the user as-is, since that exception would be complaining about indexes being out of bounds for intermediary arrays that users would ordinarily never see.

So the code that we've already written generally checks to ensure that NumPy isn't going to be raising exceptions, and those checks raise ValueErrors, IndexErrors, etc. with Awkward-meaningful messages. This PR replaces

if check_for_bad_condition(intermediary_array):
    raise ValueError("your Awkward input is invalid in such-and-such a way")
else:
    do_calculation(intermediary_array)

with

if check_for_bad_condition(intermediary_array):
    raise ak._v2._util.error(ValueError("your Awkward input is invalid in such-and-such a way"))
else:
    do_calculation(intermediary_array)

If do_calculation can raise an exception that we don't know about because it's a third-party exception, then before this PR it would be confusing to users and after this PR it would be confusing to users and also not get a banner with the ak.whatever function arguments. In either case, it's something that I hope users would report as a bug that we could fix by checking for the bad condition. The value of the banner is to tell users how to fix their code. If they don't get a banner, then the problem is not in their code.

With the existing PR, I wonder whether ak._util.error should leave the responsibility of the exception formatting to the context object rather than switching with isinstance inline?

You're totally right about that—I agree it would be cleaner to do so. But it's also centralized code that can be fixed on one spot. Most of the changes in this PR are all the individual raise Something(...) to raise ak._v2._util.error(Something(...)).

Also, notice that there's an ak._v2._util.indexerror(arguments, of, indexerror) function, which is replacing the try-except of the old NestedIndexError. I was going to write a function for each type of error (ValueError, IndexError, np.AxisError), taking the arguments of each and building it in the function, rather than rewriting them after they've been built. But every exception we raise is constructed with the error message as its only argument. If we ever deal with exception types that have more complex arguments, we can create a handler function for that, just as we have a special one for indexerror.

My first impression is that we need a lot of boilerplate code in every Awkward function in order to handle this. Would it be acceptable just to create a decorator that captures the args and wraps any raised exceptions?

I'd rather have boilerplate inside our codebase than have functions modify our functions. Having everything be visibly laid out, so you can see what it all does, is better for maintenance, though understandably more effort to type. (Parts of Awkward 0 were decorator-based, and it caused more harm than good. We'd want to be careful with that.) Also, the decorator you describe would annotate all exceptions, chaining them with the try-except, and I only intended to annotate the exceptions we raise with Awkward-meaningful messages in them.

This is not foolproof either - someone could raise an exception that sets __slots__.

I think this is illustrating the main thing: we're talking about annotating exceptions we raise, and we know how we raise them. I'd like to backtrack from this generality.

@agoose77
Copy link
Collaborator

agoose77 commented Mar 1, 2022

but in a few detail points I think you're assuming this does more than it does.

Right, I crossed some wires in my head w.r.t where this PR actually calls ak._util.error. My mistake!

Given your explanations, it seems like we're not worrying about the case where an unexpected exception is raised - we're only worried about exceptions that are explicitly raised in the Awkward code-path (known ahead-of-time). That simplifies the scope of the problem a bit

This is a big PR, and I am going in circles trying to write out a full reply 🤕

I have two separate axes of concern:

  • How we set the exception context
  • How we annotate the exceptions

Given the explicit, opt-in approach, I think the rewriting approach here is the simplest. ak._util.error can make assumptions about exceptions it receives. When/if PEP 678 lands, it looks like all we'd be doing is using __notes__ instead of overwriting the exception message. That's a much less important change.

This just leaves the boilerplate of the exception context. With

with ak._v2._util.OperationErrorContext(
    "ak._v2.from_arrow_schema",
    dict(schema=schema),
):
    return _impl(schema)

what this is mainly doing AFAICT is capturing the arguments to the high-level operation, so that the formatted traceback can guide the user as to what went wrong. This is quite similar in motivation to Rich's locals rendering in tracebacks, but targetted for Awkward.

If we want to implement this (if we're dealing with async stuff, I'm guessing this becomes more important), then I would think that a simple decorator would remove most of the boilerplate:

def operation(func):
    signature = inspect.signature(func)
    @wraps(func)
    def wrapper(*args, **kwargs):
        context = signature.bind_partial(*args, **kwargs)
        context.apply_defaults()
        push_context(func.__module__, func.__name__, context)
        try:
            return func(*args, **kwargs)
        finally:
            pop_context()
        
    return wrapper

Is this something you'd be on board with?

RE

handling errors if Awkward functions in a GPU context are asynchronous

This immediately sounds like ExceptionGroups (similar to Trio's MultiError), but it's late and the topic probably warrants a longer reply :)

@jpivarski
Copy link
Member Author

This is quite similar in motivation to Rich's locals rendering in tracebacks, but targetted for Awkward.

I'm not surprised that it's been done before. We may want to take control of the repr of behavior, since this is either None or a dict with dozens or hundreds of entries. Right now, the repr of all arguments are truncated at 80 characters.

If we want to implement this (if we're dealing with async stuff, I'm guessing this becomes more important), then I would think that a simple decorator would remove most of the boilerplate:

...

Is this something you'd be on board with?

This one, yes! The boilerplate of raising all exceptions like raise ak._v2._util.error(ActualException(...)) is better than the alternative of decorating every function in our codebase with something that catches and reraises exceptions. However, this one would only decorate high-level operations (all the ak.* functions), which should be called out as being special.

I'm assuming that a decorator like this would then make them look like

@ak._v2._util.operation
def whatever(arguments, with_some="defaults"):
    """
    Really long docstring.
    """
    return _impl(arguments, with_some)

Another reason that I had for wanting to move the implementations out to _impl is because pytest prints the whole function source up to the error, and that includes the really long docstring. I was looking forward to it only printing source code and letting the operation functions be primarily holders of docstrings. But I could go either way on that.

@jpivarski
Copy link
Member Author

How does everyone feel about this now? Let me know if you object to merging it in its current state. (Silence is assumed to be consent!)

@agoose77's idea of using a decorator to reduce boilerplate in src/awkward/_v2/operations/**/*.py is a good one, but I think it can be applied at a later date. The harder-to-merge part is all of the ak._v2._util.error function calls in the codebase. I'd like to merge this PR so that there will be fewer adjustments to other PRs that touch the same lines.

@henryiii suggested writing a flake8 check for the ak._v2._util.error function calls. I'll write an AST-crawler and post it here before I forget, so that we'll have a foot-in-the-door toward writing that flake8 check.

With this PR merged into main, #1331 can be turned to target main instead of this branch.

Oh! And I never did write up that motivation in terms of delayed processing. I'll do that now because having #1331 to point to would make it easier to talk about.

@jpivarski
Copy link
Member Author

@henryiii suggested writing a flake8 check for the ak._v2._util.error function calls. I'll write an AST-crawler and post it here before I forget, so that we'll have a foot-in-the-door toward writing that flake8 check.

This does it:

import ast

parsed = ast.parse(open(filename).read())

for node in ast.walk(parsed):
    if isinstance(node, ast.Raise):
        if not isinstance(node.exc, ast.Call) or ast.unparse(node.exc.func) not in (
                "ak._v2._util.error", "ak._v2._util.indexerror"
        ):
            raise ValueError(
                f"{filename} line {node.lineno} needs exception to be wrapped in ak._v2._util.*error"
            )

although there should be a way of opting-out of some files (src/awkward/_v2/_connect/numba/**/*.py are excluded because those errors are not in Awkward operations and Numba does its own manipulation of error messages), and there should be a way to opt-out of individual lines, like with a # noqa: ??? number.

@jpivarski
Copy link
Member Author

Motivation (as promised)

Apart from the long-standing issues with appropriateness of error messages for users, there's a new one regarding eagerness/laziness and error messages. @swishdiff and I talked about CUDA occupancy at length on Monday: the CUDA backend exists only for speeding things up, so keeping a GPU fully occupied is its raison d'être. Picking a concurrency model is not a premature optimization. We expanded this conversation to potential Awkward-CUDA users in Discussion #1321.

The details are on that Discussion, but two things came out: (1) users are already dissatisfied with error messages and find the primary value to be one of locating the line number in their code, and (2) Awkward's current eagerness strategy would ensure that either the CPU is busy or the GPU is busy, never both. That's bad.

(There's a secondary part to that story that @swishdiff brought up, that in addition to keeping both the CPU and GPU busy, you also have to keep all the processors on the GPU busy. With our strict data dependencies between subsequent Awkward-kernels in an Awkward-operation and unknown data dependencies between Awkward-operations (it depends on user code), it would be very difficult for us to run multiple Awkward-kernels, and hence CUDA-kernels, at the same time. The only way we could do this well is by letting the user put independent work on CUDA streams, so everything I say below about a "background worker" applies per CUDA stream.)

We need to run our Awkward-kernels in a particular order to handle data dependencies, but the result does not need to be ready when an operation like ak.whatever returns control to Python. It needs to be ready when a user looks at values in the array. Ideally, we want all of the work to happen in the same order that it would happen in eager, CPU-bound Awkward Array, but let the evaluation lag behind the user's thread. That way, CUDA-kernel calls can be "packed tightly," since there would be a workload waiting for it, a queue filled by the user thread because the user thread didn't have to wait for numeric calculations to finish.

The CUDA tools we looked at for doing this are (a) unaware of the Python steps we need to perform between CUDA-kernels and (b) don't seem to apply to cudaMalloc, which is every other step in our workflow. That's okay, we can do the "lagging behind" in Python, and moving it there frees us to use all the blocking CUDA calls we want. I would call this "asynchronous," but after a deep-dive into Python's asyncio, I see that Python's use of "asynchronous" is different. We don't want to do coroutines because the order of operations matters very much, and this order isn't entirely encoded in functional dependencies, either, since kernels act purely through side-effects.

Below is (the beginning of) an implementation of a "lagging"/"foot-dragging"/"delayed" executor as an nplike. The idea is that the NumPy-like arrays in the Indexes and NumpyArrays of some Awkward Arrays can be built out of futures. The array object's data might still be in the process of being computed, and it would have to wait for that computation to finish if and only if you ask for that information. Values in the array are always subject to delay, the shape of the array is sometimes subject to delay (it might depend on a value in another, uncomputed array), but the dtype and such information as number of dimensions and contiguousness are always known.

This delayed array is similar to v1's VirtualArray, except that it is low-level (Indexes and buffers, not a Content node), it's completely invisible to users, has no evictable cache (it runs once and fills a permanent result), and would never be used for I/O. Its only intended use is for CUDA, but we can separate that part out and just have this nplike point to a nested_nplike, which can be NumPy or CuPy.

https://github.com/scikit-hep/awkward-1.0/blob/22b8184da46bc50b35a8b31dc4166c16bad76cf2/src/awkward/_v2/_delayed.py#L12-L150

The threading model is important to get right and keep simple, since "hanging"/"deadlock" is as hard to debug as segfaults. The worker has three states:

  1. waiting for a task
  2. processing a task
  3. a task raised an exception, worker is dead

State 1 → 2 when a task appears on its queue, 2 → 1 when it completes, and 2 → 3 if it raises an exception. An exception in a task ruins the worker (3 is an absorbing state); a new worker needs to be made to replace it. A future has three states:

  1. computation in progress and worker is not dead
  2. computation is done and it is a good value
  3. computation failed or a previous task in the sequence failed

Attempting to view the result in state 1 blocks until it leaves that state. State 1 can go to 2 or 3, which are both absorbing states. In state 1, result() returns the result, and in state 2, result() raises the exception from the first task that failed.

If anything goes wrong in a task, it will be reported on the user's thread, either when trying to add a task to a dead worker or when trying to access the result() of the future that failed or a later one in the sequence. The key thing, the whole reason I made this PR, is that this exception is necessarily happening at a different point in time from the scheduling of the task. It was when the task was scheduled that the stack trace included the call from outside Awkward Array into Awkward Array.

If you read the code above, you'll see that the stack trace the user sees is a relevant one for debugging the task itself—in other words, Awkward internals—but it's decoupled from the stack trace at the time when it was scheduled—in other words, how the user called ak.whatever. That's why it was important to add that information into the error message.

  • In normal, eager Awkward Array, the information about ak.whatever is in the stack trace, though it may be buried. Annotating the error message highlights that information.
  • In delayed Awkward Array, that information is not in the stack trace. Annotating the error message is essential.

The thread-local ErrorContext is a way of getting that information from the time when a task is scheduled to the time when the exception is raised. Here are some examples showing how that works.

First, a demo of how the delayed processing works when there are no exceptions.

>>> import time
>>> from awkward._v2._delayed import *
>>> def task():
...     print("begin")
...     time.sleep(10)
...     print("end")
...     return 123
... 
>>> worker = Worker(); worker.start()
>>> future = worker.schedule(task)
begin
>>> future.result()  # processing has already begun; wait for it to end
end
123

Now if the future is scheduled in an OperationErrorContext, it will be able to talk about that context in its error message, even though the exception is raised long after ak_whatever has returned and the OperationErrorContext is no longer operative on the main thread.

>>> from awkward._v2._util import OperationErrorContext, error
>>> def task():
...     print("begin")
...     time.sleep(10)
...     raise error(ValueError("oops"))
... 
>>> def ak_whatever(**kwargs):
...     with OperationErrorContext("ak.whatever", kwargs):
...         future = worker.schedule(task)
...     return future
... 
>>> future = ak_whatever(args=123)
begin
>>> future.result()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/_delayed.py", line 62, in result
    raise exception_value.with_traceback(traceback)
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/_delayed.py", line 45, in run
    self._result = self._task()
  File "<stdin>", line 4, in task
ValueError: while calling (from <stdin>, line 1)

    ak.whatever(
        args = 123
    )

Error details: oops

The worker does sequential work: if there's an exception anywhere in the sequence, it's the only exception because nothing can be scheduled or executed after that. In the following, we put two tasks onto the worker:

>>> worker = Worker(); worker.start()
>>> future1 = ak_whatever(args=1)
begin
>>> future2 = ak_whatever(args=2)

then wait a long time (more than 10 seconds), then try to put another one on:

>>> future3 = ak_whatever(args=3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in ak_whatever
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/_delayed.py", line 102, in schedule
    self._futures.put(future)
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/_delayed.py", line 73, in put
    raise exception_value.with_traceback(traceback)
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/_delayed.py", line 45, in run
    self._result = self._task()
  File "<stdin>", line 4, in task
ValueError: while calling (from <stdin>, line 1)

    ak.whatever(
        args = 1
    )

Error details: oops

Note that the error message has args = 1 from the first task that failed. The line number would also point to the first ak.whatever call that failed, though here everything is on the prompt and the file and line number aren't informative.

If you try to evaluate future2.result() or future1.result(), you get the same error message: that the first task failed, and how it was called.

Other than the fact that these stack traces are full stack traces from the worker thread (reported on the main thread), they are not being manipulated. The file and line number are part of the final error message.

That's why this PR was written.

@henryiii
Copy link
Member

henryiii commented Mar 3, 2022

per-file-ignores =
    tests/*: T, AK1
    dev/*: T, AK1
    setup.py: T
    localbuild.py: T
    src/awkward/__init__.py: E402, F401, F403
    ./awkward/__init__.py: E402, F401, F403
    src/awkward/_v2/_connect/numba/*: AK1

This check might still be grabbing a tiny bit too much:

src/awkward/_connect/_jax/__init__.py:17:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_jax/__init__.py:30:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:154:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:368:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:382:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:414:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:590:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:734:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:792:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:847:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:951:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:984:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:1496:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:1523:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:1530:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:1688:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:1772:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:1816:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:2058:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:2062:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:2066:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:2076:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:2092:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:2144:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:2170:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:2404:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:3047:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:3061:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:3318:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:3323:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:3336:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:3357:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:3444:25: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:3474:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:3493:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:3599:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:3852:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:3967:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:3986:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:4265:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:4323:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:4344:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:4368:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:4379:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:4395:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:4406:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/structure.py:4673:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_v2/_util.py:43:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/behaviors/mixins.py:86:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/nplike.py:36:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/nplike.py:449:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/nplike.py:480:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/nplike.py:497:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/nplike.py:506:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/nplike.py:519:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/nplike.py:526:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/nplike.py:564:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_typeparser/parser.py:82:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_typeparser/parser.py:102:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_typeparser/parser.py:120:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_typeparser/parser.py:283:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/describe.py:64:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/describe.py:72:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/describe.py:175:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/describe.py:190:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/builder.py:150:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/builder.py:160:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/builder.py:174:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/builder.py:188:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/builder.py:202:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/builder.py:212:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/builder.py:222:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/builder.py:236:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/builder.py:250:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/builder.py:260:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/builder.py:276:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/builder.py:290:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/builder.py:300:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/builder.py:337:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/builder.py:561:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_util.py:548:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_util.py:608:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_util.py:630:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_util.py:789:25: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_util.py:909:29: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_util.py:1049:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_util.py:1058:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_util.py:1071:25: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_util.py:1081:25: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_util.py:1114:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_util.py:1301:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_util.py:1500:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_util.py:1800:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numpy.py:240:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numpy.py:268:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numpy.py:281:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numpy.py:290:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_v2/operations/structure/ak_broadcast_arrays.py:10:5: AK101 exception must be wrapped in ak._v2._util.*error
setup.py:68:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numexpr.py:19:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numexpr.py:125:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:256:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:264:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:302:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:334:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:367:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:376:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:457:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:472:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:509:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:526:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:535:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:544:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:612:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:632:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:647:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:667:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:683:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:700:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:709:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:718:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:884:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:931:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1002:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1011:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1108:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1125:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1224:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1562:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1578:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1587:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1606:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1623:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1633:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1638:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1683:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1749:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1844:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1847:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1902:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1920:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1968:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:1982:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:2502:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:2727:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:2852:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:2936:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:3076:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:3113:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:3260:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:3292:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:3414:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:3458:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:3461:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:3557:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:3578:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:3652:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:3702:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:3731:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:3814:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:3818:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4144:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4182:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4379:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4386:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4405:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4486:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4517:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4544:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4567:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4586:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4618:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4656:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4663:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4702:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4739:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4754:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4790:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4840:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4847:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:4927:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:5068:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:5076:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:5124:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/operations/convert.py:5249:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/behaviors/string.py:42:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/behaviors/string.py:48:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/behaviors/string.py:83:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/behaviors/string.py:89:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_v2/_connect/numpy.py:13:5: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/partition.py:182:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/partition.py:185:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/partition.py:192:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/partition.py:211:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/partition.py:259:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/partition.py:393:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/partition.py:442:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/partition.py:526:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/arrayview.py:224:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/arrayview.py:265:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/arrayview.py:337:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/arrayview.py:589:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/arrayview.py:877:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/arrayview.py:1505:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:44:5: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:123:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:137:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:238:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:243:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:275:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:292:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:316:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:337:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:523:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:539:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:567:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:899:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:1065:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:1236:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2021:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2028:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2045:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2052:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2066:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2073:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2366:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2371:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2394:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2401:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2408:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2426:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2444:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2451:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2497:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2520:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2536:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2546:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2608:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2644:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2653:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2840:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2882:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/layout.py:2926:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_cuda_kernels.py:11:5: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_cuda_kernels.py:16:5: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_cuda_kernels.py:24:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_jax/jax_utils.py:42:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_jax/jax_utils.py:71:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_jax/jax_utils.py:87:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_jax/jax_utils.py:110:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_jax/jax_utils.py:130:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_jax/jax_utils.py:160:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_jax/jax_utils.py:185:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_jax/jax_utils.py:207:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_jax/jax_utils.py:285:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/__init__.py:17:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/__init__.py:30:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_connect/_numba/__init__.py:97:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:207:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:236:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:255:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:328:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:358:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:1058:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:1067:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:1117:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:1123:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:1489:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:1545:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:1560:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:1569:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:1638:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:1668:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:1795:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:1839:21: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:1845:17: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:2083:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:2267:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/highlevel.py:2414:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_v2/_connect/pyarrow.py:38:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_v2/_connect/pyarrow.py:44:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_v2/numba.py:15:13: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_v2/numba.py:29:13: AK101 exception must be wrapped in ak._v2._util.*error

@jpivarski
Copy link
Member Author

jpivarski commented Mar 3, 2022

It should be applied to files within src/awkward/_v2, excluding src/awkward/_v2/_connect/numba.

None of this applies to v1.

Looking at the _v2 output, _util should probably be excluded, too, because it would refer to this function as error, not ak._v2._util.error.

Then the rest of these are actual surprises that I can investigate:

src/awkward/_v2/operations/structure/ak_broadcast_arrays.py:10:5: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_v2/_connect/numpy.py:13:5: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_v2/_connect/pyarrow.py:38:9: AK101 exception must be wrapped in ak._v2._util.*error
src/awkward/_v2/_connect/pyarrow.py:44:9: AK101 exception must be wrapped in ak._v2._util.*error

@jpivarski
Copy link
Member Author

src/awkward/_v2/operations/structure/ak_broadcast_arrays.py:10:5: AK101 exception must be wrapped in ak._v2._util.*error

def broadcast_arrays(*arrays, **kwargs):
    raise NotImplementedError

Okay, I could wrap that.

src/awkward/_v2/_connect/numpy.py:13:5: AK101 exception must be wrapped in ak._v2._util.*error

if not numpy_at_least("1.13.1"):
    raise ImportError("NumPy 1.13.1 or later required")

This is one that I'd want to write a # noqa: for.

src/awkward/_v2/_connect/pyarrow.py:38:9: AK101 exception must be wrapped in ak._v2._util.*error

def import_pyarrow(name):
    if pyarrow is None:
        raise ImportError(error_message.format(name))
    return pyarrow

Same here.

src/awkward/_v2/_connect/pyarrow.py:44:9: AK101 exception must be wrapped in ak._v2._util.*error

def import_pyarrow_parquet(name):
    if pyarrow is None:
        raise ImportError(error_message.format(name))

And here.

@henryiii
Copy link
Member

henryiii commented Mar 3, 2022

Pushed the custom check. It triggers only on the NotImplementedError, though we could allow NotImplementedErrors if you want. I avoided the unparse step just in case it was slow (flake8 is pretty slow on such a large amount of code), though you can add it back if you want. I'm assuming that wrapping it in something.else.error is okay to pass on.

Comment on lines +45 to +57
def main(path):
with open(path) as f:
code = f.read()

node = ast.parse(code)
plugin = AwkwardASTPlugin(node)
for err in plugin.run():
print(f"{path}:{err.line_number}:{err.offset} {err.msg}")


if __name__ == "__main__":
for item in sys.argv[1:]:
main(item)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is only for debugging

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, visitor pattern, checking each attribute individually—that all is fine. Better/more formal, in fact: I was trying to make it simple.

The unparse could only ever be applied to an expression that is called, which is hardly ever anything more than a Name with Attributes, and it was only doing that if it's to the right of a Raise, so I think the execution time was pretty well controlled.

But this is really nice! (And now I have an example I'll be looking up if I ever want to add another one.) Thanks!

@agoose77
Copy link
Collaborator

agoose77 commented Mar 3, 2022

Thanks for the explanation @jpivarski, the additional motivation is much clearer now.

I would call this "asynchronous," but after a deep-dive into Python's asyncio, I see that Python's use of "asynchronous" is different.

I think async is a reasonable description. Given the fact that we're communicating with a remote executor, and tasks can operate out of order (independent streams), it seems like a good label. I suspect many workloads will not involve large numbers of concurrent streams, because analyses tend to evolve forwards rather than sideways, but that's by-the-by!

Also, I don't think "async" in Python makes the async label unusable.

The worker does sequential work: if there's an exception anywhere in the sequence, it's the only exception because nothing can be scheduled or executed after that. In the following, we put two tasks onto the worker:

I suppose that this is a reasonable limitation - if any user operation fails, we need to blow up at some point to report the exception. Unlike fully-fledged async executors, we don't really need a recovery mechanism: errors are likely to be deterministic errors (bad data) or semi-deterministic (OOM).

Overall this sounds like a reasonable direction to me, though I confess I've not spent a lot of time thinking about it yet. The executor support would be interesting - we could even use it to add threading support to Awkward CPU by releasing the GIL. It would probably be quite easy1 once the CUDA work is done. I'm not sure how much benefit this would bring, because again, this would only speed up concurrent kernels, whereas CUDA does this in addition to running different algorithms.

Footnotes

  1. Don't quote me on this!

@jpivarski
Copy link
Member Author

The executor support would be interesting - we could even use it to add threading support to Awkward CPU by releasing the GIL. It would probably be quite easy once the CUDA work is done.

With v2 reducing the amount of C++ code, we'll be releasing the GIL on all of that C++ code. So, AwkwardForth, ArrayBuilders working on parsing large JSON, etc.


Keep in mind that this delayed thread (one per CUDA stream) is a very restricted model of concurrency—not even as general as Python's Executors. Since it runs everything in order, just one of these background threads is technically not "concurrent," though the value we're looking for is to have one attached to each CUDA stream, and the multiple CUDA streams would be concurrent with each other (and parallel).

Another thing to keep in mind about a background thread attached to a CUDA stream is that it doesn't have to be fast! It's a bit like an I/O thread, in that the CPU part is just waiting on the GPU as an external resource.

@jpivarski
Copy link
Member Author

Since it looks like all of the tests are going to pass, I'm going to squash-and-merge. Is everybody ready (nothing more to add)?

@jpivarski jpivarski merged commit 8a61c88 into main Mar 3, 2022
@jpivarski jpivarski deleted the jpivarski/straighten-out-error-message-handling branch March 3, 2022 21:46
@agoose77
Copy link
Collaborator

agoose77 commented Mar 3, 2022

not even as general as Python's Executors. ... just one of these background threads is technically not "concurrent"

For sure, just as having one proc/thread is a mostly useless kind of concurrency in conventional threading models 😄 This reminds me of Dask's concurrent.futures interface to Client ­ — IIRC it tracks data dependencies and promotes locality (and therefore doesn't block workers that depend upon other workers)

Could you clarify what you mean by this being less general than Python's executors? AFAICR these all assign work to a pool of sequential executors, which seems to map well to the model you've set out above.

Another thing to keep in mind about a background thread attached to a CUDA stream is that it doesn't have to be fast!

Right, and to clarify, I mean that a "multiple, sequential worker" executor could provide a lightweight mechanism for multiple-CPU execution of Awkward kernels (i.e., with performance boost as an explicit goal). Clearly, this parallelism would be limited - users would only gain any performance benefits on multi-core PCs where they are operating on independent arrays in the same program (assuming that these are assigned to different workers).

Moreover, I don't know why I'm suggesting making any more work for ourselves 😆

@jpivarski
Copy link
Member Author

For sure, just as having one proc/thread is a mostly useless kind of concurrency in conventional threading models

I meant in the sense that even though we have a main thread and a background thread, there's no concurrency in the non-error behavior. Those two threads don't count as concurrency. If you take the pair of them as one unit and have multiple units, then you can get concurrent behavior among the units, but that would also happen with regular threads, and these are thread pairs.

Could you clarify what you mean by this being less general than Python's executors? AFAICR these all assign work to a pool of sequential executors, which seems to map well to the model you've set out above.

A Python ThreadPoolExecutor (or a TBB executor, etc.) is a pool of threads that do the tasks they've been given in an arbitrary order. With a ThreadPoolExecutor, you don't know which thread is going to run a given task, if it's going to be before or after or at the same time as another task, etc. Our background thread (singular, always only one of these per main thread) executes its tasks in exactly the order given and one does not begin until the previous one ends. All it buys is looseness between the main thread (controlled by the user's Python process) and the background thread, which is a "GPU shepherd" that keeps the GPU busy. (The GPU runs a bunch of concurrent sub-tasks, but that's another story.)

I mean that a "multiple, sequential worker" executor could provide a lightweight mechanism for multiple-CPU execution of Awkward kernels (i.e., with performance boost as an explicit goal).

We gain that in a different way, from Dask. This "background thread" mechanism is useful for dealing with an external resource (quite a lot like an I/O thread, if you count controlling a GPU as "I/O"), which isn't what we have when trying to accelerate work on a CPU: the cpu-kernels compete with Python for the same resource (CPU cores). Dask will scale out threads, processes, and remote processes running ordinary single-threaded Awkward tasks. For that purpose, these background threads are an unnecessary complication.

Maybe a different word would help (other than "asynchronous" and "delayed," the two I've used so far): maybe we can call it a "shadow thread" because there's only one of them behind a user thread and it mimics what the user thread could have done on its own, possibly offset in time.

@agoose77
Copy link
Collaborator

agoose77 commented Mar 4, 2022

If you take the pair of them as one unit and have multiple units, then you can get concurrent behavior among the units

I think I see what you're saying. I believe we're using slightly different interpretations of "thread", which your point on "thread pairs" clarified - I am referring to a wheel-hub model, where we have (background) threads, and a single main thread (hence multiple threads = multiple background (concurrent) threads).

With a ThreadPoolExecutor, you don't know which thread is going to run a given task, if it's going to be before or after or at the same time as another task,

Right, I ended up deleting this from my last comment, but that's where I draw the distinction - in the normal futures executor, tasks are consumed eagerly rather than scheduled.

Our background thread (singular, always only one of these per main thread) executes its tasks in exactly the order given and one does not begin until the previous one ends.

Oh! Are you referring to a

graph LR;
Python --> shepherd
shepherd -->  w1(worker 1)
shepherd --> w2(worker 2)
Loading

model? I.e. there is a single "shepherd" thread that blockingly manages the GPU, the main thread that doesn't block (unless the user tries to resolve a future), and then N workers? That would explain why we're crossing wires!

We gain that in a different way, from Dask.

Yes, of course. I was thinking about this finer grained parallelism having benefits, but on second thoughts there's no compelling case for it.

Maybe a different word would help

Yes, if we only have one shepherd, then we're not concurrent. Shadow thread works pretty well to make the important features known. I'm 👍 on that.

@jpivarski
Copy link
Member Author

That's it! (And that graph is really cool!) There's only one shepherd/shadow per user thread (and Dask can make multiple of those). The shepherd/shadow is controlling a CUDA stream, which internally has a lot of workers, though that's something that we only see through CUDA tools.

This is helping to improve the nomenclature. (@swishdiff, feel free to rename "Worker" as "Shepherd" or "Shadow." In the generality of src/awkward/_v2/_delayed.py, there's no "shepherding" because there's no GPU yet, but its main application will be pairing it with a GPU. Given that generality, maybe "Shadow" is best?)

@agoose77
Copy link
Collaborator

agoose77 commented Mar 4, 2022

@jpivarski and I discussed this a little offline, and I realised that we still had slightly different ideas about how the shadow thread system would work.

I was picturing something like this,

graph LR;
Python <--> Shepherd
Shepherd <-->  w1(Worker 1)
Shepherd <--> w2(Worker 2)
Loading

all localised to the host. In my understanding, GPU streams would be communicated with from the workers, and the shepherd's role was something like a supervisor/scheduler.

But actually the "worker" here is the GPU stream. The worker-stream pair represents the mapping of each stream to a CPU thread:

graph LR;
Python --> w1
Python --> w2
w1("Worker 1 [Host]") -->  s1("Stream 1 [GPU]")
w2("Worker 2 [Host]") --> s2("Stream 2 [GPU]")
Loading

I don't want to put words into Jim's mouth, but I think the endpoint of our conversation is that Python (the main thread) can talk to multiple workers (shepherds) that keep GPU streams busy. These streams (and therefore workers) offer concurrency between one another, so we can (where data relationships permit) compute independent array operations in parallel.

@jpivarski
Copy link
Member Author

I don't want to put words into Jim's mouth, but I think the endpoint of our conversation is that Python (the main thread) can talk to multiple workers (shepherds) that keep GPU streams busy. These streams (and therefore workers) offer concurrency between one another, so we can (where data relationships permit) compute independent array operations in parallel.

Yes, that's right, and what I said about the main thread only having one worker/shadow/shepherd was me flaking out: I just hadn't thought of the fact that a main thread can run multiple of these without any problems. It has to send independent tasks to each (and maybe we'll need some way to make sure of that... maybe through the nplikes). But also, there must be exactly one GPU stream per worker/shadow/shepherd thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants