Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent MapperPipeline.complete() from infinitely retrying after pipeline is aborted #65

Conversation

MattFaus
Copy link
Contributor

Summary:
I was investigating this error:
https://www.khanacademy.org/devadmin/errors/8bb7cca3

I learned that error-monitor-db will match on ANY of the 3 id's, and one of the id's is the last 3 words of the first line of the error message, in this case "not yet filled". So, the error message is actually this:

"Slot with name ""job_id"", key ""ag5zfmtoYW4tYWNhZGVteXI3CxIRX0FFX1BpcGVsaW5lX1Nsb3QiIGYzNDM1MGNjYWQ1ZjExZTNiYjI3MzkzOWFjYTNhOWZlDA"" not yet filled.
Traceback (most recent call last):
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1536, in __call__
    rv = self.handle_exception(request, response, e)
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1530, in __call__
    rv = self.router.dispatch(request, response)
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1278, in default_dispatcher
    return route.handler_adapter(request, response)
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1102, in __call__
    return handler.dispatch()
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 572, in dispatch
    return self.handle_exception(e, self.app.debug)
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 570, in dispatch
    return method(*args, **kwargs)
  File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/lib/pipeline/pipeline.py"", line 2671, in get
    callback_result = stage._callback_internal(kwargs)
  File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/lib/pipeline/pipeline.py"", line 1041, in _callback_internal
    return self.callback(**kwargs)
  File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/mapper_pipeline.py"", line 107, in callback
    mapreduce_id = self.outputs.job_id.value
  File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/lib/pipeline/pipeline.py"", line 194, in value
    % (self.name, self.key))

I investigated in devshell to find that this _SlotRecord belonged to a pipeline that had been aborted due to intermittent datastore problems:

dbkey = db.Key("ag5zfmtoYW4tYWNhZGVteXI3CxIRX0FFX1BpcGVsaW5lX1Nsb3QiIDgyM2U1MmYwZTllMTExZTM4ZDhlOTk4YjA1Y2VjYjBmDA")
sr = _SlotRecord.get(dbkey)

mattfaus@.. [68]: sr.status
khan-academy.appspot.com [68]: u'waiting'

mattfaus@.. [72]: sr.root_pipeline.root_pipeline.status
khan-academy.appspot.com [72]: u'aborted'

mattfaus@.. [82]: sr.root_pipeline.root_pipeline.key().to_path()
khan-academy.appspot.com [82]: [u'_AE_Pipeline_Record', u'75aedcb5e9e011e3bf77b331629cd373']

http://www.khanacademy.org/_ah/pipeline/status?root=75aedcb5e9e011e3bf77b331629cd373

To reproduce this error locally, I inserted an exception into the MapperPipeline.run(). Before this change, the callback task would infinitely retry, but after this change it would stop retrying. I debated between this fix and changing the upstream code to simply not call the callback in the event of a pipeline abortion, but based on this comment in Pipeline.finalized() I think checking was_aborted is the intended pattern.

Implementors be sure to call 'was_aborted' to find out if the finalization
that you're handling is for a success or error case.

Test Plan:

  1. Modify MapperPipeline.run() as mentioned above
  2. Run a mapper pipeline
  3. Verify that the are not infinitely retried callback tasks

and for testing in prod:

  1. Deploy
  2. Watch exercise-summary-queue drain

Differential Revision: https://phabricator.khanacademy.org/D18378

…line is aborted

Summary:
I was investigating this error:
https://www.khanacademy.org/devadmin/errors/8bb7cca3

I learned that error-monitor-db will match on ANY of the 3 id's, and one of the id's is the last 3 words of the first line of the error message, in this case "not yet filled". So, the error message is actually this:

```
"Slot with name ""job_id"", key ""ag5zfmtoYW4tYWNhZGVteXI3CxIRX0FFX1BpcGVsaW5lX1Nsb3QiIGYzNDM1MGNjYWQ1ZjExZTNiYjI3MzkzOWFjYTNhOWZlDA"" not yet filled.
Traceback (most recent call last):
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1536, in __call__
    rv = self.handle_exception(request, response, e)
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1530, in __call__
    rv = self.router.dispatch(request, response)
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1278, in default_dispatcher
    return route.handler_adapter(request, response)
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1102, in __call__
    return handler.dispatch()
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 572, in dispatch
    return self.handle_exception(e, self.app.debug)
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 570, in dispatch
    return method(*args, **kwargs)
  File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/lib/pipeline/pipeline.py"", line 2671, in get
    callback_result = stage._callback_internal(kwargs)
  File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/lib/pipeline/pipeline.py"", line 1041, in _callback_internal
    return self.callback(**kwargs)
  File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/mapper_pipeline.py"", line 107, in callback
    mapreduce_id = self.outputs.job_id.value
  File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/lib/pipeline/pipeline.py"", line 194, in value
    % (self.name, self.key))
```

I investigated in devshell to find that this `_SlotRecord` belonged to a pipeline that had been aborted due to intermittent datastore problems:

```
dbkey = db.Key("ag5zfmtoYW4tYWNhZGVteXI3CxIRX0FFX1BpcGVsaW5lX1Nsb3QiIDgyM2U1MmYwZTllMTExZTM4ZDhlOTk4YjA1Y2VjYjBmDA")
sr = _SlotRecord.get(dbkey)

mattfaus@.. [68]: sr.status
khan-academy.appspot.com [68]: u'waiting'

mattfaus@.. [72]: sr.root_pipeline.root_pipeline.status
khan-academy.appspot.com [72]: u'aborted'

mattfaus@.. [82]: sr.root_pipeline.root_pipeline.key().to_path()
khan-academy.appspot.com [82]: [u'_AE_Pipeline_Record', u'75aedcb5e9e011e3bf77b331629cd373']

http://www.khanacademy.org/_ah/pipeline/status?root=75aedcb5e9e011e3bf77b331629cd373
```

To reproduce this error locally, I inserted an exception into the `MapperPipeline.run()`. Before this change, the callback task would infinitely retry, but after this change it would stop retrying. I debated between this fix and changing the upstream code to simply not call the callback in the event of a pipeline abortion, but based on this comment in `Pipeline.finalized()` I think checking `was_aborted` is the intended pattern.

```
Implementors be sure to call 'was_aborted' to find out if the finalization
that you're handling is for a success or error case.
```

Test Plan:
1. Modify `MapperPipeline.run()` as mentioned above
2. Run a mapper pipeline
3. Verify that the are not infinitely retried callback tasks

and for testing in prod:
1. Deploy
2. Watch exercise-summary-queue drain

Differential Revision: https://phabricator.khanacademy.org/D18378
tkaitchuck added a commit that referenced this pull request Jun 24, 2015
Prevent MapperPipeline.complete() from infinitely retrying after pipeline is aborted
@tkaitchuck tkaitchuck merged commit 5c8524d into GoogleCloudPlatform:master Jun 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants