Prevent MapperPipeline.complete() from infinitely retrying after pipeline is aborted #65

MattFaus · 2015-06-11T19:49:10Z

Summary:
I was investigating this error:
https://www.khanacademy.org/devadmin/errors/8bb7cca3

I learned that error-monitor-db will match on ANY of the 3 id's, and one of the id's is the last 3 words of the first line of the error message, in this case "not yet filled". So, the error message is actually this:

"Slot with name ""job_id"", key ""ag5zfmtoYW4tYWNhZGVteXI3CxIRX0FFX1BpcGVsaW5lX1Nsb3QiIGYzNDM1MGNjYWQ1ZjExZTNiYjI3MzkzOWFjYTNhOWZlDA"" not yet filled.
Traceback (most recent call last):
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1536, in __call__
    rv = self.handle_exception(request, response, e)
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1530, in __call__
    rv = self.router.dispatch(request, response)
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1278, in default_dispatcher
    return route.handler_adapter(request, response)
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1102, in __call__
    return handler.dispatch()
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 572, in dispatch
    return self.handle_exception(e, self.app.debug)
  File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 570, in dispatch
    return method(*args, **kwargs)
  File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/lib/pipeline/pipeline.py"", line 2671, in get
    callback_result = stage._callback_internal(kwargs)
  File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/lib/pipeline/pipeline.py"", line 1041, in _callback_internal
    return self.callback(**kwargs)
  File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/mapper_pipeline.py"", line 107, in callback
    mapreduce_id = self.outputs.job_id.value
  File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/lib/pipeline/pipeline.py"", line 194, in value
    % (self.name, self.key))

I investigated in devshell to find that this _SlotRecord belonged to a pipeline that had been aborted due to intermittent datastore problems:

dbkey = db.Key("ag5zfmtoYW4tYWNhZGVteXI3CxIRX0FFX1BpcGVsaW5lX1Nsb3QiIDgyM2U1MmYwZTllMTExZTM4ZDhlOTk4YjA1Y2VjYjBmDA")
sr = _SlotRecord.get(dbkey)

mattfaus@.. [68]: sr.status
khan-academy.appspot.com [68]: u'waiting'

mattfaus@.. [72]: sr.root_pipeline.root_pipeline.status
khan-academy.appspot.com [72]: u'aborted'

mattfaus@.. [82]: sr.root_pipeline.root_pipeline.key().to_path()
khan-academy.appspot.com [82]: [u'_AE_Pipeline_Record', u'75aedcb5e9e011e3bf77b331629cd373']

http://www.khanacademy.org/_ah/pipeline/status?root=75aedcb5e9e011e3bf77b331629cd373

To reproduce this error locally, I inserted an exception into the MapperPipeline.run(). Before this change, the callback task would infinitely retry, but after this change it would stop retrying. I debated between this fix and changing the upstream code to simply not call the callback in the event of a pipeline abortion, but based on this comment in Pipeline.finalized() I think checking was_aborted is the intended pattern.

Implementors be sure to call 'was_aborted' to find out if the finalization
that you're handling is for a success or error case.

Test Plan:

Modify MapperPipeline.run() as mentioned above
Run a mapper pipeline
Verify that the are not infinitely retried callback tasks

and for testing in prod:

Deploy
Watch exercise-summary-queue drain

Differential Revision: https://phabricator.khanacademy.org/D18378

…line is aborted Summary: I was investigating this error: https://www.khanacademy.org/devadmin/errors/8bb7cca3 I learned that error-monitor-db will match on ANY of the 3 id's, and one of the id's is the last 3 words of the first line of the error message, in this case "not yet filled". So, the error message is actually this: ``` "Slot with name ""job_id"", key ""ag5zfmtoYW4tYWNhZGVteXI3CxIRX0FFX1BpcGVsaW5lX1Nsb3QiIGYzNDM1MGNjYWQ1ZjExZTNiYjI3MzkzOWFjYTNhOWZlDA"" not yet filled. Traceback (most recent call last): File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1536, in __call__ rv = self.handle_exception(request, response, e) File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1530, in __call__ rv = self.router.dispatch(request, response) File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1278, in default_dispatcher return route.handler_adapter(request, response) File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 1102, in __call__ return handler.dispatch() File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 572, in dispatch return self.handle_exception(e, self.app.debug) File ""/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py"", line 570, in dispatch return method(*args, **kwargs) File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/lib/pipeline/pipeline.py"", line 2671, in get callback_result = stage._callback_internal(kwargs) File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/lib/pipeline/pipeline.py"", line 1041, in _callback_internal return self.callback(**kwargs) File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/mapper_pipeline.py"", line 107, in callback mapreduce_id = self.outputs.job_id.value File ""/base/data/home/apps/s~khan-academy/batch:150418-1603-73b540bd3651.383702003414263966/third_party/mapreduce/lib/pipeline/pipeline.py"", line 194, in value % (self.name, self.key)) ``` I investigated in devshell to find that this `_SlotRecord` belonged to a pipeline that had been aborted due to intermittent datastore problems: ``` dbkey = db.Key("ag5zfmtoYW4tYWNhZGVteXI3CxIRX0FFX1BpcGVsaW5lX1Nsb3QiIDgyM2U1MmYwZTllMTExZTM4ZDhlOTk4YjA1Y2VjYjBmDA") sr = _SlotRecord.get(dbkey) mattfaus@.. [68]: sr.status khan-academy.appspot.com [68]: u'waiting' mattfaus@.. [72]: sr.root_pipeline.root_pipeline.status khan-academy.appspot.com [72]: u'aborted' mattfaus@.. [82]: sr.root_pipeline.root_pipeline.key().to_path() khan-academy.appspot.com [82]: [u'_AE_Pipeline_Record', u'75aedcb5e9e011e3bf77b331629cd373'] http://www.khanacademy.org/_ah/pipeline/status?root=75aedcb5e9e011e3bf77b331629cd373 ``` To reproduce this error locally, I inserted an exception into the `MapperPipeline.run()`. Before this change, the callback task would infinitely retry, but after this change it would stop retrying. I debated between this fix and changing the upstream code to simply not call the callback in the event of a pipeline abortion, but based on this comment in `Pipeline.finalized()` I think checking `was_aborted` is the intended pattern. ``` Implementors be sure to call 'was_aborted' to find out if the finalization that you're handling is for a success or error case. ``` Test Plan: 1. Modify `MapperPipeline.run()` as mentioned above 2. Run a mapper pipeline 3. Verify that the are not infinitely retried callback tasks and for testing in prod: 1. Deploy 2. Watch exercise-summary-queue drain Differential Revision: https://phabricator.khanacademy.org/D18378

Prevent MapperPipeline.complete() from infinitely retrying after pipeline is aborted

tkaitchuck added a commit that referenced this pull request Jun 24, 2015

Merge pull request #65 from Khan/fix_infinite_retry_in_mapper_pipeline

5c8524d

Prevent MapperPipeline.complete() from infinitely retrying after pipeline is aborted

tkaitchuck merged commit 5c8524d into GoogleCloudPlatform:master Jun 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent MapperPipeline.complete() from infinitely retrying after pipeline is aborted #65

Prevent MapperPipeline.complete() from infinitely retrying after pipeline is aborted #65

MattFaus commented Jun 11, 2015

Prevent MapperPipeline.complete() from infinitely retrying after pipeline is aborted #65

Prevent MapperPipeline.complete() from infinitely retrying after pipeline is aborted #65

Conversation

MattFaus commented Jun 11, 2015