Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand monitoring to include Tale launching #7

Open
craig-willis opened this issue Feb 5, 2018 · 8 comments · May be fixed by #27
Open

Expand monitoring to include Tale launching #7

craig-willis opened this issue Feb 5, 2018 · 8 comments · May be fixed by #27
Assignees
Milestone

Comments

@craig-willis
Copy link
Collaborator

craig-willis commented Feb 5, 2018

This is the kind of error that's reported most

  • Launch Ligo tale
  • Run
  • Confirm output
  • Stop it
  • Every hour or N hours
  • Setup a monitoring user
@craig-willis
Copy link
Collaborator Author

@craig-willis
Copy link
Collaborator Author

@Xarthisius This has been running on staging now for several days and I'm seeing at least one error per day during the create call-- occasionally 400 and 401, and more often 500. Looking at the staging girder error log, it's not immediately clear what the cause is:

[2018-06-26 14:17:30,663] ERROR: 500 Error
Traceback (most recent call last):
  File "/girder/girder/api/rest.py", line 620, in endpointDecorator
    val = fun(self, args, kwargs)
  File "/girder/girder/api/rest.py", line 1204, in POST
    return self.handleRoute(method, path, params)
  File "/girder/girder/api/rest.py", line 947, in handleRoute
    val = handler(**kwargs)
  File "/girder/girder/api/access.py", line 63, in wrapped
    return fun(*args, **kwargs)
  File "/girder/girder/api/describe.py", line 702, in wrapped
    return fun(*args, **kwargs)
  File "/girder/plugins/wholetale/server/rest/instance.py", line 166, in createInstance
    save=True)
  File "/girder/plugins/wholetale/server/models/instance.py", line 147, in createInstance
    volume = volumeTask.get(timeout=TASK_TIMEOUT)
  File "/usr/local/lib/python3.5/dist-packages/celery/result.py", line 191, in get
    on_message=on_message,
  File "/usr/local/lib/python3.5/dist-packages/celery/backends/async.py", line 190, in wait_for_pending
    return result.maybe_throw(callback=callback, propagate=propagate)
  File "/usr/local/lib/python3.5/dist-packages/celery/result.py", line 296, in maybe_throw
    self.throw(value, self._to_remote_traceback(tb))
  File "/usr/local/lib/python3.5/dist-packages/celery/result.py", line 289, in throw
    self.on_ready.throw(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/vine/promises.py", line 217, in throw
    reraise(type(exc), exc, tb)
  File "/usr/local/lib/python3.5/dist-packages/vine/five.py", line 179, in reraise
    raise value
celery.backends.base.ValueError
Additional info:
  Request URL: POST https://girder.stage.wholetale.org/api/v1/instance
  Query string: taleId=59f0b91584b7920001b46f2e
  Remote IP: 10.0.0.2

I've seen alerts at 4:19AM and 9:18AM today.

The Girder error logs also have a surprising number of these:

[2018-06-26 13:17:12,728] ERROR: INFO:girder:Booting server at [https://tmp-rs9nxumk3tuf.stage.wholetale.org/?token=9e9e0a4366fc4131adf9fddc811f9ad8], getting HTTP status [404]

and these

[2018-06-26 11:00:22,466] ERROR: INFO:girder:Running DM file GC

But since they are INFO level, maybe not real errors?

@Xarthisius
Copy link
Contributor

9:18 AM CDT?
Those two "errors" can be ignored. I don't know why they're flagged as errors...

@craig-willis
Copy link
Collaborator Author

Sorry, yes 9:18 AM CDT was the alert time from check_mk.

@Xarthisius
Copy link
Contributor

Looks like ValueError is raised during parsing girder_token:

[2018-06-26 14:17:30,537: INFO/MainProcess] Received task: gwvolman.tasks.create_volume[1389d137-cec7-4619-aec6-293f255def26]  
[2018-06-26 14:17:30,540: WARNING/ForkPoolWorker-1] [2018-06-26 14:17:30,540] WARNING: No jobInfoSpec. Setting job_manager to None.
[2018-06-26 14:17:30,540: WARNING/ForkPoolWorker-1] No jobInfoSpec. Setting job_manager to None.
[2018-06-26 14:17:30,661: WARNING/ForkPoolWorker-1] Bad gider token
[2018-06-26 14:17:30,664: ERROR/ForkPoolWorker-1] Task gwvolman.tasks.create_volume[1389d137-cec7-4619-aec6-293f255def26] raised unexpected: ValueError()
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 375, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/girder_worker/task.py", line 148, in __call__
    results = super(Task, self).__call__(*_t_args, **_t_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 632, in __protected_call__
    return self.run(*args, **kwargs)
  File "gwvolman/tasks.py", line 35, in create_volume
    gc, user, tale = _parse_request_body(payload)
  File "gwvolman/utils.py", line 88, in _parse_request_body
    raise ValueError
ValueError

relevant code:

https://github.com/whole-tale/gwvolman/blob/master/gwvolman/utils.py#L80-L86

either payload was mangled somehow (expired token?) or worker failed to connect back to girder to get user info. I set up flower, so that we can see what's being passed as payload into the tasks and rule out the former.

@craig-willis
Copy link
Collaborator Author

Thank you. This happens so infrequently that we probably won't see anything until tomorrow AM.

The problem does correct itself (i.e., during the next check, things are OK). The check-tale code exchanges an API key for token during each call, so it seems unlikely that it would expire, but maybe I've done something wrong.

@craig-willis
Copy link
Collaborator Author

Got one. From check_mk:

Host wt-stage-00 (wt-stage-00)
Service Tale
Event OK → CRITICAL
Address 149.165.168.151
Date / Time Wed Jun 27 18:51:30 CDT 2018
Plugin Output CRIT - CRITICAL - error creating tale
Host Metrics rta=10.073ms;200.000;500.000;0; pl=0%;80;100;; rtmax=10.236ms;;;; rtmin=9.955ms;;;;
Service Metrics status=500;;;;

Looking at the Flower dashboard, the payload looks OK:

[{
  'apiUrl': 'https://girder.stage.wholetale.org/api/v1',
  'taleId': '59f0b91584b7920001b46f2e',
  'workspaceId': '59f0b91584b7920001b46f2f', 
  'girder_token': 'TOKEN', ' 
  `api_version': '2.0'
}]

I was able to curl with the token:

curl -H "Girder-Token: TOKEN " https://girder.stage.wholetale.org/api/v1/user/me

{
   "_accessLevel": 2, 
   "_id": "5ade46db715d3400011e9524", 
   "_modelType": "user",
   "email": "[email protected]", 
   "firstName": "Monitoring", 
   ...
}

So this seems to point to a possible networking problem, which is unsurprising. I'm going to reduce the check time to every 10 minutes to see how pervasive this is.

@craig-willis
Copy link
Collaborator Author

Also received 401 on POST instance at ~5am CDT this morning

@craig-willis craig-willis modified the milestones: 0.4, 0.5 Jul 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants