Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seems supervisor did not restart the cfjsonlog.pl script! #10

Closed
geoffmcl opened this issue Aug 24, 2016 · 11 comments
Closed

Seems supervisor did not restart the cfjsonlog.pl script! #10

geoffmcl opened this issue Aug 24, 2016 · 11 comments

Comments

@geoffmcl
Copy link
Member

Hi Pete,

The cflog script was running good, under supervisor,
for a few days, so I stopped monitoring it for a few
days... and BANG ;=))

It stopped on 2016-08-21, and for some reason
supervisor did not restart the script ????????

When I entered $ sudo supervisorctl it showed an
error line -

cflog          FATAL    Exited too quickly (process log may have details)

Anyway, I restarted cflog, and it is now outputting
flights-2016-08-24.csv ... over the coming days I will
fill in the missing logs from my RPI2 collection, so
no problem...

But just a question! Why did supervisor not
restart the cflog job?

The super config $ /etc/supervisor/conf.d/cflog.conf
already contains -

autostart=true
autorestart=true

Is there anything else we need to do to make
supervisor restart the script if it fails,
for any reason...

I note the startretries=3. If super restarts the
script too quickly, maybe the json fetch problem,
the only known reason it stops, has not yet
recovered...

Is there are way to config super to wait say
20 to 25 seconds before the restart... then 3
retries over the next minute or so might make
sense...

But am concerned that if super restarts the script
virtually immediately, then it may be easy to get 3
similar fetch errors within the first few seconds...

Will keep monitoring the situation...

Regards,
Geoff.

@pedromorgan
Copy link
Member

Have a look at the stopwaitsecs and startsecs... here http://supervisord.org/configuration.html#program-x-section-settings

Not sure what to do as am not familar with the stop signals for perl..

Also shall I create the supervisor log in /home/geoff/log/* ?? or alike

@geoffmcl
Copy link
Member Author

Hi Pete,

Thanks for the hints, but...

stopwaitsecs, default 10, seems to be more about
when supervisor, aka SV, wants to stop a process.
How long to wait for exit. That does not seem a problem
here...

startsecs, default 10, seems to be a measure of
how long the process should run before it is considered
running ok... so to me the default of 10 secs is ok...

No, this is about when SV detects the process has
stopped, and when to restart it...

As stated, if there is a tiny glitch in the json http
fetch, it is very likely that if SV restarts the
script immediately, that the problem will still be
there...

BUT, if SV waits 2-5 secs before restarting the
script, then there is a good chance the very momentary
json fetch problem will have cleared itself... that's
all...

Is there a way to say to SV - take it easy kid! The
process stopped. Leave it dead for 2-5 secs before
restarting it... Seems like a reasonable option...
but can not find it...

If there is a temporary problem, do not try to
start the process immediately... wait some time
for the problem to resolve itself... then restart
the process...

I do not know how quickly SV detects a process down!

But is SV is very attentive... maybe too attentive...
and restarts the process more that 3 times within the
next second, then I can see this can be a problem!

As you know, the json string in crossfeed is only
updated each seconds... If a json fetch fails,
then trying again, several times, before at least 1,
or more, seconds have passed, will probably mean more
failures... calm down, and wait a little, and all
may, will be well...

At some point, I will get around to moving this
logic into the script... If a json fetch fails...
do not panic, and exit... sleep/wait 5 seconds... and
calmly read again... And let this happen say 3 to
10 times before actually bailing...

Yes, I have created a /home/geoff/log folder you can
write a log to, but not sure this is going to solve
the above maybe SV is too fast, too anxious problem...

And I too am not famiar with the stop signals for perl,
except I would believe they are very similar across
all script languages - perl, python, etc, etc... But
really not sure how this is relevant...

As stated, I will get around to adding this calmness
to the script itself... some day soon... actually waiting
the the next RPI2 acquisition to break, to exactly
narrow this down... and work on it...

All it seems I want at the moment is for SV to take a
little time off ;=)) calm down... take a sip of red
wine... a short nap... and then autorestart the
process...

Regards,
Geoff.

@pedromorgan
Copy link
Member

Supervisor probably does not do what we want.. It designed to keep processes running, ie zero downtime

@pedromorgan
Copy link
Member

Why is the script crashing anyway.. ?

@pedromorgan
Copy link
Member

Another quick hack would be to launch from a bash.script and make is sleep for X seconds before kicking off the pl script

@geoffmcl
Copy link
Member Author

Hi Pete,

Ok, understand supervisor is a tool to keep
maximum up-time, and it seems to do a good
job of that... so no problem there...

Yes, I too thought of the quick hack to
run the pl from a bash script that first
slept for a few seconds...

But as stated I am waiting for this to
happen again on RPI2... I think I know
which pl line to fix, but just want to be sure...
and if I am right, should be an easy fix...

See lines -
https://github.com/geoffmcl/scripts/blob/master/cfjsonlog.pl#L346-L348

That service fetch_url() can return an undefined,
or some truncated, perhaps not valid json string
and it is in the json decode that bombs...

I could add some checks here, and if not a flights.json
string, just sleep the script for a few seconds, and quietly
try again... for say at least the next minute
or so...

But, as usual, it has been running several
days/weeks without problem... and I am sure
when it breaks in RPI2, it will probably break
in m2 at the same time...

I guess it may go back to a fault in
crossfeed (cf_client) code, on the rare
occasion, there are no flights to report,
due to the cf packet filtering...

Maybe at that moment cf forms some bad
json, or 'nothing', or something like that...
need also to look deeper into this...

See service Write_JSON(), from line 532 -
https://gitlab.com/fgtools/crossfeed/blob/master/src/cf_pilot.cxx#L532

This service establishes the next json string
to return... note it uses pvList->size(), but
skips expired pilots... if this is zero, at
that time what exactly does the http json GET
request return?

As stated it seems quite a rare event... but
am on its tail ;=))

And as I now feel Mr SuperVisor is doing his
job, will close this issue... ;=))

Thanks,
Geoff.

@geoffmcl
Copy link
Member Author

Ok, it broke again in RPI2. exactly where predicted... BUT, not in m2!!!!

Maybe the auto restart 1, 2, ... were already in the next crossfeed 1 second cycle...

Made a kludge fix in the pl script, and that is to wait 5 seconds before doing the first fetch... there really is no big rush here... on start up...

Added some more precise timing... just debug stuff...

This is a quick fix, while I look deeper into avoiding it totally ;=))

Also decided to cycle the log to .1, .2, ... on restart, to be able to look back... but this presents a log cleanup to be done later...

TODO: Get the CSV files synced...
TODO: Add a log cycling system

On to the next steps...

@pedromorgan
Copy link
Member

had a quick look at script... and seems strange to be that

  • fetch_url does not return error
  • and decoding I cant figure out,. is that a perl package...

@pedromorgan
Copy link
Member

@geoffmcl adding issues on crossfeed-logger, re analysis..

Issue #1 is your TODO ie CSV files synced freeflightsim/crossfeed-logger#1

@pedromorgan
Copy link
Member

WHAW...
I knew I been here before..

https://github.com/freeflightsim/fgxgobot

@geoffmcl
Copy link
Member Author

@pedromorgan yes, I certainly could add some error return to fetch_url, and maybe also to the JSON perl package decode, but haven't yet...

This did start out just as a WIP test to get an updated MODEL use, but has grown since then...

And as mentioned still a TODO to sync the raw CSV files... presently missing 22 and 23, and 24 is truncated... will get to this, hopefully soon ;=))

And I had not forgotten your experimental FG Multi Player Network Bot written in Go WIP of a few years back... But personally, have no interest in replacing FGMS and/or crossfeed C++ programs with Go ;=))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants