Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device freezes #304

Closed
hph304 opened this issue May 8, 2022 · 211 comments
Closed

Device freezes #304

hph304 opened this issue May 8, 2022 · 211 comments
Labels
bug Something isn't working
Milestone

Comments

@hph304
Copy link

hph304 commented May 8, 2022

I have a 7" OpenVario with sensorboard. After between 5 seconds and 20 minutes of operation, the screen goes black(or white, or green, or blue...) and the whole device stops responding. I am running 21118 for the CH070 screen.

After installing 17119, the issue seems to be gone, so there must be something wrong on the software side. I can pull some logs if that helps, but I will need some guidance on where to find them on the device.

@hph304
Copy link
Author

hph304 commented May 17, 2022

I tried v22086, which produces the same issue. Only stable version for me is 17119

@OBrown92
Copy link

We have the same issue in our club glider. We powered the device with a separate battery (new) but it crashes after about 2 hours. 17119 works fine.

@linuxianer99
Copy link
Member

@OBrown92 : Which type of system do you have ?? Also CH070 ??

@OBrown92 @hph304 : i assume XCSoar is running in this 2 hours ??
Can you try, just leave the system in the Start menu ?? (i want to eliminate the mail issues ...)

@mihu-ov
Copy link
Member

mihu-ov commented May 19, 2022

We should also eliminate any potential hardware issues.

@hph304 @OBrown92
Can you please elaborate on your OV hardware? DIY soldered or SteFly? Which type of DC/DC converters? Old images with 3.4 kernel had lower power consumption which may make a difference on some systems.

@OBrown92
Copy link

We got the same issues in two gilders with SteFly OV. We haven't modified the DC/DC converter yet but it seems to happen also with modified one (reported in xcsoar forum). First we thought it's a power issue because it seems to happen when the radio tx or rx but we completely power it from one battery so this shouldn't be the issue.
We are on a competition next week, if the weather is bad I can test some stuff like keep it in start menu.

@DanD222
Copy link
Contributor

DanD222 commented May 19, 2022

Do we know at what current the reseteable fuse kicks in on the SteFly OV?

@OBrown92
Copy link

Good question, don't know exactly if and where it is. We got a Stefly OV in spare, I can try to figure it out.

@hph304
Copy link
Author

hph304 commented May 19, 2022

I can leave it running tomorrow for 2 hours or longer. Which start menu do you mean, the OV menu? Mine is a DIY, not sure about the converter but will open it up and have a look if I can find a part number.

My device is not yet built in. I have connected it to 2 different PSUs (Basetech and Delta). The devices uses about 0.58A during operation at current Image. Once the device freezes, there is a spike to about 0.7A and it stays there.

With Image 17119 it uses 0.45A.

@tb59427
Copy link
Contributor

tb59427 commented May 20, 2022

I seem to have the same issue with my vanilla Stefly 57. Should I piggy back on this issue or do you want me to open an additional one?

@linuxianer99
Copy link
Member

I think we should do it in a systematic approach... Maybe create a table where everyone can add the affected configuration, so we can eleiminate some things maybe ...

Data we would need in the table:

Hardware Variant (CH070, PQ070, etc).
Is the device DIY or bought ready build ?
Image version used (21xxx, 22xxx, etc).
At which state happens the freeze (XCSoar running, Just in the text console (ov-menu))
Are there any external effects ?? (TX of radio (EMV)) ??

Advanced debug:
Is the console still accessible (only graphic/XCSoar freeze) or is also network/serial port ?
Is there any kernel panic in the logs ? (can be read out at next reboot)

@tb59427
Copy link
Contributor

tb59427 commented May 20, 2022

Fair point, linuxianer99, so here we go

Data we would need in the table:
Hardware Variant (CH070, PQ070, etc): SteFly 5.7 inch, no sensor board
Is the device DIY or bought ready build ?: kinda half. Pre-built by SteFly and completed by me (didn't change anything). Plus rotary encoder from SteFly
Image version used (21xxx, 22xxx, etc). 22086 as linked on Stefan's web page. variod disabled
At which state happens the freeze (XCSoar running, Just in the text console (ov-menu)): so far never tried staying in menu or console but always started xcsoar which eventually froze after anything from a few mins to a couple of hours. Will check with just console or menu running over the weekend
Are there any external effects ?? (TX of radio (EMV)) ?? No. System is not in the glider yet. Sits on my desk and powered via a 12V transformer

Advanced debug:
Is the console still accessible (only graphic/XCSoar freeze) or is also network/serial port ? system is completely frozen
Is there any kernel panic in the logs ? (can be read out at next reboot): will check after next crash. Should be in /var/log, no?

OV so far has been connected via /tty/S1 to an XCVario (w/ connected Flarm). Have changed this now to WIFI with a USB Wifi dongle.

Will test / have tested the following cases (may take some time, will update accordingly):
(a) connected via /tty/S1 and xcsoar running: system freezes [x]
(b) connected via /tty/S1 and menu running: system freezes [to be checked]
(c) connected via /tty/S1 and shell running: system freezes [to be checked]
(d) connected via Wifi and xcsoar running: system freezes [to be checked]
(e) connected via Wifi and menu running: system freezes [to be checked]
(f) connected via Wifi and shell running: system freezes [to be checked]

@tb59427
Copy link
Contributor

tb59427 commented May 20, 2022

I have created a little spreadsheet on google docs (not sure everyone here likes google - but it was the easiest for the moment) where people can document their freezes along the lines of @linuxianer99's request. Feel free to add yourself. Also: feel free to spread the word to those OV-users not reading this here.

@lordfolken
Copy link
Contributor

Please start at the minimum for testing. So no devices connected and menu, then work up from there. Also reenable the serial console and see the last messages.

@tb59427
Copy link
Contributor

tb59427 commented May 20, 2022

How do I enable serial console?

@linuxianer99
Copy link
Member

linuxianer99 commented May 21, 2022

How do I enable serial console?

It's always enabled ... just connect to Cubieboard port Serial 0

@tb59427
Copy link
Contributor

tb59427 commented May 21, 2022

Preliminary results are in the sheet now. Seems my USB-serial adaptor is broken. Ordered a new one and will retest with console open.

@mihu-ov
Copy link
Member

mihu-ov commented May 23, 2022

Before you trash your USB-serial adaptor note that only error messages show up on the serial after #265
The constant stream of error messages is no more.

@tb59427
Copy link
Contributor

tb59427 commented May 23, 2022

Thanks for the hint @mihu-ov - but once I had it connected to OV I remembered that it wasn't working properly already a year ago, when I was debugging a pfsense router on a firebox.

115kBaud, 8N1 is the correct setting, though, isn't it?

@mihu-ov
Copy link
Member

mihu-ov commented May 23, 2022

@lordfolken
Copy link
Contributor

Do we have a prompt on ttyS0? if so you can login and do a journalctl -f to get all the log messages. Then let that run until it freezes and look at the last output.

@tb59427
Copy link
Contributor

tb59427 commented May 23, 2022

Based on @mihu-ov's statement re: #265 it doesn't look like there's a login on that port...provided the necessary packages are built into the OV kernel and distro we could, however, possibly enable it and do as you suggested @lordfolken

@tb59427
Copy link
Contributor

tb59427 commented May 23, 2022

Still preliminary - however: it appears 22050 is stable as well as 22028 (freevario version). 22086 is crashing. So most likely something must have happened between 22050 and 22086. Not familiar enough with github to find the associated changes in between these two releases. Maybe someone better at mastering github than me could do that.

@mihu-ov
Copy link
Member

mihu-ov commented May 23, 2022

22050 is day 50 in year 2022 or February 19th and 22086 is March 27th. https://github.com/Openvario/meta-openvario/commits/master shows a kernel update from 5.15.24 -> 5.15.27 and "sensord: use systemd socket activation" (but you have sensord disabled according to your spreadsheet).

@tb59427
Copy link
Contributor

tb59427 commented May 23, 2022

So since sensord is unlikely to be the culprit this leaves us with the kernel update as a possible cause. How easy is it to revert the build system back to 5.15.24 but retain all other changes? If easy (and someone explains this to me in simple language) I could build an up-to-date image with the old kernel and run that as attempt for positive proof....

@tb59427
Copy link
Contributor

tb59427 commented May 23, 2022

Ok, I figured the build process and built an image w/ kernel 5.15.24....let's see what happens...

Turns out that my ov image 22028 (from the freevario git repo) was built using kernel 5.10.2. So chances are that kernel 5.15.24 may not be the right version for a stable image.

@Scumi
Copy link
Contributor

Scumi commented May 23, 2022

I filled my information into the spreadsheet. I flew 6.5h at the weekend (and the device was powered ~8h in total), no freezes anymore after disabling sensord, variod and pulseaudio. The Openvario was also more responsive. I am not at the latest master branch commit with my image, so I could try that (after our competition) to see if the Kernel as implied has something to do with it.

@tb59427
Copy link
Contributor

tb59427 commented May 23, 2022

Added "pulseaudio" to the "running demons" section.

@tb59427
Copy link
Contributor

tb59427 commented May 24, 2022

Quick request: (to whoever entered line 13 in my table): could you please check, which kernel version your OV is using (hook a keyboard to OV, go to OV menu, exit to shell and type "uname -a" (w/o ")? Or anybody else using 22050...be sure to check on the system (and not based on what should be in the image :-) ). I was using an image called ..22028.. which in reality was 21350 with a kernel version of 5.10.2 (relatively old kernel).

@tb59427
Copy link
Contributor

tb59427 commented May 24, 2022

I filled my information into the spreadsheet. I flew 6.5h at the weekend (and the device was powered ~8h in total), no freezes anymore after disabling sensord, variod and pulseaudio. The Openvario was also more responsive. I am not at the latest master branch commit with my image, so I could try that (after our competition) to see if the Kernel as implied has something to do with it.

@Scumi could you please check the exact version of your kernel on the OV where everything runs fine?

@tb59427
Copy link
Contributor

tb59427 commented Oct 12, 2022

Hi @mihu-ov,

(1) I have not yet hat a chance to measure power consumption. Will redo my panel in the weeks to come and check power consumption during this process.

(2) Voltage increase in itself helps stability but doesn't make OV completely rock solid. I have played with this and found that the best solution seems to be to fix cpu freq at a higher freq e.g. 720MHz. This poses the question: why then increase voltage (I have written a bit about that in my commit message). Simple answer: CPU frequency fix (or changes) can be done at run time through changes in the script (see PR #331) - this may create a less stable system (if min freq != max freq and the kernel starts changing CPU freqs). It will be increasingly instable if we allow voltages below 1.2V (especially on a system that cycles CPU freq) . In a way this is a belt and braces strategy to create the best possible stability. So long story short: is the 1.2V fix absolutely necessary (if cpu freq fixed at 720MHz)? No!. Is it better for cases where users want to "play" with changing CPU speed? Absolutely!

cheers
Torsten

@bomilkar
Copy link
Contributor

* Do we have any data about the increase in power consumption?

I've been looking at temperature reported by /sys/class/thermal/thermal_zone0/temp and can't find any significant increase.

* Do we know if the frequency fix is required after the "increase min voltage to 1.2V" fix?

Not sure. However, there might be individuals where 1.2V isn't enough for CPU frequencies <= 720MHz. So we should be prepared.

* Would it be enough to increase the minimum frequency instead of fixed frequency

No. The key is to limit the maximum frequency (to 720MHz) and make sure the voltage doesn't change (PR #330 ). I've been experimenting with no limits on the minimum frequency and that went well.

@tb59427
Copy link
Contributor

tb59427 commented Oct 29, 2022

root@openvario-57-lvds:/sys/devices/system/cpu/cpu1# uptime
12:03:11 up 39 days, 8:13, load average: 0.19, 0.08, 0.07

I would consider this to be stable enough for the cockpit. When do you guys intend to roll in the pull requests?

@tb59427
Copy link
Contributor

tb59427 commented Nov 1, 2022

I believe we can close this issue now as being fixed (at least for the time being). I'm only hearing positive feedback to these fixes...

@linuxianer99 linuxianer99 moved this to Done in V3.0 Release Dec 16, 2022
@linuxianer99 linuxianer99 added this to the Release 22304 milestone Dec 17, 2022
@OBrown92
Copy link

Hi there,
we tested the image (22304) last weekend and unfortunately it freezes after about an hour. Tested on two different SD Cards and always freezing after a while. Before further investigations we will make sure our power management in the glider is good.

@tb59427
Copy link
Contributor

tb59427 commented Apr 11, 2023

Yeah, please do - also: did you download the image from one of my download locations?

@eku60
Copy link

eku60 commented Apr 11, 2023 via email

@Scumi
Copy link
Contributor

Scumi commented Apr 11, 2023

Did anyone notice such a behavior? And how can this be repaired. I think it is not a problem of the image. Two different images showing the same behavior. It might be a problem of the new USB-Connectors which clamp the cables directly into the connector. Has anyone any idea?

See discussion in #312

@tb59427
Copy link
Contributor

tb59427 commented Apr 11, 2023

Beware: the stick (as well as the rotary) controllers run with arduino boards that do the conversion from whatever input rotary and stick buttons/dials create and translate them to key strokes. Could it be that these arduino boards react more sensitive to voltage drops than OV?

Or (given the discussion in #312) it could be the shielding of the boards/cables.

@eku60
Copy link

eku60 commented Apr 11, 2023 via email

@OBrown92
Copy link

Yeah, please do - also: did you download the image from one of my download locations?

I've downloaded the image from this repo releases.

@bomilkar
Copy link
Contributor

The SteFly OVs are known to have a critical DC-DC converter. At least some of them. It can't cope with input voltage below 11 V (more ore less). It just "browns out". A good battery and good (power) cables will help. Replacing the DC-DC converter is not a bad idea.

I had a couple of OV instruments running for 20+ hours in flight this season and they both work just fine with the recent image.

I fixed one issue with a SteFly OV: the 15-pin connector didn't connect safely. I replaced the fixation with screws such that the connector stays in place firmly. That issue is easy to detect: just wiggle the connector and see if the OV keeps running.

@OBrown92
Copy link

Thanks for the tipps. We just purchased a LiFePo4 Battery and will check our cable management. Will get back to you as soon we tested it.

@OBrown92
Copy link

OBrown92 commented May 1, 2023

So it really seems that our batteries were too weak. We tested the lifepo4 batteries this weekend and so far have not had a crash. What is still a mystery though is that we never had these problems with the old 17119 image anyway. Thanks to all the hard work you guys put into this.

@tb59427
Copy link
Contributor

tb59427 commented May 1, 2023

Very simple: 17119 used a very different linux kernel than the current images. All current kernels have this bug in the cpu_freq driver for cubieboards - or to me more precise: the cubieboards seem to have a bug when switching cpu frequency and corresponding voltages. And the kernel doesn't work around this bug. Hence the crashes. The fix simply fixes voltage and cpu frequency to work around this hardware oddity.

For a detailed background of what happened, why it happened etc. see the earlier parts of this thread or read through the commit messages for the relevant patches...

@OBrown92
Copy link

OBrown92 commented May 1, 2023

Yes, i followed all of this and was btw really impressed by the fixes. The strange thing though is, that the newest image, including all the freq and voltage fixes crashed within about an hour, with our old battery setup. But the old one doesn't, same day, same sd card, same battery.

@eku60
Copy link

eku60 commented May 1, 2023 via email

@HoeckDK
Copy link

HoeckDK commented May 1, 2023

@eku60

When was this voltage driver update done? What is an old board (version)?

I flew with 22304 the other day. Stefly OV 7" without the vario part. It crashed after 2,5 hrs. Black screen, no response to ctrl+alt+del.

My history:
I bought my OV in 2016. Ran flawless until 2018, where it started crashing. I suspected a PSU issue, and eliminated the 5v PSU for an UBEC. Later found out it was a faulty wifi module that killed it. While debugging I bought a 2nd device... That has not been used until now. That setup has served me well until the recent updates.

Then after the more resent Freevario updates (22097 and the one before), I began to see crashes. And thought that maybe I should swap to the un-moded hardware. I did, and after 2,5hr on 22304 I got the crash. No luck

Knowing that maybe the supply is not up for the task, I will revert to hardware with UBEC mod. I hope this fixes it, as competition is coming up.

I have run the devices many days on desk without crashes... But maybe flight conditions stress it more than just sitting on desk.

@eku60
Copy link

eku60 commented May 2, 2023 via email

@HoeckDK
Copy link

HoeckDK commented May 3, 2023

I'll put my hopes on the converting back to the modified device with the UBEC 5V converter, and pray it will work, together with 22304

@eku60
Copy link

eku60 commented May 4, 2023 via email

@HoeckDK
Copy link

HoeckDK commented May 13, 2023

Hi

I did 4 hours of flying today. With 22304 OV. No crash. I guess there is something with the other HW setup then. Most likely the power supply?

@HoeckDK
Copy link

HoeckDK commented May 13, 2023

Hi all

Today I got to fly again. 4 hrs with the modified hardware. No crashes. I guess there is something power supply related in my unmodified Stefly hardware then.

However I had issues with Flarm reception, maybe also TX. This has been a challenge for me every since I owned this glider. But this season it worked well (new OV hardware). Today it didn't. Any ideas if it could be related? Noise from OV teasing Flarm. I have seen a post where OV trips noise in the radio. That does not seem to be an issue for me though.

@bomilkar
Copy link
Contributor

What do you mean "had issues with Flarm reception"?
Does the FLARM produce a valid IGC file? At least a clean IGC file if it doesn't have the IGC seal.
Or is it not providing a stable data stream to the OV?
If the latter, simply replace the cable connection: a standard Ethernet patch cable connects the SteFly tty port to a FLARM. (The 15-pin D connector provides the 12V connection to the tty port. See documentation.)
If it is the former: change the position of the GPS antenna.

@HoeckDK
Copy link

HoeckDK commented May 15, 2023

RS232 is no issue. My issue has been receiving Flarm targets over the 868 radio. It has been an issue all the time in this glider. Then I thought I saw improvement the a single flight some flights ago. I thought it was due to another change. However according to Flarm IGC's it had started working for this season. Then on the last flight I reception issues again. Only change being changing OV hardware.

I changed the display cable now, for the shorter shielded one, belonging to the newer HW revision.

Bwt the HW revisions are 1.01 and 1.34. 1.01 with UBEC does not crash. 1.34 seems to crash.
On the 1.01 to rule out that the disconnected V-regulator is the issue, I have now removed all connections to the chip, but the two grounds.

image
Original Vref disconnected. UBEC connected to the Cubie supply points
image
UBEC powered from the fuse.

Before this last change the UBEC was powered directly from the dB15 power pin.

@bomilkar
Copy link
Contributor

There are 2 useful tools to analyze the range of the 868 MHz radio of the FLARM:

  1. to analyze your receiver: https://www.flarm.com/support/tools-software/flarm-range-analyzer/
  2. to analyze your transmitter: https://ktrax.kisstech.ch/flarm-liverange

I'd like to help you, but this is the wrong place: this Issue is closed and it is under a different topic. Therefor I suggest you open a new thread in the Forum: https://forum.xcsoar.org/index.php

@OBrown92
Copy link

Btw. both of our OV's working well with the new LiFePo4 batteries. We've got around 30 hours of competition time on both of them last week with no crashes. Only one time, the OV in our Duo crashed before we started, but I thinks its overheated. After a restart and a covered up hood, everything works great for the next 6 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests