Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nothing tells you when you're timing out on 30 second harvester proof checks #2651

Closed
coding-horror opened this issue Apr 25, 2021 · 22 comments
Labels
bug Something isn't working

Comments

@coding-horror
Copy link

coding-horror commented Apr 25, 2021

The problem

I thought I was farming, but I wasn't -- because something about my network caused the proof check to take more than the hard-coded 30 second limit.

I had an average time to win of 8 or 9 hours for more than 120 hours without a single win. This seemed statistically implausible, so I researched the logs, and cleared any errors or warnings in the logs (well done, all the warnings and errors in debug.log were indeed things I should fix!). Still no wins for a long time.

How to reproduce

Have a bunch of plots on slow storage media; when the proof check happens, verifying the proofs takes longer than the hard-coded 30 seconds allowed. You will never win a single Chia, but there's absolutely nothing in the GUI to inform you that this is happening. You can view the logs, but in the logs it is not even presented as a warning (!), but as an INFO message:

2021-04-24T16:03:29.039 harvester chia.harvester.harvester: INFO     8 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 3.74567 s. Total 3922 plots
2021-04-24T16:03:29.433 harvester chia.harvester.harvester: INFO     7 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 2.92976 s. Total 3922 plots
2021-04-24T16:03:29.635 harvester chia.harvester.harvester: INFO     7 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 2.01540 s. Total 3922 plots
2021-04-24T16:03:44.553 harvester chia.harvester.harvester: INFO     4 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 12.11716 s. Total 3922 plots
2021-04-24T16:04:10.882 harvester chia.harvester.harvester: INFO     6 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 24.51660 s. Total 3922 plots
2021-04-24T16:04:36.101 harvester chia.harvester.harvester: INFO     9 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 37.96922 s. Total 3922 plots
2021-04-24T16:05:13.959 harvester chia.harvester.harvester: INFO     11 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 63.07326 s. Total 3922 plots
2021-04-24T16:05:13.959 harvester chia.harvester.harvester: INFO     5 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 50.81979 s. Total 3922 plots
2021-04-24T16:05:31.022 harvester chia.harvester.harvester: INFO     6 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 54.92095 s. Total 3922 plots
2021-04-24T16:05:31.022 harvester chia.harvester.harvester: INFO     7 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 42.29693 s. Total 3922 plots
2021-04-24T16:05:31.022 harvester chia.harvester.harvester: INFO     8 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 29.67779 s. Total 3922 plots
2021-04-24T16:05:33.568 harvester chia.harvester.harvester: INFO     4 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 18.59361 s. Total 3922 plots
2021-04-24T16:05:33.568 harvester chia.harvester.harvester: INFO     8 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 6.10934 s. Total 3922 plots
2021-04-24T16:05:33.568 harvester chia.harvester.harvester: INFO     10 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 3.17564 s. Total 3922 plots
2021-04-24T16:05:33.568 harvester chia.harvester.harvester: INFO     13 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 3.82808 s. Total 3922 plots
2021-04-24T16:05:35.600 harvester chia.harvester.harvester: INFO     5 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 2.82815 s. Total 3922 plots
2021-04-24T16:05:35.600 harvester chia.harvester.harvester: INFO     12 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 3.56252 s. Total 3922 plots
2021-04-24T16:05:58.022 harvester chia.harvester.harvester: INFO     5 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 12.26593 s. Total 3922 plots
2021-04-24T16:06:11.887 harvester chia.harvester.harvester: INFO     4 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 12.42098 s. Total 3922 plots
2021-04-24T16:06:36.417 harvester chia.harvester.harvester: INFO     3 plots were eligible for farming 1c75d8d21c... Found 0 proofs. Time: 24.36354 s. Total 3922 plots

Of the above, the proofs that take longer than 30 seconds are not eligible to win, but this is not logged as an ERROR or WARNING or surfaced in the UI in any way.

Expected behavior

The GUI will tell you "hey, your proof checks are too slow, there's absolutely no chance for you to win, even if you are farming infinity plots"

Screenshots

Imgur

Desktop

  • OS: Windows 10
  • OS Version/Flavor: latest 19042.928
  • CPU: i7-1165G7

Additional context

I followed up on the #support channel in Keybase, where I got the important advice to enable INFO level logging and check for the 30 second proof limit.. and I wrote up a detailed account on the forum; if you need excruciating levels of detail, please check there 🙇‍♂️

Recommended solution

  • Allow the 30 second proof check threshold to be configurable to account for slower storage (or relax it a bit)
  • change the log level of beyond 30 second on harvester proof checks to WARN or ERROR
  • in the logs, provide details on which files are being checked for proofs, so you can identify which media or NAS is potentially the problem one slowing everything down
@coding-horror coding-horror added the bug Something isn't working label Apr 25, 2021
@maomingyang1314
Copy link

Moreover, I would like to know why your laptop can mount nearly 400TB disks, is your disk on nas?

@maomingyang1314
Copy link

Your NAS caused your task to time out. It is recommended to abandon this method.

@AZ663
Copy link

AZ663 commented Apr 25, 2021

多少速度会影响呢

@Chia-Network Chia-Network deleted a comment from maomingyang1314 Apr 25, 2021
@Chia-Network Chia-Network deleted a comment from maomingyang1314 Apr 25, 2021
@Chia-Network Chia-Network deleted a comment from maomingyang1314 Apr 25, 2021
@mariano54
Copy link
Contributor

We can't change the consensus algorithm anymore, but we can change the log level, and show the files.

Please note that looking up qualities for plots passing the filter requires about 7 random reads in a plot, whereas actually looking up a proof requires 64 reads. It might not be feasible on a slow NAS, since these are sequential reads. Furthermore, you need to take into account network latency to propagate your proof and block to the network, so you should be under 5 seconds to reduce risk of losing rewards.

Actually, the proof of space library does them sequentially, but they could be done in parallel, since it's a tree, so you could do 1 read, then 2, then 4, .. etc, for a total of around 7 sequential phases (one for each table in the plot). We haven't got around to doing this yet.

@mariano54
Copy link
Contributor

Also another thing to point out is that the responses are returned to the full node as they come out from the drives, so the high time is probably only affecting the slow drive or the slow NAS.

@coding-horror
Copy link
Author

Yeah, this only started happening as I added multiple NAS devices to the network. With 1 or 2 NASes, it was all fine. Once you get to 5.. not so much, especially if the algorithm picks plots on 6 different devices. At the very least

  • definitely change the log level to ERROR (or WARN)
  • surface this in the UI for sure, otherwise you're silently "farming".. absolutely nothing. A silent killer.
  • tell us WHICH files were picked for proofs so we can look for "problem" devices

thanks @mariano54 !

@immortalx
Copy link

Is there any benefit of using larger plot files in this case? Less files per directory in his case.

@keliew
Copy link

keliew commented Apr 26, 2021

I think the problem is inherent in how Windows manages mounted network drives.

If you'd to switch to a Linux-based farmer, it should work better. I know it's not an option for most people, but it'd be a good test to see if that's the cause.

@kimklai
Copy link

kimklai commented Apr 26, 2021

I think lots of people need to know their whole system just too slow to provide the a valid answer in time. So please mark the logs as WARNING as they takes longer a certain threshold.

@desek
Copy link

desek commented Apr 26, 2021

This was quite a rough finding since I've been happily farming on my RPi on wifi to a remote storage with proof checking usually between 60-90 seconds.

@tartley
Copy link

tartley commented Apr 26, 2021

Could some drives be powering down when idle?

@coding-horror
Copy link
Author

coding-horror commented Apr 26, 2021 via email

@bertybassett
Copy link

bertybassett commented Apr 27, 2021

yeah my 1 Synology NAS is doing the same if I hit it hard with other services such as docker, Plex, etc

My Synology can sleep the HDs and out of the box I believe it set to that as default, no idea about yours.

Synology HDD hibernation
Synology NAS, expansion units, and USB HDDs enter different energy-saving modes after being idle. ... Phase 1: HDDs power down after a period of inactivity (idle time). You may go to DSM > Control Panel > Hardware & Power > HDD Hibernation to configure the length of inactivity for the HDDs to enter hibernation.

@cccat6
Copy link

cccat6 commented Apr 27, 2021

We can't change the consensus algorithm anymore, but we can change the log level, and show the files.

Please note that looking up qualities for plots passing the filter requires about 7 random reads in a plot, whereas actually looking up a proof requires 64 reads. It might not be feasible on a slow NAS, since these are sequential reads. Furthermore, you need to take into account network latency to propagate your proof and block to the network, so you should be under 5 seconds to reduce risk of losing rewards.

Actually, the proof of space library does them sequentially, but they could be done in parallel, since it's a tree, so you could do 1 read, then 2, then 4, .. etc, for a total of around 7 sequential phases (one for each table in the plot). We haven't got around to doing this yet.

@mariano54
My disk array is not in the local but distributes in different places. I gather them as a huge disk and connect them with NFS. The latency of this solution for each random seek is between 200-1000ms. The average and the mean is about 400ms.
I wondering whether this is able to farm. 400ms*(7+64)seeks will be 28.4s. Just a little bit below the 30s limitation.
However, this solution support parallel read. If it is seeking 64 different places in the plot at the same time, it will be able to respond to all of them within a single seek time, which is about 400ms.
Is parallel seeking in the todo list and is that work as what I am thinking? When it will be updated?
Thanks a lot!

@coding-horror
Copy link
Author

Yeah this is a critical issue for the project IMO, since a LOT of people are probably "farming" absolutely nothing due to the 30s harvester timeout, and the logs aren't WARN-ing or ERROR-ing them.. the GUI isn't telling them.. the only way to know this is happening is to intentionally set log level to INFO and scan for 30s or longer in the INFO messages 😱

😭

@dorkmo
Copy link

dorkmo commented Apr 27, 2021

i believe the small time window was designed to push tape drives out of the chia eco system. ideally there would be big flashing warnings if youre missing any rewards. hopefully can be addressed in future releases.

@desek
Copy link

desek commented Apr 28, 2021

I did a somewhat scalable workaround for this by running multiple harvesters in containers (on the RPi4), each with ONE plot directory with 100 plots. This gives arounds 2 seconds for proof checking per container (up to 100 plots).

@bertybassett
Copy link

bertybassett commented Apr 28, 2021

isn't the simple answer make each Nas its now farmer until chia fix this. Building the raid array in performance mode on a Synology also causes this issue.

@qfishpear
Copy link

It's soooooo sad that by "grep eligible" I found nothing there!
(only 1.5TB in my farm, so.... it's just reasonable)

@Sagittarius
Copy link

You should try "farming on many machines"

@mariano54
Copy link
Contributor

You fast ones will still answer in a fast speed, since it's all threaded. It will just display the time of when the slowest finished.
Most users can farm many plots in under a second, so this is not an issue for most people. But it's worth checking if you're not winning rewards for a while.

@fiveangle
Copy link
Contributor

fiveangle commented Apr 30, 2021

You fast ones will still answer in a fast speed, since it's all threaded. It will just display the time of when the slowest finished.
Most users can farm many plots in under a second, so this is not an issue for most people. But it's worth checking if you're not winning rewards for a while.

It seems near criminal that we are conditioned to enable "INFO" level logs and routinely told to ignore the countless "ERROR" level scary sounding messages spamming the logs, yet not a whimper about anything in the logs when a proof challenge request lookup is nearing or fully exhausting some non-presented timeout value that only tribal knowledge or code inspection is aware of.

I watched incredulously as my friends with far fewer plots reached, then far surpassed and doubled over my winnings with 1/2 the plots. Only once reaching out after the damage is done do we find out with an obtuse, "OF COURSE YOU WON'T WIN WITH LOOKUP DELAYS LIKE THIS" are we then made painfully aware of the worst of all actual "errors".

Like I said, that these messages are not flagged as ERROR level or at least WARNING is essentially criminal, especially in the face of the constant stream of non-critical "ERROR" level messages spamming the logs.

@Chia-Network Chia-Network locked and limited conversation to collaborators Apr 30, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests