-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weekly cleanups #4284
Weekly cleanups #4284
Conversation
We sometimes have very specific sequences of tx broadcasts and blocks being generated to confirm them. If the confirmation is missed the test can completely get out of sync. Make debugging this easier by logging what we confirmed.
Useful if we want to debug a bit better
Both my machine and apparently the CI tester machines regularly run into issues with load on the system, causing timeouts (and unresponsiveness). The throttler throttles the speed with which new instances of c-lightning get started to avoid overloading. Since the plugin used for parallelism when testing spawns multiple processes we need to lock on the fs. Since we have that file open already, we'll also write a couple of performance metics to it.
I stumbled over this in a test run and it seems benign.
We were getting bad gossip because some nodes discarded the channel announcement for being in the future. This is because the node was, at that time, below the confirmation height. It'd then discard the followup messages because not preceded by an announcement, and getting upset about that.
This was causing the following error ``` Exception in thread Thread-553: Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/tmp/cirrus-ci-build/contrib/pyln-testing/pyln/testing/utils.py", line 232, in tail self.err_logs.append(line.rstrip().decode('UTF-8', 'replace')).rstrip() AttributeError: 'NoneType' object has no attribute 'rstrip' [gw5] [ 33%] FAILED tests/test_misc.py::test_bitcoin_failure ``` Notice the second call to `.rstrip()` on the return value of `.append()`
The CI regularly trips up on this very taxing test (100 nodes) so only run it if we have the required horsepower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a question about the throttler
delay = time.time() - start_time | ||
with open("/tmp/ltest-throttler.csv", "a") as f: | ||
f.write("{}, {}, {}, {}\n".format(time.time(), self.load(), self.target, delay)) | ||
self.current_load = 100 # Back off slightly to avoid triggering right away |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read this comment a few times, but i still don't get it 😅 . What would a < 75
load would trigger ? And why setting it to 100
after getting each node would avoid the trigger ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that if we leave the current_load
at 75 percent a followup call, e.g., another lightningd
wanting to start, could end up seeing 75% and decide to immediately schedule its startup (maybe the first one hasn't had time to start up and push the load up yet), thus we might end up clearing the queue despite only the first one getting the all-clear.
Setting this to 100% means the ewma smoothing needs a couple of rounds before allowing the next one to go through, avoiding the flooding scenario above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i had it the other way around, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 336e972
Just the usual bunch of drive-by fixes, mostly while trying to get Cirrus-CI to pass.
Changelog-None