Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Ethstore optimizations #6827

Merged
merged 1 commit into from
Oct 20, 2017
Merged

Ethstore optimizations #6827

merged 1 commit into from
Oct 20, 2017

Conversation

arkpar
Copy link
Collaborator

@arkpar arkpar commented Oct 19, 2017

Closes #6456, #6787

@arkpar arkpar added A0-pleasereview 🤓 Pull request needs code review. B0-patch M4-core ⛓ Core client code / Rust. labels Oct 19, 2017
Copy link
Collaborator

@tomusdrw tomusdrw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor nitpicks


fn last_modification_date(&self) -> Result<u64, Error> {
use std::time::UNIX_EPOCH;
let duration = fs::metadata(&self.path)?.modified()?.duration_since(UNIX_EPOCH).unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expect?

return Ok(())
let mut last_timestamp = self.timestamp.lock();
let now = Instant::now();
if (now - last_timestamp.last_checked) > Duration::from_secs(REFRESH_TIME_SEC) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather store last_timestamp.valid_till to avoid subtraction (and possible overflow in case the time is adjusted by the system):

if last_timestamp.valid_till < now {

}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instant is monotonic. It is not affected by clock changes and is guaranteed to be never decreased.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

.cloned()
.ok_or(Error::InvalidAccount)
let cache = self.cache.read();
let mut r = cache.range((::std::collections::Bound::Included(*address), ::std::collections::Bound::Included(*address)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ::std::collections::Bound at the top of the function?

@5chdn
Copy link
Contributor

5chdn commented Oct 19, 2017

Is this related to #6456 and #6787?

@arkpar
Copy link
Collaborator Author

arkpar commented Oct 19, 2017

Right, closes #6456, #6787

@AdvancedStyle
Copy link

This fixes the issue #6787 and is now finding/keeping peers.
However parity is using a massive amount of disk read IO...like 100% disk read IO all the time (using latest master branch + this commit)

let mut last_timestamp = self.timestamp.lock();
let now = Instant::now();
if (now - last_timestamp.last_checked) > Duration::from_secs(REFRESH_TIME_SEC) {
let dir_hash = Some(self.dir.unique_repr()?);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be creating a disk IO read problem. Because every 5 seconds you end up looping and hashing all files in the keystore...if this is a large number of files (in my case 150,000)...in the end causes total server crash.

Is there a better way to detect changes in the keystore dir? Possibly only detect a change when new key is added via personal_newAccount... calls. Or simple file count?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unique_repr is changed to return directory' last modification date in this PR. So no more hashing here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AdvancedStyle are you saying your server crashes with the changes from this PR?

Copy link

@AdvancedStyle AdvancedStyle Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, doing more testing and will provide more information shortly.

Basically once I compile/run latest master with this PR it fixes the peer connections, but then crashes the server (with noticeably huge amount of disk reading).

But i'll do a bit more testing and see if I can gather some actually error messages from the server crash

Copy link

@AdvancedStyle AdvancedStyle Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i've been using iostat / iotop / tracefile (https://gitlab.com/ole.tange/tangetools/tree/master/tracefile) to montor disk IO and here are the results:

With using latest master the iostat util is quite low (less than 5%) however the tracefiles shows that parity is constantly accessing the keystore files...but it is not causing high disk util and not causing server crash..only causing peers to drop/timeout and syncing to stop.

After PR #6827 the iostat util goes through the roof (50-100% disk util all the time):

iostat output sample: avg-cpu: %user %nice %system %iowait %steal %idle 23.14 0.00 20.48 13.56 0.27 42.55

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.00 5339.00 0.00 49176.00 0.00 18.42 0.75 0.14 0.14 0.00 0.09 49.60

avg-cpu: %user %nice %system %iowait %steal %idle
20.21 0.00 17.32 14.70 0.26 47.51

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 17.00 5192.00 3.00 46776.00 80.00 18.04 0.74 0.14 0.14 2.67 0.11 55.60

avg-cpu: %user %nice %system %iowait %steal %idle
25.07 0.00 15.04 10.55 0.53 48.81

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 28.00 4083.00 2.00 38940.00 120.00 19.12 0.60 0.15 0.15 2.00 0.10 42.40

avg-cpu: %user %nice %system %iowait %steal %idle
22.13 0.00 18.67 15.73 0.27 43.20

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 10.00 3744.00 4.00 32000.00 56.00 17.11 0.83 0.22 0.21 1.00 0.17 64.00

avg-cpu: %user %nice %system %iowait %steal %idle
24.42 0.00 15.32 32.21 0.26 27.79

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 13.00 2987.00 4.00 28780.00 68.00 19.29 2.07 0.70 0.70 1.00 0.27 80.80

But the strange thing is the tracefile (strace script) is not really seeing a lot of file access on the process (parity is no longer accessing the keystore files individuals)... it only appears to be accessing the keystore dir...so why all the disk IO?:

Sample of tracefile output:

/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484066.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484067.log
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484068.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484066.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484068.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484069.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484069.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484066.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484064.sst
/home/ethereum/.ethereum/keystore/ethereum
/home/ethereum/.ethereum/keystore/ethereum
/home/ethereum/.ethereum/keystore/ethereum
/home/ethereum/.ethereum/keystore/ethereum
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484070.log
/sys/devices/system/cpu/online
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484071.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484071.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484072.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484072.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484071.sst
/home/ethereum/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d/overlayrecent/db/484068.sst
/home/ethereum/.ethereum/keystore/ethereum
/home/ethereum/.ethereum/keystore/ethereum

VM then hangs / crashes after a few minutes with hung_task_timeout_secs

INFO: task kworker/[...] blocked for more than 120 seconds.
 Not tainted [...]
 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

And a bunch of similar hung tasks.

Conclusion:
For some reason the modified time check on the keystore dir might be creating a lot of IO read?

Copy link

@AdvancedStyle AdvancedStyle Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean at first was testing with master (no IO read problem / no crash).

Then i compiled with this PR #6827 (git fetch origin pull/6827/head:sometest && git checkout sometest and compiled)

The IO read problem only occurred after compiling with PR #6827

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IO must be just the database IO. Does the machine have HDD or SSD? How exactly does it crash? Is there a backtrace printed?

Copy link

@AdvancedStyle AdvancedStyle Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSD. Server running on a KVM virtual machine (Debian host, Ubuntu 16 guest).

The entire virtual machine just locks up; SSH non-responsive and virtmanager console non-responsive, then it dumps a series of error messages in the virtmanager console similar to:

INFO: task kworker/[...] blocked for more than 120 seconds.
Not tainted [...]
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Note, i'm running parity with params:

screen -d -m parity --cache-size 4096 --allow-ips public --jsonrpc-apis web3,eth,net,parity,traces,rpc,personal --geth --keys-path /home/ethereum/.ethereum/keystore

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to clone the VM and try to delete/resync the chain from 0 to see if that helps, and so far it seems to be doing ok (no IO read problems so far)...will post back shortly

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have resynced the chain from 0 using PR #6827 and now disk IO is normal again and all is working.

So everything is fine now (fingers crossed).

The issue of slow key generation (issue #6839 is also fixed)

@svyatonik svyatonik added A8-looksgood 🦄 Pull request is reviewed well. and removed A0-pleasereview 🤓 Pull request needs code review. labels Oct 20, 2017
@arkpar arkpar merged commit 097815c into master Oct 20, 2017
@arkpar arkpar deleted the ethstore-optimize branch October 20, 2017 18:20
arkpar added a commit that referenced this pull request Oct 20, 2017
arkpar added a commit that referenced this pull request Oct 20, 2017
@AdvancedStyle AdvancedStyle mentioned this pull request Oct 21, 2017
lght pushed a commit to lght/parity that referenced this pull request Oct 25, 2017
@5chdn 5chdn added this to the 1.8 milestone Dec 8, 2017
tomusdrw pushed a commit that referenced this pull request Dec 20, 2017
5chdn pushed a commit that referenced this pull request Dec 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A8-looksgood 🦄 Pull request is reviewed well. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants