Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce IO activity from 2.4 GByte/hour #564

Closed
synctext opened this issue Apr 16, 2014 · 40 comments
Closed

Reduce IO activity from 2.4 GByte/hour #564

synctext opened this issue Apr 16, 2014 · 40 comments

Comments

@synctext
Copy link
Member

excessive disk writes have been a long-standing Tribler problem.

16 April 2014: 623 MByte/hour

17May 2013: 660 MByte/hour

See below the total IO activity of running Tribler f2889f4 for 16 hours. Test conditions: using the -O optimized mode, no Dispersy debug panel, no download, no searches executed, no GUI activity of any kind for 16 hours. Megacache fully bootstrapped before experiment: 547MB dispersy.db and 784MB tribler.sdb

tribler_6 3pre_heavy_write_traffic_iotop_apr2014

With 22% average CPU utilization (fraction of 1 core) this is now a nice value. Since #15 this has been an issue, but now addressed with the Bloom skip and re-use heuristic.
tribler_6 3pre_cpu_usage_htop

Related ticket: #8. Running Tribler for 1 hour and see resource usage.

@LipuFei
Copy link

LipuFei commented Apr 22, 2014

Tribler is doing way too much database activities (about 600 to 1000 times per minute when idle). I'll check if we can optimize it.

@synctext
Copy link
Member Author

12May2014: 538 MByte/hour (b8ec22b branch).

tribler_6 3pre_write_traffic_after_10h40m

@egbertbouman egbertbouman added this to the V9.0: Shadow Internet milestone Nov 3, 2014
@brandc
Copy link

brandc commented Apr 22, 2016

I'm guessing the proposed solution is to group transactions?

@synctext
Copy link
Member Author

@brandc thnx for reviving this issue. it's been a whole year since I've measured the IO performance. We don't know where this is coming from.
The AllChannel community needs a rewrite I think. We plan to do that now that we can filter fake channel votes based on Multi-Chain this year.

@brandc
Copy link

brandc commented Apr 27, 2016

I certainly hope it gets fixed somewhat soon, then I don't have to run Tribler on ram disk with scripts to back it up periodically.

@synctext synctext modified the milestones: Backlog, V9.0: Shadow Internet Jul 13, 2016
@lfdversluis
Copy link

lfdversluis commented Jul 27, 2016

@synctext how did you get to 538 MB/hour? If I add the write_bytes and read_bytes and divide that by 1024*1024 I end up with 8127,37 MB. Assuming TIME+ 2h 41min = 161 minutes, then I obtain 3028,83 MB/hour.

So I guess this ran took around 16h? But I cannot find that in your screenshot.

@lfdversluis
Copy link

Using my laptop, I have ran Tribler (with MultiChain crawler enabled) for 2 hours. I have captured the total io using sudo iotop -aoP and the results are far less than 500MB/hour.

results

From the screenshot above, I see that only around ~70-75 MB/hour is done in I/O. I did use my laptop which is only an i5, with 14mbit internet. But I guess Trible running idle won't do 1.5MB/s. All in all I think in the above screenshots the amount of I/O was divided by the TIME+ column, which is the total cpu time spent on Tribler. I am not sure whether this is a valid measure to say Tribler does X amount of I/O per "CPU hour", as in hour of actual computation.

@synctext
Copy link
Member Author

synctext commented Aug 3, 2016

interesting. Try to reproduce old results the with older code, like, b8ec22b

Please ensure you are connectable on several ports (dht,libtorrent,dispersy).

@synctext
Copy link
Member Author

synctext commented Aug 3, 2016

plus: f2889f4

Nice experiment to do IO intensity over 6 different versions. Then the 7th Laurens version. Plot X axis the date of release, Yaxis the IO per hour, for instance. Each just 1 hour run with clean megacache and also plot collected torrents + AllChannel votes (easy in debug panel with 'other stuff option' on). CPU user time also nice evolution in time to plot for 7 versions.

@lfdversluis
Copy link

AFAIK I was connectable on all ports. Since I am not seeding nor downloading libtorrent should not do much if anything. I will try that particular commit tomorrow. Will be interesting to observe the amount of I/O there.

Regarding the plot, that might be indeed interesting. I'll try to do that tomorrow.

@synctext
Copy link
Member Author

synctext commented Aug 3, 2016

@lfdversluis
Copy link

lfdversluis commented Aug 8, 2016

I have managed to get Tribler 6.3.5 working on the ubuntu 15.10 machine. It looks like the IO numbers mentioned above are accurate, 10 minutes in and already 300MB of disk writes have been done. May actually be higher than 600MB/hour. Wow.

Edit: after ~1 hour, it read a stunning 26.51 GB of data and wrote 1659.38 MB of data. This brings it to 28805 MB/hour.

@lfdversluis
Copy link

lfdversluis commented Aug 8, 2016

I note that on a fresh start, both 6.3.5 and 6.4.3 read a lot of data, ~100MB

Edit: 6.4.3 reads in ~1 hour 19.81GB and writes 850.30 MB. This means 21135 MB/hour

@lfdversluis
Copy link

6.5.2 reads in ~1 hour 74.44MB and writes 1530.72MB. This means it does 1605 MB/hour.

Apparently the virtual machines on a 100mbit connection do a lot more I/O, perhaps this plays a role, maybe in both connectivity and download speed of content?

@lfdversluis
Copy link

lfdversluis commented Aug 9, 2016

With the twitstd plugin: 6.6.0pre-exp1 read 4 KB in ~1hour and wrote 2.59GB. This means an I/O activity of 2652 MB/hour.

With tribler.sh (GUI enabled): Read 17.52 MB and wrote 2.52 GB, or 2598 MB/hour

These all are significantly higher than what @synctext measured or what my laptop running on my 14 mbit connection did. Odd.

@synctext
Copy link
Member Author

synctext commented Aug 9, 2016

good progress.
As inidicated, did you check how fast AllChannel was collecting votes? torrentcollecting?

key indight: cost vs. benefit.
with 100mbps you sync very fast and intense is my hypothesis.

@lfdversluis
Copy link

lfdversluis commented Aug 9, 2016

No, these were the official .deb files that ran, so no modifications in the code.
I will have to checkout the versions by git and modify the code to obtain these statistics. Note that torrent meta messages are shared via the Channel community rather than AllChannel.

@synctext
Copy link
Member Author

synctext commented Aug 9, 2016

no code changes needed. all in debug panel.

@lfdversluis
Copy link

lfdversluis commented Aug 13, 2016

Forgot to update.

Ran four tribler versions idle for one hour using a clean state dir on the ubuntu 15.10 machine with 100 mbit connection.Noted down the packets in the database using the debug panel:

db packets

@lfdversluis
Copy link

lfdversluis commented Aug 13, 2016

Ok so I have measured both the I/O calls and times in Tribler as well in Dispersy using an empty megacache (.Tribler state dir) and run it for one hour. It turns out Tribler was blocked for 607.723 seconds doing I/O, this result does not include LevelDB as I am not sure yet if this has blocking behavior. Dispersy a lot less with 30.399 seconds. In total Tribler thus did 638.122 seconds of I/O, meaning 18% of the time the main thread was blocked.

A breakdown generated:

Tribler:

------- fetchall --------
Times called: 95195 
total time: 279.232 seconds
avg: 0.00293
min: 0.00000
max: 0.22155
Total amount of functions calling this db function: 5
-----------------

------- execute --------
Times called: 225346 
total time: 14.996 seconds
avg: 0.00007
min: 0.00000
max: 0.20853
Total amount of functions calling this db function: 5
-----------------

------- executemany --------
Times called: 22263 
total time: 10.200 seconds
avg: 0.00046
min: 0.00000
max: 0.06107
Total amount of functions calling this db function: 5
-----------------

------- update --------
Times called: 6582 
total time: 21.420 seconds
avg: 0.00325
min: 0.00000
max: 0.02372
Total amount of functions calling this db function: 5
-----------------

------- insert_or_ignore --------
Times called: 18869 
total time: 50.146 seconds
avg: 0.00266
min: 0.00199
max: 0.21303
Total amount of functions calling this db function: 5
-----------------

------- fetchone --------
Times called: 90732 
total time: 276.729 seconds
avg: 0.00305
min: 0.00000
max: 0.26454
Total amount of functions calling this db function: 5
-----------------

Dispersy:

------- commit --------
Times called: 69 
total time: 3.078 seconds
avg: 0.04460
min: 0.00000
max: 0.21289
Total amount of functions calling this db function: 9
-----------------

------- execute --------
Times called: 770882 
total time: 27.065 seconds
avg: 0.00004
min: 0.00000
max: 0.17525
Total amount of functions calling this db function: 47
-----------------

------- executemany --------
Times called: 6540 
total time: 0.256 seconds
avg: 0.00004
min: 0.00001
max: 0.00320
Total amount of functions calling this db function: 8
-----------------

------- executescript --------
Times called: 3 
total time: 0.000 seconds
avg: 0.00001
min: 0.00001
max: 0.00001
Total amount of functions calling this db function: 7
-----------------

@synctext
Copy link
Member Author

607 out of 3600 seconds. (you mixed up your . and , btw)
Solid progress! great stuff.

Note that which IO execute, fetchall or community.py calls are the heaviest is the critical info.

@lfdversluis
Copy link

lfdversluis commented Aug 13, 2016

Yes this info is all captured as well. Lines are in the form of:
1471109266.44 fetchall helper 46 /home/laurens/Documents/thesis/tribler/Tribler/dispersy/util.py 0.0179858207703
From left to right: time it was measured, database function, caller function name, caller function_ ine-number, path caller function file, elapsed query time

edit: I just notice the helper is a decorator function, I'll check if I can get the function one frame alter so the actual caller shows.

@lfdversluis
Copy link

lfdversluis commented Aug 15, 2016

@whirm I remembered you explained this to me (was doubting myself), thanks for the confirmation 👍

@synctext
Copy link
Member Author

Jenkins needs IO read/write intensity tracking.

@devos50
Copy link
Contributor

devos50 commented Sep 11, 2018

Gumby has this functionality (using filefs):

readbytes

writebytes

@devos50
Copy link
Contributor

devos50 commented Sep 19, 2018

It seems that database I/O has a huge impact on the reactor. I observed this during one my experiments with TrustChain. Within a network of 100 nodes where every node creates a TrustChain record every second, I get the following evolution of the total number of blocks.

With a file database:

trustchain_blocks_created

When storing all blocks in dictionaries:

trustchain_blocks_created-2

I think this experiment gives us a reliable baseline for further improvements and optimizations.

@synctext
Copy link
Member Author

synctext commented Sep 19, 2018

Oops!
7.1 RC1 is making things really bad. Started an instance from scratch. Not a priority now, first do IPv8 all the way. Or bootstrapping everything is just happening much faster then before.

2.4 GByte/hour of write IO! (9.62 GByte 4 hours)

image

Decent almost 50% CPU usage (1h 51min in 4h):
image

We earned 754 MByte of credit by relaying traffic:
image

We're discovering and pulling in 2918 channels with their content, stored as 76.9MByte of database content.
image

What is IPv8 doing here? Is this 383 MBytes download or 383 million of UDP packets?
image
138 peers? We know everybody...
image
Nice CPU graph
image

@synctext synctext changed the title Reduce IO activity from 623MByte/hour Reduce IO activity from 2.4 GByte/hour Sep 19, 2018
@qstokkink
Copy link
Contributor

9.62 GByte written to disk
1.055 GByte given to community
0.407 GByte sent over the network

Several things are off here.

@devos50
Copy link
Contributor

devos50 commented Sep 30, 2018

I placed a timer in the add_block method in the TrustChainDB class and started the TrustChain experiment again (of which I posted two graphs a few comments ago). During a period of 60 seconds where each node creates one block a second, it seems that around 75% (45 seconds) of the time is spent in the add_block method. This explains the inconsistent timing of the reactor during the experiment.

We should explore whether delegating add_block to the thread pool improves performance. This is in particular important for the TrustChain crawler, which receives and processes new blocks at a high rate.

@ichorid
Copy link
Contributor

ichorid commented Feb 23, 2019

@devos50 . As #4090 is merged, we should revisit this issue after 7.3 release.

@devos50 devos50 removed their assignment Sep 5, 2019
@ichorid
Copy link
Contributor

ichorid commented Sep 14, 2019

Preliminary observations for 7.3.

In the idle state (channels database live for a couple of days, 50 channels, 600 Mb database, 600k torrents) on an SSD: ~50 Kb writes per second (0, 25% of the peak i/o).

In the "channel processing" state (when inserting downloaded channel contents into the database) on an HDD: ~600 Kb writes per second (50% of the peak i/o).

Processing a 500k torrents channel on an HDD takes more than 3 days! The database grows to ~500Mb during this time, but, you guess it, the total write i/o spend by Tribler during this time is about 130 GBytes. That is a crazy amount of write amplification. The reason for this is the reactor-congestion-control algorithm we employ to split channel processing in small chunks so the reactor is able to process other stuff. On slower media like HDDs it becomes super-inefficient.

There are two ways to solve this, both require very invasive changes to the current Tribler architecture:

  1. Create a separate priority queue for processing ORM calls and serve these in a separate thread (advised by Pony author).
  2. Move GigaChannel into a separate process with its own reactor, REST endpoint, etc.

Also, both ways require blocking write access to GigaChannel database during the channel processing.

@qstokkink
Copy link
Contributor

Having done option 2 before myself, I agree with the Pony author: option 1 is the way to go if it is a viable alternative.

@devos50
Copy link
Contributor

devos50 commented Sep 14, 2019

Option 1 sounds easier to implement. We could first verify the performance gains of option 1 with some (small-scale) experiments.

@ichorid
Copy link
Contributor

ichorid commented Sep 14, 2019

Option 2 should be relatively easy to implement because GigaChannel makes (almost) no use of other parts of Tribler. The only two calls that it uses are: "download a torrent" and "check if some torrent is already downloaded". However, we'll still have to implement something like a simplified priority queue even in case we go with a separate process.

@ichorid
Copy link
Contributor

ichorid commented Jan 3, 2020

Now that we use AIOHTTP, we can start gradually making GigaChannel requests asynchronous, by wrapping Pony calls into coroutines working on a background thread. This will allow us to effectively realize option 1 employing asyncio reactor as a kind of "queue manager".

@ichorid ichorid modified the milestones: Backlog, V7.5: core refactoring Jan 3, 2020
@devos50
Copy link
Contributor

devos50 commented Feb 3, 2020

This will allow us to effectively realize option 1 employing asyncio reactor as a kind of "queue manager".

I believe that this is indeed the proper way to manage all our operations in Tribler (and IPv8). The first release(s) of IPv8 used the reactor thread purely for socket listening and processed each incoming packet on one of the available thread pools. However, DAS5 experiments showed that this heavily impacted anonymous download speed. Therefore, we switched to packet processing on the reactor thread. During the Noodle release, we realised that it's actually the other way around and handling packets is better done in the thread pool, at least for TrustChain block processing.

Now that we have switched to asyncio, we should see what the impact on anonymous download speed is, if we process packets on the reactor thread vs the thread pool.

@synctext
Copy link
Member Author

synctext commented Jun 9, 2020

Wow, impressive! 🥇
The latest Tribler 7.5.0-RC5 is doing very little IO, only 60 MByte / hour

When only idle and doing very little, the IO read/write activity and also CPU is small. Measured for 15 minute the total accumulated IO on Linux:

Screenshot from 2020-06-09 16-54-51

@synctext synctext closed this as completed Jun 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

10 participants