Reduce IO activity from 2.4 GByte/hour #564

synctext · 2014-04-16T09:43:05Z

excessive disk writes have been a long-standing Tribler problem.

16 April 2014: 623 MByte/hour

17May 2013: 660 MByte/hour

See below the total IO activity of running Tribler f2889f4 for 16 hours. Test conditions: using the -O optimized mode, no Dispersy debug panel, no download, no searches executed, no GUI activity of any kind for 16 hours. Megacache fully bootstrapped before experiment: 547MB dispersy.db and 784MB tribler.sdb

With 22% average CPU utilization (fraction of 1 core) this is now a nice value. Since #15 this has been an issue, but now addressed with the Bloom skip and re-use heuristic.

Related ticket: #8. Running Tribler for 1 hour and see resource usage.

The text was updated successfully, but these errors were encountered:

LipuFei · 2014-04-22T13:12:28Z

Tribler is doing way too much database activities (about 600 to 1000 times per minute when idle). I'll check if we can optimize it.

synctext · 2014-05-12T07:37:59Z

12May2014: 538 MByte/hour (b8ec22b branch).

brandc · 2016-04-22T05:01:10Z

I'm guessing the proposed solution is to group transactions?

synctext · 2016-04-22T08:46:46Z

@brandc thnx for reviving this issue. it's been a whole year since I've measured the IO performance. We don't know where this is coming from.
The AllChannel community needs a rewrite I think. We plan to do that now that we can filter fake channel votes based on Multi-Chain this year.

brandc · 2016-04-27T07:06:07Z

I certainly hope it gets fixed somewhat soon, then I don't have to run Tribler on ram disk with scripts to back it up periodically.

lfdversluis · 2016-07-27T20:26:37Z

@synctext how did you get to 538 MB/hour? If I add the write_bytes and read_bytes and divide that by 1024*1024 I end up with 8127,37 MB. Assuming TIME+ 2h 41min = 161 minutes, then I obtain 3028,83 MB/hour.

So I guess this ran took around 16h? But I cannot find that in your screenshot.

lfdversluis · 2016-08-03T21:06:56Z

Using my laptop, I have ran Tribler (with MultiChain crawler enabled) for 2 hours. I have captured the total io using sudo iotop -aoP and the results are far less than 500MB/hour.

From the screenshot above, I see that only around ~70-75 MB/hour is done in I/O. I did use my laptop which is only an i5, with 14mbit internet. But I guess Trible running idle won't do 1.5MB/s. All in all I think in the above screenshots the amount of I/O was divided by the TIME+ column, which is the total cpu time spent on Tribler. I am not sure whether this is a valid measure to say Tribler does X amount of I/O per "CPU hour", as in hour of actual computation.

synctext · 2016-08-03T22:09:03Z

interesting. Try to reproduce old results the with older code, like, b8ec22b

Please ensure you are connectable on several ports (dht,libtorrent,dispersy).

synctext · 2016-08-03T22:15:28Z

plus: f2889f4

Nice experiment to do IO intensity over 6 different versions. Then the 7th Laurens version. Plot X axis the date of release, Yaxis the IO per hour, for instance. Each just 1 hour run with clean megacache and also plot collected torrents + AllChannel votes (easy in debug panel with 'other stuff option' on). CPU user time also nice evolution in time to plot for 7 versions.

lfdversluis · 2016-08-03T22:17:18Z

AFAIK I was connectable on all ports. Since I am not seeding nor downloading libtorrent should not do much if anything. I will try that particular commit tomorrow. Will be interesting to observe the amount of I/O there.

Regarding the plot, that might be indeed interesting. I'll try to do that tomorrow.

synctext · 2016-08-03T22:21:00Z

plus: http://svn.tribler.org/abc/branches/release-5.4.x/
http://svn.tribler.org/abc/branches/release-5.3.x/

lfdversluis · 2016-08-08T14:53:06Z

I have managed to get Tribler 6.3.5 working on the ubuntu 15.10 machine. It looks like the IO numbers mentioned above are accurate, 10 minutes in and already 300MB of disk writes have been done. May actually be higher than 600MB/hour. Wow.

Edit: after ~1 hour, it read a stunning 26.51 GB of data and wrote 1659.38 MB of data. This brings it to 28805 MB/hour.

lfdversluis · 2016-08-08T16:03:20Z

I note that on a fresh start, both 6.3.5 and 6.4.3 read a lot of data, ~100MB

Edit: 6.4.3 reads in ~1 hour 19.81GB and writes 850.30 MB. This means 21135 MB/hour

lfdversluis · 2016-08-08T18:44:11Z

6.5.2 reads in ~1 hour 74.44MB and writes 1530.72MB. This means it does 1605 MB/hour.

Apparently the virtual machines on a 100mbit connection do a lot more I/O, perhaps this plays a role, maybe in both connectivity and download speed of content?

lfdversluis · 2016-08-09T10:05:36Z

With the twitstd plugin: 6.6.0pre-exp1 read 4 KB in ~1hour and wrote 2.59GB. This means an I/O activity of 2652 MB/hour.

With tribler.sh (GUI enabled): Read 17.52 MB and wrote 2.52 GB, or 2598 MB/hour

These all are significantly higher than what @synctext measured or what my laptop running on my 14 mbit connection did. Odd.

synctext · 2016-08-09T11:35:00Z

good progress.
As inidicated, did you check how fast AllChannel was collecting votes? torrentcollecting?

key indight: cost vs. benefit.
with 100mbps you sync very fast and intense is my hypothesis.

lfdversluis · 2016-08-09T11:47:36Z

No, these were the official .deb files that ran, so no modifications in the code.
I will have to checkout the versions by git and modify the code to obtain these statistics. Note that torrent meta messages are shared via the Channel community rather than AllChannel.

synctext · 2016-08-09T13:46:45Z

no code changes needed. all in debug panel.

lfdversluis · 2016-08-13T11:28:28Z

Forgot to update.

Ran four tribler versions idle for one hour using a clean state dir on the ubuntu 15.10 machine with 100 mbit connection.Noted down the packets in the database using the debug panel:

lfdversluis · 2016-08-13T20:01:34Z

Ok so I have measured both the I/O calls and times in Tribler as well in Dispersy using an empty megacache (.Tribler state dir) and run it for one hour. It turns out Tribler was blocked for 607.723 seconds doing I/O, this result does not include LevelDB as I am not sure yet if this has blocking behavior. Dispersy a lot less with 30.399 seconds. In total Tribler thus did 638.122 seconds of I/O, meaning 18% of the time the main thread was blocked.

A breakdown generated:

Tribler:

------- fetchall --------
Times called: 95195 
total time: 279.232 seconds
avg: 0.00293
min: 0.00000
max: 0.22155
Total amount of functions calling this db function: 5
-----------------

------- execute --------
Times called: 225346 
total time: 14.996 seconds
avg: 0.00007
min: 0.00000
max: 0.20853
Total amount of functions calling this db function: 5
-----------------

------- executemany --------
Times called: 22263 
total time: 10.200 seconds
avg: 0.00046
min: 0.00000
max: 0.06107
Total amount of functions calling this db function: 5
-----------------

------- update --------
Times called: 6582 
total time: 21.420 seconds
avg: 0.00325
min: 0.00000
max: 0.02372
Total amount of functions calling this db function: 5
-----------------

------- insert_or_ignore --------
Times called: 18869 
total time: 50.146 seconds
avg: 0.00266
min: 0.00199
max: 0.21303
Total amount of functions calling this db function: 5
-----------------

------- fetchone --------
Times called: 90732 
total time: 276.729 seconds
avg: 0.00305
min: 0.00000
max: 0.26454
Total amount of functions calling this db function: 5
-----------------

Dispersy:

------- commit --------
Times called: 69 
total time: 3.078 seconds
avg: 0.04460
min: 0.00000
max: 0.21289
Total amount of functions calling this db function: 9
-----------------

------- execute --------
Times called: 770882 
total time: 27.065 seconds
avg: 0.00004
min: 0.00000
max: 0.17525
Total amount of functions calling this db function: 47
-----------------

------- executemany --------
Times called: 6540 
total time: 0.256 seconds
avg: 0.00004
min: 0.00001
max: 0.00320
Total amount of functions calling this db function: 8
-----------------

------- executescript --------
Times called: 3 
total time: 0.000 seconds
avg: 0.00001
min: 0.00001
max: 0.00001
Total amount of functions calling this db function: 7
-----------------

synctext · 2016-08-13T20:12:25Z

607 out of 3600 seconds. (you mixed up your . and , btw)
Solid progress! great stuff.

Note that which IO execute, fetchall or community.py calls are the heaviest is the critical info.

lfdversluis · 2016-08-13T20:15:48Z

Yes this info is all captured as well. Lines are in the form of:
1471109266.44 fetchall helper 46 /home/laurens/Documents/thesis/tribler/Tribler/dispersy/util.py 0.0179858207703
From left to right: time it was measured, database function, caller function name, caller function_ ine-number, path caller function file, elapsed query time

edit: I just notice the helper is a decorator function, I'll check if I can get the function one frame alter so the actual caller shows.

lfdversluis · 2016-08-15T16:56:32Z

@whirm I remembered you explained this to me (was doubting myself), thanks for the confirmation 👍

synctext · 2018-09-11T13:30:08Z

Jenkins needs IO read/write intensity tracking.

devos50 · 2018-09-11T13:31:46Z

Gumby has this functionality (using filefs):

devos50 · 2018-09-19T08:52:27Z

It seems that database I/O has a huge impact on the reactor. I observed this during one my experiments with TrustChain. Within a network of 100 nodes where every node creates a TrustChain record every second, I get the following evolution of the total number of blocks.

With a file database:

When storing all blocks in dictionaries:

I think this experiment gives us a reliable baseline for further improvements and optimizations.

synctext · 2018-09-19T13:17:38Z

Oops!
7.1 RC1 is making things really bad. Started an instance from scratch. Not a priority now, first do IPv8 all the way. Or bootstrapping everything is just happening much faster then before.

2.4 GByte/hour of write IO! (9.62 GByte 4 hours)

Decent almost 50% CPU usage (1h 51min in 4h):

We earned 754 MByte of credit by relaying traffic:

We're discovering and pulling in 2918 channels with their content, stored as 76.9MByte of database content.

What is IPv8 doing here? Is this 383 MBytes download or 383 million of UDP packets?

138 peers? We know everybody...

Nice CPU graph

qstokkink · 2018-09-19T14:04:16Z

9.62 GByte written to disk
1.055 GByte given to community
0.407 GByte sent over the network

Several things are off here.

devos50 · 2018-09-30T15:21:57Z

I placed a timer in the add_block method in the TrustChainDB class and started the TrustChain experiment again (of which I posted two graphs a few comments ago). During a period of 60 seconds where each node creates one block a second, it seems that around 75% (45 seconds) of the time is spent in the add_block method. This explains the inconsistent timing of the reactor during the experiment.

We should explore whether delegating add_block to the thread pool improves performance. This is in particular important for the TrustChain crawler, which receives and processes new blocks at a high rate.

ichorid · 2019-02-23T11:14:57Z

@devos50 . As #4090 is merged, we should revisit this issue after 7.3 release.

ichorid · 2019-09-14T17:23:06Z

Preliminary observations for 7.3.

In the idle state (channels database live for a couple of days, 50 channels, 600 Mb database, 600k torrents) on an SSD: ~50 Kb writes per second (0, 25% of the peak i/o).

In the "channel processing" state (when inserting downloaded channel contents into the database) on an HDD: ~600 Kb writes per second (50% of the peak i/o).

Processing a 500k torrents channel on an HDD takes more than 3 days! The database grows to ~500Mb during this time, but, you guess it, the total write i/o spend by Tribler during this time is about 130 GBytes. That is a crazy amount of write amplification. The reason for this is the reactor-congestion-control algorithm we employ to split channel processing in small chunks so the reactor is able to process other stuff. On slower media like HDDs it becomes super-inefficient.

There are two ways to solve this, both require very invasive changes to the current Tribler architecture:

Create a separate priority queue for processing ORM calls and serve these in a separate thread (advised by Pony author).
Move GigaChannel into a separate process with its own reactor, REST endpoint, etc.

Also, both ways require blocking write access to GigaChannel database during the channel processing.

qstokkink · 2019-09-14T17:54:08Z

Having done option 2 before myself, I agree with the Pony author: option 1 is the way to go if it is a viable alternative.

devos50 · 2019-09-14T17:56:29Z

Option 1 sounds easier to implement. We could first verify the performance gains of option 1 with some (small-scale) experiments.

ichorid · 2019-09-14T18:40:38Z

Option 2 should be relatively easy to implement because GigaChannel makes (almost) no use of other parts of Tribler. The only two calls that it uses are: "download a torrent" and "check if some torrent is already downloaded". However, we'll still have to implement something like a simplified priority queue even in case we go with a separate process.

ichorid · 2020-01-03T10:32:09Z

Now that we use AIOHTTP, we can start gradually making GigaChannel requests asynchronous, by wrapping Pony calls into coroutines working on a background thread. This will allow us to effectively realize option 1 employing asyncio reactor as a kind of "queue manager".

devos50 · 2020-02-03T21:05:09Z

This will allow us to effectively realize option 1 employing asyncio reactor as a kind of "queue manager".

I believe that this is indeed the proper way to manage all our operations in Tribler (and IPv8). The first release(s) of IPv8 used the reactor thread purely for socket listening and processed each incoming packet on one of the available thread pools. However, DAS5 experiments showed that this heavily impacted anonymous download speed. Therefore, we switched to packet processing on the reactor thread. During the Noodle release, we realised that it's actually the other way around and handling packets is better done in the thread pool, at least for TrustChain block processing.

Now that we have switched to asyncio, we should see what the impact on anonymous download speed is, if we process packets on the reactor thread vs the thread pool.

synctext · 2020-06-09T15:10:43Z

Wow, impressive! 🥇
The latest Tribler 7.5.0-RC5 is doing very little IO, only 60 MByte / hour

When only idle and doing very little, the IO read/write activity and also CPU is small. Measured for 15 minute the total accumulated IO on Linux:

synctext added bug labels Apr 16, 2014

synctext assigned LipuFei Apr 16, 2014

egbertbouman added this to the V9.0: Shadow Internet milestone Nov 3, 2014

whirm unassigned LipuFei Mar 18, 2016

synctext assigned lfdversluis Jul 13, 2016

synctext modified the milestones: Backlog, V9.0: Shadow Internet Jul 13, 2016

synctext mentioned this issue Aug 24, 2016

CPU load by Tribler looks too high #2504

Closed

synctext mentioned this issue Apr 28, 2018

Refactoring IPv8 class structure Tribler/py-ipv8#123

Closed

2 tasks

synctext assigned xoriole, devos50 and ichorid and unassigned lfdversluis Sep 11, 2018

synctext changed the title ~~Reduce IO activity from 623MByte/hour~~ Reduce IO activity from 2.4 GByte/hour Sep 19, 2018

synctext mentioned this issue Sep 19, 2018

Torrect collecting is broken #3889

Closed

devos50 removed their assignment Sep 5, 2019

ichorid modified the milestones: Backlog, V7.5: core refactoring Jan 3, 2020

ichorid modified the milestones: V7.5: core refactoring, V7.6: Collective authoring Mar 2, 2020

ichorid mentioned this issue May 19, 2020

Redesign of the Search/Channels feature #3615

Closed

synctext closed this as completed Jun 9, 2020

Reduce IO activity from 2.4 GByte/hour #564

Reduce IO activity from 2.4 GByte/hour #564

Comments

synctext commented Apr 16, 2014

LipuFei commented Apr 22, 2014

synctext commented May 12, 2014

brandc commented Apr 22, 2016

synctext commented Apr 22, 2016

brandc commented Apr 27, 2016 • edited Loading

lfdversluis commented Jul 27, 2016 • edited Loading

lfdversluis commented Aug 3, 2016

synctext commented Aug 3, 2016

synctext commented Aug 3, 2016

lfdversluis commented Aug 3, 2016

synctext commented Aug 3, 2016

lfdversluis commented Aug 8, 2016 • edited Loading

lfdversluis commented Aug 8, 2016 • edited Loading

lfdversluis commented Aug 8, 2016

lfdversluis commented Aug 9, 2016 • edited Loading

synctext commented Aug 9, 2016

lfdversluis commented Aug 9, 2016 • edited Loading

synctext commented Aug 9, 2016

lfdversluis commented Aug 13, 2016 • edited Loading

lfdversluis commented Aug 13, 2016 • edited Loading

synctext commented Aug 13, 2016

lfdversluis commented Aug 13, 2016 • edited Loading

lfdversluis commented Aug 15, 2016 • edited Loading

synctext commented Sep 11, 2018

devos50 commented Sep 11, 2018

devos50 commented Sep 19, 2018 • edited Loading

synctext commented Sep 19, 2018 • edited Loading

qstokkink commented Sep 19, 2018

devos50 commented Sep 30, 2018 • edited Loading

ichorid commented Feb 23, 2019

ichorid commented Sep 14, 2019

qstokkink commented Sep 14, 2019

devos50 commented Sep 14, 2019

ichorid commented Sep 14, 2019

ichorid commented Jan 3, 2020

devos50 commented Feb 3, 2020

synctext commented Jun 9, 2020

brandc commented Apr 27, 2016 •

edited

Loading

lfdversluis commented Jul 27, 2016 •

edited

Loading

lfdversluis commented Aug 8, 2016 •

edited

Loading

lfdversluis commented Aug 8, 2016 •

edited

Loading

lfdversluis commented Aug 9, 2016 •

edited

Loading

lfdversluis commented Aug 9, 2016 •

edited

Loading

lfdversluis commented Aug 13, 2016 •

edited

Loading

lfdversluis commented Aug 13, 2016 •

edited

Loading

lfdversluis commented Aug 13, 2016 •

edited

Loading

lfdversluis commented Aug 15, 2016 •

edited

Loading

devos50 commented Sep 19, 2018 •

edited

Loading

synctext commented Sep 19, 2018 •

edited

Loading

devos50 commented Sep 30, 2018 •

edited

Loading