Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database performance is abysmal #405

Closed
Nutomic opened this issue Jan 10, 2020 · 12 comments
Closed

Database performance is abysmal #405

Nutomic opened this issue Jan 10, 2020 · 12 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@Nutomic
Copy link
Member

Nutomic commented Jan 10, 2020

I ran a quick benchmark against the feeds endpoint, and Lemmy only manages to handle 1.19 requests/second (peertube.social does 34 r/s, and is written in typescript). I dont know if this is a problem with feeds in particular or with Lemmy in general, as other APIs are only available as websocket and much more complicated to test.

I ran the following using ab.

ab -c 10 -t 10 "https://communism.lemmy.ml/feeds/all.xml?sort=Hot"
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking communism.lemmy.ml (be patient)
Finished 12 requests


Server Software:        nginx
Server Hostname:        communism.lemmy.ml
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256
Server Temp Key:        X25519 253 bits
TLS Server Name:        communism.lemmy.ml

Document Path:          /feeds/all.xml?sort=Hot
Document Length:        10427 bytes

Concurrency Level:      10
Time taken for tests:   10.084 seconds
Complete requests:      12
Failed requests:        0
Total transferred:      128868 bytes
HTML transferred:       125124 bytes
Requests per second:    1.19 [#/sec] (mean)
Time per request:       8403.400 [ms] (mean)
Time per request:       840.340 [ms] (mean, across all concurrent requests)
Transfer rate:          12.48 [Kbytes/sec] received

Connection Times (ms)
min  mean[+/-sd] median   max
Connect:      179  209  22.3    201     251
Processing:   854 4385 2596.8   4812    7926
Waiting:      854 4384 2597.0   4812    7926
Total:       1033 4594 2606.0   5008    8160

Percentage of the requests served within a certain time (ms)
50%   5008
66%   5799
75%   7382
80%   7382
90%   7912
95%   8160
98%   8160
99%   8160
100%   8160 (longest request)

I took this screenshot on the server at the same time. It is obvious that the database is doing too much work. Maybe we need to add more indexes in the database, or reduce the amount of queries.
Screenshot_20200110_234959

Edit: Worth noting that I cant reproduce this problem locally, so it is probably related to the database size.

Edit 2: This also means that any Lemmy instance can be DDoSed by doing only one request per second.

@Nutomic Nutomic changed the title Performance is abysmal Database performance is abysmal Jan 10, 2020
@Nutomic Nutomic added the bug Something isn't working label Jan 10, 2020
@Nutomic
Copy link
Member Author

Nutomic commented Jan 10, 2020

Here is an analysis of the postgres logs using pgBadger. One thing I noticed is that every query selects a lot of columns, and uses a lot of stuff like selection and sort parameters.

http://nextcloud.nutomic.com/index.php/s/m3j8yKcG6zTR9mC

Edit: Query durations seem to be missing in the log analysis for some reason. The most frequent query (actually the second, after select 0;), has an execution time of 142 ms alone. Others are probably similar, and presumably a single request needs much more than one query.

@dessalines dessalines added the enhancement New feature or request label Jan 10, 2020
@dessalines
Copy link
Member

Okay I've got this big db imported locally, and am getting some similarly bad results.

I'm gonna make a script that generates a bunch of explain files for the most common queries and their output, and puts them in a folder, so they can be analyzed by http://tatiyants.com/pev/

This is the general command it needs:

echo "explain (analyze, format json) select * from user_view" > explain.sql
psql -qAt -U lemmy -f explain.sql > analyze.json

Ideally doing diffs of the json before and after creating the indexes

@dessalines
Copy link
Member

Okay yeah this is pretty substantial. I'm getting like 10x better results just by adding some indexes on creator_ids for a few of the tables.

@dessalines
Copy link
Member

dessalines commented Jan 11, 2020

Okay I've added some indexes with a commit I'll post here in a second, but here's what just adding a few indexes did:

To test this, run

git fetch
git checkout dessalines/db_indexes
cd server/query_testing
./generate_explain_reports.sh
... check the json files in that folder
cd ..
diesel migration run
cd query_testing
./generate_explain_reports.sh
... check the json files in that folder
# if you want to revert, run diesel migration revert
query before (ms) after (ms)
select * from user_view 1017 182
select * from post_view where user_id is null order by hot_rank desc 651 317
select * from comment_view where user_id is null 114 114
select * from community_view where user_id is null order by hot_rank desc 94 51

The main thing that needs to be optimized, is the post_view (which is the front page) and unfortunately its still pretty slow because it has to do lots of joins. Before I commit my indexes, I'm gonna see if I can speed that view up at all.

Someone could take a look at the expensive cross joins I'm doing for the views here: https://github.com/dessalines/lemmy/blob/master/server/migrations/2019-12-29-164820_add_avatar/up.sql#L23

They're the only way I could finally come up with to basically be a single fetch, where you can optionally provide your user_id = X, or user_id is null, and get user specific data like your vote, whether you've saved that post, etc.

@dessalines
Copy link
Member

So I tried to do a LOT of optimizing today, and could only get the post query down to ~200ms. The issue is basically the hot_sort function needs both a score, (which comes from sum(post_likes) and a published_time, which I can't create an index for since those scores come from a join of two different tables)

The other option options would be to have the post_view be a materialized view, which would mean the whole caching mess and having to periodically refresh that view, or to move the hot_sorting off sql which would be my last resort.

@dessalines
Copy link
Member

Something weird is going on, maybe with the nginx configs on the servers or something.

I'm testing out adding the connection pooling locally, and finding that locally, both before and after adding the proper r2d2 code, I'm getting the same speed with ab, about 800 reqs / s

When I push to dev.lemmy.ml, and test there, I get the same before and afters, but its only 3 reqs / s

Requests per second:    3.16 [#/sec] (mean)
Time per request:       3169.220 [ms] (mean)
Time per request:       316.922 [ms] (mean, across all concurrent requests)

The only difference is possibly the nginx config.

@Nutomic
Copy link
Member Author

Nutomic commented Jan 12, 2020

Is postgres still using 90% of cpu or more while running the benchmark? If so, the queries are still too complex and we need to cache them or something.

@dessalines
Copy link
Member

dessalines commented Jan 12, 2020

I tried this on my dev server, and postgres only goes up to 10% usage, but its still at only 3 reqs per sec.

I'm running the exact same postgres docker image locally too.

@dessalines
Copy link
Member

I put most of the comments in #411 . But we might be able to just close this ticket, since the DB performance is now hundreds of times better, and every major query (front page posts, user searching, comments, and communities) is now < 30ms

@dessalines
Copy link
Member

I'll close now but we can re-open if it becomes an issue. At least we have the performance tests in server/query_testing now.

@dessalines
Copy link
Member

Re-opening this issue, as the DB is now currently the CPU bottleneck. The best way to figure this out, is to analyze the actually running longer queries, and optimize them individually.

@Nutomic
Copy link
Member Author

Nutomic commented Mar 24, 2022

Better make a new issue, this is already quite long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants