-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scan performance when task and execution tables in DB have more than 300k rows #17538
Comments
hi @AlenversFr thanks for reporting the performance issue and the datails. We will take time to investigate it. |
the poor performance is related to this simple query "select id from task where extra_attrs->>$1 = $2", there are no index on the extra_attrs column which is a json format. As a result, the query is executed with a recursive scan and it becomes quite inefficient when the task table is growing. 2 actions needed :
We now have a weekly cleaning cron job that executes these queries : IMAGE SCAN --> keep 14 days of tasks because it's the job_log config REPLICATION --> keep only the last one executed delete from task where vendor_type='REPLICATION' and status not in('Pending', 'Scheduled', 'Running') and ((extra_attrs::json ->> 'destination_resource'),creation_time) not in (select (extra_attrs::json ->> 'destination_resource'),max(creation_time) from task group by (extra_attrs::json ->> 'destination_resource')); Then delete dependencies in execution and job_log tables |
This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days. |
Not stale. |
This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days. |
not stale |
In version 2.8, we made the following improvements: The related PRs: Close the issue as above commits, feel free to reopen or open a new issue if the problem still exists after upgrading to v2.8, thanks. |
What is the problem ?
Since our last upgrade to Harbor version 2.5.3, the scheduled full scan with trivy takes a huge amount of time.
With version 2.4.x the full scan operation was executed in 4h to 5h with around 55k images in the registry.
With version 2.5.3 the full scan takes 20h to 30h to be executed with the same mount of images.
What is the configuration ?
Harbor is running in a Kubernetes cluster (AKS)
Trivy has 2 replicas
Job service has HPA with 2 to 5 pods based on cpu and ram.
Redis is a statefulset with only one pod (Mem limit = 2.4Go)
Database is a Azure flexible postgres with 4vCPU and 8Go Ram
Storage account is around 9To
All the other components are deployed with HA config and some with HPA.
What we see
During the full scan, the DB is 100% vcpu for 10h at the beginning of the scheduled action.
The memory of Redis rise gently to the limit of 2.4Go and it takes around 5h to get there.
Here is the scan_image queue
Investigation on the database with pg_stats_statement
SELECT LEFT(query,60)as query_short, SUM(calls) as tot_calls, SUM(total_exec_time) as tot_time, SUM(total_exec_time)/SUM(calls) as avg_time FROM pg_stat_statements where dbid=24831 GROUP BY query_short ORDER BY tot_time DESC limit 3;
duration is in milliseconds
The query over Task table takes a lot of time with a simple select query.
task table
select vendor_type, count(*) from task group by vendor_type;
The oldest task entry has a start_time in May 2021.
It seems that the task table is somehow never cleaned up.
execution table
in the data model 1 excution is linked with n task.
When you use the ui to scan an image => 1 execution + 1 task are stored in the db
When you launch a scheduled full scan => 1 execution + 55000 tasks are stored in the db (our case)
once again we do not see any cleaning of the entries.
A detailed view on the tasks and execution for the scans vendor_type:
What is the issue ?
When Harbor host a large amount of images, the full scan add a lot of entries in the task table. After 8 full scan the task table have around 400k rows in it (our case) resulting in poor query performance.
As for now we do not want to upgrade the cpu/ram capacities of the DB because it will double the cost of usage of postgres flexible.
What are the questions ?
It is not clear how the cleaning is done in the tables task and execution, where can we find the information for the different vendor_type?
if there's a cleaning mechanism, how can we check is it working properly, is there some logs somewhere ?
It is not clear what harbor does with the previous entries in the tables task and execution, is it safe to clear it ? is it safe to clear all the different vendor_type (see query result above)
We did not check with previous version of harbor (ex: 2.4.x) if there's important changes in the database queries, is it related to the bump version of trivy adapter ?
Any help appreciated ^^
The text was updated successfully, but these errors were encountered: