Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scan performance when task and execution tables in DB have more than 300k rows #17538

Closed
AlenversFr opened this issue Sep 12, 2022 · 7 comments
Closed

Comments

@AlenversFr
Copy link

AlenversFr commented Sep 12, 2022

What is the problem ?
Since our last upgrade to Harbor version 2.5.3, the scheduled full scan with trivy takes a huge amount of time.

With version 2.4.x the full scan operation was executed in 4h to 5h with around 55k images in the registry.
With version 2.5.3 the full scan takes 20h to 30h to be executed with the same mount of images.

What is the configuration ?
Harbor is running in a Kubernetes cluster (AKS)
Trivy has 2 replicas
Job service has HPA with 2 to 5 pods based on cpu and ram.
Redis is a statefulset with only one pod (Mem limit = 2.4Go)
Database is a Azure flexible postgres with 4vCPU and 8Go Ram
Storage account is around 9To

All the other components are deployed with HA config and some with HPA.

What we see
During the full scan, the DB is 100% vcpu for 10h at the beginning of the scheduled action.

image

The memory of Redis rise gently to the limit of 2.4Go and it takes around 5h to get there.

image

Here is the scan_image queue
image

Investigation on the database with pg_stats_statement

SELECT LEFT(query,60)as query_short, SUM(calls) as tot_calls, SUM(total_exec_time) as tot_time, SUM(total_exec_time)/SUM(calls) as avg_time FROM pg_stat_statements where dbid=24831 GROUP BY query_short ORDER BY tot_time DESC limit 3;

                              query_short                     | tot_calls |  tot_time  |  avg_time       
--------------------------------------------------------------+-----------+------------+---
 select id from task where extra_attrs->>$1 = $2              |      2154 |    2541720 |   1180
 SELECT T0."id", T0."cve_id", T0."registration_uuid", T0."pac |      1053 |    595942  |    565
 INSERT INTO "report_vulnerability_record" ("report_uuid", "v |      3924 |    147030  |    37

duration is in milliseconds

The query over Task table takes a lot of time with a simple select query.

task table

select vendor_type, count(*) from task group by vendor_type;

    vendor_type     | count  
--------------------+--------
 GARBAGE_COLLECTION |     51
 IMAGE_SCAN         | 183808
 REPLICATION        |  12129
 RETENTION          |  80421
 SCAN_ALL           | 223754
 SCHEDULER          |    436

The oldest task entry has a start_time in May 2021.
It seems that the task table is somehow never cleaned up.

execution table
in the data model 1 excution is linked with n task.
When you use the ui to scan an image => 1 execution + 1 task are stored in the db
When you launch a scheduled full scan => 1 execution + 55000 tasks are stored in the db (our case)

once again we do not see any cleaning of the entries.

A detailed view on the tasks and execution for the scans vendor_type:

WITH task_error as (
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Error' group by b.vendor_type, b.execution_id
), task_success as (
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Success' group by b.vendor_type, b.execution_id
), task_stopped as (
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Stopped' group by b.vendor_type, b.execution_id
)
 select a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status,
 count(distinct a.id) as nb_execution,
 SUM(task_error.cnt) as task_error_cnt,
 SUM(task_success.cnt) as task_success_cnt,
 SUM(task_stopped.cnt) as task_stopped_cnt
 from execution as a
 left Join task_error on task_error.execution_id=a.id
 left Join task_success on task_success.execution_id=a.id
 left Join task_stopped on task_stopped.execution_id=a.id
 Where vendor_type like '%SCAN%'
 group by a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status
 order by to_char(a.start_time,'YYYY-MM-DD');

vendor_type |  to_char   | status  | nb_execution | task_error_cnt | task_success_cnt | task_stopped_cnt 
-------------+------------+---------+--------------+----------------+------------------+------------------
 IMAGE_SCAN  | 2022-08-01 | Success |          778 |                |              778 |
 IMAGE_SCAN  | 2022-08-02 | Error   |            1 |              1 |                  |
 IMAGE_SCAN  | 2022-08-02 | Success |          725 |                |              725 |
 IMAGE_SCAN  | 2022-08-03 | Success |          737 |                |              737 |
 IMAGE_SCAN  | 2022-08-04 | Error   |            2 |              2 |                  |
 IMAGE_SCAN  | 2022-08-04 | Success |          745 |                |              745 |
 IMAGE_SCAN  | 2022-08-05 | Error   |            1 |              1 |                  |
 IMAGE_SCAN  | 2022-08-05 | Success |          629 |                |              629 |
 IMAGE_SCAN  | 2022-08-06 | Success |           87 |                |               87 |
 IMAGE_SCAN  | 2022-08-07 | Success |           42 |                |               42 |
 SCAN_ALL    | 2022-08-07 | Error   |            1 |             12 |            20679 |
 IMAGE_SCAN  | 2022-08-08 | Success |          544 |                |              544 |
 IMAGE_SCAN  | 2022-08-09 | Success |          574 |                |              574 |
 IMAGE_SCAN  | 2022-08-10 | Success |          599 |                |              599 |
 IMAGE_SCAN  | 2022-08-11 | Success |          554 |                |              554 |
 IMAGE_SCAN  | 2022-08-12 | Success |          678 |                |              678 |
 IMAGE_SCAN  | 2022-08-13 | Success |           70 |                |               70 |
 IMAGE_SCAN  | 2022-08-14 | Success |           39 |                |               39 |
 SCAN_ALL    | 2022-08-14 | Error   |            1 |             14 |            21209 |
 IMAGE_SCAN  | 2022-08-15 | Success |           35 |                |               35 |
 IMAGE_SCAN  | 2022-08-16 | Error   |            1 |              1 |                  |
 IMAGE_SCAN  | 2022-08-16 | Success |          452 |                |              452 |
 IMAGE_SCAN  | 2022-08-17 | Success |          548 |                |              548 |
 IMAGE_SCAN  | 2022-08-18 | Success |          620 |                |              620 |
 IMAGE_SCAN  | 2022-08-19 | Error   |            2 |              2 |                  |
 IMAGE_SCAN  | 2022-08-19 | Success |          701 |                |              701 |
 IMAGE_SCAN  | 2022-08-20 | Success |           74 |                |               74 |
 IMAGE_SCAN  | 2022-08-21 | Error   |           28 |             28 |                  |
 IMAGE_SCAN  | 2022-08-21 | Success |           45 |                |               45 |
 SCAN_ALL    | 2022-08-21 | Error   |            1 |            885 |             7657 |            43895
 IMAGE_SCAN  | 2022-08-22 | Error   |          604 |            604 |                  |
 IMAGE_SCAN  | 2022-08-22 | Stopped |            1 |                |                  |                1
 IMAGE_SCAN  | 2022-08-22 | Success |          129 |                |              129 |
 IMAGE_SCAN  | 2022-08-23 | Error   |           11 |             11 |                  |
 IMAGE_SCAN  | 2022-08-23 | Success |          917 |                |              917 |
 IMAGE_SCAN  | 2022-08-24 | Success |          748 |                |              748 |
 IMAGE_SCAN  | 2022-08-25 | Success |          654 |                |              654 |
 IMAGE_SCAN  | 2022-08-26 | Success |          535 |                |              535 |
 IMAGE_SCAN  | 2022-08-27 | Success |           42 |                |               42 |
 IMAGE_SCAN  | 2022-08-28 | Error   |           10 |             10 |                  |
 IMAGE_SCAN  | 2022-08-28 | Success |           39 |                |               39 |
 SCAN_ALL    | 2022-08-28 | Error   |            1 |             52 |            53508 |
 IMAGE_SCAN  | 2022-08-29 | Success |          739 |                |              739 |
 IMAGE_SCAN  | 2022-08-30 | Success |          637 |                |              637 |
 IMAGE_SCAN  | 2022-08-31 | Success |          626 |                |              626 |
 IMAGE_SCAN  | 2022-09-01 | Success |          620 |                |              620 |
 IMAGE_SCAN  | 2022-09-02 | Success |          814 |                |              814 |
 IMAGE_SCAN  | 2022-09-03 | Success |           86 |                |               86 |
 IMAGE_SCAN  | 2022-09-04 | Error   |           12 |             12 |                  |
 IMAGE_SCAN  | 2022-09-04 | Success |           73 |                |               73 |
 SCAN_ALL    | 2022-09-04 | Error   |            1 |            210 |            53937 |
 IMAGE_SCAN  | 2022-09-05 | Error   |            5 |              5 |                  |
 IMAGE_SCAN  | 2022-09-05 | Success |          700 |                |              700 |
 IMAGE_SCAN  | 2022-09-06 | Success |          624 |                |              624 |
 IMAGE_SCAN  | 2022-09-07 | Success |          858 |                |              858 |
 IMAGE_SCAN  | 2022-09-08 | Error   |            1 |              1 |                  |
 IMAGE_SCAN  | 2022-09-08 | Success |          891 |                |              891 |
 IMAGE_SCAN  | 2022-09-09 | Success |          568 |                |              568 |
 IMAGE_SCAN  | 2022-09-10 | Success |           87 |                |               87 |
 IMAGE_SCAN  | 2022-09-11 | Error   |            2 |              2 |                  |
 IMAGE_SCAN  | 2022-09-11 | Success |           79 |                |               79 |
 SCAN_ALL    | 2022-09-11 | Error   |            1 |            257 |            54649 |
 IMAGE_SCAN  | 2022-09-12 | Success |          396 |                |              396 |

What is the issue ?
When Harbor host a large amount of images, the full scan add a lot of entries in the task table. After 8 full scan the task table have around 400k rows in it (our case) resulting in poor query performance.

As for now we do not want to upgrade the cpu/ram capacities of the DB because it will double the cost of usage of postgres flexible.

What are the questions ?

  1. It is not clear how the cleaning is done in the tables task and execution, where can we find the information for the different vendor_type?

  2. if there's a cleaning mechanism, how can we check is it working properly, is there some logs somewhere ?

  3. It is not clear what harbor does with the previous entries in the tables task and execution, is it safe to clear it ? is it safe to clear all the different vendor_type (see query result above)

  4. We did not check with previous version of harbor (ex: 2.4.x) if there's important changes in the database queries, is it related to the bump version of trivy adapter ?

Any help appreciated ^^

@wy65701436
Copy link
Contributor

hi @AlenversFr thanks for reporting the performance issue and the datails. We will take time to investigate it.

@AlenversFr
Copy link
Author

the poor performance is related to this simple query "select id from task where extra_attrs->>$1 = $2", there are no index on the extra_attrs column which is a json format.

As a result, the query is executed with a recursive scan and it becomes quite inefficient when the task table is growing.

2 actions needed :

  • try to optimize the query by adding json index (I've tried with btree but it's not relevant) or add other criterias in the where clause in order to reduce the recursive scan (vendor type and/or creation_time)
  • review the cleaning mechanism of the task and execution tables because it seems to be KO

We now have a weekly cleaning cron job that executes these queries :
SCAN_ALL --> flush all the tasks because it's not in use in the ui
delete from task where vendor_type='SCAN_ALL' and status not in('Pending', 'Scheduled', 'Running');

IMAGE SCAN --> keep 14 days of tasks because it's the job_log config
RETENTION --> keep 14 days of tasks because it's the job_log config

delete from task where vendor_type in('IMAGE_SCAN', 'RETENTION') AND status not in('Pending', 'Scheduled', 'Running') AND creation_time< NOW() - INTERVAL '14 DAY';

REPLICATION --> keep only the last one executed

delete from task where vendor_type='REPLICATION' and status not in('Pending', 'Scheduled', 'Running') and ((extra_attrs::json ->> 'destination_resource'),creation_time) not in (select (extra_attrs::json ->> 'destination_resource'),max(creation_time) from task group by (extra_attrs::json ->> 'destination_resource'));

Then delete dependencies in execution and job_log tables
delete from execution where execution.id not in (select execution_id from task);
delete from job_log where job_uuid not in (select job_id from task);

@github-actions
Copy link

github-actions bot commented Jan 3, 2023

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Jan 3, 2023
@z0rc
Copy link

z0rc commented Jan 4, 2023

Not stale.

@github-actions
Copy link

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Mar 18, 2023
@dkulchinsky
Copy link
Contributor

not stale

@github-actions github-actions bot removed the Stale label Mar 19, 2023
@chlins
Copy link
Member

chlins commented Mar 22, 2023

In version 2.8, we made the following improvements:
We refactored the logic for cleaning up execution records and replaced the unreliable goroutine implementation with scheduled job execution once per hour. This can be viewed and managed through the jobservice dashboard to check cleanup logs. Additionally, we added an implementation to fix dangling executions' status. Harbor will periodically update executions that have been in a Running state for one week or more to a final state for easier cleanup by sweep job.

The related PRs:

Close the issue as above commits, feel free to reopen or open a new issue if the problem still exists after upgrading to v2.8, thanks.

@chlins chlins closed this as completed Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants