Scan performance when task and execution tables in DB have more than 300k rows #17538

AlenversFr · 2022-09-12T11:50:49Z

What is the problem ?
Since our last upgrade to Harbor version 2.5.3, the scheduled full scan with trivy takes a huge amount of time.

With version 2.4.x the full scan operation was executed in 4h to 5h with around 55k images in the registry.
With version 2.5.3 the full scan takes 20h to 30h to be executed with the same mount of images.

What is the configuration ?
Harbor is running in a Kubernetes cluster (AKS)
Trivy has 2 replicas
Job service has HPA with 2 to 5 pods based on cpu and ram.
Redis is a statefulset with only one pod (Mem limit = 2.4Go)
Database is a Azure flexible postgres with 4vCPU and 8Go Ram
Storage account is around 9To

All the other components are deployed with HA config and some with HPA.

What we see
During the full scan, the DB is 100% vcpu for 10h at the beginning of the scheduled action.

The memory of Redis rise gently to the limit of 2.4Go and it takes around 5h to get there.

Here is the scan_image queue

Investigation on the database with pg_stats_statement

SELECT LEFT(query,60)as query_short, SUM(calls) as tot_calls, SUM(total_exec_time) as tot_time, SUM(total_exec_time)/SUM(calls) as avg_time FROM pg_stat_statements where dbid=24831 GROUP BY query_short ORDER BY tot_time DESC limit 3;

                              query_short                     | tot_calls |  tot_time  |  avg_time       
--------------------------------------------------------------+-----------+------------+---
 select id from task where extra_attrs->>$1 = $2              |      2154 |    2541720 |   1180
 SELECT T0."id", T0."cve_id", T0."registration_uuid", T0."pac |      1053 |    595942  |    565
 INSERT INTO "report_vulnerability_record" ("report_uuid", "v |      3924 |    147030  |    37

duration is in milliseconds

The query over Task table takes a lot of time with a simple select query.

task table

select vendor_type, count(*) from task group by vendor_type;

    vendor_type     | count  
--------------------+--------
 GARBAGE_COLLECTION |     51
 IMAGE_SCAN         | 183808
 REPLICATION        |  12129
 RETENTION          |  80421
 SCAN_ALL           | 223754
 SCHEDULER          |    436

The oldest task entry has a start_time in May 2021.
It seems that the task table is somehow never cleaned up.

execution table
in the data model 1 excution is linked with n task.
When you use the ui to scan an image => 1 execution + 1 task are stored in the db
When you launch a scheduled full scan => 1 execution + 55000 tasks are stored in the db (our case)

once again we do not see any cleaning of the entries.

A detailed view on the tasks and execution for the scans vendor_type:

WITH task_error as (
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Error' group by b.vendor_type, b.execution_id
), task_success as (
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Success' group by b.vendor_type, b.execution_id
), task_stopped as (
    select b.execution_id,count(b.id) as cnt from task as b where b.status='Stopped' group by b.vendor_type, b.execution_id
)
 select a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status,
 count(distinct a.id) as nb_execution,
 SUM(task_error.cnt) as task_error_cnt,
 SUM(task_success.cnt) as task_success_cnt,
 SUM(task_stopped.cnt) as task_stopped_cnt
 from execution as a
 left Join task_error on task_error.execution_id=a.id
 left Join task_success on task_success.execution_id=a.id
 left Join task_stopped on task_stopped.execution_id=a.id
 Where vendor_type like '%SCAN%'
 group by a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status
 order by to_char(a.start_time,'YYYY-MM-DD');


vendor_type |  to_char   | status  | nb_execution | task_error_cnt | task_success_cnt | task_stopped_cnt 
-------------+------------+---------+--------------+----------------+------------------+------------------
 IMAGE_SCAN  | 2022-08-01 | Success |          778 |                |              778 |
 IMAGE_SCAN  | 2022-08-02 | Error   |            1 |              1 |                  |
 IMAGE_SCAN  | 2022-08-02 | Success |          725 |                |              725 |
 IMAGE_SCAN  | 2022-08-03 | Success |          737 |                |              737 |
 IMAGE_SCAN  | 2022-08-04 | Error   |            2 |              2 |                  |
 IMAGE_SCAN  | 2022-08-04 | Success |          745 |                |              745 |
 IMAGE_SCAN  | 2022-08-05 | Error   |            1 |              1 |                  |
 IMAGE_SCAN  | 2022-08-05 | Success |          629 |                |              629 |
 IMAGE_SCAN  | 2022-08-06 | Success |           87 |                |               87 |
 IMAGE_SCAN  | 2022-08-07 | Success |           42 |                |               42 |
 SCAN_ALL    | 2022-08-07 | Error   |            1 |             12 |            20679 |
 IMAGE_SCAN  | 2022-08-08 | Success |          544 |                |              544 |
 IMAGE_SCAN  | 2022-08-09 | Success |          574 |                |              574 |
 IMAGE_SCAN  | 2022-08-10 | Success |          599 |                |              599 |
 IMAGE_SCAN  | 2022-08-11 | Success |          554 |                |              554 |
 IMAGE_SCAN  | 2022-08-12 | Success |          678 |                |              678 |
 IMAGE_SCAN  | 2022-08-13 | Success |           70 |                |               70 |
 IMAGE_SCAN  | 2022-08-14 | Success |           39 |                |               39 |
 SCAN_ALL    | 2022-08-14 | Error   |            1 |             14 |            21209 |
 IMAGE_SCAN  | 2022-08-15 | Success |           35 |                |               35 |
 IMAGE_SCAN  | 2022-08-16 | Error   |            1 |              1 |                  |
 IMAGE_SCAN  | 2022-08-16 | Success |          452 |                |              452 |
 IMAGE_SCAN  | 2022-08-17 | Success |          548 |                |              548 |
 IMAGE_SCAN  | 2022-08-18 | Success |          620 |                |              620 |
 IMAGE_SCAN  | 2022-08-19 | Error   |            2 |              2 |                  |
 IMAGE_SCAN  | 2022-08-19 | Success |          701 |                |              701 |
 IMAGE_SCAN  | 2022-08-20 | Success |           74 |                |               74 |
 IMAGE_SCAN  | 2022-08-21 | Error   |           28 |             28 |                  |
 IMAGE_SCAN  | 2022-08-21 | Success |           45 |                |               45 |
 SCAN_ALL    | 2022-08-21 | Error   |            1 |            885 |             7657 |            43895
 IMAGE_SCAN  | 2022-08-22 | Error   |          604 |            604 |                  |
 IMAGE_SCAN  | 2022-08-22 | Stopped |            1 |                |                  |                1
 IMAGE_SCAN  | 2022-08-22 | Success |          129 |                |              129 |
 IMAGE_SCAN  | 2022-08-23 | Error   |           11 |             11 |                  |
 IMAGE_SCAN  | 2022-08-23 | Success |          917 |                |              917 |
 IMAGE_SCAN  | 2022-08-24 | Success |          748 |                |              748 |
 IMAGE_SCAN  | 2022-08-25 | Success |          654 |                |              654 |
 IMAGE_SCAN  | 2022-08-26 | Success |          535 |                |              535 |
 IMAGE_SCAN  | 2022-08-27 | Success |           42 |                |               42 |
 IMAGE_SCAN  | 2022-08-28 | Error   |           10 |             10 |                  |
 IMAGE_SCAN  | 2022-08-28 | Success |           39 |                |               39 |
 SCAN_ALL    | 2022-08-28 | Error   |            1 |             52 |            53508 |
 IMAGE_SCAN  | 2022-08-29 | Success |          739 |                |              739 |
 IMAGE_SCAN  | 2022-08-30 | Success |          637 |                |              637 |
 IMAGE_SCAN  | 2022-08-31 | Success |          626 |                |              626 |
 IMAGE_SCAN  | 2022-09-01 | Success |          620 |                |              620 |
 IMAGE_SCAN  | 2022-09-02 | Success |          814 |                |              814 |
 IMAGE_SCAN  | 2022-09-03 | Success |           86 |                |               86 |
 IMAGE_SCAN  | 2022-09-04 | Error   |           12 |             12 |                  |
 IMAGE_SCAN  | 2022-09-04 | Success |           73 |                |               73 |
 SCAN_ALL    | 2022-09-04 | Error   |            1 |            210 |            53937 |
 IMAGE_SCAN  | 2022-09-05 | Error   |            5 |              5 |                  |
 IMAGE_SCAN  | 2022-09-05 | Success |          700 |                |              700 |
 IMAGE_SCAN  | 2022-09-06 | Success |          624 |                |              624 |
 IMAGE_SCAN  | 2022-09-07 | Success |          858 |                |              858 |
 IMAGE_SCAN  | 2022-09-08 | Error   |            1 |              1 |                  |
 IMAGE_SCAN  | 2022-09-08 | Success |          891 |                |              891 |
 IMAGE_SCAN  | 2022-09-09 | Success |          568 |                |              568 |
 IMAGE_SCAN  | 2022-09-10 | Success |           87 |                |               87 |
 IMAGE_SCAN  | 2022-09-11 | Error   |            2 |              2 |                  |
 IMAGE_SCAN  | 2022-09-11 | Success |           79 |                |               79 |
 SCAN_ALL    | 2022-09-11 | Error   |            1 |            257 |            54649 |
 IMAGE_SCAN  | 2022-09-12 | Success |          396 |                |              396 |

What is the issue ?
When Harbor host a large amount of images, the full scan add a lot of entries in the task table. After 8 full scan the task table have around 400k rows in it (our case) resulting in poor query performance.

As for now we do not want to upgrade the cpu/ram capacities of the DB because it will double the cost of usage of postgres flexible.

What are the questions ?

It is not clear how the cleaning is done in the tables task and execution, where can we find the information for the different vendor_type?
if there's a cleaning mechanism, how can we check is it working properly, is there some logs somewhere ?
It is not clear what harbor does with the previous entries in the tables task and execution, is it safe to clear it ? is it safe to clear all the different vendor_type (see query result above)
We did not check with previous version of harbor (ex: 2.4.x) if there's important changes in the database queries, is it related to the bump version of trivy adapter ?

Any help appreciated ^^

The text was updated successfully, but these errors were encountered:

wy65701436 · 2022-11-01T15:15:12Z

hi @AlenversFr thanks for reporting the performance issue and the datails. We will take time to investigate it.

AlenversFr · 2022-11-03T10:14:50Z

the poor performance is related to this simple query "select id from task where extra_attrs->>$1 = $2", there are no index on the extra_attrs column which is a json format.

As a result, the query is executed with a recursive scan and it becomes quite inefficient when the task table is growing.

2 actions needed :

try to optimize the query by adding json index (I've tried with btree but it's not relevant) or add other criterias in the where clause in order to reduce the recursive scan (vendor type and/or creation_time)
review the cleaning mechanism of the task and execution tables because it seems to be KO

We now have a weekly cleaning cron job that executes these queries :
SCAN_ALL --> flush all the tasks because it's not in use in the ui
delete from task where vendor_type='SCAN_ALL' and status not in('Pending', 'Scheduled', 'Running');

IMAGE SCAN --> keep 14 days of tasks because it's the job_log config
RETENTION --> keep 14 days of tasks because it's the job_log config
delete from task where vendor_type in('IMAGE_SCAN', 'RETENTION') AND status not in('Pending', 'Scheduled', 'Running') AND creation_time< NOW() - INTERVAL '14 DAY';

REPLICATION --> keep only the last one executed

delete from task where vendor_type='REPLICATION' and status not in('Pending', 'Scheduled', 'Running') and ((extra_attrs::json ->> 'destination_resource'),creation_time) not in (select (extra_attrs::json ->> 'destination_resource'),max(creation_time) from task group by (extra_attrs::json ->> 'destination_resource'));

Then delete dependencies in execution and job_log tables
delete from execution where execution.id not in (select execution_id from task);
delete from job_log where job_uuid not in (select job_id from task);

github-actions · 2023-01-03T09:04:57Z

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

z0rc · 2023-01-04T12:04:05Z

Not stale.

github-actions · 2023-03-18T09:03:44Z

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

dkulchinsky · 2023-03-18T12:35:26Z

not stale

chlins · 2023-03-22T01:39:53Z

In version 2.8, we made the following improvements:
We refactored the logic for cleaning up execution records and replaced the unreliable goroutine implementation with scheduled job execution once per hour. This can be viewed and managed through the jobservice dashboard to check cleanup logs. Additionally, we added an implementation to fix dangling executions' status. Harbor will periodically update executions that have been in a Running state for one week or more to a final state for easier cleanup by sweep job.

The related PRs:

Close the issue as above commits, feel free to reopen or open a new issue if the problem still exists after upgrading to v2.8, thanks.

Vad1mo added the kind/bug label Sep 12, 2022

chlins assigned chlins and wy65701436 Sep 19, 2022

chlins added area/performance and removed kind/bug labels Sep 19, 2022

wy65701436 added the candidate/2.8.0 label Nov 1, 2022

github-actions bot added the Stale label Jan 3, 2023

github-actions bot removed the Stale label Jan 5, 2023

wy65701436 added target/2.8.0 and removed candidate/2.8.0 labels Jan 17, 2023

github-actions bot added the Stale label Mar 18, 2023

github-actions bot removed the Stale label Mar 19, 2023

chlins closed this as completed Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scan performance when task and execution tables in DB have more than 300k rows #17538

Scan performance when task and execution tables in DB have more than 300k rows #17538

AlenversFr commented Sep 12, 2022 •

edited

Loading

wy65701436 commented Nov 1, 2022

AlenversFr commented Nov 3, 2022

github-actions bot commented Jan 3, 2023

z0rc commented Jan 4, 2023

github-actions bot commented Mar 18, 2023

dkulchinsky commented Mar 18, 2023

chlins commented Mar 22, 2023

Scan performance when task and execution tables in DB have more than 300k rows #17538

Scan performance when task and execution tables in DB have more than 300k rows #17538

Comments

AlenversFr commented Sep 12, 2022 • edited Loading

wy65701436 commented Nov 1, 2022

AlenversFr commented Nov 3, 2022

github-actions bot commented Jan 3, 2023

z0rc commented Jan 4, 2023

github-actions bot commented Mar 18, 2023

dkulchinsky commented Mar 18, 2023

chlins commented Mar 22, 2023

AlenversFr commented Sep 12, 2022 •

edited

Loading