-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sizing Calculator: Estimating Size projections for Table with existing data #7468
Comments
Proper SQL
|
using TPCH SF10 and SF100 [1]+ Running nohup cockroach workload fixtures import tpch --scale-factor=100 & check for expected rows is only supported with scale factor 1, so it was disabled sf100
SELECT cnt, WITH Time: 33.377451296s root@:26257/defaultdb> select 239977037122/1024/1024/1024
|
@knz @tbg @jordanlewis we had another idea discussed on cockroachdb/cockroach#20712. using crdb_internal.ranges_no_leases seems to provide close match to the actual. |
The exploration above is interesting. Could you give a summary of the final numbers? There's a little too much going on for me to grasp them at a glance. |
I'm a little confused here. In the UI, we use the |
how we make ApproximateSize accessible via SQL? that could satisfy couple of other customer requests.
…Sent from my iPhone
On Jun 3, 2020, at 8:34 AM, Tobias Grieger ***@***.***> wrote:
I'm not sure the ApproximateSize RPC is accessible via SQL. That, however, could be changed.
|
I have not validated the UI numbers in all test case. I am using this method to approximate as there is no other method available. If ApproximateSize will be available at a later date, I can look into that.
… On Jun 3, 2020, at 8:34 AM, Tobias Grieger ***@***.***> wrote:
I'm a little confused here. ranges_no_leases has no info on the range size. The range_stats function returns the MVCCSize, which has no good correlation with the actual on-disk size (MVCCSize does not take into account replication nor compression nor LSM overhead). I would expect that to significantly over- or undershoot, depending on the situation.
In the UI, we use the ApproximateSize method on the storage engine and sum that up. Was that found to be lacking? It is an approximation as the name suggests, but comes from the LSM (storage engine), which is generally supposed to have the best idea of how much data it's actually using.
Note that in your experiments, where you likely imported data and measured, the numbers may work out very differently vs. having the dataset grown organically, because the resulting LSM structure will be different.
I'm not sure the ApproximateSize RPC is accessible via SQL. That, however, could be changed. For now I'd just be curious if the numbers from the admin ui (which should come from that method) make enough sense.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#7468 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJNKTCKHMSKQHN2CW4LYKL3RUY7OPANCNFSM4NRIMTPA>.
|
I'll defer on the SQL team on how that would be exposed but it amounts to hitting this endpoint: |
Actually it is exposed via http: http://127.0.0.1:xxx/_status/span though it is awkward to use due to the need to post KV-encoded spans. The below is for the entire keyspace. The end_key below is
Note that we get
|
Thanks @tbg , I can look into this. I also noticed cockroachdb/cockroach#20712 (comment) so it seems it is in backlog. |
I don't think anyone is really looking at this right now, it's worth reaching out to SQL PM if this is a big win. It should be relatively easy (mod the usual bike shedding) |
FWIW there's two separate steps that can be looked at:
|
what is required to make the RPC-over-SQL interface a reality?
… FWIW there's two separate steps that can be looked at:
the "current" approach is to add a virtual table that exposes that information
a better/next approach will be to make RPCs universally reachable via SQL, e.g. via a RPC-over-SQL interface (which doesn't exist yet)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
effort! it's one of the lower priority items on the 20.2 roadmap. But it's on the radar. We really don't like opening the main RPC port to clients and would rather have all operational traffic go either over HTTP or SQL. |
Sounds like we should wait for engineering to create a more official avenue for using current capacity and row count to estimate growth. Leaving this on backlog until that happens. @taroface, @piyush-singh, I think this falls under deployment and ops. |
Relates to #7818. |
This looks like a SQL issue cc @awoods187 |
Jesse Seldess (jseldess) commented: |
Andrew Deally (drewdeally) commented:
size projection SQL
WITH
table_rows
AS (
SELECT
count(1) AS cnt
FROM
ycsb1000000.usertable
),
table_size
AS (
SELECT
sum(
(
crdb_internal.range_stats(
start_key
)->>'key_bytes'
)::INT8
+ (
crdb_internal.range_stats(
start_key
)->>'val_bytes'
)::INT8
)
AS size
FROM
crdb_internal.ranges_no_leases
WHERE
table_name = 'usertable'
AND database_name = 'ycsb1000000'
GROUP BY
database_name, table_name
)
SELECT
cnt,
size,
size / cnt AS size_per_row,
(size / cnt * 10000000) AS estamted_size_10000000,
size / cnt * cnt AS validate
FROM
table_rows, table_size;
cnt | size | size_per_row | estamted_size_10000000 | validate
----------+-----------+--------------+------------------------+-------------------
1000000 | 553375965 | 553.375965 | 5533759650.000000 | 553375965.000000
(1 row)
Time: 4.313701085s
root@:26257/defaultdb> WITH table_rows AS (
SELECT count(1) AS cnt FROM ycsb10000000.usertable
),
table_size AS (
SELECT sum(
(
crdb_internal.range_stats(
start_key
)->>'key_bytes'
)::INT8
+ (
crdb_internal.range_stats(
start_key
)->>'val_bytes'
)::INT8
) AS size
FROM crdb_internal.ranges_no_leases
WHERE table_name = 'usertable'
AND database_name = 'ycsb10000000'
GROUP BY database_name, table_name
)
SELECT cnt,
size,
size / cnt AS size_per_row,
(size / cnt * 10000000) estamted_size_10000000,
size / cnt * cnt AS validate
FROM table_rows, table_size;
cnt | size | size_per_row | estamted_size_10000000 | validate
-----------+------------+--------------+------------------------+---------------------
10000000 | 5533745768 | 553.3745768 | 5533745768.0000000 | 5533745768.0000000
(1 row)
Time: 54.007199522s
root@:26257/defaultdb>
using bytes per row from ycsb1000000.usertable
and validating with ycsb10000000.usertable
5533759650.000000 calculated
actual is 5533745768
root@:26257/ycsb10000000> WITH
table_rows
AS (
SELECT
count(1) AS cnt
FROM
ycsb1000000.usertable
),
table_size
AS (
SELECT
sum(
(
crdb_internal.range_stats(
start_key
)->>'key_bytes'
)::INT8
+ (
crdb_internal.range_stats(
start_key
)->>'val_bytes'
)::INT8
)
AS size
FROM
crdb_internal.ranges_no_leases
WHERE
table_name = 'usertable'
AND database_name = 'ycsb1000000'
GROUP BY
database_name, table_name
)
SELECT
cnt,
size,
size / cnt AS size_per_row,
(size / cnt * 10000000) AS estamted_size_10000000,
size / cnt * cnt AS validate
FROM
table_rows, table_size;
cnt | size | size_per_row | estamted_size_10000000 | validate
----------+-----------+--------------+------------------------+------------------
1000000 | 719569410 | 719.56941 | 7195694100.00000 | 719569410.00000
(1 row)
Time: 659.006037ms
root@:26257/ycsb10000000> WITH table_rows AS (
SELECT count(1) AS cnt FROM ycsb10000000.usertable
),
table_size AS (
SELECT sum(
(
crdb_internal.range_stats(
start_key
)->>'key_bytes'
)::INT8
+ (
crdb_internal.range_stats(
start_key
)->>'val_bytes'
)::INT8
) AS size
FROM crdb_internal.ranges_no_leases
WHERE table_name = 'usertable'
AND database_name = 'ycsb10000000'
GROUP BY database_name, table_name
)
SELECT cnt,
size,
size / cnt AS size_per_row,
(size / cnt * 10000000) estamted_size_10000000,
size / cnt * cnt AS validate
FROM table_rows, table_size;
cnt | size | size_per_row | estamted_size_10000000 | validate
-----------+------------+--------------+------------------------+---------------------
10000000 | 7195676432 | 719.5676432 | 7195676432.0000000 | 7195676432.0000000
(1 row)
Time: 4.719286553s
root@:26257/ycsb10000000>
6257/tpch> WITH
table_rows
AS (SELECT count(1) AS cnt FROM tpchsf10.customer),
table_size
AS (
SELECT
sum(
(
crdb_internal.range_stats(
start_key
)->>'key_bytes'
)::INT8
+ (
crdb_internal.range_stats(
start_key
)->>'val_bytes'
)::INT8
)
AS size
FROM
crdb_internal.ranges_no_leases
WHERE
table_name = 'customer'
AND database_name = 'tpchsf10'
GROUP BY
database_name, table_name
)
SELECT
cnt,
size,
size / cnt AS size_per_row,
(size / cnt * 15000000) AS estamted_size_15000000,
size / cnt * cnt AS validate
FROM
table_rows, table_size;
cnt | size | size_per_row | estamted_size_15000000 | validate
----------+-----------+-----------------------+------------------------------+------------------------------
1500000 | 322310089 | 214.87339266666666667 | 3223100890.00000000005000000 | 322310089.00000000000500000
(1 row)
Time: 1.147665163s
root@:26257/tpch> show create table tpchsf10.customer;
table_name | create_statement
---------------------------+----------------------------------------------------------------------------------------------------------------
tpchsf10.public.customer | CREATE TABLE customer (
| c_custkey INT8 NOT NULL,
| c_name VARCHAR(25) NOT NULL,
| c_address VARCHAR(40) NOT NULL,
| c_nationkey INT8 NOT NULL,
| c_phone CHAR(15) NOT NULL,
| c_acctbal FLOAT8 NOT NULL,
| c_mktsegment CHAR(10) NOT NULL,
| c_comment VARCHAR(117) NOT NULL,
| CONSTRAINT "primary" PRIMARY KEY (c_custkey ASC),
| CONSTRAINT customer_fkey_nation FOREIGN KEY (c_nationkey) REFERENCES nation(n_nationkey),
| INDEX c_nk (c_nationkey ASC),
| FAMILY "primary" (c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment)
| )
(1 row)
Time: 529.061245ms
root@:26257/tpch> WITH table_rows AS (
SELECT count(1) AS cnt FROM tpch.customer
),
table_size AS (
SELECT sum(
(
crdb_internal.range_stats(
start_key
)->>'key_bytes'
)::INT8
+ (
crdb_internal.range_stats(
start_key
)->>'val_bytes'
)::INT8
) AS size
FROM crdb_internal.ranges_no_leases
WHERE table_name = 'customer'
AND database_name = 'tpch'
GROUP BY database_name, table_name
)
SELECT cnt,
size,
size / cnt AS size_per_row,
(size / cnt * 15000000) estamted_size_15000000,
size / cnt * cnt AS validate
FROM table_rows, table_size;
cnt | size | size_per_row | estamted_size_15000000 | validate
-----------+------------+--------------+------------------------+---------------------
15000000 | 3224900646 | 214.9933764 | 3224900646.0000000 | 3224900646.0000000
(1 row)
Time: 7.991273263s
11:45
Andrew Deally 1:53 PM
root@:26257/defaultdb> WITH
table_rows
AS (SELECT count(1) AS cnt FROM tpchsf10.orders),
table_size
AS (
SELECT
sum(
(
crdb_internal.range_stats(
start_key
)->>'key_bytes'
)::INT8
+ (
crdb_internal.range_stats(
start_key
)->>'val_bytes'
)::INT8
)
AS size
FROM
crdb_internal.ranges_no_leases
WHERE
table_name = 'orders'
AND database_name = 'tpchsf10'
GROUP BY
database_name, table_name
)
SELECT
cnt,
size,
size / cnt AS size_per_row,
(size / cnt * 150000000) AS estamted_size_150000000,
size / cnt * cnt AS validate
FROM
table_rows, table_size;
cnt | size | size_per_row | estamted_size_150000000 | validate
-----------+------------+--------------+-------------------------+--------------------
15000000 | 2858921070 | 190.594738 | 28589210700.000000 | 2858921070.000000
(1 row)
Time: 9.824237264s
root@:26257/defaultdb> WITH
table_rows AS (SELECT count(1) AS cnt FROM tpch.orders),
table_size
AS (
SELECT
sum(
(
crdb_internal.range_stats(
start_key
)->>'key_bytes'
)::INT8
+ (
crdb_internal.range_stats(
start_key
)->>'val_bytes'
)::INT8
)
AS size
FROM
crdb_internal.ranges_no_leases
WHERE
table_name = 'orders'
AND database_name = 'tpch'
GROUP BY
database_name, table_name
)
SELECT
cnt,
size,
size / cnt AS size_per_row,
(size / cnt * 150000000) AS estamted_size_150000000,
size / cnt * cnt AS validate
FROM
table_rows, table_size;
cnt | size | size_per_row | estamted_size_150000000 | validate
------------+-------------+--------------+-------------------------+-----------------------
150000000 | 28803450978 | 192.02300652 | 28803450978.00000000 | 28803450978.00000000
(1 row)
Time: 1m27.403918464s
root@:26257/defaultdb>
Jira Issue: DOC-531
The text was updated successfully, but these errors were encountered: