Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block structure mismatch in MergingSorted stream #3162

Closed
Yashin32 opened this issue Sep 18, 2018 · 8 comments
Closed

Block structure mismatch in MergingSorted stream #3162

Yashin32 opened this issue Sep 18, 2018 · 8 comments

Comments

@Yashin32
Copy link

Clickhouse server version 18.12.17

Query to distributed table sometimes fails with error:

2018.09.18 13:39:41.251273 [ 1108366 ] {efbabc85-def6-4747-9443-c302558e8793} <Error> executeQuery: Code: 171, e.displayText() = DB::Exception: Block structure mismatch in MergingSorted stream: different names of columns:
Timestamp DateTime UInt32(size = 5), CampaignID String String(size = 5), TrafficSourceID String String(size = 5), EventType UInt16 UInt16(size = 5)
TrafficSourceID String String(size = 0), Timestamp DateTime UInt32(size = 0), CampaignID String String(size = 0), EventType UInt16 UInt16(size = 0), e.what() = DB::Exception (from [::ffff:127.0.0.1]:49342) (in query: SELECT EventType, Timestamp, TrafficSourceID, CampaignID FROM test.TestDistributed WHERE UserID = '779b588a-6479-4b82-89b6-256fe66536b0' AND EventDate BETWEEN '2018-09-17' AND '2018-09-18' AND Timestamp BETWEEN '2018-09-17 22:00:00' AND '2018-09-18 21:59:59' AND EventType IN (4, 5, 6, 7) AND TrafficSourceID = '9496c5bd-f7f5-4e20-8c06-a6b17aa999e8' ORDER BY Timestamp DESC LIMIT 50 FORMAT JSON), Stack trace:

0. /usr/bin/clickhouse-server(StackTrace::StackTrace()+0x16) [0x9491e46]
1. /usr/bin/clickhouse-server(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x22) [0x3019952]
2. /usr/bin/clickhouse-server() [0x695ecce]
3. /usr/bin/clickhouse-server() [0x69632b7]
4. /usr/bin/clickhouse-server(DB::MergingSortedBlockInputStream::init(std::vector<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn>, std::allocator<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn> > >&)+0x7d9) [0x74ec969]
5. /usr/bin/clickhouse-server(DB::MergingSortedBlockInputStream::readImpl()+0x65) [0x74ed835]
6. /usr/bin/clickhouse-server(DB::IProfilingBlockInputStream::read()+0x25a) [0x698ecfa]
7. /usr/bin/clickhouse-server(DB::ExpressionBlockInputStream::readImpl()+0x2e) [0x74be89e]
8. /usr/bin/clickhouse-server(DB::IProfilingBlockInputStream::read()+0x25a) [0x698ecfa]
9. /usr/bin/clickhouse-server(DB::LimitBlockInputStream::readImpl()+0x225) [0x6993065]
10. /usr/bin/clickhouse-server(DB::IProfilingBlockInputStream::read()+0x25a) [0x698ecfa]
11. /usr/bin/clickhouse-server(DB::AsynchronousBlockInputStream::calculate()+0x56) [0x6984316]
12. /usr/bin/clickhouse-server() [0x6984780]
13. /usr/bin/clickhouse-server(ThreadPool::worker()+0x19e) [0x9692b5e]
14. /usr/bin/clickhouse-server() [0x9e659ef]
15. /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f5acd0836ba]
16. /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5acc6a441d]

Error appears only for specific WHERE values and only when TrafficSourceID column is included in query. Same query to underlying ReplicatedMergeTree table succeeds on every instance of cluster.

@mdonkers
Copy link
Contributor

mdonkers commented Oct 18, 2018

We are experiencing a similar issue, using the same ClickHouse version (18.12.17).

We are running a multi-tenant setup with separate databases, all having the same tables. The issue occurs on different databases, but thus far only with the below specific query (according to our logs).

We have checked local / distributed tables, and order of columns are in sync.

ClickHouse is being queried via HTTP, resulting in the following error:

com.instana.application.clickhouse.exception.ClickHouseRequestException: Code: 171, e.displayText() = DB::Exception: Block structure mismatch in MergingSorted stream: different names of columns:#012! trace_id String String(size = 41), call_id String String(size = 41), call_name String String(size = 41), t UInt64 UInt64(size = 41), destination_service_id String String(size = 41), destination_application_ids Array(String) Array(size = 41, UInt64(size = 41), String(size = 41)), duration UInt64 UInt64(size = 41), batchCount UInt64 UInt64(size = 41), errorCount UInt64 UInt64(size = 41)#012! destination_service_id String String(size = 0), trace_id String String(size = 0), call_id String String(size = 0), call_name String String(size = 0), t UInt64 UInt64(size = 0), destination_application_ids Array(String) Array(size = 0, UInt64(size = 0), String(size = 0)), duration UInt64 UInt64(size = 0), batchCount UInt64 UInt64(size = 0), errorCount UInt64 UInt64(size = 0), e.what() = DB::Exception#012! .
Query: SELECT call_id,call_name,trace_id,t,destination_service_id,duration,errorCount,batchCount,destination_application_ids,service_label FROM (SELECT call_id,call_name,trace_id,t,destination_service_id,duration,errorCount,batchCount,destination_application_ids FROM tenant_database.all_calls WHERE t >= 1539791407995 AND t <= 1539795007995 AND ingestion_time <= 1539795006000 AND lower(destination_endpoint) = 'my-endpoint' AND (destination_service_id = '060b4106309e8c6e1996') AND (arrayExists((a) -> a = 'SmSKlDyM5yLtJA',destination_application_ids)=1) ORDER BY t DESC LIMIT 50) ANY INNER JOIN services_temp_table USING destination_service_id FORMAT JSON

Looking at the order of columns, there is indeed a difference it seems:

trace_id String String(size = 41),
call_id String String(size = 41),
call_name String String(size = 41),
t UInt64 UInt64(size = 41),
destination_service_id String String(size = 41),
destination_application_ids Array(String) Array(size = 41, UInt64(size = 41), String(size = 41)),
duration UInt64 UInt64(size = 41),
batchCount UInt64 UInt64(size = 41),
errorCount UInt64 UInt64(size = 41)

destination_service_id String String(size = 0),
trace_id String String(size = 0),
call_id String String(size = 0),
call_name String String(size = 0),
t UInt64 UInt64(size = 0),
destination_application_ids Array(String) Array(size = 0, UInt64(size = 0), String(size = 0)),
duration UInt64 UInt64(size = 0),
batchCount UInt64 UInt64(size = 0),
errorCount UInt64 UInt64(size = 0),

Please let me know if more information is required to further analyse this issue.

@mlex
Copy link
Contributor

mlex commented Oct 29, 2018

Hi,
we were able to create a small example to reproduce the bug. You need a cluster of at least 2 nodes.

CREATE TABLE local_test ON CLUSTER testcluster (
  t UInt64,
  date Date MATERIALIZED toDate(t/1000),
  shard UInt64,
  col1 String,
  col2 String
) ENGINE = MergeTree 
PARTITION BY toRelativeDayNum(date) 
ORDER BY (t) 
SETTINGS index_granularity=8192;

CREATE TABLE dist_test ON CLUSTER testcluster (
  t UInt64,
  shard UInt64,
  date Date MATERIALIZED toDate(t/1000),
  col1 String,
  col2 String
) Engine = Distributed(testcluster, default, local_test, shard);

INSERT INTO dist_test (t, shard, col1, col2) 
VALUES 
(1000, 1, 'foo', 'bar'), 
(1000, 2, 'x', 'y');

The following select will then trigger the bug:

SELECT
    col1,
    col2
FROM dist_test
WHERE (t < 3600000) AND (col1 = 'foo')
ORDER BY t ASC

Here is also a full debug log produced by the above query:


[1efae51a1d8e] 2018.10.29 10:33:35.751022 {a5417e06-48a8-435e-ac95-2aedf1a6ce16} [ 79 ] <Debug> executeQuery: (from 172.22.0.3:35656) SELECT     col1,     col2 FROM dist_test WHERE (t < 3600000) AND (col1 = 'foo') ORDER BY t ASC
[1efae51a1d8e] 2018.10.29 10:33:35.752241 {a5417e06-48a8-435e-ac95-2aedf1a6ce16} [ 79 ] <Debug> InterpreterSelectQuery: MergeTreeWhereOptimizer: condition `col1 = 'foo'` moved to PREWHERE
[1efae51a1d8e] 2018.10.29 10:33:35.752654 {a5417e06-48a8-435e-ac95-2aedf1a6ce16} [ 79 ] <Debug> default.local_test (SelectExecutor): Key condition: (column 0 in (-inf, 3599999]), unknown, and
[1efae51a1d8e] 2018.10.29 10:33:35.752688 {a5417e06-48a8-435e-ac95-2aedf1a6ce16} [ 79 ] <Debug> default.local_test (SelectExecutor): MinMax index condition: unknown, unknown, and
[1efae51a1d8e] 2018.10.29 10:33:35.752729 {a5417e06-48a8-435e-ac95-2aedf1a6ce16} [ 79 ] <Debug> default.local_test (SelectExecutor): Selected 1 parts by date, 1 parts by key, 1 marks to read from 1 ranges
[1efae51a1d8e] 2018.10.29 10:33:35.753047 {a5417e06-48a8-435e-ac95-2aedf1a6ce16} [ 79 ] <Debug> executeQuery: Query pipeline:
Expression
 MergingSorted
  Asynchronous
   Materializing
    MergeSorting
     PartialSorting
      Expression
       Filter
        MergeTreeThread
  Asynchronous
   Remote

[451e87648d25] 2018.10.29 10:33:35.754027 {71e89224-9914-4932-92e2-475776b8e8f0} [ 45 ] <Debug> executeQuery: (from 172.22.0.4:59844, initial_query_id: a5417e06-48a8-435e-ac95-2aedf1a6ce16) SELECT col1, col2 FROM default.local_test  WHERE (t < 3600000) AND (col1 = 'foo') ORDER BY t ASC
[451e87648d25] 2018.10.29 10:33:35.754711 {71e89224-9914-4932-92e2-475776b8e8f0} [ 45 ] <Debug> InterpreterSelectQuery: MergeTreeWhereOptimizer: condition `col1 = 'foo'` moved to PREWHERE
[451e87648d25] 2018.10.29 10:33:35.755130 {71e89224-9914-4932-92e2-475776b8e8f0} [ 45 ] <Debug> default.local_test (SelectExecutor): Key condition: (column 0 in (-inf, 3599999]), unknown, and
[451e87648d25] 2018.10.29 10:33:35.755165 {71e89224-9914-4932-92e2-475776b8e8f0} [ 45 ] <Debug> default.local_test (SelectExecutor): MinMax index condition: unknown, unknown, and
[451e87648d25] 2018.10.29 10:33:35.755203 {71e89224-9914-4932-92e2-475776b8e8f0} [ 45 ] <Debug> default.local_test (SelectExecutor): Selected 1 parts by date, 1 parts by key, 1 marks to read from 1 ranges
[451e87648d25] 2018.10.29 10:33:35.755336 {71e89224-9914-4932-92e2-475776b8e8f0} [ 45 ] <Debug> executeQuery: Query pipeline:
MergeSorting
 PartialSorting
  Expression
   Filter
    MergeTreeThread

[451e87648d25] 2018.10.29 10:33:35.756590 {71e89224-9914-4932-92e2-475776b8e8f0} [ 45 ] <Information> executeQuery: Read 1 rows, 32.00 B in 0.002 sec., 400 rows/sec., 12.53 KiB/sec.
[1efae51a1d8e] 2018.10.29 10:33:35.761627 {a5417e06-48a8-435e-ac95-2aedf1a6ce16} [ 79 ] <Error> executeQuery: Code: 171, e.displayText() = DB::Exception: Block structure mismatch in MergingSorted stream: different names of columns:
t UInt64 UInt64(size = 1), col1 String String(size = 1), col2 String String(size = 1)
col1 String String(size = 0), t UInt64 UInt64(size = 0), col2 String String(size = 0), e.what() = DB::Exception (from 172.22.0.3:35656) (in query: SELECT     col1,     col2 FROM dist_test WHERE (t < 3600000) AND (col1 = 'foo') ORDER BY t ASC), Stack trace:

0. /usr/bin/clickhouse-server(StackTrace::StackTrace()+0x16) [0x9491e46]
1. /usr/bin/clickhouse-server(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x22) [0x3019952]
2. /usr/bin/clickhouse-server() [0x695ecce]
3. /usr/bin/clickhouse-server() [0x69632b7]
4. /usr/bin/clickhouse-server(DB::MergingSortedBlockInputStream::init(std::vector<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn>, std::allocator<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn> > >&)+0x7d9) [0x74ec969]
5. /usr/bin/clickhouse-server(DB::MergingSortedBlockInputStream::readImpl()+0x65) [0x74ed835]
6. /usr/bin/clickhouse-server(DB::IProfilingBlockInputStream::read()+0x25a) [0x698ecfa]
7. /usr/bin/clickhouse-server(DB::ExpressionBlockInputStream::readImpl()+0x2e) [0x74be89e]
8. /usr/bin/clickhouse-server(DB::IProfilingBlockInputStream::read()+0x25a) [0x698ecfa]
9. /usr/bin/clickhouse-server(DB::AsynchronousBlockInputStream::calculate()+0x56) [0x6984316]
10. /usr/bin/clickhouse-server() [0x6984780]
11. /usr/bin/clickhouse-server(ThreadPool::worker()+0x19e) [0x9692b5e]
12. /usr/bin/clickhouse-server() [0x9e659ef]
13. /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f083d14a6db]
14. /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f083c6c988f]

Received exception from server (version 18.12.17):
Code: 171. DB::Exception: Received from clickhouse-server:9000, 172.22.0.4. DB::Exception: Block structure mismatch in MergingSorted stream: different names of columns:
t UInt64 UInt64(size = 1), col1 String String(size = 1), col2 String String(size = 1)
col1 String String(size = 0), t UInt64 UInt64(size = 0), col2 String String(size = 0).

0 rows in set. Elapsed: 0.012 sec.

I tested the above example with 18.14.10 (using the docker container). It still shows the same error.

@alesapin
Copy link
Member

Can you provide your cluster config or test it yourself on 18.14.11? This error seems to be fixed. Currently I can't reproduce it.

@mlex
Copy link
Contributor

mlex commented Oct 29, 2018

The bug is still there with the clickhouse docker image 18.14.11.

My cluster.xml (for node-1):

<yandex>
    <macros>
        <cluster>testcluster</cluster>
        <shard>1</shard>
    </macros>
    <remote_servers>
        <testcluster>
            <shard>
                <replica>
                    <host>instana-clickhouse1</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <replica>
                    <host>instana-clickhouse2</host>
                    <port>9000</port>
                </replica>
            </shard>
        </testcluster>
    </remote_servers>

    <zookeeper>
        <node>
            <host>zk</host>
            <port>2181</port>
        </node>
    </zookeeper>
</yandex>

(I'm not sure if the zookeeper part is really needed because in the example no Replicated engines are used).

Another observation: When disabling optimize_move_to_prewhere (via SET optimize_move_to_prewhere=0) the error disappears.

@mikeeremin
Copy link

Confirm for 18.14.11 - not fixed.

@alesapin
Copy link
Member

alesapin commented Nov 2, 2018

Repoduced issue with test https://github.com/yandex/ClickHouse/pull/3534/files

@ztlpn
Copy link
Contributor

ztlpn commented Nov 8, 2018

Fix was released in v18.14.13

@ztlpn ztlpn closed this as completed Nov 8, 2018
@hzruandd
Copy link

We are experiencing a similar issue, using the same ClickHouse version (18.12.17).

We are running a multi-tenant setup with separate databases, all having the same tables. The issue occurs on different databases, but thus far only with the below specific query (according to our logs).

We have checked local / distributed tables, and order of columns are in sync.

ClickHouse is being queried via HTTP, resulting in the following error:

com.instana.application.clickhouse.exception.ClickHouseRequestException: Code: 171, e.displayText() = DB::Exception: Block structure mismatch in MergingSorted stream: different names of columns:#012! trace_id String String(size = 41), call_id String String(size = 41), call_name String String(size = 41), t UInt64 UInt64(size = 41), destination_service_id String String(size = 41), destination_application_ids Array(String) Array(size = 41, UInt64(size = 41), String(size = 41)), duration UInt64 UInt64(size = 41), batchCount UInt64 UInt64(size = 41), errorCount UInt64 UInt64(size = 41)#012! destination_service_id String String(size = 0), trace_id String String(size = 0), call_id String String(size = 0), call_name String String(size = 0), t UInt64 UInt64(size = 0), destination_application_ids Array(String) Array(size = 0, UInt64(size = 0), String(size = 0)), duration UInt64 UInt64(size = 0), batchCount UInt64 UInt64(size = 0), errorCount UInt64 UInt64(size = 0), e.what() = DB::Exception#012! .
Query: SELECT call_id,call_name,trace_id,t,destination_service_id,duration,errorCount,batchCount,destination_application_ids,service_label FROM (SELECT call_id,call_name,trace_id,t,destination_service_id,duration,errorCount,batchCount,destination_application_ids FROM tenant_database.all_calls WHERE t >= 1539791407995 AND t <= 1539795007995 AND ingestion_time <= 1539795006000 AND lower(destination_endpoint) = 'my-endpoint' AND (destination_service_id = '060b4106309e8c6e1996') AND (arrayExists((a) -> a = 'SmSKlDyM5yLtJA',destination_application_ids)=1) ORDER BY t DESC LIMIT 50) ANY INNER JOIN services_temp_table USING destination_service_id FORMAT JSON

Looking at the order of columns, there is indeed a difference it seems:

trace_id String String(size = 41),
call_id String String(size = 41),
call_name String String(size = 41),
t UInt64 UInt64(size = 41),
destination_service_id String String(size = 41),
destination_application_ids Array(String) Array(size = 41, UInt64(size = 41), String(size = 41)),
duration UInt64 UInt64(size = 41),
batchCount UInt64 UInt64(size = 41),
errorCount UInt64 UInt64(size = 41)

destination_service_id String String(size = 0),
trace_id String String(size = 0),
call_id String String(size = 0),
call_name String String(size = 0),
t UInt64 UInt64(size = 0),
destination_application_ids Array(String) Array(size = 0, UInt64(size = 0), String(size = 0)),
duration UInt64 UInt64(size = 0),
batchCount UInt64 UInt64(size = 0),
errorCount UInt64 UInt64(size = 0),

Please let me know if more information is required to further analyse this issue.

Hello, we also plan to adopt a multi-tenant setup with separate databases, which involves connection and routing management. Is there a middleware that is easy to use? thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants