Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression starting from 19.2 with Elixir / binary protocol when loading arrays #51360

Closed
tlvenn opened this issue Jul 13, 2020 · 14 comments
Closed
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Comments

@tlvenn
Copy link
Contributor

tlvenn commented Jul 13, 2020

Describe the problem

Arrays use to work just fine in 19.1 but from 19.2, it seems a regression has been introduced and now Elixir / Postgrex is crashing when trying to decode an array from the DB:

[error] GenServer #PID<0.795.0> terminating
** (DBConnection.ConnectionError) client #PID<0.1054.0> stopped: ** (MatchError) no match of right hand side value: <<0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 4, 19, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 22, 117, 112, 115, 101, 114, 116, 95, 105, 100, 101, 109, 112, 111, 116, 101, 110, 99, 121, 95, 107, 101, 121, 0, 0, 0, 12, ...>>
    (postgrex 0.15.5) lib/postgrex/type_module.ex:897: Postgrex.DefaultTypes."Elixir.Postgrex.Extensions.Array"/8
    (postgrex 0.15.5) lib/postgrex/protocol.ex:2870: Postgrex.Protocol.rows_recv/4
    (postgrex 0.15.5) lib/postgrex/protocol.ex:1899: Postgrex.Protocol.recv_execute/5
    (postgrex 0.15.5) lib/postgrex/protocol.ex:1773: Postgrex.Protocol.bind_execute/4
    (db_connection 2.2.2) lib/db_connection/holder.ex:316: DBConnection.Holder.holder_apply/4
    (db_connection 2.2.2) lib/db_connection.ex:1255: DBConnection.run_execute/5
    (db_connection 2.2.2) lib/db_connection.ex:1342: DBConnection.run/6
    (db_connection 2.2.2) lib/db_connection.ex:539: DBConnection.parsed_prepare_execute/5
    (db_connection 2.2.2) lib/db_connection.ex:532: DBConnection.prepare_execute/4
    (ecto_sql 3.4.5) lib/ecto/adapters/sql.ex:534: Ecto.Adapters.SQL.execute!/4
    (ecto_sql 3.4.5) lib/ecto/adapters/sql.ex:526: Ecto.Adapters.SQL.execute/5
    (ecto 3.4.5) lib/ecto/repo/queryable.ex:192: Ecto.Repo.Queryable.execute/4
    (ecto 3.4.5) lib/ecto/repo/queryable.ex:17: Ecto.Repo.Queryable.all/3
    (stdlib 3.13) erl_eval.erl:680: :erl_eval.do_apply/6
    (elixir 1.10.4) src/elixir.erl:278: :elixir.recur_eval/3
    (elixir 1.10.4) src/elixir.erl:263: :elixir.eval_forms/3
    (iex 1.10.4) lib/iex/evaluator.ex:258: IEx.Evaluator.handle_eval/5
    (iex 1.10.4) lib/iex/evaluator.ex:239: IEx.Evaluator.do_eval/3
    (iex 1.10.4) lib/iex/evaluator.ex:217: IEx.Evaluator.eval/3
    (iex 1.10.4) lib/iex/evaluator.ex:103: IEx.Evaluator.loop/1

This was first reported to me by @damboscolo in jumpn/ecto_cockroachdb#3 and I was kinda hoping it would go away with the 20.X release but unfortunately we are not so lucky ;)

To Reproduce
Create a table with a array column, in my case it is a VARCHAR(255)[] NOT NULL, insert one row in the database and query it

Expected behavior
The rows should load just fine.

Environment:

  • CockroachDB version: 20.1
  • Server OS: OSX 11
  • Client app: Postgrex / Elixir
@blathers-crl
Copy link

blathers-crl bot commented Jul 13, 2020

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Jul 13, 2020
@rafiss
Copy link
Collaborator

rafiss commented Jul 13, 2020

Thanks for the issue.

I haven't been able to reproduce it yet. I wrote the following test using our DSL for testing wire protocol, and it passes.

send
Query {"String": "CREATE TABLE t9 (a INT8 PRIMARY KEY, b VARCHAR(256)[] NOT NULL)"}
----

until
ReadyForQuery
----
{"Type":"CommandComplete","CommandTag":"CREATE TABLE"}
{"Type":"ReadyForQuery","TxStatus":"I"}

send
Query {"String": "INSERT INTO t9 VALUES(4, ARRAY['hello', 'goodbye'])"}
----

until
ReadyForQuery
----
{"Type":"CommandComplete","CommandTag":"INSERT 0 1"}
{"Type":"ReadyForQuery","TxStatus":"I"}

# 80 = ASCII 'P' for Portal
send
Parse {"Name": "s", "Query": "SELECT b FROM t9"}
Bind {"DestinationPortal": "p", "PreparedStatement": "s"}
Describe {"ObjectType": 80, "Name": "p"}
Execute {"Portal": "p"}
Sync
----

until
ReadyForQuery
----
{"Type":"ParseComplete"}
{"Type":"BindComplete"}
{"Type":"RowDescription","Fields":[{"Name":"b","TableOID":0,"TableAttributeNumber":0,"DataTypeOID":1015,"DataTypeSize":-1,"TypeModifier":-1,"Format":0}]}
{"Type":"DataRow","Values":[{"text":"{hello,goodbye}"}]}
{"Type":"CommandComplete","CommandTag":"SELECT 1"}
{"Type":"ReadyForQuery","TxStatus":"I"}

Would you be able to provide an exact sequence of commands and queries that fails? I believe the report is saying that this only happens when using prepared statements. Is that correct?

If you could point me to how to run the code (and if I can get an Elixir environment set up on my computer), I might be able to try capturing the wire protocol commands that are being sent. Or if you are able to capture this yourself with Wireshark, that would be really helpful!

@tlvenn
Copy link
Contributor Author

tlvenn commented Jul 13, 2020

Hi @rafiss, I can definitely capture the sequence for you. I will get back to you on this asap.

@tlvenn
Copy link
Contributor Author

tlvenn commented Jul 27, 2020

Hi @rafiss ,

Please find the captured packets

binary_protocol_arrays.pcapng.gz

View of the table and the data:
image
with completed_phases being defined as completed_phases VARCHAR(255)[] NOT NULL

The exact error:

iex(5)> from(k in Jumpn.Idempotency.Key, select: k.completed_phases) |> Jumpn.Repo.all
[error] GenServer #PID<0.774.0> terminating
** (DBConnection.ConnectionError) client #PID<0.987.0> stopped: ** (MatchError) no match of right hand side value: <<0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 4, 19, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 22, 117, 112, 115, 101, 114, 116, 95, 105, 100, 101, 109, 112, 111, 116, 101, 110, 99, 121, 95, 107, 101, 121, 0, 0, 0, 12, ...>>
    (postgrex 0.15.5) lib/postgrex/type_module.ex:897: Postgrex.DefaultTypes."Elixir.Postgrex.Extensions.Array"/8
    (postgrex 0.15.5) lib/postgrex/protocol.ex:2870: Postgrex.Protocol.rows_recv/4
    (postgrex 0.15.5) lib/postgrex/protocol.ex:1899: Postgrex.Protocol.recv_execute/5
    (postgrex 0.15.5) lib/postgrex/protocol.ex:1773: Postgrex.Protocol.bind_execute/4
    (db_connection 2.2.2) lib/db_connection/holder.ex:316: DBConnection.Holder.holder_apply/4
    (db_connection 2.2.2) lib/db_connection.ex:1255: DBConnection.run_execute/5
    (db_connection 2.2.2) lib/db_connection.ex:1342: DBConnection.run/6
    (db_connection 2.2.2) lib/db_connection.ex:539: DBConnection.parsed_prepare_execute/5
    (db_connection 2.2.2) lib/db_connection.ex:532: DBConnection.prepare_execute/4
    (ecto_sql 3.4.5) lib/ecto/adapters/sql.ex:534: Ecto.Adapters.SQL.execute!/4
    (ecto_sql 3.4.5) lib/ecto/adapters/sql.ex:526: Ecto.Adapters.SQL.execute/5
    (ecto 3.4.5) lib/ecto/repo/queryable.ex:192: Ecto.Repo.Queryable.execute/4
    (ecto 3.4.5) lib/ecto/repo/queryable.ex:17: Ecto.Repo.Queryable.all/3
    (stdlib 3.13) erl_eval.erl:680: :erl_eval.do_apply/6
    (elixir 1.10.4) src/elixir.erl:278: :elixir.recur_eval/3
    (elixir 1.10.4) src/elixir.erl:263: :elixir.eval_forms/3
    (iex 1.10.4) lib/iex/evaluator.ex:258: IEx.Evaluator.handle_eval/5
    (iex 1.10.4) lib/iex/evaluator.ex:239: IEx.Evaluator.do_eval/3
    (iex 1.10.4) lib/iex/evaluator.ex:217: IEx.Evaluator.eval/3
    (iex 1.10.4) lib/iex/evaluator.ex:103: IEx.Evaluator.loop/1

    (db_connection 2.2.2) lib/db_connection/connection.ex:168: DBConnection.Connection.handle_cast/2
    (connection 1.0.4) lib/connection.ex:810: Connection.handle_async/3
    (stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13) proc_lib.erl:236: :proc_lib.wake_up/3

@rafiss
Copy link
Collaborator

rafiss commented Jul 27, 2020

Thank you @tlvenn. I looked at the captured packets, and it looks like the completed_phases column has this type metadata:

Column name: completed_phases
    Table OID: 0
    Column index: 0
    Type OID: 1009
    Column length: -1
    Type modifier: -1
    Format: Text (0)

The type OID is 1099, which is the OID for a TEXT[]. A VARCHAR(n)[] would have an OID of 1015 as seen in the example in my previous comment.

Would you mind sharing the output of the following commands:

SHOW CREATE TABLE idempotency_keys;

SELECT pg_typeof(completed_phases), pg_typeof(completed_phases)::regtype from idempotency_keys

I wonder if some part of the Ecto CockroachDB adapter is turning VARCHAR(n)[] into TEXT[] at table creation time. This could cause issues when reading from the table later and trying to interpret as a VARCHAR(n)[]. As for why this started with 19.2, a good guess is this PR: #36598. This PR made our type system align with the Postgres type system more closely. Maybe before this PR, the Ecto CockroachDB adapter had a workaround for some of the differences in the type system.

@tlvenn
Copy link
Contributor Author

tlvenn commented Jul 28, 2020

SHOW CREATE TABLE idempotency_keys:

	id INT8 NOT NULL DEFAULT unique_rowid(),
	key VARCHAR(255) NOT NULL,
	last_executed_at TIMESTAMP NOT NULL,
	locked_at TIMESTAMP NULL,
	mutation_query STRING NOT NULL,
	mutation_variables JSONB NOT NULL,
	mutation_result JSONB NULL,
	completed_phases VARCHAR(255)[] NOT NULL,
	user_id INT8 NULL,
	inserted_at TIMESTAMP NOT NULL,
	CONSTRAINT "primary" PRIMARY KEY (id ASC),
	CONSTRAINT idempotency_keys_user_id_fkey FOREIGN KEY (user_id) REFERENCES users(id),
	INDEX idempotency_keys_auto_index_idempotency_keys_user_id_fkey (user_id ASC),
	UNIQUE INDEX idempotency_keys_user_id_key_index (user_id ASC, key ASC),
	FAMILY "primary" (id, key, last_executed_at, locked_at, mutation_query, mutation_variables, mutation_result, completed_phases, user_id, inserted_at)
)

SELECT pg_typeof(completed_phases), pg_typeof(completed_phases)::regtype from idempotency_keys

pg_typeof pg_typeof
varchar[] 'character varying[]'

It seems the column has the good type (varchar[]) and the issue is elsewhere.

@tlvenn
Copy link
Contributor Author

tlvenn commented Jul 30, 2020

@rafiss is there any thing else I could help you with ?

@rafiss
Copy link
Collaborator

rafiss commented Jul 30, 2020

Thanks for the info so far. It does look like the schema is correct.

I've tried a few other ideas and still cannot reproduce it. Can you provide me with Elixir code that runs into the issue and instructions to run it?

@tlvenn
Copy link
Contributor Author

tlvenn commented Jul 31, 2020

Sure thing

@tlvenn
Copy link
Contributor Author

tlvenn commented Jul 31, 2020

I created a repo to reproduce the issue at: https://github.com/tlvenn/cdb_test
Please follow the steps as explained in the readme and you should see the following error:
image

@rafiss
Copy link
Collaborator

rafiss commented Aug 1, 2020

Thank you very much for creating this sample. It was really helpful to finally track this down. My previous attempts to reproduce this failed because of a combo of this being fixed in our 20.2 alphas (unintentionally) and my own issues with getting our pgwire testing DSL working on older branches.

The reason this is not working with 19.2 and 20.1 is this code:

var resultOidMap = map[oid.Oid]oid.Oid{
oid.T_bit: oid.T_varbit,
oid.T__bit: oid.T__varbit,
oid.T_bpchar: oid.T_text,
oid.T__bpchar: oid.T__text,
oid.T_char: oid.T_text,
oid.T__char: oid.T__text,
oid.T_varchar: oid.T_text,
oid.T__varchar: oid.T__text,
}
// mapResultOid maps an Oid value returned by the server to an Oid value that is
// backwards-compatible with previous versions of CRDB. See this issue for more
// details: https://github.com/cockroachdb/cockroach/issues/36811
//
// TODO(andyk): Remove this once issue #36811 is resolved.
func mapResultOid(o oid.Oid) oid.Oid {
mapped := resultOidMap[o]
if mapped != 0 {
return mapped
}
return o

PR #44471 removed that bit of code, and indeed this issue doesn't happen in the 20.2 release.

Since we broke functionality here, we could consider backporting the PR that repairs the functionality to the 19.2 and 20.1 branches. Thoughts @jordanlewis / @RaduBerinde? (tagging you two since the idea of backporting this same PR already came up in #49353)

@tlvenn: even if we aren't able to backport, the good news is that this will be fixed in the 20.2 release that will come later this year.

@tlvenn
Copy link
Contributor Author

tlvenn commented Aug 4, 2020

Given how trivial the PR seems to be, it would make sense to backport it to at the very least the 20.1.X branch so it's available within the next release. It seems we missed the 20.1.4 already, hoping to see this fixed with the 20.1.5.

@jordanlewis 🙏

@rafiss
Copy link
Collaborator

rafiss commented Aug 4, 2020

I created the backport #52351 and will advocate for it to be approved.

rafiss added a commit to rafiss/cockroach that referenced this issue Aug 4, 2020
See issue cockroachdb#51360. The bug described in it was fixed somewhat
accidentally, do this test will verify that we don't regress again.

Release note: None
rafiss added a commit to rafiss/cockroach that referenced this issue Aug 4, 2020
See issue cockroachdb#51360. The bug described in it was fixed somewhat
accidentally, do this test will verify that we don't regress again.

Release note: None
craig bot pushed a commit that referenced this issue Aug 5, 2020
51803: cmd/docgen: add HTTP extractor r=mjibson a=mjibson

Add a way to extract docs from the status.proto HTTP endpoint. These
can be imported into the docs project as needed.

Release note: None

52083: roachtest: small misc r=andreimatei a=andreimatei

See individual commits.

52094: rpc: implement tenant access control policies at KV RPC boundary r=nvanbenschoten a=nvanbenschoten

Fixes #47898.

Rebased on #51503 and #52034. Ignore all but the last 3 commits.

This commit adds a collection of access control policies for the newly exposed tenant RPC server. These authorization policies ensure that an authenticated tenant is only able to access keys within its keyspace and that no tenant is able to access data from another tenant's keyspace through the tenant RPC server. This is a major step in providing crypto-backed logical isolation between tenants in a multi-tenant cluster.

The existing auth mechanism is retained on the standard RPC server, which means that the system tenant is still able to access any key in the system.

52352: sql/pgwire: add regression test for varchar OIDs in RowDescription r=jordanlewis a=rafiss

See issue #51360. The bug described in it was fixed somewhat
accidentally, so this test will verify that we don't regress again.

Release note: None

52386: opt: add SerializingProject exec primitive r=RaduBerinde a=RaduBerinde

The top-level projection of a query has a special property - it can project away
columns that we want an ordering on (e.g. `SELECT a FROM t ORDER BY b`).

The distsql physical planner was designed to tolerate such cases, as they were
much more common with the heuristic planner. But the new distsql exec factory
does not; it currently relies on a hack: it detects this case by checking if the
required output ordering is `nil`. This is fragile and doesn't work in all
cases.

This change adds a `SerializingProject` primitive which is like a SimpleProject
but it forces serialization of all parallel streams into one. The new primitive
is used to enforce the final query presentation. We only need to pass column
names for the presentation, so we remove `RenameColumns` and remove the column
names argument from `SimpleProject` (simplifying some execbuilder code).

We also fix a bug in `ConstructSimpleProject` where we weren't taking the
`PlanToStreamColMap` into account when building the projection.

Release note: None

Co-authored-by: Matt Jibson <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
@rafiss
Copy link
Collaborator

rafiss commented Aug 5, 2020

Closing this out since it will be in the next 20.1 release.

@rafiss rafiss closed this as completed Aug 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

No branches or pull requests

2 participants