Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMPORT: support null as non-quoted string in import csv #19743

Closed
maddyblue opened this issue Nov 2, 2017 · 9 comments · Fixed by #84487
Closed

IMPORT: support null as non-quoted string in import csv #19743

maddyblue opened this issue Nov 2, 2017 · 9 comments · Fixed by #84487
Assignees
Labels
A-disaster-recovery C-investigation Further steps needed to qualify. C-label will change. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Milestone

Comments

@maddyblue
Copy link
Contributor

maddyblue commented Nov 2, 2017

Postgres copy with csv format differentiates a nullable text field that is null or an empty string by using "" as the empty string and an unquoted empty string (i.e., nothing) as null. This is round-trippable by default with the postgres copy. We should support this syntax.

See https://forum.cockroachlabs.com/t/import-from-csv-fails-on-null-data-for-int-types/1067/1

Epic: CRDB-14049

Jira issue: CRDB-17420

@maddyblue maddyblue added this to the 1.2 milestone Nov 2, 2017
@dt dt modified the milestones: 2.0, 2.1 Jan 18, 2018
@bobvawter bobvawter self-assigned this Apr 6, 2018
@bobvawter
Copy link
Contributor

I took a look at doing this. There's a slight hiccup using golang's CSV reader. According to RFC 4180, the empty-quoted string "" and the zero-length string are both the same value. We'd need to use a different version of the parser in order to be able to distinguish the case, since the current API only exposes string and not *string or something to distinguish missing versus empty fields.

@mjibson, @dianasaur323 Thoughts on whether or not we want to do this?

Saving additional thoughts for later:

for i, v := range record {

// Map the column value to NULL if it matches the user-defined
// nullIf value or if the column is nullable and the raw string
// has a zero length.  This distinguishes the empty string
// from the null string,  e.g. (1,"",,4).
if nullif != nil && v == *nullif || col.IsNullable() && len(v) == 0 {
	datums[i] = tree.DNull
} else {
	datums[i], err = parser.ParseStringAs(col.Type.ToDatumType(), v, &evalCtx, cenv)
	if err != nil {
		return errors.Wrapf(err, "%s: row %d: parse %q as %s", batch.file, rowNum, col.Name, col.Type.SQLString())
	}
}

		{
			name: "empty string vs. zero string",
			create: `
				i int primary key,
        s string,
        s2 string not null
      `,
			csv: `1,,
2,"",""`,
			query: map[string][][]string{
				`SELECT i, s, s2 from d.t`: {
					{"1", "NULL", ""},
					{"2", "", ""},
				},
			},
		},

@maddyblue
Copy link
Contributor Author

It is possible we decide we just don't want to do this because the guarantees of COPY are different than CSV, even though they are similar.

@maddyblue maddyblue added the C-investigation Further steps needed to qualify. C-label will change. label Apr 26, 2018
@bobvawter bobvawter removed their assignment Oct 8, 2018
@dt dt changed the title sqlccl: support null as non-quoted string in import csv IMPORT: support null as non-quoted string in import csv Jul 16, 2019
@mwang1026
Copy link

Closing since we haven't heard many requests for this recently. If anyone wants to see this feel free to reopen.

@florence-crl
Copy link

A customer would like this feature supported, see this internal issue: Could not parse "NULL" as type bool or JSONB #1683.

@florence-crl florence-crl reopened this Jul 7, 2022
@florence-crl florence-crl added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jul 7, 2022
@jlinder jlinder added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Jul 7, 2022
@vy-ton
Copy link
Contributor

vy-ton commented Jul 14, 2022

@rafiss Can we investigate what it would take to address this and not require the workaround mentioned in https://github.com/cockroachlabs/support/issues/1683#issuecomment-1172624382

@rafiss
Copy link
Collaborator

rafiss commented Jul 15, 2022

It should be possible with around a week of work, since we forked the CSV reader logic and can control it.

Implementation notes: The way to do it would be to modify how quoted strings versus unquoted strings are parsed. Then we'd probably need to change the return value of readRecord so instead of []string it returns []*string or some other type that can represent missing data.


While investigating this, I also realized there is a bug in our implementation of COPY ... FROM STDIN CSV. It treats both the zero-length string (``) and the quoted empty string (`""`) as NULL. However, the correct behavior would be:

rafiss@127:postgres> create table t(a int, b text);
CREATE TABLE

rafiss@127:postgres> copy t from stdin csv;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1,""
>> 2,
>> \.
COPY 2

rafiss@127:postgres> select * from t;
+---+--------+
| a | b      |
|---+--------|
| 1 |        |
| 2 | <null> |
+---+--------+
SELECT 2

In Cockroach, this results in:

  a |  b
----+-------
  1 | NULL
  2 | NULL
(2 rows)

I think the fact that there's a bug in COPY too means we should prioritize a fix.

@rafiss rafiss self-assigned this Jul 15, 2022
@rafiss
Copy link
Collaborator

rafiss commented Jul 18, 2022

@vy-ton I have a fix available, but how would you like to handle the backwards compatibility for this? Some specific questions:

  1. If NULLIF is not specified, then should we treat an unquoted empty string as NULL?
  2. If NULLIF='' is used, then should both the unquoted empty string and the quoted empty string be treated as NULL? (This is the current IMPORT behavior. However COPY has a rule that that if the NULL token appears in the input wrapped in quotes, then the parser uses that literal value rather than treating it as NULL. I.e., COPY would treat "" as the empty string, and a completely blank input as NULL.)
  3. If we do want IMPORT to match the COPY behavior described in 2, should that be configurable with a new IMPORT option? E.g., we could name it allow_quoted_nulls or something.

@vy-ton
Copy link
Contributor

vy-ton commented Aug 2, 2022

Proposed guiding principles:

  1. CRDB COPY matches PG COPY by default since COPY is a familiar interface for PG users.
  2. IMPORT INTO matches COPY by default since they are both statements for loading delimited data.
  3. For backward compatibility, we can introduce session settings or IMPORT options to preserve the existing behavior that violates (1) or (2) above.

So applying this for COPY null handling:

postgres=# create table foo(a int, b text);
CREATE TABLE
postgres=# copy foo from stdin csv;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1,""
>> 2,''
>> 3,
>> \.
COPY 3
postgres=# select * from foo;
 a | b
---+----
 1 |
 2 | ''
 3 |
(3 rows)
                 ^
postgres=# select * from foo where b is null;
 a | b
---+---
 3 |
(1 row)

@rafiss to answer your questions, let me know what you think

If NULLIF is not specified, then should we treat an unquoted empty string as NULL?

IMPORT should match CRDB/PG COPY

If NULLIF='' is used, then should both the unquoted empty string and the quoted empty string be treated as NULL? (This is the current IMPORT behavior. However COPY has a rule that that if the NULL token appears in the input wrapped in quotes, then the parser uses that literal value rather than treating it as NULL. I.e., COPY would treat "" as the empty string, and a completely blank input as NULL.)L

What does COPY do if NULL AS is specified?

If we do want IMPORT to match the COPY behavior described in 2, should that be configurable with a new IMPORT option? E.g., we could name it allow_quoted_nulls or something.

I was thinking the default should match COPY and configuration needed for the existing behavior. Do you think that's too disruptive?

@rafiss
Copy link
Collaborator

rafiss commented Aug 8, 2022

What does COPY do if NULL AS is specified?

If you do COPY t FROM STDIN WITH CSV NULL '$', then an unquoted $ is NULL, but "$" is treated as a string. (The same rule applies to the empty string.) So we can make IMPORT match that.

I was thinking the default should match COPY and configuration needed for the existing behavior. Do you think that's too disruptive?

That sounds fine to me. I'll mark it as a breaking change in the release notes

@craig craig bot closed this as completed in 5f331ae Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-investigation Further steps needed to qualify. C-label will change. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants