Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csvsql: agate-sql: Add option to set varchar length to data's maximum length, or to a user-provided length #928

Open
StudentForever opened this issue Jan 31, 2018 · 11 comments

Comments

@StudentForever
Copy link

Hi

Not sure if I am not passing the right flags, I noticed that the column sizes in the create table query default to max e.g. varchar (max) . Is there an option so that column sizes are inferred based on the actual data.


 cat test.csv
tag,tag2,Date,col1,col2,col3,col4,col5
DFAN8086,sss,12/24/2012,0.9454054,0.391277,0.696738598,0.963471,0.422531
FDG491BT1088,wew,12/24/2012,0.922601887,0.3125817,0.991514,0.740261,0.3096907
GEW00328745,tere,12/24/2012,0.341763918,0.587643,0.834997,0.611671,0.3686074
GWW02993182,qewqeqw,12/24/2012,0.078535992,0.497771,0.394849,0.4730659,0.4074645


csvsql -i mssql test.csv
CREATE TABLE test (
        tag VARCHAR(max) NOT NULL,
        tag2 VARCHAR(max) NOT NULL,
        [Date] DATE NOT NULL,
        col1 DECIMAL(38, 9) NOT NULL,
        col2 DECIMAL(38, 7) NOT NULL,
        col3 DECIMAL(38, 9) NOT NULL,
        col4 DECIMAL(38, 7) NOT NULL,
        col5 DECIMAL(38, 7) NOT NULL
);

@StudentForever StudentForever changed the title csvsql field size csvsql field size defaults to varchar(max) instead of inferring from data ? Jan 31, 2018
@jpmckinney jpmckinney changed the title csvsql field size defaults to varchar(max) instead of inferring from data ? agate-sql: mssql: csvsql field size defaults to varchar(max) instead of inferring from data Feb 2, 2018
@jpmckinney
Copy link
Member

Different databases have different default behavior in agate-sql (which csvsql uses) - it seems mssql defaults to max. We can add a flag to csvsql to calculate the max length from the data. Some users don't want this, because they may add new data whose fields are longer, or they may not want csvsql to do any inference on the data (which can be slow for large files).

@jpmckinney jpmckinney changed the title agate-sql: mssql: csvsql field size defaults to varchar(max) instead of inferring from data agate-sql: mssql: csvsql add option to calculate maximum varchar length Feb 2, 2018
@StudentForever
Copy link
Author

@jpmckinney Thanks for reviewing the request.

regards
kiran

@Freewilly3d
Copy link

I'm interested in something like this for Oracle also. Another option would be to set a default for varchar on the command line like varchar2(1000).

Thanks

@jpmckinney jpmckinney changed the title agate-sql: mssql: csvsql add option to calculate maximum varchar length agate-sql: csvsql add option to provide / calculate maximum varchar length Apr 22, 2018
@jpmckinney jpmckinney changed the title agate-sql: csvsql add option to provide / calculate maximum varchar length agate-sql: csvsql add option to use data's maximum varchar length May 21, 2018
@jpmckinney jpmckinney changed the title agate-sql: csvsql add option to use data's maximum varchar length csvsql: agate-sql: add option to use data's maximum varchar length Oct 17, 2023
@chriszrc
Copy link

I really thought csvsql used to do this already, at least for postgres. Finding max lengths for values in multiple columns and setting reasonable varchar limits is a pain, would really this to be automated

@jpmckinney
Copy link
Member

Why do you need limits when using Postgres? The current behavior with Postgres is to set no limit.

@jpmckinney
Copy link
Member

jpmckinney commented Mar 29, 2024

Postgres describes no performance concern, so the only reason I can imagine is database-level data validation - though validation is often better handled in code. https://www.postgresql.org/docs/current/datatype-character.html
See also https://stackoverflow.com/a/4849030

@chriszrc
Copy link

@jpmckinney yes, it's for data validation, it's saved me more than once from erroneous data being inserted in wrong/mismatched columns

@acosonic
Copy link

acosonic commented Jul 8, 2024

Hi, any progress on this? I am struggling with csvsql creating smaller fields than needed, instead of just using varchar at full length...

@jpmckinney
Copy link
Member

jpmckinney commented Jul 8, 2024

@acosonic What database are you using? Its behavior differs across databases. FWIW, most people here are discussing csvsql creating fields that are longer (maximum length) than needed – in MSSQL, PostgreSQL and also sounds like Oracle.

@jpmckinney jpmckinney changed the title csvsql: agate-sql: add option to use data's maximum varchar length csvsql: agate-sql: Add option to set varchar length to data's maximum length, or to a user-provided length Jul 8, 2024
@acosonic
Copy link

acosonic commented Jul 9, 2024

@acosonic What database are you using? Its behavior differs across databases. FWIW, most people here are discussing csvsql creating fields that are longer (maximum length) than needed – in MSSQL, PostgreSQL and also sounds like Oracle.

Mysql, it works well, on most csv files.

@jpmckinney
Copy link
Member

@acosonic Ok, for your use case, I've added --min-col-len and --col-len-multiplier, which already existed in agate-sql, but weren't exposed in CSV Kit 60fa3cf. I leave the issue open to constraint the maximum length (per original discussion).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants