Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV driver doesn't honor CSVT sidecar in Dataset.GetFileList(), Driver.CreateCopy(), and other I/O operations #8165

Closed
gorloffslava opened this issue Aug 2, 2023 · 5 comments
Assignees

Comments

@gorloffslava
Copy link
Contributor

gorloffslava commented Aug 2, 2023

Expected behavior and actual behavior.

Given:

Steps to reproduce the problem.

Reproduction case #1:

from osgeo import gdal
ds = gdal.OpenEx("testcsvt.csv")
ds.GetFileList()

Expected output: ['testcsvt.csv', 'testcsvt.csvt']
Actual output: ['testcsvt.csv']

Reproduction case #2:

from osgeo import gdal
ds_src = gdal.OpenEx("testcsvt.csv")
driver = gdal.GetDriverByName("CSV")
ds_dst = driver.CreateCopy("test_csvt_copy", ds_src)
ds_dst.FlushCache()

import os
os.listdir("test_csvt_copy")

Expected output: ['testcsvt.csv', 'testcsvt.csvt']
Actual output: ['testcsvt.csv']

Reproduction case #3:
How can we checked that CSVT is really loaded by GDAL?

import geopandas

gdf = geopandas.read_file("testcsvt.csv")
gdf.info()
"""
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   INTCOL      1 non-null      float64       
 1   REALCOL     1 non-null      float64       
 2   STRINGCOL   2 non-null      object        
 3   INTCOL2     1 non-null      float64       
 4   REALCOL2    1 non-null      float64       
 5   STRINGCOL2  2 non-null      object        
 6   DATETIME    1 non-null      datetime64[ns]
 7   DATE        1 non-null      object        
 8   TIME        1 non-null      object        
 9   geometry    0 non-null      geometry      
dtypes: datetime64[ns](1), float64(4), geometry(1), object(4)
memory usage: 288.0+ bytes
"""
# Typings are applied correctly

os.remove("testcsvt.csvt")
gdf = geopandas.read_file("testcsvt.csv")
gdf.info()
"""
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   INTCOL      2 non-null      object  
 1   REALCOL     2 non-null      object  
 2   STRINGCOL   2 non-null      object  
 3   INTCOL2     2 non-null      object  
 4   REALCOL2    2 non-null      object  
 5   STRINGCOL2  2 non-null      object  
 6   DATETIME    2 non-null      object  
 7   DATE        2 non-null      object  
 8   TIME        2 non-null      object  
 9   geometry    0 non-null      geometry
dtypes: geometry(1), object(9)
memory usage: 288.0+ bytes
"""
# Typings no longer work. Expected, as we deleted `.csvt` sidecar w/ them.

Operating system

Reproducible w/ any of the following:

  • macOS 14.0 beta 4 (x86_64, Intel)
  • Amazon Linux 2 (x86_64, Intel + aarch64, AWS Graviton2)
  • Amazon Linux 2023 (x86_64, Intel + aarch64, AWS Graviton3E)

GDAL version and provenance

Reproducible w/ any of the following:

  • GDAL 3.7.1 from conda-forge
  • Self-compiled GDAL 3.8.0dev from master branch
@jratike80
Copy link
Collaborator

By reading the documentation of the CSV driver https://gdal.org/drivers/vector/csv.html, by default the .csvt file is not created. A special layer creation option is required.

CREATE_CSVT=[YES/NO]: Defaults to NO. Create the associated .csvt file (see above paragraph) to describe the type of each column of the layer and its optional width and precision.

@gorloffslava
Copy link
Contributor Author

By reading the documentation of the CSV driver https://gdal.org/drivers/vector/csv.html, by default the .csvt file is not created. A special layer creation option is required.

CREATE_CSVT=[YES/NO]: Defaults to NO. Create the associated .csvt file (see above paragraph) to describe the type of each column of the layer and its optional width and precision.

Thanks for your response! We use that when writing datasets, yes, and it works.

But in our issue above, we copy dataset, not create from scratch, so expect all sidecars to be copied automatically as it happens, for example, w/ GeoTIFF or ESRI Shapefile. +It doesn't seem to affect opening datasets which already have this sidecar.

@jratike80
Copy link
Collaborator

jratike80 commented Aug 3, 2023

I may be wrong, but doesn't driver.CreateCopy make a copy of the internal presentation of the data that GDAL has after opening the source dataset? So it does not copy files even if the source and target formats are the same, but the data gets rewritten. Have you tried to use the layer creation option as I suggested? Unfortunately I am not a programmer and I can't tell how to test that.

Maybe https://gdal.org/api/python/osgeo.ogr.html#osgeo.ogr.DataSource.CopyLayer does something similar:

Duplicate an existing layer.
This function creates a new layer, duplicate the field definitions of the source layer and then duplicate each features of the source layer. The papszOptions argument can be used to control driver specific creation options. These options are normally documented in the format specific documentation. The source layer may come from another dataset.

@rouault rouault self-assigned this Aug 9, 2023
@rouault
Copy link
Member

rouault commented Aug 9, 2023

I'm working on having GetFileList() report the .csvt file, but you indeed shouldn't expect CreateCopy() to create a .csvt file, even if the source dataset is a .csv file with a .csvt. Output driver of GDAL know nothing about input drivers, and everything goes through a pivot model that forget about the implementation details. You'd better use plain file copy if you want to do CSV -> CSV without any change.
As there isn't a way of provider layer creation options in the GDALDataset::CopyLayer() call done by GDALDriver::DefaultCreateCopy(), you'd better use GDALVectorTranslate() instead

rouault added a commit that referenced this issue Aug 10, 2023
CSV: implement GetFileList() and return .csvt if used (fixes #8165)
@gorloffslava
Copy link
Contributor Author

@rouault big thanks for fixing this!
And for your explanation about CreateCopy() behavior.

@jratike80 big thanks for your assist as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants