Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding not handled correctly for natural earth data #739

Open
jtbraun opened this issue Mar 16, 2016 · 4 comments
Open

Encoding not handled correctly for natural earth data #739

jtbraun opened this issue Mar 16, 2016 · 4 comments

Comments

@jtbraun
Copy link

jtbraun commented Mar 16, 2016

Prior to version 3.x of the natural earth data, the strings inside the *.dbf files were encoded as Windows-1252 as documented here: http://www.naturalearthdata.com/features/

Starting with the 3.x versions, the *.dbf files are encoded with UTF-8, as mentioned here: nvkelso/natural-earth-vector#89

At some point the zip files began including a .cpg file (like ne_10m_admin_0_map_subunits.cpg), whose contents specify the character encoding (UTF-8 in the example given).

In my opinion, since cartopy.io.sharereader.natural_earth() does the magic downloading of the natural earth data, it should also look for and unzip/cache the *.cpg file and the *.VERSION.txt file. It should look for the *.cpg file for the encoding, and if that doesn't exist it should read the version and compare it against 3.x and assume Windows-1252 or UTF-8.

Then, pyshp (shapefile.py) needs to be modified to allow the encoding to be specified. Today it auto-assumes utf-8 under sys.vertion_info[0] == 3, and assumes nothing (passes the bytes back/forth) for sys.version_info[0] != 3. (see GeospatialPython/pyshp#46)

@pelson
Copy link
Member

pelson commented Mar 17, 2016

This sounds reasonable if it isn't part of the core shapefile capability. Out of interest, have you tried loading the shapefiles with Fiona? Is there an assumed encoding there?

@jtbraun
Copy link
Author

jtbraun commented Mar 17, 2016

I tried fiona at your suggestion. It also seems to default to cp1252. However, unlike shapefile, you can provide an encoding= kwarg to fiona.open, which results in the correct value. If you wanted to take the dependency (or do so at run time), you could use fiona instead of shaprefile to get properly decoded fields. Cartopy would still have to do the work of tracking the *.cpg file.

Here's a small sample that reads in the 50m and 10m files w/ fiona and each of the cp1252/utf-8 encodings, and you can see the difference for the 'name' property (which is 'NAME' in the 10m file).

import cartopy.io.shapereader as sr
import fiona
from itertools import product
import regex as re

for resolution, encoding in product(['50m', '10m'], ['cp1252', 'utf-8']):
    filename = sr.natural_earth(
        resolution=resolution,
        category='cultural',
        name='admin_0_map_subunits')
    source = fiona.open(filename, encoding=encoding)
    print filename, len(source)
    try:
        for f in source:
            for propname in ['name', 'NAME']:
                try:
                    name = f['properties'][propname]
                    break
                except KeyError:
                    name = '<missing>'

            if re.search(r'[\u0080-\u7fff]', name):
                print u"%-4s %-8s %-30s ==> %-30s" % (resolution, encoding, repr(name), name)
    except:
        print "ERROR DURING", resolution, encoding
    source.close()

@jtbraun
Copy link
Author

jtbraun commented Mar 17, 2016

Incidentally, the fiona user's manual even says:

The format drivers will attempt to detect the encoding of your data, but may fail. In my experience GDAL 1.7.2 (for example) doesn’t detect that the encoding of the Natural Earth dataset is Windows-1252. In this case, the proper encoding can be specified explicitly by using the encoding keyword parameter of fiona.open(): encoding='Windows-1252'.

@pelson
Copy link
Member

pelson commented Mar 23, 2016

Excellent. Thank you @jtbraun. I think fiona is becoming more readily installable, and is a reasonable optional dependency for cartopy. The upshot will be huge performance boosts, which is always nice 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants