CSVs are awesome, yet they're pretty dumb. Let's get them smarter!
smartcsv is a python utility to read and parse CSVs based on model definitions. Instead of just parsing the CSV into lists (like the builtin csv
module) it adds the ability to specify models with attributes names. On top of that it adds nice features like validation, custom parsing, failure control and nice error messages.
>>> reader = smartcsv.reader(file_object, columns=COLUMNS, fail_fast=False)
>>> my_object = next(reader)
>>> my_object['title'] # Accessed by model name.
'iPhone 5c Blue'
>>> my_object['price'] # Value transform included
Decimal("799.99")
>>> my_object['currency'] # Based on choices = ['USD', 'YEN']
'USD'
>>> my_object['url'] # custom validator lambda v: v.startswith('http')
https://www.apple.com/iphone.jpg
# Nice errors
>>> from pprint import pprint as pp
>>> pp(my_object.errors)
{
17: { # The row number
'row': ['','',...] # The complete row for reference,
'errors': { # Description of the errors
'url': 'Validation failed',
'currency': 'Invalid choice. Expected ['USD', 'YEN']. Got 'AUD' instead.
}
}
}
pip install smartcsv
To see an entire set of usages check the test
package (99% coverage).
The basic is to define a spec for the columns of your csv. Assuming the following CSV file:
title,category,subcategory,currency,price,url,image_url
iPhone 5c blue,Phones,Smartphones,USD,399,http://apple.com/iphone,http://apple.com/iphone.jpg
iPad mini,Tablets,Apple,USD,699,http://apple.com/iphone,http://apple.com/iphone.jpg
First you need to define the spec for your columns. This is an example (the one used in tests
):
CURRENCIES = ('USD', 'ARS', 'JPY')
COLUMNS_1 = [
{'name': 'title', 'required': True},
{'name': 'category', 'required': True},
{'name': 'subcategory', 'required': False},
{
'name': 'currency',
'required': True,
'choices': CURRENCIES
},
{
'name': 'price',
'required': True,
'validator': is_number
},
{
'name': 'url',
'required': True,
'validator': lambda c: c.startswith('http')
},
{
'name': 'image_url',
'required': False,
'validator': lambda c: c.startswith('http')
},
]
You can then use smartcsv
to parse the CSV:
import smartcsv
with open('my-csv.csv', 'r') as f:
reader = smartcsv.reader(f, columns=COLUMNS_1)
for obj in reader:
print(obj['title'])
smartcsv.reader
uses the builtin csv
module and accepts a dialect to use.
Errors
By default smartcsv
will raise a smartcsv.exceptions.InvalidCSVException
when it encounters an error in a column (a missing required field, a field different than choices, a validation failure, etc). The exception will have a nice error message in that case:
# Assuming the price field is missing
try:
item = next(reader)
except InvalidCSVException as e:
print(e.errors)
# {'price': 'Field required and not provided.'}
You can always avoid fast-failure (raising an exception on failure). You can pass the fail_fast
argument as False
. That will prevent exceptions, instead the errors are reported in the reader object (indicating the row number and the detail of the errors). For example, assuming a CSV with the an error in the second row:
reader = smartcsv.reader(f, columns=COLUMNS_1, fail_fast=False)
for obj in reader:
# All the processing is done Ok without exceptions raised.
print(obj['title'])
error_row = reader.errors['rows'][1] # Second row has index = 1. Errors are 0-indexed.
print(error_row['row']) # Print original row data
print(error_row['errors'].keys()) # currency (the currency column)
print(error_row['errors']['currency']) # Invalid currency... (nice error explanation)
You can also specify a max_failures
parameter. It will count failures and will raise an exception when that threshold is exceeded.
Strip white spaces
By default the strip_white_spaces
option is set to True. Example:
sample.csv
title,price
Some Product , 55.5
row['title']
will be "Some Product" and row['price']
will be "55.5" (spaces stripped)
Skip lines
sample.csv
GENERATED BY AWESOME SCRIPT
2014-08-12
title,price
Some Product,55.5
The first 3 lines don't contain any valuable data so we'll skip them.
reader = smartcsv.reader(f, columns=COLUMNS_1, fail_fast=False, skip_lines=3)
for obj in reader:
print(obj['title'])
Break (stop) on occurrance of first error
By default, value of fail_fast
is True
. You can also mention it explicitly with fail_fast=True
. This will cause halting execution of reader() function as soon as it faces an error in the csv file. This error can be data mismatch in between your data specification and found value in csv file. Data-validation failure also trigger fail_fast
.
reader = smartcsv.reader(f, columns=COLUMNS_1, fail_fast=True)
for obj in reader:
print(obj['title'])
Fork, code, watch your tests pass, submit PR. To test:
$ python setup.py test # Run tests in your venv
$ tox # Make sure it passes in all versions.
There are "integration" tests included under tests/integration
. They are not run by the default test runner. The idea of those tests is to have real examples of use cases for smartcsv
documented.
You'll have to run them manually:
py.test tests/integration/lpnk/test_lpnk.py