-
Notifications
You must be signed in to change notification settings - Fork 64
Separated Value Parser Improvements in the works
I always thought my implementation for CSV was pretty original - I use an unbelievably complex regular expression to find the different tokens making up a record. There's some issues with this approach. For one, there is an up-front cost associated with building the regex the first time a SeparatedValueReader
is created. Even after the regex is compiled, the generated code is nowhere near as fast as a custom parser would be.
I think I can write a custom parser that is not only ridiculously fast but also flexible and more correct. My ultimate goal is to write the fastest .NET CSV parser available.
My parser is wrong, today. If you have used this library for a while, you will realize it doesn't always comply with your expectations of CSV. I wish I could say it doesn't comply with some standard or another, but there's really no standard CSV format. There's just a general notion about how CSV should behave and that's good enough. Check out http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm.
I spent a lot of my downtime thinking about how to write a faster, more flexible parser. This is no small undertaking. In order to build an effective parser, I have dozens, if not hundreds, of unit tests to write. Even after I get a working code base, I have to performance tune it. I already have some ideas and code snippets to get me started. I need a solid block of time to just focus on my implementation. Let's hope that block of time comes soon while these ideas are still fresh in my mind.
I had been wanting to work on this CSV parser for about 2 weeks. Finally, Memorial Day weekend provided me enough time to really sit down and focus on the task. I ended up with a decent implementation, decent code coverage and really good performance. Just to test things out, I created a sample data set of 1,000,000 records and ran them against FlatFiles and CsvHelper. Eventually, I ended up beating it just slightly. It wasn't even close at first, but I kept tweaking it until I got my run time down. Here are the numbers: http://gist.github.com/jehugaleahsa/86defee87df88be404bb9ea40509da8c.
I'm still not totally satisfied. I feel like there is some obvious optimization I am not seeing. I don't think I have an adequate number of unit tests, either. I am hoping an idea comes along.