Support reading from Text and/or ByteString #92

axman6 · 2017-10-05T04:15:34Z

I have a use case where I want to build frames from data what'd been downloaded from the internet (in the form of a .csv.zip); most of the API seems centred around reading directly from files. I haven't seen pure alternatives to readTableOpt' or readTable which accept a Text or ByteString inputs.

Is there any fundamental reason why this doesn't exist? If not, would it be difficult to add? I assume there's a pure core in these existing functions already which process data read in MonadIO, so it feels like it should be possible to split these out (though there's possibly some optimisations which can be made in both situations - streaming data from IO and reading a known length chunk of data).

The text was updated successfully, but these errors were encountered:

acowley · 2017-10-05T12:37:38Z

The pure reading code should clearly be separated from the IO part. That it isn’t is a serious flaw; fixing that would be a significant improvement.

The reason file IO is baked in so deeply is the expectation that compile-time and run-time look at the same data.

That said, the reliance on IO is incidental.

Out of curiosity, are you not using the TH pieces, or perhaps using another file to establish the types before streaming more data at run time?

axman6 · 2017-10-06T00:40:39Z

Yeah I probably will be (though it's more likely I'll write the types by hand in the end, but the TH will help me find them). One of the other reasons for me needing to parse Text values in memory is that the CSVs I deal with aren't nice columnar data, but often have a row of metadata at the beginning and a final row to signify the end of the document (yeah it's pretty awful).

Something I've just realised this morning, that probably needs a robust fix, is that all the code assumes that CSV is a line based format - it's perfectly valid to have new lines within quoted text blocks (and some of the data I work with does that a lot). This means the code I've made in the PR is likely broken, as well as the IO hGetLine based code. I know there's been some discussion of using cassava, which I believe handles this problem. Maybe it would be worth investigating taking at least the parser from cassava, since a lot of the work in Frames duplicates it.

axman6 pushed a commit to axman6/Frames that referenced this issue Oct 5, 2017

Add pure parsing functions (Fixes acowley#92)

57a5d86

This was referenced Nov 16, 2017

quoting in custom type #95

Closed

encoding #96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reading from Text and/or ByteString #92

Support reading from Text and/or ByteString #92

axman6 commented Oct 5, 2017

acowley commented Oct 5, 2017

axman6 commented Oct 6, 2017 •

edited

Loading

Support reading from Text and/or ByteString #92

Support reading from Text and/or ByteString #92

Comments

axman6 commented Oct 5, 2017

acowley commented Oct 5, 2017

axman6 commented Oct 6, 2017 • edited Loading

axman6 commented Oct 6, 2017 •

edited

Loading