Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading from Text and/or ByteString #92

Open
axman6 opened this issue Oct 5, 2017 · 2 comments
Open

Support reading from Text and/or ByteString #92

axman6 opened this issue Oct 5, 2017 · 2 comments

Comments

@axman6
Copy link

axman6 commented Oct 5, 2017

I have a use case where I want to build frames from data what'd been downloaded from the internet (in the form of a .csv.zip); most of the API seems centred around reading directly from files. I haven't seen pure alternatives to readTableOpt' or readTable which accept a Text or ByteString inputs.

Is there any fundamental reason why this doesn't exist? If not, would it be difficult to add? I assume there's a pure core in these existing functions already which process data read in MonadIO, so it feels like it should be possible to split these out (though there's possibly some optimisations which can be made in both situations - streaming data from IO and reading a known length chunk of data).

axman6 pushed a commit to axman6/Frames that referenced this issue Oct 5, 2017
@acowley
Copy link
Owner

acowley commented Oct 5, 2017

The pure reading code should clearly be separated from the IO part. That it isn’t is a serious flaw; fixing that would be a significant improvement.

The reason file IO is baked in so deeply is the expectation that compile-time and run-time look at the same data.

That said, the reliance on IO is incidental.

Out of curiosity, are you not using the TH pieces, or perhaps using another file to establish the types before streaming more data at run time?

@axman6
Copy link
Author

axman6 commented Oct 6, 2017

Yeah I probably will be (though it's more likely I'll write the types by hand in the end, but the TH will help me find them). One of the other reasons for me needing to parse Text values in memory is that the CSVs I deal with aren't nice columnar data, but often have a row of metadata at the beginning and a final row to signify the end of the document (yeah it's pretty awful).

Something I've just realised this morning, that probably needs a robust fix, is that all the code assumes that CSV is a line based format - it's perfectly valid to have new lines within quoted text blocks (and some of the data I work with does that a lot). This means the code I've made in the PR is likely broken, as well as the IO hGetLine based code. I know there's been some discussion of using cassava, which I believe handles this problem. Maybe it would be worth investigating taking at least the parser from cassava, since a lot of the work in Frames duplicates it.

This was referenced Nov 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants