-
Notifications
You must be signed in to change notification settings - Fork 200
parse_stream error v0.6.7 vs 0.6.9 #356
Comments
Been digging more into this and managed to find a tweet that breaks the parsing function in v0.6.9. The file four_tweets.json.zip contains four tweets. The second tweet ( I cannot see anything wrong with the structure of that tweet. I used several online JSON viewers such as http://jsonviewer.stack.hu/ and https://jsoneditoronline.org/ and they both parse it correctly, so I'm lost. TL;DR - there seem to be some tweets (e.g the second tweet in four_tweets.json.zip) that cannot be parsed correctly. The difference is that v0.6.9 throws a parsing error, whilst v0.6.7 ignores the tweet and creates a tibble without that tweet. Both issues seem problematic when there's nothing visibly wrong with the "offending" tweet. |
I had a similar problem and found a solution, which I describe in this comment in issue #355. I tried my function with your examples and it works fine. Instead of ignoring the "damaged" tweets like v0.6.7, you will be asked what to do with them. I would probably attempt to re-download them using
|
@JBGruber thanks a lot for this. Must admit I originally ignored issue #355 as I thought it was related to broken/damaged tweets, because as explained above, I could open these "offending" tweets using online JSON viewers. Now I realise that the issue with my "offending" tweets was "simply" an extra newline/carriage return halfway through the JSON entry. If I manually open the JSON file and remove this newline/carriage return, the parser works fine. I find it strange that the Twitter API randomly introduces these newlines/carriage return. I doubt it was a problem with internet connection, because I think if that was the case, the whole second part of the the tweet would've been lost, no? |
Interesting. So it seems it has to do with this: jeroen/jsonlite#47. I updated my function to remove the misbehaving characters before parsing. Now all your files seem to work. The four tweets are parsed. testA has 243, testB 231 tweets. |
To install old version of a package in
|
@ghostbeagle just as an FYI, as I'm not sure if you've seen my subsequent comments regarding this issue. Going back to v0.6.7 doesn't actually solve the issue, it just ignores the "broken" tweets rather than erroring. Best is to use @JBGruber |
I think the root cause is that |
Ah no, that probably won’t help because any data streamed in the interim will be lost. I think it should probably only ever write complete lines, leaving any trailing fragment without a nl to be added to the start of the next chunk. It might also be worth parsing the stream in order to separate tweets from system messages. |
I've implemented a completely fresh approach in #526 — this should never write invalid data to disk, therefor you'll never need to do any sort of special parsing. |
Should be resolved now. Please open a new issue with reprex if you run into problems in the future. |
Problem
I have hundreds of
.json
files of tweets that were collected using thestream_tweets()
function. When I try to parse these files into an R-object, some produce a parsing error and some work fine when usingrtweet v0.6.9
(the "new" version). However, they are all parsed correctly when usingrtweet v0.6.7
(the "older" version).Expected behavior
parse_stream()
should behave similarly in both versions (v0.6.7 vs 0.6.9)Reproduce the problem
Here are two sample files to reproduce the problem:
testA.json.zip
testB.json.zip
When using v0.6.9 the following parsing errors are produced
When using v0.6.7, both files are parsed correctly without any error.
Potential culprit
Comparing both versions and debugging the problem it seems that the parsing error is being caused by changes to either or both of these functions.
good_lines()
v0.6.7 vsgood_lines2()
v0.6.9The function
good_lines2()
performs a lot more string manipulation thangood_lines()
..parse_stream()
v0.6.7 vs.parse_stream_two
v0.6.9.parse_stream()
uses thejsonlite::stream_in()
function whilst.parse_stream_two()
usesjsonlite::fromJSON()
My suspicion is that
good_lines2()
is the one introducing the downstream parsing error. I don't know why this function performs a lot more string manipulation than the previousgood_lines()
function. Please note that in both cases the output fromreadr::read_lines()
is identical, so it doesn't seem to be a problem with changes toreadr
.Session info
v0.6.7
v0.6.9
Token
The text was updated successfully, but these errors were encountered: