parse_stream error v0.6.7 vs 0.6.9 #356

jjvalletta · 2019-09-24T12:50:11Z

Problem

I have hundreds of .json files of tweets that were collected using the stream_tweets() function. When I try to parse these files into an R-object, some produce a parsing error and some work fine when using rtweet v0.6.9 (the "new" version). However, they are all parsed correctly when using rtweet v0.6.7 (the "older" version).

Expected behavior

parse_stream() should behave similarly in both versions (v0.6.7 vs 0.6.9)

Reproduce the problem

Here are two sample files to reproduce the problem:
testA.json.zip
testB.json.zip

When using v0.6.9 the following parsing errors are produced

# unzip file first
dfA <- rtweet::parse_stream("testA.json")

 Error: parse error: invalid object key (must be a string)
          om\/UE\/status\/1105\u2026"},,"is_quote_status":true,"quote_
                     (right here) ------^

# unzip file first
dfB <- rtweet::parse_stream("testB.json")

 Error: parse error: unallowed token at this point in JSON text
          b_OrSQIA-cqZH55v.mp4?tag=8"},]},"sizes":{"thumb":{"w":150,"h
                     (right here) ------^

When using v0.6.7, both files are parsed correctly without any error.

Potential culprit

Comparing both versions and debugging the problem it seems that the parsing error is being caused by changes to either or both of these functions.

good_lines() v0.6.7 vs good_lines2() v0.6.9

The function good_lines2() performs a lot more string manipulation than good_lines().

.parse_stream() v0.6.7 vs .parse_stream_two v0.6.9

.parse_stream() uses the jsonlite::stream_in() function whilst .parse_stream_two() uses jsonlite::fromJSON()

My suspicion is that good_lines2() is the one introducing the downstream parsing error. I don't know why this function performs a lot more string manipulation than the previous good_lines() function. Please note that in both cases the output from readr::read_lines() is identical, so it doesn't seem to be a problem with changes to readr.

Session info

v0.6.7

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] devtools_2.2.0 usethis_1.5.1  rtweet_0.6.7  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2        rstudioapi_0.10   magrittr_1.5      pkgload_1.0.2     R6_2.4.0          rlang_0.4.0       httr_1.4.1        tools_3.6.1      
 [9] pkgbuild_1.0.5    DT_0.9            sessioninfo_1.1.1 cli_1.1.0         withr_2.1.2       remotes_2.1.0     htmltools_0.3.6   ellipsis_0.3.0   
[17] rprojroot_1.3-2   digest_0.6.21     assertthat_0.2.1  crayon_1.3.4      processx_3.4.1    callr_3.3.1       htmlwidgets_1.3   fs_1.3.1         
[25] ps_1.3.0          testthat_2.2.1    memoise_1.1.0     glue_1.3.1        compiler_3.6.1    backports_1.1.4   desc_1.2.0        prettyunits_1.0.2
[33] jsonlite_1.6

v0.6.9

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rtweet_0.6.9

loaded via a namespace (and not attached):
[1] httr_1.4.1     compiler_3.6.1 magrittr_1.5   R6_2.4.0       tools_3.6.1    Rcpp_1.0.2     jsonlite_1.6

Token

rtweet::get_token()
<Token>
<oauth_endpoint>
 request:   https://api.twitter.com/oauth/request_token
 authorize: https://api.twitter.com/oauth/authenticate
 access:    https://api.twitter.com/oauth/access_token
<oauth_app> pull_jj_tweets
  key:    0mxc****************
  secret: <hidden>
<credentials> oauth_token, oauth_token_secret, user_id, screen_name
---

The text was updated successfully, but these errors were encountered:

jjvalletta · 2019-09-24T13:45:01Z

Been digging more into this and managed to find a tweet that breaks the parsing function in v0.6.9. The file four_tweets.json.zip contains four tweets. The second tweet (RT @GeorgeMonbiot: BP successfully lobbied Trump....) is the one that causes the parsing error. When using v0.6.7, this tweet is simply ignored, that is, the returned tibble contains only the other three tweets, and hence why it doesn't throw an error.

I cannot see anything wrong with the structure of that tweet. I used several online JSON viewers such as http://jsonviewer.stack.hu/ and https://jsoneditoronline.org/ and they both parse it correctly, so I'm lost.

TL;DR - there seem to be some tweets (e.g the second tweet in four_tweets.json.zip) that cannot be parsed correctly. The difference is that v0.6.9 throws a parsing error, whilst v0.6.7 ignores the tweet and creates a tibble without that tweet. Both issues seem problematic when there's nothing visibly wrong with the "offending" tweet.

JBGruber · 2019-09-26T14:06:26Z

I had a similar problem and found a solution, which I describe in this comment in issue #355. I tried my function with your examples and it works fine. Instead of ignoring the "damaged" tweets like v0.6.7, you will be asked what to do with them. I would probably attempt to re-download them using lookup_statuses:

source("https://gist.githubusercontent.com/JBGruber/dee4c44e7d38d537426f57ba1e4f84ab/raw/ce28d3e8115f9272db867158794bc710e8e28ee5/recover_stream.R")

tweets <- recover_stream("testB.json")

tweets2 <- rtweet::lookup_statuses(readLines("broken_tweets.txt"),
                                   token = readRDS("twitter_token.RDS"))

jjvalletta · 2019-09-26T16:09:31Z

@JBGruber thanks a lot for this. Must admit I originally ignored issue #355 as I thought it was related to broken/damaged tweets, because as explained above, I could open these "offending" tweets using online JSON viewers. Now I realise that the issue with my "offending" tweets was "simply" an extra newline/carriage return halfway through the JSON entry. If I manually open the JSON file and remove this newline/carriage return, the parser works fine.

I find it strange that the Twitter API randomly introduces these newlines/carriage return. I doubt it was a problem with internet connection, because I think if that was the case, the whole second part of the the tweet would've been lost, no?

JBGruber · 2019-09-30T11:45:14Z

Interesting. So it seems it has to do with this: jeroen/jsonlite#47. I updated my function to remove the misbehaving characters before parsing. Now all your files seem to work. The four tweets are parsed. testA has 243, testB 231 tweets.

JBGruber · 2019-10-02T07:19:13Z

To install old version of a package in R use:

devtools::install_version("rtweet", version = "0.6.7")

jjvalletta · 2019-10-03T09:38:59Z

@ghostbeagle just as an FYI, as I'm not sure if you've seen my subsequent comments regarding this issue. Going back to v0.6.7 doesn't actually solve the issue, it just ignores the "broken" tweets rather than erroring. Best is to use @JBGruber recover_stream function.

hadley · 2021-02-28T23:45:06Z

I think the root cause is that write_fun() uses writeLines() which will introduce a new line after every disconnect. I suspect it should just use writeBin() (that should also improve performance since it wouldn't need to convert to text prior to writing)

hadley · 2021-03-01T00:24:30Z

Ah no, that probably won’t help because any data streamed in the interim will be lost. I think it should probably only ever write complete lines, leaving any trailing fragment without a nl to be added to the start of the next chunk.

It might also be worth parsing the stream in order to separate tweets from system messages.

hadley · 2021-03-01T15:42:46Z

I've implemented a completely fresh approach in #526 — this should never write invalid data to disk, therefor you'll never need to do any sort of special parsing.

hadley · 2021-03-04T23:14:47Z

Should be resolved now. Please open a new issue with reprex if you run into problems in the future.

llrs mentioned this issue Feb 15, 2021

Update roadmap #471

Closed

llrs added the bug label Feb 16, 2021

hadley closed this as completed Mar 4, 2021

frodew mentioned this issue Oct 14, 2021

stream_tweets raises lexical error (-> parsing .json) | random line-break #630

Closed

resulumit mentioned this issue Mar 12, 2022

Problems with streaming tweets #659

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parse_stream error v0.6.7 vs 0.6.9 #356

parse_stream error v0.6.7 vs 0.6.9 #356

jjvalletta commented Sep 24, 2019

jjvalletta commented Sep 24, 2019

JBGruber commented Sep 26, 2019

jjvalletta commented Sep 26, 2019

JBGruber commented Sep 30, 2019 •

edited

Loading

JBGruber commented Oct 2, 2019

jjvalletta commented Oct 3, 2019 •

edited

Loading

hadley commented Feb 28, 2021

hadley commented Mar 1, 2021

hadley commented Mar 1, 2021

hadley commented Mar 4, 2021

parse_stream error v0.6.7 vs 0.6.9 #356

parse_stream error v0.6.7 vs 0.6.9 #356

Comments

jjvalletta commented Sep 24, 2019

Problem

Expected behavior

Reproduce the problem

Potential culprit

Session info

v0.6.7

v0.6.9

Token

jjvalletta commented Sep 24, 2019

JBGruber commented Sep 26, 2019

jjvalletta commented Sep 26, 2019

JBGruber commented Sep 30, 2019 • edited Loading

JBGruber commented Oct 2, 2019

jjvalletta commented Oct 3, 2019 • edited Loading

hadley commented Feb 28, 2021

hadley commented Mar 1, 2021

hadley commented Mar 1, 2021

hadley commented Mar 4, 2021

JBGruber commented Sep 30, 2019 •

edited

Loading

jjvalletta commented Oct 3, 2019 •

edited

Loading