Skip to content
This repository has been archived by the owner on Nov 10, 2024. It is now read-only.

parse_stream error v0.6.7 vs 0.6.9 #356

Closed
jjvalletta opened this issue Sep 24, 2019 · 14 comments
Closed

parse_stream error v0.6.7 vs 0.6.9 #356

jjvalletta opened this issue Sep 24, 2019 · 14 comments
Labels

Comments

@jjvalletta
Copy link

Problem

I have hundreds of .json files of tweets that were collected using the stream_tweets() function. When I try to parse these files into an R-object, some produce a parsing error and some work fine when using rtweet v0.6.9 (the "new" version). However, they are all parsed correctly when using rtweet v0.6.7 (the "older" version).

Expected behavior

parse_stream() should behave similarly in both versions (v0.6.7 vs 0.6.9)

Reproduce the problem

Here are two sample files to reproduce the problem:
testA.json.zip
testB.json.zip

When using v0.6.9 the following parsing errors are produced

# unzip file first
dfA <- rtweet::parse_stream("testA.json")

 Error: parse error: invalid object key (must be a string)
          om\/UE\/status\/1105\u2026"},,"is_quote_status":true,"quote_
                     (right here) ------^ 
# unzip file first
dfB <- rtweet::parse_stream("testB.json")

 Error: parse error: unallowed token at this point in JSON text
          b_OrSQIA-cqZH55v.mp4?tag=8"},]},"sizes":{"thumb":{"w":150,"h
                     (right here) ------^ 

When using v0.6.7, both files are parsed correctly without any error.

Potential culprit

Comparing both versions and debugging the problem it seems that the parsing error is being caused by changes to either or both of these functions.

  • good_lines() v0.6.7 vs good_lines2() v0.6.9

The function good_lines2() performs a lot more string manipulation than good_lines().

  • .parse_stream() v0.6.7 vs .parse_stream_two v0.6.9

.parse_stream() uses the jsonlite::stream_in() function whilst .parse_stream_two() uses jsonlite::fromJSON()

My suspicion is that good_lines2() is the one introducing the downstream parsing error. I don't know why this function performs a lot more string manipulation than the previous good_lines() function. Please note that in both cases the output from readr::read_lines() is identical, so it doesn't seem to be a problem with changes to readr.

Session info

v0.6.7

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] devtools_2.2.0 usethis_1.5.1  rtweet_0.6.7  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2        rstudioapi_0.10   magrittr_1.5      pkgload_1.0.2     R6_2.4.0          rlang_0.4.0       httr_1.4.1        tools_3.6.1      
 [9] pkgbuild_1.0.5    DT_0.9            sessioninfo_1.1.1 cli_1.1.0         withr_2.1.2       remotes_2.1.0     htmltools_0.3.6   ellipsis_0.3.0   
[17] rprojroot_1.3-2   digest_0.6.21     assertthat_0.2.1  crayon_1.3.4      processx_3.4.1    callr_3.3.1       htmlwidgets_1.3   fs_1.3.1         
[25] ps_1.3.0          testthat_2.2.1    memoise_1.1.0     glue_1.3.1        compiler_3.6.1    backports_1.1.4   desc_1.2.0        prettyunits_1.0.2
[33] jsonlite_1.6  

v0.6.9

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rtweet_0.6.9

loaded via a namespace (and not attached):
[1] httr_1.4.1     compiler_3.6.1 magrittr_1.5   R6_2.4.0       tools_3.6.1    Rcpp_1.0.2     jsonlite_1.6  

Token

rtweet::get_token()
<Token>
<oauth_endpoint>
 request:   https://api.twitter.com/oauth/request_token
 authorize: https://api.twitter.com/oauth/authenticate
 access:    https://api.twitter.com/oauth/access_token
<oauth_app> pull_jj_tweets
  key:    0mxc****************
  secret: <hidden>
<credentials> oauth_token, oauth_token_secret, user_id, screen_name
---
@jjvalletta
Copy link
Author

Been digging more into this and managed to find a tweet that breaks the parsing function in v0.6.9. The file four_tweets.json.zip contains four tweets. The second tweet (RT @GeorgeMonbiot: BP successfully lobbied Trump....) is the one that causes the parsing error. When using v0.6.7, this tweet is simply ignored, that is, the returned tibble contains only the other three tweets, and hence why it doesn't throw an error.

I cannot see anything wrong with the structure of that tweet. I used several online JSON viewers such as http://jsonviewer.stack.hu/ and https://jsoneditoronline.org/ and they both parse it correctly, so I'm lost.

TL;DR - there seem to be some tweets (e.g the second tweet in four_tweets.json.zip) that cannot be parsed correctly. The difference is that v0.6.9 throws a parsing error, whilst v0.6.7 ignores the tweet and creates a tibble without that tweet. Both issues seem problematic when there's nothing visibly wrong with the "offending" tweet.

@JBGruber
Copy link

I had a similar problem and found a solution, which I describe in this comment in issue #355. I tried my function with your examples and it works fine. Instead of ignoring the "damaged" tweets like v0.6.7, you will be asked what to do with them. I would probably attempt to re-download them using lookup_statuses:

source("https://gist.githubusercontent.com/JBGruber/dee4c44e7d38d537426f57ba1e4f84ab/raw/ce28d3e8115f9272db867158794bc710e8e28ee5/recover_stream.R")

tweets <- recover_stream("testB.json")

tweets2 <- rtweet::lookup_statuses(readLines("broken_tweets.txt"),
                                   token = readRDS("twitter_token.RDS"))

@jjvalletta
Copy link
Author

@JBGruber thanks a lot for this. Must admit I originally ignored issue #355 as I thought it was related to broken/damaged tweets, because as explained above, I could open these "offending" tweets using online JSON viewers. Now I realise that the issue with my "offending" tweets was "simply" an extra newline/carriage return halfway through the JSON entry. If I manually open the JSON file and remove this newline/carriage return, the parser works fine.

I find it strange that the Twitter API randomly introduces these newlines/carriage return. I doubt it was a problem with internet connection, because I think if that was the case, the whole second part of the the tweet would've been lost, no?

@JBGruber
Copy link

JBGruber commented Sep 30, 2019

Interesting. So it seems it has to do with this: jeroen/jsonlite#47. I updated my function to remove the misbehaving characters before parsing. Now all your files seem to work. The four tweets are parsed. testA has 243, testB 231 tweets.

@JBGruber
Copy link

JBGruber commented Oct 2, 2019

To install old version of a package in R use:

devtools::install_version("rtweet", version = "0.6.7")

@jjvalletta
Copy link
Author

jjvalletta commented Oct 3, 2019

@ghostbeagle just as an FYI, as I'm not sure if you've seen my subsequent comments regarding this issue. Going back to v0.6.7 doesn't actually solve the issue, it just ignores the "broken" tweets rather than erroring. Best is to use @JBGruber recover_stream function.

@llrs llrs mentioned this issue Feb 15, 2021
@llrs llrs added the bug label Feb 16, 2021
@hadley
Copy link
Collaborator

hadley commented Feb 28, 2021

I think the root cause is that write_fun() uses writeLines() which will introduce a new line after every disconnect. I suspect it should just use writeBin() (that should also improve performance since it wouldn't need to convert to text prior to writing)

@hadley
Copy link
Collaborator

hadley commented Mar 1, 2021

Ah no, that probably won’t help because any data streamed in the interim will be lost. I think it should probably only ever write complete lines, leaving any trailing fragment without a nl to be added to the start of the next chunk.

It might also be worth parsing the stream in order to separate tweets from system messages.

@hadley
Copy link
Collaborator

hadley commented Mar 1, 2021

I've implemented a completely fresh approach in #526 — this should never write invalid data to disk, therefor you'll never need to do any sort of special parsing.

@hadley
Copy link
Collaborator

hadley commented Mar 4, 2021

Should be resolved now. Please open a new issue with reprex if you run into problems in the future.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants
@hadley @jjvalletta @llrs @JBGruber and others