Skip to content
This repository has been archived by the owner on Nov 10, 2024. It is now read-only.

stream_tweets raises lexical error (-> parsing .json) | random line-break #630

Closed
frodew opened this issue Oct 14, 2021 · 2 comments
Closed

Comments

@frodew
Copy link

frodew commented Oct 14, 2021

Problem

While using stream_tweets I randomly/sometimes the lexical error:

Error: lexical error: invalid character inside string.
    nload\/android\" rel=\"nofollo w\"\u003eTwitter for Android\
               (right here) ------^

This is my code:

#stream tweets on the American Continents for 20 seconds
stream_tweets(c(-169.1, -57.2, -31.9, 74.7), timeout = 20, parse = FALSE) 

#parse the stream with the error-file
parse_stream("error_stream-20211014145815.json")

#parse the stream with the fix-file (line break removed manually)
parse_stream("fix_stream-20211014145815.json")

And this is my terminal-output:

> stream_tweets(c(-169.1, -57.2, -31.9, 74.7), timeout = 20, parse = FALSE) 
Streaming tweets for 20 seconds...
Finished streaming tweets!
streaming data saved as stream-20211014145815.json
> #parse the stream with the error-file
> parse_stream("error_stream-20211014145815.json")
Error: lexical error: invalid character inside string.
          nload\/android\" rel=\"nofollo w\"\u003eTwitter for Android\
                     (right here) ------^
> #parse the stream with the fix-file (line break removed manually)
> parse_stream("fix_stream-20211014145815.json")
# A tibble: 262 x 90
   user_id             status_id created_at          screen_name text  source display_text_wi~ reply_to_status~
   <chr>               <chr>     <dttm>              <chr>       <chr> <chr>             <dbl> <chr>           
 1 1413163529847386117 14486342~ 2021-10-14 12:58:08 EuropeSpac~ "\U0~ Twitt~               NA 144863422059169~
 2 57790703            14486342~ 2021-10-14 12:58:08 denise_ste~ "@_G~ Twitt~               22 144863178777373~
 3 1713933823          14486342~ 2021-10-14 12:58:08 somoschile~ "Bue~ Twitt~               NA NA              
 4 2272114554          14486342~ 2021-10-14 12:58:08 ahsilla82   "@el~ Twitt~                0 144863033797891~
 5 387903087           14486342~ 2021-10-14 12:58:08 gabizadoro~ "ont~ Twitt~               NA 144406775907629~
 6 1109885932268937216 14486342~ 2021-10-14 12:58:08 agathallet~ "htt~ Twitt~               NA NA              
 7 1353883596394803202 14486342~ 2021-10-14 12:58:08 JoeShow683~ "@Mi~ Twitt~               92 144863414041337~
 8 821113180499820544  14486342~ 2021-10-14 12:58:09 LIGGICPHOTO "@ha~ Twitt~               49 144852012635818~
 9 742807059322839040  14486342~ 2021-10-14 12:58:09 riverarias~ "Vel~ Twitt~               NA NA              
10 293658484           14486342~ 2021-10-14 12:58:08 shinychevy  "@Ta~ Twitt~               27 144863173869189~
# ... with 252 more rows, and 82 more variables: reply_to_user_id <chr>, reply_to_screen_name <chr>,
#   is_quote <lgl>, is_retweet <lgl>, favorite_count <int>, retweet_count <int>, quote_count <int>,
#   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>, urls_t.co <list>,
#   urls_expanded_url <list>, media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
#   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, quoted_created_at <dttm>, quoted_source <chr>, ...

If I use stream_tweets with the argument "parse = FALSE", the streaming works without any issues. Only if I in a second step try to parse_stream, I get the error. Obviously I would directly get the error, if I would not have used "parse = FALSE".
-> this is the original (error-)file: error_stream-20211014145815.zip
I looked into the json-file and found that the error seems to occur due to a line break in th middle of a tweet (having looked at other error-files, this line break can occur many times per file). It seems to be at random points, i.e., I could not observe a pattern. Normally, one tweet is written in one line in the json-file.
-> this is the fixed file: fix_stream-20211014145815.zip

So to me it seems like stream_tweets sometimes adds randomly a line break while writing the streamed tweet to a json-file!

Also: A friend has the same issue, so I don't think it is due to my OS or PC.

Reproduce the problem

Unfortunately, I am not able to reproduce the error reliably. To reproduce this example, I had to execute my 20 sec stream_tweets 4 times. I found a similar issue reported #356, and it said to be fixed. However, to me its sound like my issue is similar to the one reported there.

rtweet version

## copy/paste output
packageVersion("rtweet")
[1] ‘0.7.0

Session info

## copy/paste output
sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rtweet_0.7.0

loaded via a namespace (and not attached):
 [1] rstudioapi_0.13        magrittr_2.0.1         hms_1.1.0              tidyselect_1.1.1      
 [5] bit_4.0.4              R6_2.5.1               rlang_0.4.11           fansi_0.5.0           
 [9] httr_1.4.2             tools_4.1.1            parallel_4.1.1         vroom_1.5.5           
[13] utf8_1.2.2             cli_3.0.1              withr_2.4.2            askpass_1.1           
[17] ellipsis_0.3.2         openssl_1.4.5          bit64_4.0.5            tibble_3.1.4          
[21] lifecycle_1.0.0        crayon_1.4.1           quanteda.corpora_0.9.2 purrr_0.3.4           
[25] readr_2.0.1            tzdb_0.1.2             vctrs_0.3.8            curl_4.3.2            
[29] glue_1.4.2             compiler_4.1.1         pillar_1.6.2           jsonlite_1.7.2        
[33] pkgconfig_2.0.3       
@llrs llrs added the duplicate label Oct 14, 2021
@llrs
Copy link
Collaborator

llrs commented Oct 14, 2021

The issue #356 seems like the same bug. That bug is fixed on the development version of the package, while you are using the version of CRAN. So, you aren't using the fix that closed that issue.

If you want to use the package with a fix for this you'll need to use the current package as it is here. Unless you do that and find the same problem I'll close as duplicate this issue.

If you install this version be aware that it might change before it reaches CRAN and it changes quite from the version you are currently using.

@frodew
Copy link
Author

frodew commented Oct 14, 2021

Ok, thank you.
I thought too that my issue seemed quite similar. So I close this issue, as I think it is already fixed :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants