Query #269

shmuhammadd · 2021-12-20T16:27:11Z

shmuhammadd
Dec 20, 2021

Hi,

I want search for tweets that contain an exact word and an emoji in the tweet. For example, I am here I am searching for a tweet where a word "don" appears with the emoji "😃" in the same tweet ("He don pass exam 😃" ).

I used this as a query "("di" 😃)" as explained by @chainsawriot here #224 (comment). However, the query returns both tweets that contain the word "don't" and "don". For example, the tweet "He don't pass exam 😃" was also extracted.

 get_all_tweets(
    query = "(\"di\" 😃)" ,
    start_tweets = "2020-01-01T00:00:00Z",
    end_tweets = "2021-11-01T00:00:00Z",
    file = "pidginhappy1",
    data_path = "piding/",
    bearer_token = bearer_token,
    n =10,
    export_query = TRUE,
    country = "ng"
  )

Expectation. Please, how can I use word boundaries so that only tweet with "don" and "😃" will be return (not tweets with don't , dont). Also, I tried this as a query: "("^don$" 😃)", but got an error.

 get_all_tweets(
    query =  "("^don$" 😃),
    start_tweets = "2020-01-01T00:00:00Z",
    end_tweets = "2021-11-01T00:00:00Z",
    file = "pidginhappy1",
    data_path = "piding/",
    bearer_token = bearer_token,
    n =10,
    export_query = TRUE,
    country = "ng"
  )

Answered by chainsawriot

Dec 27, 2021

@shmuhammad2004
The problem is that the tokenizer at Twitter produces a search index with both "don" and "don't" for tweets with "don't". It makes sense because some languages use apostrophes as delimiter too, e.g. L’université.

So quoting doesn't produce what you would want (BTW, it makes no sense to quote a single word). My suggestion is to do it this way instead.

require(academictwitteR)
#> Loading required package: academictwitteR

x <- get_all_tweets("don 😃 -don't",
                 start_tweets = "2020-01-01T00:00:00Z",
                 end_tweets = "2021-11-01T00:00:00Z",
                 n = 10,
                 country = "ng",
                 verbose = FALSE)
x$text
#>  [1] "Nkw…

View full answer

chainsawriot · 2021-12-27T11:32:23Z

chainsawriot
Dec 27, 2021
Collaborator

@shmuhammad2004
The problem is that the tokenizer at Twitter produces a search index with both "don" and "don't" for tweets with "don't". It makes sense because some languages use apostrophes as delimiter too, e.g. L’université.

So quoting doesn't produce what you would want (BTW, it makes no sense to quote a single word). My suggestion is to do it this way instead.

require(academictwitteR)
#> Loading required package: academictwitteR

x <- get_all_tweets("don 😃 -don't",
                 start_tweets = "2020-01-01T00:00:00Z",
                 end_tweets = "2021-11-01T00:00:00Z",
                 n = 10,
                 country = "ng",
                 verbose = FALSE)
x$text
#>  [1] "Nkwobi, Bread and Water😀😃😃😀😃\n\nMan U don suffer #OleOutNow https://t.co/lyOA0xgoS5"                                                                                                                                                                            
#>  [2] "Oga, na Bus wey don comot for Park you dey stop.😃😃 https://t.co/cdDhG4sQpu"                                                                                                                                                                                        
#>  [3] "@CyberBug11 @TheBriDen You don get mouth ba? 😃"                                                                                                                                                                                                                     
#>  [4] "Even Neighborhood Watch self don dey from SARS. 😃😃\n#EndSARS"                                                                                                                                                                                                      
#>  [5] "Everybody for Lafia don snap for LAFIA CITY MALL.… \n\nNa only me remain.😃😃"                                                                                                                                                                                       
#>  [6] "@StillYoursADD I don forget her handle 😃😃"                                                                                                                                                                                                                         
#>  [7] "Shay your eyes don clear now? Smile 😃 https://t.co/iI5SsQF2hq"                                                                                                                                                                                                      
#>  [8] "@SamsonEguntola @max_sticks @Onise_iyanu @Mayami0105 @Femaledriver2 @Auntyfeyi @BayoAdedosu @Trinity_Don_JFK @savndaniel @fhinksleem96 😃🤣😂🤣 I'm also 12 years Sir 😂"                                                                                            
#>  [9] "I yab a slim cousin of mine that he looks like Fido Dido, he was just looking at me, I later asked the older cousins of under 25, that do dey kno Fido Dido...? They said no... One of the best TV advert we had.... \nChai... I don old o😃 https://t.co/3xVaYK7CgO"
#> [10] "@TimeyinFreedom1 Wo! I don chop my own.. make e go round abeg. I jump am pass 😃"

^{Created on 2021-12-27 by the reprex package (v2.0.1)}

1 reply

shmuhammadd Dec 27, 2021
Author

@chainsawriot Excellent solution. I appreciate taking your time and offer this brilliant solution.

shmuhammadd · 2021-12-27T12:38:08Z

shmuhammadd
Dec 27, 2021
Author

@chainsawriot Your answer gives me more clue on how to handle another problem that involves looking for tweets with a search term that contains diacritics. Twitter documenation says:

"If you specify a keyword or hashtag query with character accents or diacritics, it will match Tweet text that contains both the term with the accents and diacritics, as well as those terms with normal characters. For example, queries with a keyword Diacrítica or hashtag #cumpleaños will match Diacrítica or #cumpleaños, as well as with Diacritica or #cumpleanos without the tilde í or eñe".

So, using the same trick you applied, I try to search tweets that contains "tó" and negate the query with "-to", but it returns nothing.


x <- get_all_tweets("tó -to",
                 start_tweets = "2012-01-01T00:00:00Z",
                 end_tweets = "2021-11-01T00:00:00Z",
                 n = 100,
                 country = "ng",
                 verbose = TRUE)

However, searching with only "tó" returns tweets.

x <- get_all_tweets("tó" ,
                 start_tweets = "2012-01-01T00:00:00Z",
                 end_tweets = "2021-11-01T00:00:00Z",
                 n = 100,
                 country = "ng",
                 verbose = TRUE)

1 reply

chainsawriot Dec 27, 2021
Collaborator

@shmuhammad2004
For this kind of question (how the search index is constructed and searching for what comes up with what), I believe you would have better luck with the official Twitter developer forums. There you get Twitter Engineers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query #269

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Query #269

shmuhammadd Dec 20, 2021

Replies: 2 comments · 2 replies

chainsawriot Dec 27, 2021 Collaborator

shmuhammadd Dec 27, 2021 Author

shmuhammadd Dec 27, 2021 Author

chainsawriot Dec 27, 2021 Collaborator

shmuhammadd
Dec 20, 2021

Replies: 2 comments 2 replies

chainsawriot
Dec 27, 2021
Collaborator

shmuhammadd Dec 27, 2021
Author

shmuhammadd
Dec 27, 2021
Author

chainsawriot Dec 27, 2021
Collaborator