Skip to content
This repository has been archived by the owner on Mar 30, 2023. It is now read-only.

fix for deprecation of v1.1 endpoints : implemented search functionality #944

Merged
merged 1 commit into from
Oct 9, 2020

Conversation

himanshudabas
Copy link
Contributor

@himanshudabas himanshudabas commented Oct 8, 2020

As we all know that twitter is closing older v1.1 endpoints so a few weeks ago twitter closed the endpoint on which twint's search functionality relied upon. Which broke twint's search functionality.
So I've implemented the search functionality again using their new endpoints.

#917 Pull Request was made which relies on the mobile version of the twitter but it doesn't give all the tweets of the query and doesn't support advance queries either.

This is the list of what I have tested.

Tested on : Python 3.6.8

Features I've tested :

  1. twint -u username : works
  2. twint -s pineapple : works
  3. twint -s "#hashtag" : works
  4. twint -s "hashtag" -o file.txt
  5. twint -s "hashtag" -o file.csv --csv

Rest of the things I haven't tested yet, but I am quite certain most of the things won't directly work as the library is still broken. Although queries related to Search might work as the base Search functionality is working, so related queries will work too unless it depends on something that I have changed.

Need to test more to know what doesn't work.

Note: I've tested this on Python 3.6.8 because for some weird reason twitter's new endpoints aren't working with aiohttp on Python 3.8.6 ( haven't tested it on other versions though). even though aiohttp works fine when requesting other websites but for some reason it never returns anything when requesting twitter endpoints.

@pielco11 pielco11 merged commit 2d638de into twintproject:master Oct 9, 2020
@himanshudabas
Copy link
Contributor Author

@pielco11 I was just doing some more testing and there is one issue I found.
Twitter doesn't allow us to get any data if we use an AWS IP. I don't have a GCP account so can't test it for GCP IPs.

Workaround for this is using proxy for getting the token and then making requests without using proxy. I have tested this and this works fine.
Although I don't have any idea how proxies are implemented in twint. Maybe someone else can give it a look.

Note : "tor" might also work for this. (haven't tested it yet though)

@bushjavier
Copy link

@pielco11 I was just doing some more testing and there is one issue I found.
Twitter doesn't allow us to get any data if we use an AWS IP. I don't have a GCP account so can't test it for GCP IPs.

Workaround for this is using proxy for getting the token and then making requests without using proxy. I have tested this and this works fine.
Although I don't have any idea how proxies are implemented in twint. Maybe someone else can give it a look.

Note : "tor" might also work for this. (haven't tested it yet though)

@himanshudabas I can confirm that google cloud IPs work when scraping with Twint, I just fresh install twint using git clone in a debian instance and everything works as expected.

@himanshudabas
Copy link
Contributor Author

@himanshudabas I can confirm that google cloud IPs work when scraping with Twint, I just fresh install twint using git clone in a debian instance and everything works as expected.

Thank you for confirming this.

@NoSuck
Copy link
Contributor

NoSuck commented Oct 16, 2020

I have also succeeded in scraping as usual, from a mobile connection in the United States in my case. I noticed that formats for created_at and timezone have changed, but perhaps that is not related to this patch.

Thank you.

@himanshudabas
Copy link
Contributor Author

I have also succeeded in scraping as usual, from a mobile connection in the United States in my case. I noticed that formats for created_at and timezone have changed, but perhaps that is not related to this patch.

Thank you.

@NoSuck
I apologize if the format change caused some issues for you.
To my defense I only started to explore twint a few weeks ago, (actually after it broke) and I had no Idea what the results looked like before Twitter killed the legacy endpoints. Moreover I put up this patch as soon as I could. Bringing it back up was my first priority at that time.

Could you share with me what previous created_at and timezone looked like?

@NoSuck
Copy link
Contributor

NoSuck commented Oct 16, 2020

The previous created_at was in epoch milliseconds with a granularity of one second, so that every string ended in "000", e.g. "1602584056000" instead of "2020-10-13 06:14:16 EDT". A previous timezone value would be "EDT" instead of "-0400". You are very polite, but I cannot say these changes have caused any real issue for me. However, I would agree that treading lightly is usually best.

A greater issue I noticed just now is the retaining of newline characters in --csv output.

EDIT: It seems previously newline characters were simply replaced with spaces (U+0020).

@NoSuck
Copy link
Contributor

NoSuck commented Oct 16, 2020

I took a look and will submit a PR shortly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants