Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REST Keyword crawler: Configuration of [languageFilter] #1

Open
MonkandMonkey opened this issue Feb 13, 2019 · 0 comments
Open

REST Keyword crawler: Configuration of [languageFilter] #1

MonkandMonkey opened this issue Feb 13, 2019 · 0 comments

Comments

@MonkandMonkey
Copy link

MonkandMonkey commented Feb 13, 2019

I am using the [REST Keyword crawler], and I only want English tweets.
I did as the snippet in README.md, but I still get tweets of multi Langs. So I checked the file: "crawler.properties", and changed:
####################################################################################
# REST Cralwer of Twitter - by keyword(s)
# Class: org.backingdata.twitter.crawler.rest.TwitterRESTKeywordSearchCrawler
# - Full path of the txt file to read terms from (one term ID per line)
tweetKeyword.fullPathKeywordList=keywords.txt
# - Full path of the output folder to store crawling results
tweetKeyword.fullOutputDirPath=./data/
# - Storage format: "json" to store one tweet per line as tweet JSON object or "tab" to store
# one tweet per line as TWEET_IDTWEET_TEXT
tweetID.outputFormat=json
# - If not empty, it is possible specify a language to retrieve only tweet of a specific language
# (en, es, it, etc.) - if empty all tweet are retrieved, indipendently from their language
# IMPORTANT: The language code may be formatted as ISO 639-1 alpha-2 (en), ISO 639-3 alpha-3 (msa), or ISO 639-1 alpha-2 combined with an ISO 3166-1 alpha-2 localization (zh-tw).
tweetID.languageFilter=en

into:
####################################################################################
# REST Cralwer of Twitter - by keyword(s)
# Class: org.backingdata.twitter.crawler.rest.TwitterRESTKeywordSearchCrawler
# - Full path of the txt file to read terms from (one term ID per line)
tweetKeyword.fullPathKeywordList=keywords.txt
# - Full path of the output folder to store crawling results
tweetKeyword.fullOutputDirPath=./data/
# - Storage format: "json" to store one tweet per line as tweet JSON object or "tab" to store
# one tweet per line as TWEET_IDTWEET_TEXT
tweetKeyword.outputFormat=json
# - If not empty, it is possible specify a language to retrieve only tweet of a specific language
# (en, es, it, etc.) - if empty all tweet are retrieved, indipendently from their language
# IMPORTANT: The language code may be formatted as ISO 639-1 alpha-2 (en), ISO 639-3 alpha-3 (msa), or ISO 639-1 alpha-2 combined with an ISO 3166-1 alpha-2 localization (zh-tw).
tweetKeywod.languageFilter=en

And it worked!
Thanks for your great tool, which is useful and helped a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant