This script reads a json file generated by Twarc and filter non-desired information for each Tweet.
Twarc-count requires Python 3.7 or greater and pip.
You need to clone this repository.
git clone https://github.com/DataPolitik/twarc_filter.git
And then, move to the folder twarc_filter. Then, install all modules required by the script:
pip install -r requirements.txt
filter.py -i <INFILE> -o <OUTFILE> [-f [FIELDS] ] [ -e [EXTENSION]]
- -i | - -infile: The path to the input file. It must be a flat twarc json file.
- -o | - -outfile: The path to the output file.
- -f | - -fields: The fields to extract from the input file.
- -e | - -extension: The type of file to generate (csv or json). Default is json.
- -r | - -related: List only tweets which have a specific relation with another tweet. Allowed values are "retweeted", "mention" o "replied_to.
You can type any fields name from the input json, separated by comma. Example:
filter.py -i examples/test.json -o test_output.json -f referenced_tweets.type,source
This command extracts from every tweet in test.json the fields referenced_tweets.type and source. Just be sure that these fields exists in the input file.
You can check the Twitter API documentation (https://developer.twitter.com/en/docs/twitter-api/fields) for more information about fields and expansions.
filter.py -i examples/test.json -o test_output.json -r referenced_tweets.type,source