Skip to content

Latest commit

 

History

History
47 lines (27 loc) · 1.58 KB

README.md

File metadata and controls

47 lines (27 loc) · 1.58 KB

Filter

A filter plugin for twarc files

This script reads a json file generated by Twarc and filter non-desired information for each Tweet.

Requirements

Twarc-count requires Python 3.7 or greater and pip.

Installation

You need to clone this repository.

git clone https://github.com/DataPolitik/twarc_filter.git

And then, move to the folder twarc_filter. Then, install all modules required by the script:

pip install -r requirements.txt

Usage

filter.py -i <INFILE> -o <OUTFILE> [-f [FIELDS] ] [ -e [EXTENSION]]

  • -i | - -infile: The path to the input file. It must be a flat twarc json file.
  • -o | - -outfile: The path to the output file.
  • -f | - -fields: The fields to extract from the input file.
  • -e | - -extension: The type of file to generate (csv or json). Default is json.
  • -r | - -related: List only tweets which have a specific relation with another tweet. Allowed values are "retweeted", "mention" o "replied_to.

Examples

Fields

You can type any fields name from the input json, separated by comma. Example:

filter.py -i examples/test.json -o test_output.json -f referenced_tweets.type,source

This command extracts from every tweet in test.json the fields referenced_tweets.type and source. Just be sure that these fields exists in the input file.

You can check the Twitter API documentation (https://developer.twitter.com/en/docs/twitter-api/fields) for more information about fields and expansions.

Extract retweets

filter.py -i examples/test.json -o test_output.json -r referenced_tweets.type,source