Keyword extractor and tagger for fabric8-analytics.
For getting all available commands issue:
$ f8a_tagger_cli.py --help
Usage: f8a_tagger_cli.py [OPTIONS] COMMAND [ARGS]...
Tagger for fabric8-analytics.
Options:
-v, --verbose Level of verbosity, can be applied multiple times.
--help Show this message and exit.
Commands:
aggregate Aggregate keywords to a single file.
collect Collect keywords from external resources.
diff Compute diff on keyword files.
lookup Perform keywords lookup.
reckon Compute keywords and stopwords based on stemmer and lemmatizer configuration.
To run a command in verbose mode (adds additional messages), run:
$ f8a_tagger_cli.py -vvvv lookup /path/to/tree/or/file
Verbose output will give you additional insides on steps that are performed during execution (debug mode).
$ git clone https://github.com/fabric8-analytics/fabric8-analytics-tagger && cd fabric8-analytics-tagger
$ python3 setup.py install # or make install
The prerequisite for tagging is to collect keywords that are used out there by developers. This also means that tagger uses keywords that are considered as interesting ones by developers.
The collection is done by collectors (available in f8a_tagger/collectors
). These collectors gather keywords and also count number of occurrences for gathered keywords. Collectors do not perform any additional post-processing, but rather gather raw keywords that are after that post-processed by the aggregate
command (see bellow).
An example of raw keywords can be the following YAML file that keeps keywords gathered in PyPI ecosystem.
If you take a look at raw keywords that are gathered by the collect
command explained above, you can easily spot a lot of keywords that are written in a wrong way (they have broken encoding, multi-line keywords, numerical values, one letter keywords, …). These keywords should be removed and other keywords that are present should be normalized and, if possible, there can be computed some obvious synonyms that can be present during keywords lookup phase.
The aggregate
command handles:
-
keywords filtering - suspicious and obviously useless keywords can be directly thrown away (e.g. bogus keywords, one letter keywords, …)
-
keywords normalization - all keywords are normalized to lowercase form, having only ASCII characters, multi-word keywords are separated with dashes rather than spaces
-
synonyms computation - some synonyms can be directly computed - e.g. some people use
machine-learning
, some usemachine learning
-
aggregating multiple
keywords.yaml
files - theaggregate
command can aggregate multiplekeywords.yaml
files into one, this is especially useful if there are more than one keywords sources available for collecting keywords
The output of aggregate
command is a single configuration file (could be JSON or YAML), that keeps the following (aggregated) entries:
-
keywords themselves
-
synonyms
-
regular expressions that match the given keyword
-
occurrence count of the given keyword in keyword sources (used later for keywords scoring, see bellow)
An example of a keyword entries produced by the aggregate
command could be:
machine-learning:
synonyms:
- machine learning
- machinelearning
- machine-learning
- machine_learning
- occurrence_count: 56
django:
occurrence_count: 2654
regexp:
- '.*django.*'
The keywords.yaml
file can be, of course, additionally manually changed as desired.
An example of automatically aggregated keywords.yaml
can be found in fabric8-analytics-tags repository. This keywords.yaml
file was computed based on collected raw keywords from PyPI stated above.
The overall outcome of steps above is a single keywords.yaml
file. This file, with stopwords.txt
file keeping stopwords, is the input for the lookup
command.
The lookup
command does the whole heavy computation needed for keywords extraction. It utilizes NLTK for utilizing many natural language processing tasks.
The overall high-level overview of the lookup
command can be described in the following steps:
-
The first step is to do pre-processing of input files. Input files can be written in different formats. Except plaintext, there can be also used text files using different markup formats (such as Markdown, ASCIIdoc, and such).
-
After input pre-processing there is available plaintext without any markup formatting parts. This text is after that split into sentences. The actual split is done in a smart way (so "This Mr. Baron e.g. Mr. Foo." will be one sentence - not just split on dots).
-
Sentences are tokenized into words. This tokenization is done, again, in a smart way ("e.g. isn’t" is split into three tokens - "e.g", "is" and "n’t".
-
Lemmatization - all words (tokens) are replaced with their representative as words appear in several inflected forms. Lemmatization uses NLTK’s WordNet corpus (large lexical database of English words).
-
After lemmatization there is performed stemming. Stemming ensures that different words are mapped to their word stem (e.g. "licencing", "license" is same). There are available different stemmers, check
lookup --help
for listing of all available ones. Check out Standford’s NLP for more insights on lemmatization and stemming. -
Unwanted tokens are removed - tokens are checked against stopwords file and if there is a match, unwanted tokens are removed. This step ensures that the lookup will perform faster and we also remove obviously wrong words that shouldn’t be marked as keywords (words with high entropy).
-
There are calculated ngrams for multi-word keywords by systematically concatenating tokens (e.g. tokens
["this", "is", "machine", "learning"]
with ngram size equal to 2 create the following tokens:["this", "is", "machine", "learning", "this is", "is machine", "machine learning"]
. This step ensures that there can be performed lookup of multi-word keywords (such as "machine learning"). The actual ngrams size (bigrams, trigrams) is determined bykeywords.yaml
configuration file (based on synonyms), but can be explicitly stated using--ngram-size
option. -
Actual lookup against
keywords.yaml
configuration file. Constructed array of tokens with ngrams is checked againstkeywords.yaml
file. The output of this step is an array of found keywords during keywords mining. -
The last step performs scoring on found keywords based on their relevance in the system (based on occurrence count of the found keyword and occurrence count in the text).
You can watch check output of all steps by running tagger in debug mode by supplying multiple --verbose
command line options. In that case tagger will report what steps are performed, what is input and the outcome. This can also help you when debugging what is going on when using tagger.
There are prepared few commands that can make your life easier when working with keywords database.
This command will apply lemmatization and stemming on your keywords.yaml
and stopwords.txt
files. The output is after that printed to you to check form of keywords and stopwords that will be used during lookup (in respect to lemmatization and stemming).
Check reckon --help
for more info on available options.
File keywords.yaml
keeps all keywords that are in a form of:
keyword:
occurrence_count: 42
synonyms:
- list
- of
- synonyms
regexp:
- 'list.*'
- 'o{1}f{1}'
- 'regular[ _-]expressions?'
A keyword is a key to dictionary containing additional fields:
-
synonyms - for list of synonyms to the given keyword
-
regexp - for list of regular expressions that match the given keyword
-
occurrence_count - number of times the given keyword was found in the external source (helping with keywords scoring)
For example, if you would like to define keyword django
that matches all words that contain "django
", just define:
django:
occurrence_count: 1339
regexp:
- '.*django.*'
Another example demonstrates synonyms. To define synonyms IP, IPv4 and IPv6 as synonyms to networking, just define the following entry:
networking:
synonyms:
- ip
- ipv4
- ipv6
Regular expressions conform to Python regular expressions.
This file contains all stopwords (words that should be left out from text analysis) in raw/plaintext and regular expression format. All stopwords are listed one per line.
An example of stopwords file keeping stopwords ("would", "should" and "are"):
would
should
are
There can be also specified regular expression that describe stopwords.
An example of regular expression stopwords:
re: [0-9][0-9]*
re: https?://[a-zA-Z0-9][a-zA-Z0-9.]*.[a-z]{2,3}
In the example above, there are listed two regular expressions to define stopwords. The first one defines stopwords that consist purely of integer numbers (any integer number will be dropped from textual analysis). The latter example filters out any URL (the regexp is simplified).
Regular expressions conforms to Python regular expressions.
If you would like to set up a virtualenv for your environment, just issue prepared make venv
Make target:
$ make venv
After this command, there should be available virtual environment that can be accessed using:
$ source venv/bin/activate
And exited using:
$ deactivate
To run checks, issue make check
command:
$ make check
The check Make target runs a set of linters provided by Coala; there is also run pylint
, pydocstyle
. To execute only desired linter, run appropriate Make target:
$ make coala
$ make pylint
$ make pydocstyle
Tagger does not use any machine learning technique to gather keywords. All steps correspond to data mining techniques so there is no "accuracy" that could be evaluated. Tagger simply checks for important, key words that are relevant (low entropy). The overall quality of keywords found is equal to quality of keywords.yaml
file.
-
all collectors should receive a set of keywords that are all lowercase
-
the only delimiter that is allowed for multi word keywords is a dash (
-
), all spaces should be replaced with dash -
synonyms for multi word keywords are automatically created in aggregate command, if requested
README.json is a format introduced by one task (GitReadmeCollectorTask
) present in fabric8-analytics-worker. The structure of document is described by one JSON file containing two keys:
-
content
- raw content of README file -
type
- content type that can be markdown, ReStructuredText, … (seef8a_tagger.parsers.abstract
for more info)
Parsers are used to transform README.json files to plaintext files. Their main goal is to remove any markup specific annotations and provide just plaintext that can be directly used for additional text processing.
You can see implementation of parsers in the f8a_tagger/parsers
directory.
There is also present a set of collectors that collect keywords/topics/tags from various external resources such as PyPI, Maven central and such. These collectors produce a list of keywords with they occurrence count that can be later on used for keywords extraction.
All collectors are present under f8a_tagger/collectors
package.