Skip to content

Latest commit

 

History

History
580 lines (436 loc) · 32.2 KB

scripts.md

File metadata and controls

580 lines (436 loc) · 32.2 KB

Content

This file contains details on each script that is available in the slr-kit repository. Each script can be run as a standalone program. Althouh, typically, they are run by the main slrkit command.

Available scripts and programs

The following scripts and programs are currently available. They are listed in the order that they are expected to be used in the SLR workflow. All the scripts expect utf-8 files as input.

import_biblio.py

  • ACTION: Import a bibliographic and convert it to the CSV format.
  • INPUT: the bibliographic file.
  • OUTPUT: CSV file format with the desired columns.

The output is sent to stdout unless an output file name is explicitly specified. The input file format can be chosen using an option. Currently, only the RIS format is supported.

The advice is to always start from the RIS format (instead of CSV or BIB) since it allows a better, easier and clearer separation among the different elements of a bibliographic item. The command line parameters allow to select the fields to export fro the RIS file.

During the conversion, an unique progressive number is added to each paper, which acts as unique identifier in the rest of the processing (within a column named id). Therefore, be careful to not mix different versions of the source file that may generate a different numbering of the papers.

TODO: add an option to append the output to an existing CSV file?

Positional arguments:

  • input_file: input bibliography file

Optional arguments:

  • --type | -t TYPE: Type of the bibliography file. Supported types: RIS. If absent 'RIS' is used.
  • --output | -o FILENAME: output CSV file name
  • --columns | -c col1,..,coln: list of comma-separated columns to export. If absent 'title,abstract' is used. Use '?' for the list of available columns

Example of usage

The standard slr-kit workflow needs a CSV file with two columns: title and abstract. Such CSV file can be obtained with the command:

import_biblio --columns title,abstract dataset.ris > dataset_abstracts.csv

acronyms.py

  • ACTION: Extracts a list of acronyms from the abstracts.
  • INPUT: CSV file with the list of abstracts generated by import_biblio.py.
  • OUTPUT: CSV file containing the short and extended acronyms suitable to be classified with FAWOC.

Uses the algorithm presented in A. Schwartz and M. Hearst, "A Simple Algorithm for Identifying Abbreviations Definitions in Biomedical Text", Biocomputing, 2003.

The script assumes that the abstracts are contained in a column named abstract. A different column can be specified usign a command liine option. It also requires a column named id. All the rows in the input file with 'rejected' as the status field (if present) are discarded and not elaborated. The output is a TSV file with the columns id, term and label. This is the format used by FAWOC. The id is a number that univocally identifies an acronym. term contains the acronym. The format is <extended acronym> | (<abbreviation>). The label will be empty, becasue it is the column that will be used by FAWOC for the classification.

Positional arguments:

  • datafile input CSV data file

Optional arguments:

  • --output | -o FILENAME: output file name
  • --column | -c COLUMN: Name of the column of datafile to search the acronyms. Default: abstract

preprocess.py

  • ACTION: Performs the preprocessing of the documents to prepare it for further processing.
  • INPUT: The CSV file produced by import_biblio.py or the one modified by filter_paper.py.
  • OUTPUT: A CSV file containing the same columns of the input file, plus a new column containing the preprocessed text.

The preprocessing includes:

  • Remove punctuations
  • Convert to lowercase
  • Remove stop words
  • Mark selected n-grams as relevant
  • Acronyms substitution
  • Remove special characters and digits
  • Regex based substitutions
  • Lemmatisation

All the rows in the input file with 'rejected' as the status field (if present) are discarded and not elaborated. The stop words are read only from one or more optional files. These words are replaced, in the output, with a placeholder (called stopword placeholder) that is recognized in the term extraction phase. The default stopword placeholder is the '@' character.

The user can also specify files containing lists of relevant n-grams. For each specified file, the user can specify a particular marker that will be used as replacement text for each term in that list. That marker will be surrounded with the stopword placeholder. If the user do not specify a marker for a list of terms, then a different approach is used. Each term will be replaced with a placeholder composed by, the stopword placeholder, the words composing the term separated with '_' and finally another stopword placeholder. This kind of placeholders can be used by the subsequent phases to recognize the term without losing the meaning of the n-gram.

Examples (assuming that '@' is the stopword placeholder):

  1. if the user specifies a list of relevant n-grams with the specific placeholder 'TASK_PARAM', all the n-grams in that list will be replaced with '@TASK_PARAM@'.
  2. if the user specifies a list of relevant n-grams without a specific placeholder, then each n-gram will be replaced with a specific string. The n-gram 'edf' will be replaced with '@edf@', the n-gram 'earliest deadline first' will be replaced with '@earliest_deadline_first@'.

This program also handles the acronyms. A TSV file containing the approved acronyms can be specified. This file must have the format defined by the acronyms.py script. The file must have two columns 'term' and 'label':

  • 'term' must contain the acronym in the form <extended acronym> | (<abbreviation>)
  • 'label' is the classification made with fawoc. Only the rows with label equal to 'relevant' or 'keyword' will be considered.

For each considered row in the TSV file, the program searches for:

  1. the abbreviation of the acronym;
  2. the extended acronym;
  3. the extended acronym with all the spaces substituted with '-'.

The first search is case-sensitive, while the other two are not. The program replaces each recognized acronym with the marker <stopword placeholder><acronym abbreviation><stopword placeholder>.

The preprocess.py script can apply some specific regex and substitutions. Using the --regex option, the user can pass to script a csv file containing the instructions to apply these substitutions.

The file accepted by the --regex option has the following structure:

  • pattern: the pattern to search in the text. Can be a python3 regex pattern or a string to be searched verbatim;
  • repl: the string that substitutes the pattern. The actual text substituted is __<repl-content>__;
  • regexBoolean: if true the pattern is treated as a regular expression. If false the pattern is searched verbatim.

Positional arguments:

  • datafile input CSV data file

optional arguments:

  • --output | -o FILENAME output file name. If omitted or '-' stdout is used
  • --placeholder | -p PLACEHOLDER Placeholder for stop words. Also used as a prefix for the relevant words. Default: '@'
  • --stop-words | -b FILENAME [FILENAME ...] stop words file name
  • --relevant-term | -r FILENAME [PLACEHOLDER] relevant terms file name and the placeholder to use with those terms. The placeholder must not contains any space. If the placeholder is omitted, each relevant term from this file, is replaced with the stopword placeholder, followed by the term itself with each space changed with the "-" character and then another stopword placeholder.
  • --acronyms | -a ACRONYMS TSV files with the approved acronyms
  • --target-column | -t TARGET_COLUMN Column in datafile to process. If omitted 'abstract' is used.
  • --output-column OUTPUT_COLUMN name of the column to saveIf omitted 'abstract_lem' is used.
  • --input-delimiter INPUT_DELIMITER Delimiter used in datafile. Default '\t'
  • --output-delimiter OUTPUT_DELIMITER Delimiter used in output file. Default '\t'
  • --rows | -R INPUT_ROWS Select maximum number of samples
  • --language | -l LANGUAGE language of text. Must be a ISO 639-1 two-letter code. Default: 'en'
  • --regex REGEX regex .csv for specific substitutions

Example of usage

The following example processes the dataset_abstracts.csv file, filtering the stop words in stop_words.txt and produces dataset_preproc.csv, which contains the same columns of the input file plus the abstract_lem column:

preprocess.py --stop-words stop_words.txt dataset_abstracts.csv > dataset_preproc.csv

The following example processes the dataset_abstracts.csv file, replacing the terms relevant_terms.txt with placeholder created from the terms themselves, replacing the terms in other_relevant.txt with '@PLACEHOLDER@' and produces dataset_preproc.csv, which contains the same columns of the input file plus the abstract_lem column:

preprocess.py --relevant-term relevant_terms.txt -r other_relevant.txt PLACEHOLDER dataset_abstracts.csv > dataset_preproc.csv

gen_terms.py

  • ACTION: Extracts the terms ({1,2,3,4}-grams) from the abstracts.
  • INPUT: The TSV file produced by preprocess.py (it works on the column abstract_lem).
  • OUTPUT: A TSV file containing the list of terms, and a TSV with their frequency.

This script extracts the terms from the file produced by the preprocess.py script. It uses the placeholder character to skip all the n-grams that contains the placeholder. The script also skips all the terms that contains tokens that starts and ends with the placeholder. This kind of tokens are produced by preprocess.py to mark the acronyms and the relevant terms.

The format of the output file is the one used by FAWOC. The structure is the following:

  • id: a progressive identification number;
  • term: the n-gram;
  • label: the label added by FAWOC to the n-gram. This field is left blank by the gen_terms.py script.

This command produces also the fawoc_data.tsv file, with the following structure:

  • id: the identification number of the term;
  • term: the term;
  • count: the number of occurrences of the term.

This file is used by FAWOC to show the number of occurrences of each term.

Arguments:

  • inputfile: name of the TSV produced by preprocess.py;
  • outputfile: name of the output file. This name is also used to create the name of the file with the term frequencies. For instance, if outputfile is filename.tsv, the frequencies file will be named filename_fawoc_data.tsv. This file is create in the same directory of outputfile;
  • --stdout | -s: also print on stdout the output file;
  • --n-grams | -n N: maximum size of n-grams. The script will output all the 1-grams, ... N-grams;
  • --min-frequency |-m N: minimum frequency of the n-grams. All the n-grams with a frequency lower than N are not output.
  • --placeholder | -p PLACEHOLDER: placeholder for barrier word. Also used as a prefix for the relevant words. Default: '@'
  • --column | -c COLUMN: column in datafile to process. If omitted 'abstract_lem' is used.
  • --delimiter DELIMITER: delimiter used in datafile. Default '\t'
  • --logfile LOGFILE: log file name. If omitted 'slr-kit.log' is used

Example of usage

Extracts terms from dataset_preproc.csv and store them in dataset_terms.csv and dataset_terms_fawoc_data.tsv:

gen_terms.py dataset_preproc.csv dataset_terms.csv

fawoc.py, the FAst WOrd Classifier

  • ACTION: GUI program for the fast classification of terms.
  • INPUT: The CSV file with the terms produced by gen-n-grams.py.
  • OUTPUT: The same input CSV with the labels assigned to the terms.

NOTE: the program changes the content of the input file.

The program uses also two files to save its state and retrieve some information about terms. Assuming that the input file is called dataset_terms.tsv, FAWOC uses dataset_terms_fawoc_data.json and dataset_terms_fawoc_data.tsv. This two files are searched and saved in the same directory of the input file. The json file is used to save the state of FAWOC, and it is saved every time the input file is updated. The tsv file is used to load some additional data about the terms (currently only the terms count). This file is not modified by FAWOC. If these two file are not present, they are created by FAWOC, the json file with the current state. The tsv file is created with data eventually loaded from the input file. If no count field was present in the input file a default value of -1 is used for all terms.

FAWOC saves data every 10 classsifications. To save data more often, use the 'w' key.

The program also writes profiling information into the file profiler.log with the relevant operations that are carried out.

occurrences.py

  • ACTION: Determines the occurrences of the terms in the abstracts.
  • INPUT: Two files: 1) the list of abstracts generated by preprocess.py and 2) the list of terms generated by fawoc.py
  • OUTPUT: A JSON data structure with the position of every term in each abstract; the output is written to stdout by default.

Example of usage

Extracts the occurrences of the terms listed in dataset_terms.csv in the abstracts contained in dataset_preproc.csv, storing the results in dataset_occ_keyword.json:

occurrences.py -l keyword dataset_preproc.csv dataset_terms.csv > dataset_occ_keyword.json

postprocess.py

It actually generates a copy of the preprocess input file and it adds a column whose default name is abstract_filtered, containing the terms that appear in the abstract_lem column of the input file, and keeps only the terms classified as are classified as relevant or keyword in the terms list.

The script also considers as relevant all the token that starts and ends with the placeholder characters (the script strips the placeholder characters and uses the rest of the token). These tokens are used by preprocess.py to mark the additional relevant terms and the acronyms.

Esample of usage

postprocess.py dataset_preproc.csv dataset_terms.csv myoutputdir -o data_postprocess.csv

lda.py

  • ACTION: Train an LDA model and outputs the extracted topics and the association between topics and documents.
  • INPUT: The TSV file produced by postprocess.py (it works on the column abstract_filtered) and the terms TSV file classified with FAWOC.
  • OUTPUT: A JSON file with the description of extracted topics and a JSON file with the association between topics and documents.

The script uses the filtered documents produced by postprocess.py.

For more information and references about LDA, check the Gensim LDA library page.

This script outputs the topics in <outdir>/lda_terms-topics_<date>_<time>.json and the topics assigned to each document in <outdir>/lda_docs-topics_<date>_<time>.json.

IMPORTANT:

There are some issues on the reproducibility of the LDA training. Setting the seed option (see below) is not enough to guarantee the reproducibilty of the experiment. It is also necessary to set the environment variable PYTHONHASHSEED to 0. The following command sets the variable for a single run in a Linux shell:

PYTHONHASHSEED=0 python3 lda.py ...

Also using a saved model requires the use of the same seed used for training and the PYTHONHASHSEED to 0. More information on the PYTHONHASHSEED variable can be found here.

Arguments:

Positional:

  • postproc_file path to the preprocess file with the text to elaborate.
  • outdir path to the directory where to save the results. If omitted, the current directory is used.

Optional:

  • --text-column | -t TARGET_COLUMN: Column in preproc_file to process. If omitted 'abstract_lem' is used.
  • --topics TOPICS Number of topics. If omitted 20 is used
  • --alpha ALPHA alpha parameter of LDA. If omitted "auto" is used
  • --beta BETA beta parameter of LDA. If omitted "auto" is used
  • --no_below NO_BELOW Keep tokens which are contained in at least this number of documents. If omitted 20 is used
  • --no_above NO_ABOVE Keep tokens which are contained in no more than this fraction of documents (fraction of total corpus size, not an absolute number). If omitted 0.5 is used
  • --seed SEED Seed to be used in training
  • --model if set, the lda model is saved to directory <outdir>/lda_model. The model is saved with name "model".
  • --load-model LOAD_MODEL Path to a directory where a previously trained model is saved. Inside this directory the model named "model" is searched. the loaded model is used with the dataset file to generate the topics and the topic document association
  • --no_timestamp if set, no timestamp is added to the topics file names
  • --config | -c CONFIG Path to a toml config file like the one used by the slrkit lda command. It overrides all the cli arguments.

Example of usage

Extracts topics from dataset dataset_preproc.csv using the classified terms in dataset_terms.csv and saving the result in /path/to/outdir:

lda.py dataset_preproc.csv dataset_terms.csv /path/to/outdir

lda_ga.py

  • ACTION: Uses a GA to search the best LDA model parameters, outputs all the trained models and the extracted topics and the association between topics and documents produced by the best model.
  • INPUT: The TSV file produced by postprocess.py (it works on the column abstract_filtered), the terms TSV file classified with FAWOC and a toml file with the parameter used by the GA.
  • OUTPUT: All the trained models in a format suitable to be used with the lda.py script and a tsv file that summarize all the results. Also outputs the extracted topics and the association between topics and documents produced by the best model.

The script searches the best combination (in terms of coherence) of number of topics, alpha, beta, no-below and no-above parameters. Each combination of parameters (a.k.a. individual) is represented as

(topics, alpha_val, beta, no_above, no_below, alpha_type)

Each parameter has the following meaning:

  • topics: number of topics. It is an integer number;
  • alpha_val: value of the alpha parameter. It is a floating point number;
  • beta: value of the beta parameter. It is a floating point number;
  • no_above: value of the no-above parameter. It is a floating point number;
  • no_below: value of the no-below parameter. It is an integer number;
  • alpha_type: this integer number tells if the alpha parameter must have the value of alpha_val or if one of the string values must be used. The allowed value are:
    • 0: use the alpha_val;
    • 1: use the string symmetrical;
    • -1: use the string asymmetrical;

It uses a GA algorithm (the mu+lambda genetic algorithm) to find the best model. The mu+lambda GA starts with an initial population. Then creates lambda new individuals (a.k.a. solutions) by replication, mutation or crossover (only one of this operation is applied to a single individual). From the initial population plus the new lambda individuals the algorithm selects mu individuals that "survive" to the next generation. This procedure is repeated num-generation times. The selection procedure is a tournament where a fixed number of individuals are randomly choosen to partecipate in the tournament. The individual in the tournament with the best coherence is selected to pass to the next generation. The tournament is applied mu times in order to select the mu individuals of the next generation. The GA uses a gaussian mutaion. This means that if an individual must be mutated, then the mutation randomly selects which parameters have to be mutated. For each selected parameter, a random number is choosen from a gaussian distribution and it is added to the parameter. Each parameter has his own gaussian distribution. The crossover randomly selects a set of parameters that the two individuals have to exchange. More information and references about the mu+lambda GA can be found here.

All the parameter of the GA are taken from a TOML version 1.0.0 file. The format of this file is the following:

  • limits: this section contains the ranges of the parameter;
    • min_topics: minimum number of topics;
    • max_topics: maximum number of topics;
    • max_no_below: maximum value of the no-below parameter. The minimum is always 1. A value of -1 means a tenth of the number of documents;
    • min_no_above: minimum value of the no-above parameter. The maximum is always 1.
  • algorithm: this section contains the parameters used by the GA:
    • mu: number of individuals that will pass each generation;
    • lambda: number of individuals that are generated at each generation;
    • initial: size of the initial population;
    • generations: number of generation;
    • tournament_size: number of individuals randomly selected for the selection tournament.
  • probabilities: this section contains the probabilities used by the script:
    • mutate: probability of mutation. The sum of this probability and the mate probability must be less than 1;
    • component_mutation: probability of mutation of each individual component;
    • mate: probability of crossover (also called mating). The sum of this probability and the mutate probability must be less than 1;
    • no_filter: probability that a new individual is created with no term filter (no_above = no_below = 1);
  • mutate: this section contains the parameters of the gaussian distributions used by the mutation for each parameter:
    • topics.mu and topics.sigma are the mean value and the standard deviation for the topics parameter;
    • alpha_val.mu and alpha_val.sigma are the mean value and the standard deviation for the value of the alpha parameter;
    • beta.mu and beta.sigma are the mean value and the standard deviation for the beta parameter;
    • no_above.mu and no_above.sigma are the mean value and the standard deviation for the no_above parameter;
    • no_below.mu and no_below.sigma are the mean value and the standard deviation for the no_below parameter;
    • alpha_type.mu and alpha_type.sigma are the mean value and the standard deviation for the type of the alpha parameter.

An example of this file can be found in the ga_param.toml file. The default values in this file can be a good starting point for the GA parameters, but they require a check from the user. In particular the probabilities must be checked to ensure some variation in the individuals. The component_mutation probability must be checked to ensure that the mutation operator applies some variation to each mutating individuals. The parameters in the mutate section are also important. In particular the topics.sigma must be choosen taking into account the range of possible numbers of topics. The no_below.sigma must be choosen taking into account the number of document used in training. The other defaults are usually fine to guarantee some variation of the parameters and can be left untouched.

To each trained model it is assigned an UUID. The script outputs all the models in <outdir>/<date>_<time>_lda_results/<UUID>. For each trained model is it produced a toml file with all the parameter already set to use the corresponding model with the lda.py script. These toml files are saved in <outdir>/<date>_<time>_lda_results/<UUID>.toml, and can be loaded in the lda.py script using its --config option. It also outputs a tsv file in <outdir>/<date>_<time>_lda_results/results.csv with the following format:

  • id: progressive identification number;
  • topics: number of topics;
  • alpha: alpha value;
  • beta: beta value;
  • no_below: no-below value;
  • no_above: no-above value;
  • coherence: coherence score of the model;
  • times: time spent evaluating this model;
  • seed: seed used;
  • uuid: UUID of the model;
  • num_docs: number of document;
  • num_not_empty: number of documents not empty after filtering.

The script, also outputs the extracted topics and the topics-documents aasociation produced by the best model. The topics are output in <outdir>/lda_terms-topics_<date>_<time>.json and the topics assigned to each document in <outdir>/lda_docs-topics_<date>_<time>.json.

IMPORTANT:

There are some issues on the reproducibility of the LDA training. Setting the seed option (see below) is not enough to guarantee the reproducibilty of the experiment. It is also necessary to set the environment variable PYTHONHASHSEED to 0. The following command sets the variable for a single run in a Linux shell:

PYTHONHASHSEED=0 python3 lda_ga.py ...

Also using a saved model requires the use of the same seed used for training and the PYTHONHASHSEED to 0. More information on the PYTHONHASHSEED variable can be found here.

Arguments:

Positional:

  • preproc_file path to the preprocess file with the text to elaborate.
  • terms_file path to the file with the classified terms.
  • outdir path to the directory where to save the results. If omitted, the current directory is used.

Optional:

optional arguments:

  • --text-column | -t TARGET_COLUMN Column in preproc_file to process. If omitted 'abstract_lem' is used.
  • --title-column TITLE Column in preproc_file to use as document title. If omitted 'title' is used.
  • --seed SEED Seed to be used in training
  • --placeholder | -p PLACEHOLDER Placeholder for barrier word. Also used as a prefix for the relevant words. Default: '@'
  • --delimiter DELIMITER Delimiter used in preproc_file. Default '\t'
  • --no_timestamp if set, no timestamp is added to the topics file names
  • --logfile LOGFILE log file name. If omitted 'slr-kit.log' is used

Example of usage

lda_ga.py dataset_preproc.csv dataset_terms.csv ga_param.toml /path/to/outdir

stopword_extractor.py

  • ACTION: Extracts a list of terms classified as stopwords from the terms file.
  • INPUT: CSV file with the list of terms classified by FAWOC.
  • OUTPUT: TXT file containing the list of stopwords.

The script reads a file classified by FAWOC and searches for terms labelled as stopword. The output is a TXT file with one stopword per line. This is the format used by preprocess.py for the lists of stopwords.

Positional arguments:

  • terms_file: path to the file with the classifed terms;
  • outfile: output file

Example of usage

Extracts the stopwords from dataset_terms.csv and save the list in stopwords.txt:

stopword_extractor.py dataset_terms.csv stopwords.txt

merge_labels.py

  • ACTION: Merges an old classified terms file with a new terms file.
  • INPUT: a CSV file with the classified terms and another CSV file with new terms.
  • OUTPUT: the merged CSV file. usage: merge_labels.py [-h] old new FILENAME

This script is used to recover a classification already done. If a new list of terms is produced (maybe changing some parameters in preprocess.py), this script allows to recover an old classification taking the label of the already classified terms and applying them to the new list. This script also searches for a fawoc_data.tsv file associated with the new classification. If this file is found it is used to create a fawoc_data.tsv file associated with the merged classification.

Positional arguments:

  • old: old CSV data file partially classified
  • new: new CSV data file to be classified
  • output: output file name

Example of usage

Recovers the classification made in dataset_terms.csv trasfering the applied labels to dataset_terms_new.csv and producing the dataset_terms_merged.csv file with the recovered classification.

merge_labels.py dataset_terms.csv dataset_terms_new.csv dataset_terms_merged.csv

topic_report.py

  • ACTION: Generates reports for various statistics regarding topics and papers. The reports will be based on 2 templates, if they are not found in the working directory of this script, they will be automatically copied from the directory report_templates.
  • INPUT: the abstracs file containing the 'title', 'journal' and 'year' columns, and the lda json file with the topics assigned to each document. This file is usually called lda_docs-topics_<date>_<time>.json.
  • OUTPUT: A directory named report<timestamp>, containing a figure in png format called reportyear.png and a table directory with three tex files containing tables in tex format. Also, a latex and a markdown reports are saved inside the directory, with names report_template.tex and report.md.

This script prepares reports with some statistics about the analyzed documents. The statistics are:

  • number of paper classified in each topic, pubblicated in each considered year. Since the lda.py script calculate, for each paper, the probabilities that the paper is about a certain topic, this statistic is calculated as a real number;
  • a plot of the statistic above;
  • the number of paper pubblished in each journal for each topic. This statistic is a real number for the same reason of the one above;
  • the number of paper pubblished in each journal in each considered year;

Arguments

Positional:

  • abstract_file path to the abstracts file
  • json_file: path to the json docs file, generated by lda.py
  • topics_file: path to the json topic terms file, generated by lda.py

Optional:

  • --dir | -d DIR: path to the directory where output will be saved.
  • --minyear | -m YEAR: minimum year that will be used in the reports. If missing, the minimum year found in the data is used;
  • --maxyear | -M YEAR: maximum year that will be used in the reports. If missing, the maximum year found in the data is used.
  • --plotsize | -p SIZE: number of topics to be displayed in each subplot. If missing, 10 will be used.
  • --compact | -c : if set, the table listing all topics and terms will be in a compact style.
  • --no_stats | -s : if set, the table listing all topics and terms won't list every term coherence to the topic.

Example of usage

topic_report.py dataset_abstract.csv lda_docs-topics_2021-07-14_101211.json

report_templates

In this directory there are the 2 templates, report_template.md and report_template.tex that are used by topic_report.py.

report_template.md

This is the Markdown template that will be automatically cloned and filled by topic_report.py. The report will contain:

  • A table containing data about Topic-Year evolution
  • A graph about Topic-Year evolution
  • A table containing data about the Journals that published the papers, and their topics distribution
  • A table containing data about the Journals that published the papers, and the publication year distribution

journal_lister.py

  • ACTION: Generates a list of the journals where the analyzed papers was pubblished.
  • INPUT: the abstracts file generated by import_biblio.py with the journal column.
  • OUTPUT: A list of journals in the format used by FAWOC.

This command produces a list suitable to be classified with FAWOC in order to filter the journals that are not relevant for the analysis.

The output file has the following format:

  • id: a progressive identification number;
  • term: the name of the journal;
  • label: the label added by FAWOC to the journal. This field is left blank;
  • count: the number of papers pubblished in the journal.

Arguments:

  • abstract_file path to the abstracts file
  • outfile: path to csv output file

filter_paper.py

  • ACTION: Modify the file produced by the import_biblio.py adding a field used to papers from the journals considered not relevant.
  • INPUT: the the file with the abstract (the output of import_biblio.py) and the list of journals (produced by journal_lister.py) with the journals classified.
  • OUTPUT: The file with the abstracts with a field that tells if a paper must be considered or not.

This command uses the classified list of journals to filters the papers. The output file has a new column status added. This field contains the value good for all the papers pubblished in journals classified as relevant or keyword. All the other papers are marked with the rejected value. This field is used by preprocess.py and acronyms.py to exclude the papers marked as rejected.

Arguments:

  • abstract_file path to the file with the abstracts of the papers
  • journal_file path to the file with the classified journals