This file contains details on each script that is available in the slr-kit
repository.
Each script can be run as a standalone program.
Althouh, typically, they are run by the main slrkit
command.
The following scripts and programs are currently available. They are listed in the order that they are expected to be used in the SLR workflow. All the scripts expect utf-8 files as input.
- ACTION: Import a bibliographic and convert it to the CSV format.
- INPUT: the bibliographic file.
- OUTPUT: CSV file format with the desired columns.
The output is sent to stdout unless an output file name is explicitly specified. The input file format can be chosen using an option. Currently, only the RIS format is supported.
The advice is to always start from the RIS format (instead of CSV or BIB) since it allows a better, easier and clearer separation among the different elements of a bibliographic item. The command line parameters allow to select the fields to export fro the RIS file.
During the conversion, an unique progressive number is added to each paper, which acts as unique identifier in the rest of the processing (within a column named id
).
Therefore, be careful to not mix different versions of the source file that may generate a different numbering of the papers.
TODO: add an option to append the output to an existing CSV file?
Positional arguments:
input_file
: input bibliography file
Optional arguments:
--type | -t TYPE
: Type of the bibliography file. Supported types: RIS. If absent 'RIS' is used.--output | -o FILENAME
: output CSV file name--columns | -c col1,..,coln
: list of comma-separated columns to export. If absent 'title,abstract' is used. Use '?' for the list of available columns
The standard slr-kit workflow needs a CSV file with two columns: title
and abstract
. Such CSV file can be obtained with the command:
import_biblio --columns title,abstract dataset.ris > dataset_abstracts.csv
- ACTION: Extracts a list of acronyms from the abstracts.
- INPUT: CSV file with the list of abstracts generated by
import_biblio.py
. - OUTPUT: CSV file containing the short and extended acronyms suitable to be classified with FAWOC.
Uses the algorithm presented in A. Schwartz and M. Hearst, "A Simple Algorithm for Identifying Abbreviations Definitions in Biomedical Text", Biocomputing, 2003.
The script assumes that the abstracts are contained in a column named abstract
.
A different column can be specified usign a command liine option. It also requires a column named id
.
All the rows in the input file with 'rejected' as the status
field (if present) are discarded and not elaborated.
The output is a TSV file with the columns id
, term
and label
.
This is the format used by FAWOC.
The id
is a number that univocally identifies an acronym.
term
contains the acronym.
The format is <extended acronym> | (<abbreviation>)
.
The label
will be empty, becasue it is the column that will be used by FAWOC for the classification.
Positional arguments:
datafile
input CSV data file
Optional arguments:
--output | -o FILENAME
: output file name--column | -c COLUMN
: Name of the column of datafile to search the acronyms. Default: abstract
- ACTION: Performs the preprocessing of the documents to prepare it for further processing.
- INPUT: The CSV file produced by
import_biblio.py
or the one modified byfilter_paper.py
. - OUTPUT: A CSV file containing the same columns of the input file, plus a new column containing the preprocessed text.
The preprocessing includes:
- Remove punctuations
- Convert to lowercase
- Remove stop words
- Mark selected n-grams as relevant
- Acronyms substitution
- Remove special characters and digits
- Regex based substitutions
- Lemmatisation
All the rows in the input file with 'rejected' as the status
field (if present) are discarded and not elaborated.
The stop words are read only from one or more optional files.
These words are replaced, in the output, with a placeholder (called stopword placeholder) that is recognized in the term extraction phase.
The default stopword placeholder is the '@' character.
The user can also specify files containing lists of relevant n-grams. For each specified file, the user can specify a particular marker that will be used as replacement text for each term in that list. That marker will be surrounded with the stopword placeholder. If the user do not specify a marker for a list of terms, then a different approach is used. Each term will be replaced with a placeholder composed by, the stopword placeholder, the words composing the term separated with '_' and finally another stopword placeholder. This kind of placeholders can be used by the subsequent phases to recognize the term without losing the meaning of the n-gram.
Examples (assuming that '@' is the stopword placeholder):
- if the user specifies a list of relevant n-grams with the specific placeholder 'TASK_PARAM', all the n-grams in that list will be replaced with '@TASK_PARAM@'.
- if the user specifies a list of relevant n-grams without a specific placeholder, then each n-gram will be replaced with a specific string. The n-gram 'edf' will be replaced with '@edf@', the n-gram 'earliest deadline first' will be replaced with '@earliest_deadline_first@'.
This program also handles the acronyms. A TSV file containing the approved acronyms can be specified. This file must have the format defined by the acronyms.py script. The file must have two columns 'term' and 'label':
- 'term' must contain the acronym in the form
<extended acronym> | (<abbreviation>)
- 'label' is the classification made with fawoc. Only the rows with label equal to 'relevant' or 'keyword' will be considered.
For each considered row in the TSV file, the program searches for:
- the abbreviation of the acronym;
- the extended acronym;
- the extended acronym with all the spaces substituted with '-'.
The first search is case-sensitive, while the other two are not.
The program replaces each recognized acronym with the marker <stopword placeholder><acronym abbreviation><stopword placeholder>
.
The preprocess.py
script can apply some specific regex and substitutions.
Using the --regex
option, the user can pass to script a csv file containing the instructions to apply these substitutions.
The file accepted by the --regex
option has the following structure:
pattern
: the pattern to search in the text. Can be apython3
regex pattern or a string to be searched verbatim;repl
: the string that substitutes thepattern
. The actual text substituted is__<repl-content>__
;regexBoolean
: iftrue
thepattern
is treated as a regular expression. Iffalse
thepattern
is searched verbatim.
Positional arguments:
datafile
input CSV data file
optional arguments:
--output | -o FILENAME
output file name. If omitted or '-' stdout is used--placeholder | -p PLACEHOLDER
Placeholder for stop words. Also used as a prefix for the relevant words. Default: '@'--stop-words | -b FILENAME [FILENAME ...]
stop words file name--relevant-term | -r FILENAME [PLACEHOLDER]
relevant terms file name and the placeholder to use with those terms. The placeholder must not contains any space. If the placeholder is omitted, each relevant term from this file, is replaced with the stopword placeholder, followed by the term itself with each space changed with the "-" character and then another stopword placeholder.--acronyms | -a ACRONYMS
TSV files with the approved acronyms--target-column | -t TARGET_COLUMN
Column in datafile to process. If omitted 'abstract' is used.--output-column OUTPUT_COLUMN
name of the column to saveIf omitted 'abstract_lem' is used.--input-delimiter INPUT_DELIMITER
Delimiter used in datafile. Default '\t'--output-delimiter OUTPUT_DELIMITER
Delimiter used in output file. Default '\t'--rows | -R INPUT_ROWS
Select maximum number of samples--language | -l LANGUAGE
language of text. Must be a ISO 639-1 two-letter code. Default: 'en'--regex REGEX
regex .csv for specific substitutions
The following example processes the dataset_abstracts.csv
file, filtering the stop words in stop_words.txt
and produces dataset_preproc.csv
, which contains the same columns of the input file plus the abstract_lem
column:
preprocess.py --stop-words stop_words.txt dataset_abstracts.csv > dataset_preproc.csv
The following example processes the dataset_abstracts.csv
file, replacing the terms relevant_terms.txt
with placeholder created from the terms themselves, replacing the terms in other_relevant.txt
with '@PLACEHOLDER@' and produces dataset_preproc.csv
, which contains the same columns of the input file plus the abstract_lem
column:
preprocess.py --relevant-term relevant_terms.txt -r other_relevant.txt PLACEHOLDER dataset_abstracts.csv > dataset_preproc.csv
- ACTION: Extracts the terms ({1,2,3,4}-grams) from the abstracts.
- INPUT: The TSV file produced by
preprocess.py
(it works on the columnabstract_lem
). - OUTPUT: A TSV file containing the list of terms, and a TSV with their frequency.
This script extracts the terms from the file produced by the preprocess.py
script.
It uses the placeholder character to skip all the n-grams that contains the placeholder.
The script also skips all the terms that contains tokens that starts and ends with the placeholder.
This kind of tokens are produced by preprocess.py
to mark the acronyms and the relevant terms.
The format of the output file is the one used by FAWOC
. The structure is the following:
id
: a progressive identification number;term
: the n-gram;label
: the label added byFAWOC
to the n-gram. This field is left blank by thegen_terms.py
script.
This command produces also the fawoc_data.tsv
file, with the following structure:
id
: the identification number of the term;term
: the term;count
: the number of occurrences of the term.
This file is used by FAWOC
to show the number of occurrences of each term.
inputfile
: name of the TSV produced bypreprocess.py
;outputfile
: name of the output file. This name is also used to create the name of the file with the term frequencies. For instance, ifoutputfile
isfilename.tsv
, the frequencies file will be namedfilename_fawoc_data.tsv
. This file is create in the same directory ofoutputfile
;--stdout | -s
: also print on stdout the output file;--n-grams | -n N
: maximum size of n-grams. The script will output all the 1-grams, ... N-grams;--min-frequency |-m N
: minimum frequency of the n-grams. All the n-grams with a frequency lower thanN
are not output.--placeholder | -p PLACEHOLDER
: placeholder for barrier word. Also used as a prefix for the relevant words. Default: '@'--column | -c COLUMN
: column in datafile to process. If omitted 'abstract_lem' is used.--delimiter DELIMITER
: delimiter used in datafile. Default '\t'--logfile LOGFILE
: log file name. If omitted 'slr-kit.log' is used
Extracts terms from dataset_preproc.csv
and store them in dataset_terms.csv
and dataset_terms_fawoc_data.tsv
:
gen_terms.py dataset_preproc.csv dataset_terms.csv
- ACTION: GUI program for the fast classification of terms.
- INPUT: The CSV file with the terms produced by
gen-n-grams.py
. - OUTPUT: The same input CSV with the labels assigned to the terms.
NOTE: the program changes the content of the input file.
The program uses also two files to save its state and retrieve some information about terms.
Assuming that the input file is called dataset_terms.tsv
, FAWOC uses dataset_terms_fawoc_data.json
and dataset_terms_fawoc_data.tsv
.
This two files are searched and saved in the same directory of the input file.
The json file is used to save the state of FAWOC, and it is saved every time the input file is updated.
The tsv file is used to load some additional data about the terms (currently only the terms count).
This file is not modified by FAWOC.
If these two file are not present, they are created by FAWOC, the json file with the current state.
The tsv file is created with data eventually loaded from the input file.
If no count field was present in the input file a default value of -1 is used for all terms.
FAWOC saves data every 10 classsifications. To save data more often, use the 'w' key.
The program also writes profiling information into the file profiler.log
with the relevant operations that are carried out.
- ACTION: Determines the occurrences of the terms in the abstracts.
- INPUT: Two files: 1) the list of abstracts generated by
preprocess.py
and 2) the list of terms generated byfawoc.py
- OUTPUT: A JSON data structure with the position of every term in each abstract; the output is written to stdout by default.
Extracts the occurrences of the terms listed in dataset_terms.csv
in the abstracts contained in dataset_preproc.csv
, storing the results in dataset_occ_keyword.json
:
occurrences.py -l keyword dataset_preproc.csv dataset_terms.csv > dataset_occ_keyword.json
It actually generates a copy of the preprocess
input file and it adds a column whose default name is abstract_filtered
, containing the terms that appear in the abstract_lem
column of the input file, and keeps only the terms classified as are classified as relevant
or keyword
in the terms list.
The script also considers as relevant
all the token that starts and ends with the placeholder characters (the script strips the placeholder characters and uses the rest of the token).
These tokens are used by preprocess.py
to mark the additional relevant terms and the acronyms.
postprocess.py dataset_preproc.csv dataset_terms.csv myoutputdir -o data_postprocess.csv
- ACTION: Train an LDA model and outputs the extracted topics and the association between topics and documents.
- INPUT: The TSV file produced by
postprocess.py
(it works on the columnabstract_filtered
) and the terms TSV file classified with FAWOC. - OUTPUT: A JSON file with the description of extracted topics and a JSON file with the association between topics and documents.
The script uses the filtered documents produced by postprocess.py
.
For more information and references about LDA, check the Gensim LDA library page.
This script outputs the topics in <outdir>/lda_terms-topics_<date>_<time>.json
and the topics assigned
to each document in <outdir>/lda_docs-topics_<date>_<time>.json
.
IMPORTANT:
There are some issues on the reproducibility of the LDA training.
Setting the seed
option (see below) is not enough to guarantee the reproducibilty of the experiment.
It is also necessary to set the environment variable PYTHONHASHSEED
to 0
.
The following command sets the variable for a single run in a Linux shell:
PYTHONHASHSEED=0 python3 lda.py ...
Also using a saved model requires the use of the same seed used for training and the PYTHONHASHSEED
to 0.
More information on the PYTHONHASHSEED
variable can be found here.
Positional:
postproc_file
path to the preprocess file with the text to elaborate.outdir
path to the directory where to save the results. If omitted, the current directory is used.
Optional:
--text-column | -t TARGET_COLUMN
: Column in preproc_file to process. If omitted 'abstract_lem' is used.--topics TOPICS
Number of topics. If omitted 20 is used--alpha ALPHA
alpha parameter of LDA. If omitted "auto" is used--beta BETA
beta parameter of LDA. If omitted "auto" is used--no_below NO_BELOW
Keep tokens which are contained in at least this number of documents. If omitted 20 is used--no_above NO_ABOVE
Keep tokens which are contained in no more than this fraction of documents (fraction of total corpus size, not an absolute number). If omitted 0.5 is used--seed SEED
Seed to be used in training--model
if set, the lda model is saved to directory<outdir>/lda_model
. The model is saved with name "model".--load-model LOAD_MODEL
Path to a directory where a previously trained model is saved. Inside this directory the model named "model" is searched. the loaded model is used with the dataset file to generate the topics and the topic document association--no_timestamp
if set, no timestamp is added to the topics file names--config | -c CONFIG
Path to a toml config file like the one used by the slrkit lda command. It overrides all the cli arguments.
Extracts topics from dataset dataset_preproc.csv
using the classified terms in dataset_terms.csv
and saving the result in /path/to/outdir
:
lda.py dataset_preproc.csv dataset_terms.csv /path/to/outdir
- ACTION: Uses a GA to search the best LDA model parameters, outputs all the trained models and the extracted topics and the association between topics and documents produced by the best model.
- INPUT: The TSV file produced by
postprocess.py
(it works on the columnabstract_filtered
), the terms TSV file classified with FAWOC and a toml file with the parameter used by the GA. - OUTPUT: All the trained models in a format suitable to be used with the
lda.py
script and a tsv file that summarize all the results. Also outputs the extracted topics and the association between topics and documents produced by the best model.
The script searches the best combination (in terms of coherence) of number of topics, alpha, beta, no-below and no-above parameters. Each combination of parameters (a.k.a. individual) is represented as
(topics, alpha_val, beta, no_above, no_below, alpha_type)
Each parameter has the following meaning:
topics
: number of topics. It is an integer number;alpha_val
: value of the alpha parameter. It is a floating point number;beta
: value of the beta parameter. It is a floating point number;no_above
: value of the no-above parameter. It is a floating point number;no_below
: value of the no-below parameter. It is an integer number;alpha_type
: this integer number tells if the alpha parameter must have the value ofalpha_val
or if one of the string values must be used. The allowed value are:0
: use thealpha_val
;1
: use the stringsymmetrical
;-1
: use the stringasymmetrical
;
It uses a GA algorithm (the mu+lambda genetic algorithm) to find the best model. The mu+lambda GA starts with an initial population. Then creates lambda new individuals (a.k.a. solutions) by replication, mutation or crossover (only one of this operation is applied to a single individual). From the initial population plus the new lambda individuals the algorithm selects mu individuals that "survive" to the next generation. This procedure is repeated num-generation times. The selection procedure is a tournament where a fixed number of individuals are randomly choosen to partecipate in the tournament. The individual in the tournament with the best coherence is selected to pass to the next generation. The tournament is applied mu times in order to select the mu individuals of the next generation. The GA uses a gaussian mutaion. This means that if an individual must be mutated, then the mutation randomly selects which parameters have to be mutated. For each selected parameter, a random number is choosen from a gaussian distribution and it is added to the parameter. Each parameter has his own gaussian distribution. The crossover randomly selects a set of parameters that the two individuals have to exchange. More information and references about the mu+lambda GA can be found here.
All the parameter of the GA are taken from a TOML version 1.0.0 file. The format of this file is the following:
limits
: this section contains the ranges of the parameter;min_topics
: minimum number of topics;max_topics
: maximum number of topics;max_no_below
: maximum value of the no-below parameter. The minimum is always 1. A value of -1 means a tenth of the number of documents;min_no_above
: minimum value of the no-above parameter. The maximum is always 1.
algorithm
: this section contains the parameters used by the GA:mu
: number of individuals that will pass each generation;lambda
: number of individuals that are generated at each generation;initial
: size of the initial population;generations
: number of generation;tournament_size
: number of individuals randomly selected for the selection tournament.
probabilities
: this section contains the probabilities used by the script:mutate
: probability of mutation. The sum of this probability and the mate probability must be less than 1;component_mutation
: probability of mutation of each individual component;mate
: probability of crossover (also called mating). The sum of this probability and the mutate probability must be less than 1;no_filter
: probability that a new individual is created with no term filter (no_above = no_below = 1);
mutate
: this section contains the parameters of the gaussian distributions used by the mutation for each parameter:topics.mu
andtopics.sigma
are the mean value and the standard deviation for the topics parameter;alpha_val.mu
andalpha_val.sigma
are the mean value and the standard deviation for the value of the alpha parameter;beta.mu
andbeta.sigma
are the mean value and the standard deviation for the beta parameter;no_above.mu
andno_above.sigma
are the mean value and the standard deviation for the no_above parameter;no_below.mu
andno_below.sigma
are the mean value and the standard deviation for the no_below parameter;alpha_type.mu
andalpha_type.sigma
are the mean value and the standard deviation for the type of the alpha parameter.
An example of this file can be found in the ga_param.toml
file.
The default values in this file can be a good starting point for the GA parameters, but they require a check from the user.
In particular the probabilities must be checked to ensure some variation in the individuals.
The component_mutation
probability must be checked to ensure that the mutation operator applies some variation to each mutating individuals.
The parameters in the mutate
section are also important.
In particular the topics.sigma
must be choosen taking into account the range of possible numbers of topics.
The no_below.sigma
must be choosen taking into account the number of document used in training.
The other defaults are usually fine to guarantee some variation of the parameters and can be left untouched.
To each trained model it is assigned an UUID.
The script outputs all the models in <outdir>/<date>_<time>_lda_results/<UUID>
.
For each trained model is it produced a toml
file with all the parameter already set to use the corresponding model with the lda.py
script.
These toml
files are saved in <outdir>/<date>_<time>_lda_results/<UUID>.toml
, and can be loaded in the lda.py
script using its --config
option.
It also outputs a tsv file in <outdir>/<date>_<time>_lda_results/results.csv
with the following format:
id
: progressive identification number;topics
: number of topics;alpha
: alpha value;beta
: beta value;no_below
: no-below value;no_above
: no-above value;coherence
: coherence score of the model;times
: time spent evaluating this model;seed
: seed used;uuid
: UUID of the model;num_docs
: number of document;num_not_empty
: number of documents not empty after filtering.
The script, also outputs the extracted topics and the topics-documents aasociation produced by the best model.
The topics are output in <outdir>/lda_terms-topics_<date>_<time>.json
and the topics assigned
to each document in <outdir>/lda_docs-topics_<date>_<time>.json
.
IMPORTANT:
There are some issues on the reproducibility of the LDA training.
Setting the seed
option (see below) is not enough to guarantee the reproducibilty of the experiment.
It is also necessary to set the environment variable PYTHONHASHSEED
to 0
.
The following command sets the variable for a single run in a Linux shell:
PYTHONHASHSEED=0 python3 lda_ga.py ...
Also using a saved model requires the use of the same seed used for training and the PYTHONHASHSEED
to 0.
More information on the PYTHONHASHSEED
variable can be found here.
Positional:
preproc_file
path to the preprocess file with the text to elaborate.terms_file
path to the file with the classified terms.outdir
path to the directory where to save the results. If omitted, the current directory is used.
Optional:
optional arguments:
--text-column | -t TARGET_COLUMN
Column in preproc_file to process. If omitted 'abstract_lem' is used.--title-column TITLE
Column in preproc_file to use as document title. If omitted 'title' is used.--seed SEED
Seed to be used in training--placeholder | -p PLACEHOLDER
Placeholder for barrier word. Also used as a prefix for the relevant words. Default: '@'--delimiter DELIMITER
Delimiter used in preproc_file. Default '\t'--no_timestamp
if set, no timestamp is added to the topics file names--logfile LOGFILE
log file name. If omitted 'slr-kit.log' is used
lda_ga.py dataset_preproc.csv dataset_terms.csv ga_param.toml /path/to/outdir
- ACTION: Extracts a list of terms classified as stopwords from the terms file.
- INPUT: CSV file with the list of terms classified by FAWOC.
- OUTPUT: TXT file containing the list of stopwords.
The script reads a file classified by FAWOC and searches for terms labelled as stopword
.
The output is a TXT file with one stopword per line.
This is the format used by preprocess.py
for the lists of stopwords.
Positional arguments:
terms_file
: path to the file with the classifed terms;outfile
: output file
Extracts the stopwords from dataset_terms.csv
and save the list in stopwords.txt
:
stopword_extractor.py dataset_terms.csv stopwords.txt
- ACTION: Merges an old classified terms file with a new terms file.
- INPUT: a CSV file with the classified terms and another CSV file with new terms.
- OUTPUT: the merged CSV file. usage: merge_labels.py [-h] old new FILENAME
This script is used to recover a classification already done.
If a new list of terms is produced (maybe changing some parameters in preprocess.py
), this script allows to recover an old classification taking the label of the already classified terms and applying them to the new list.
This script also searches for a fawoc_data.tsv
file associated with the new classification.
If this file is found it is used to create a fawoc_data.tsv
file associated with the merged classification.
Positional arguments:
old
: old CSV data file partially classifiednew
: new CSV data file to be classifiedoutput
: output file name
Recovers the classification made in dataset_terms.csv
trasfering the applied labels to dataset_terms_new.csv
and producing the dataset_terms_merged.csv
file with the recovered classification.
merge_labels.py dataset_terms.csv dataset_terms_new.csv dataset_terms_merged.csv
- ACTION: Generates reports for various statistics regarding topics and papers. The reports will be based on 2 templates, if they are not found in the working directory of this script, they will be automatically copied from the directory
report_templates
. - INPUT: the abstracs file containing the 'title', 'journal' and 'year' columns, and the lda json file with the topics assigned to each document. This file is usually called
lda_docs-topics_<date>_<time>.json
. - OUTPUT: A directory named
report<timestamp>
, containing a figure in png format calledreportyear.png
and atable
directory with three tex files containing tables in tex format. Also, a latex and a markdown reports are saved inside the directory, with namesreport_template.tex
andreport.md
.
This script prepares reports with some statistics about the analyzed documents. The statistics are:
- number of paper classified in each topic, pubblicated in each considered year. Since the
lda.py
script calculate, for each paper, the probabilities that the paper is about a certain topic, this statistic is calculated as a real number; - a plot of the statistic above;
- the number of paper pubblished in each journal for each topic. This statistic is a real number for the same reason of the one above;
- the number of paper pubblished in each journal in each considered year;
Positional:
abstract_file
path to the abstracts filejson_file
: path to the json docs file, generated bylda.py
topics_file
: path to the json topic terms file, generated bylda.py
Optional:
--dir | -d DIR
: path to the directory where output will be saved.--minyear | -m YEAR
: minimum year that will be used in the reports. If missing, the minimum year found in the data is used;--maxyear | -M YEAR
: maximum year that will be used in the reports. If missing, the maximum year found in the data is used.--plotsize | -p SIZE
: number of topics to be displayed in each subplot. If missing, 10 will be used.--compact | -c
: if set, the table listing all topics and terms will be in a compact style.--no_stats | -s
: if set, the table listing all topics and terms won't list every term coherence to the topic.
topic_report.py dataset_abstract.csv lda_docs-topics_2021-07-14_101211.json
In this directory there are the 2 templates, report_template.md
and report_template.tex
that are used by topic_report.py
.
This is the Markdown template that will be automatically cloned and filled by topic_report.py
.
The report will contain:
- A table containing data about Topic-Year evolution
- A graph about Topic-Year evolution
- A table containing data about the Journals that published the papers, and their topics distribution
- A table containing data about the Journals that published the papers, and the publication year distribution
- ACTION: Generates a list of the journals where the analyzed papers was pubblished.
- INPUT: the abstracts file generated by
import_biblio.py
with thejournal
column. - OUTPUT: A list of journals in the format used by
FAWOC
.
This command produces a list suitable to be classified with FAWOC
in order to filter the journals that are not relevant for the analysis.
The output file has the following format:
id
: a progressive identification number;term
: the name of the journal;label
: the label added byFAWOC
to the journal. This field is left blank;count
: the number of papers pubblished in the journal.
abstract_file
path to the abstracts fileoutfile
: path to csv output file
- ACTION: Modify the file produced by the
import_biblio.py
adding a field used to papers from the journals considered not relevant. - INPUT: the the file with the abstract (the output of
import_biblio.py
) and the list of journals (produced byjournal_lister.py
) with the journals classified. - OUTPUT: The file with the abstracts with a field that tells if a paper must be considered or not.
This command uses the classified list of journals to filters the papers.
The output file has a new column status
added.
This field contains the value good
for all the papers pubblished in journals classified as relevant
or keyword
.
All the other papers are marked with the rejected
value.
This field is used by preprocess.py
and acronyms.py
to exclude the papers marked as rejected
.
abstract_file
path to the file with the abstracts of the papersjournal_file
path to the file with the classified journals