From 75cd3457f5ba3c238ae7f6151ffb4280723f8325 Mon Sep 17 00:00:00 2001 From: Shiva Nadi Date: Mon, 18 Mar 2024 11:27:20 +0100 Subject: [PATCH 1/3] Add readme --- README.md | 105 ++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 67 insertions(+), 38 deletions(-) diff --git a/README.md b/README.md index 0aded1b..54bb8ca 100644 --- a/README.md +++ b/README.md @@ -1,51 +1,80 @@ -# re-python-package +# INTEREST -This template repository is created by the [UU Research Engineering team](https://utrechtuniversity.github.io/research-engineering/) and is aimed to provide a simple project template for python package development. +The code in this repository is implemented to investigate how the sentiment of the news articles changes over decades regarding the topics such as fossil fuel, green energy, etc. The interest python package offers a variety of methods for analysing the sentiment of the news articles. From traditional dictionary-based approaches to cutting-edge similarity-based techniques. The methods are tested on a large dataset of news articles harvested from the national library of the Netherlans ([KB](https://www.kb.nl)). -The template includes: -- Project directory structure -- Project configuration using `pyproject.toml` -- GitHub actions workflows for testing, linting, type checking and publishing on pypi +## Getting Started +Clone this repository to your working station to obtain example notebooks and python scripts: +``` +git clone https://github.com/UtrechtUniversity/historical-news-sentiment.git +``` -Many other project templates exist, check for example this advanced [python template](https://github.com/NLeSC/python-template) by the NL eScience Center. +### Prerequisites +To install and run this project you need to have the following prerequisites installed. +``` +- Python [>=3.9, <3.11] +``` -## Dependencies -This template uses: -| Tool | Aim | -| --- | --- | -| setuptools | building | -| flake8, pylint | code linting | -| pytest | testing | -| pydocstyle | checking docstrings | -| mypy | type checking | -| sphinx | documentation generation | +### Installation +To run the project, ensure to install the interest package that is part of this project. +``` +pip install interest +``` -If needed, most of these tools can be removed by simply removing the GitHub action that calls the tool, or by changing `pyproject.toml` +### Built with +These packages are automatically installed in the step above: +* [scikit-learn](https://scikit-learn.org/stable/) +* [SciPy](https://scipy.org) -## How to use +## Usage +### 1. Preparation +Harvested KB data is in xml format. Before proceeding, ensure that you have the data prepared. This entails organizing your data into a specific directory structure. Within this directory, you should have several folders for each newsletter, each containing JSON files compressed in the .gz format. These compressed JSON files encapsulate metadata pertaining to newsletters, alongside lists comprising article titles and their corresponding bodies. +``` +from interest.preprocessor.parser import XMLExtractor -### Step 1: Create new repository from this template -Click `Use this template` at the top of this page to create a new repository using this template +extractor = XMLExtractor(Path(input_dir), Path(output_dir)) +extractor.extract_xml_string() +``` -### Step 2: Change the name of your package in pyproject.toml -- Change the name of the folder `package-name` to the name of your package -- Open `pyproject.toml` and change `package-name` to the name of your package -- Also change the authors and optionally any other items that you want to change +Navigate to scripts folder and run: +``` +python3 convert_input_files.py --input_dir path/to/raw/xml/data --output_dir path/to/converted/json/compressed/output +``` -### Step 3: Change GitHub Actions workflow -- Open `.github/workflows/python-package.yml` -- Change `package-name` to the name of your package (line 21) -- Many actions are commented out, uncomment them when you want to start using them. +### 2. Filtering +To be compeleted... -### Step 4: Replace this README file with your README -- You may use this [README template](https://github.com/UtrechtUniversity/rse-project-templates/blob/master/README-template.md) +## About the Project +**Date**: February 2024 -### Step 5: Change the license file -- Open `LICENSE`, change the copyright holder when required (line 3) -- Or replace the entire license file if another license applies +**Researcher(s)**: -### Step 6: Add a citation file -- Create a citation file for your repository using [cffinit](https://citation-file-format.github.io/cff-initializer-javascript/#/) +Pim Huijnen (p.huijnen@uu.nl) -### Step 7: Publising on Pypi (optional/later) -For publishing the package on Pypi you need to create [API tokens](https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#publishing-to-package-registries). +**Research Software Engineer(s)**: + +- Parisa Zahedi (p.zahedi@uu.nl) +- Shiva Nadi (s.nadi@uu.nl) +- Matty Vermet (m.s.vermet@uu.nl) + + +### License + +The code in this project is released under [MIT license](LICENSE). + +## Contributing + +Contributions are what make the open source community an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**. + +To contribute: + +1. Fork the Project +2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`) +3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`) +4. Push to the Branch (`git push origin feature/AmazingFeature`) +5. Open a Pull Request + +## Contact + +Pim Huijnen - p.huijnen@uu.nl + +Project Link: [https://github.com/UtrechtUniversity/historical-news-sentiment](https://github.com/UtrechtUniversity/historical-news-sentiment) \ No newline at end of file From 5edb65a740835b2e357ee9361a13d0fd418e6187 Mon Sep 17 00:00:00 2001 From: parisa-zahedi Date: Thu, 4 Apr 2024 16:58:46 +0200 Subject: [PATCH 2/3] complete steps of the pipeline --- README.md | 163 ++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 159 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 54bb8ca..6ec8809 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,14 @@ # INTEREST -The code in this repository is implemented to investigate how the sentiment of the news articles changes over decades regarding the topics such as fossil fuel, green energy, etc. The interest python package offers a variety of methods for analysing the sentiment of the news articles. From traditional dictionary-based approaches to cutting-edge similarity-based techniques. The methods are tested on a large dataset of news articles harvested from the national library of the Netherlans ([KB](https://www.kb.nl)). +The code in this repository implement a pipeline to extract specific articles from a large corpus. + +Currently, this tool is tailored for the [Delpher Kranten](https://www.delpher.nl/nl/kranten) corpus, but it can be adapted for other corpora as well. + +Articles can be filtered based on individual or multiple features such as title, year, decade, or a set of keywords. To select the most relevant articles, we utilize models such as tf-idf. These models are configurable and extendable. + ## Getting Started -Clone this repository to your working station to obtain example notebooks and python scripts: +Clone this repository to your working station to obtain examples and python scripts: ``` git clone https://github.com/UtrechtUniversity/historical-news-sentiment.git ``` @@ -15,19 +20,69 @@ To install and run this project you need to have the following prerequisites ins ``` ### Installation +#### Option 1 - Install interest package To run the project, ensure to install the interest package that is part of this project. ``` pip install interest ``` +#### Option 2 - Run from source code +If you want to run the scripts without installation you need to: + +- Install requirement +```commandline +pip install setuptools wheel +``` +Change your current working directory to the location of your pyproject.toml file. +``` +python -m build +pip install . +``` +- Set PYTHONPATH environment: +On Linux and Mac OS, you might have to set the PYTHONPATH environment variable to point to this directory. +```commandline +export PYTHONPATH="current working directory/historical-news-sentiment:${PYTHONPATH}" +``` ### Built with These packages are automatically installed in the step above: * [scikit-learn](https://scikit-learn.org/stable/) * [SciPy](https://scipy.org) +* [NumPy](https://numpy.org) +* [spaCy](https://spacy.io) +* [pandas](https://pandas.pydata.org) ## Usage ### 1. Preparation -Harvested KB data is in xml format. Before proceeding, ensure that you have the data prepared. This entails organizing your data into a specific directory structure. Within this directory, you should have several folders for each newsletter, each containing JSON files compressed in the .gz format. These compressed JSON files encapsulate metadata pertaining to newsletters, alongside lists comprising article titles and their corresponding bodies. +#### Data Prepration +Before proceeding, ensure that you have the data prepared in the following format: The expected format is a set of JSON files compressed in the .gz format. Each JSON file contains metadata related to a newsletter, magazine, etc., as well as a list of article titles and their corresponding bodies. These files may be organized within different folders or sub-folders. +Below is a snapshot of the JSON file format: +```commandline +{ + "newsletter_metadata": { + "title": "Newspaper title ..", + "language": "NL", + "date": "1878-04-29", + ... + }, + "articles": { + "1": { + "title": "title of article1 ", + "body": [ + "paragraph 1 ....", + "paragraph 2...." + ] + }, + "2": { + "title": "title of article2", + "body": [ + "text..." + ] + } + } +} +``` + +In our use case, the harvested KB data is in XML format. We have provided the following script to transform the original data into the expected format. ``` from interest.preprocessor.parser import XMLExtractor @@ -39,10 +94,110 @@ Navigate to scripts folder and run: ``` python3 convert_input_files.py --input_dir path/to/raw/xml/data --output_dir path/to/converted/json/compressed/output ``` +#### Customize input-file + +In order to define a corpus with a new data format you should: + +- add a new input_file_type to [INPUT_FILE_TYPES](https://github.com/UtrechtUniversity/historical-news-sentiment/blob/main/interest/filter/__init__.py) +- implement a class that inherits from [input_file.py](https://github.com/UtrechtUniversity/historical-news-sentiment/blob/main/interest/filter/input_file.py). +This class is customized to read a new data format. In our case-study we defined [delpher_kranten.py](https://github.com/UtrechtUniversity/historical-news-sentiment/blob/main/interest/filter/delpher_kranten.py). + ### 2. Filtering -To be compeleted... +In this step, you may select articles based on a filter or a collection of filters. Articles can be filtered by title, year, decade, or a set of keywords defined in the ```config.json``` file. +```commandline + "filters": [ + { + "type": "TitleFilter", + "title": "example" + }, + { + "type": "YearFilter", + "year": 2022 + }, + { + "type": "DecadeFilter", + "decade": 1960 + }, + { + "type": "KeywordsFilter", + "keywords": ["sustainability", "green"] + } + ] + } + +``` +run the following to filter the articles: +```commandline +python3 scripts/step1_filter_articles.py --input-dir "path/to/converted/json/compressed/output/" --output-dir "output_filter/" --input-type "delpher_kranten" --glob "*.gz" +``` +In our case, input-type is "delpher_kranten", and input data is a set of compresed json files with ```.gz``` extension. + +The output of this script is a JSON file for each selected article in the following format: +```commandline +{ + "file_path": "output/transfered_data/00/KRANTEN_KBPERS01_000002100.json.gz", + "article_id": "5", + "Date": "1878-04-29", + "Title": "Opregte Haarlemsche Courant" +} +``` +### 3. Categorization by timestamp +The output files generated in the previous step are categorized based on a specified [period-type](https://github.com/UtrechtUniversity/historical-news-sentiment/blob/main/interest/temporal_categorization/__init__.py), +such as ```year``` or ```decade```. This categorization is essential for subsequent steps, especially if you intend to apply tf-idf or other models to specific periods. In our case, we applied tf-idf per decade. + +```commandline +python3 scripts/step2_categorize_by_timestamp.py --input-dir "output_filter/" --glob "*.json" --period-type "decade" --output-dir "output_timestamped/" + +``` +The output consists of a .csv file for each period, such as one file per decade, containing the ```file_path``` and ```article_id``` of selected articles. + +### 4. Select final articles +This step is applicable when articles are filtered (in step 2) using a set of keywords. +By utilizing tf-idf, the most relevant articles related to the specified topic (defined by the provided keywords) are selected. + +Before applying tf-idf, articles containing any of the specified keywords in their title are selected. + +From the rest of articles, to choose the most relevant ones, you can specify one of the following criteria in [config.py](https://github.com/UtrechtUniversity/historical-news-sentiment/blob/main/config.json): + +- Threshold for the tf-idf score value +- Maximum number of selected articles with the top scores + +```commandline +"article_selector": + { + "type": "threshold", + "value": "0.02" + }, + + OR + + "article_selector": + { + "type": "num_articles", + "value": "200" + }, +``` +The following script, add a new column, ```selected``` to the .csv files from the previous step. +```commandline +python3 scripts/3_select_final_articles.py --input_dir "output/output_timestamped/" +``` + +### 5. Generate output +As the final step of the pipeline, the text of the selected articles is saved in a .csv file, which can be used for manual labeling. The user has the option to choose whether the text should be divided into paragraphs. +This feature can be set in [config.py](https://github.com/UtrechtUniversity/historical-news-sentiment/blob/main/config.json). +```commandline +"output_unit": "paragraph" + +OR + +"output_unit": "text" +``` + +```commandline +python3 scripts/step4_generate_output.py --input_dir "output/output_timestamped/” --output-dir “output/output_results/“ --glob “*.csv” +``` ## About the Project **Date**: February 2024 From d66b4c0a6773dc676b6871568ffccad37a476578 Mon Sep 17 00:00:00 2001 From: parisa-zahedi Date: Thu, 4 Apr 2024 17:03:32 +0200 Subject: [PATCH 3/3] Update README.md fix a grammer --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 6ec8809..c8f6ce1 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # INTEREST -The code in this repository implement a pipeline to extract specific articles from a large corpus. +The code in this repository implements a pipeline to extract specific articles from a large corpus. Currently, this tool is tailored for the [Delpher Kranten](https://www.delpher.nl/nl/kranten) corpus, but it can be adapted for other corpora as well. @@ -232,4 +232,4 @@ To contribute: Pim Huijnen - p.huijnen@uu.nl -Project Link: [https://github.com/UtrechtUniversity/historical-news-sentiment](https://github.com/UtrechtUniversity/historical-news-sentiment) \ No newline at end of file +Project Link: [https://github.com/UtrechtUniversity/historical-news-sentiment](https://github.com/UtrechtUniversity/historical-news-sentiment)