mlb-data

This repo contains scripts to create the MLB dataset introduced in the paper Data-to-text Generation with Entity Modeling (Puduppully, R., Dong, L., & Lapata, M.; ACL 2019).

Prerequisites

Install the mlbgame-api

pip install git+https://github.com/ratishsp/mlbgame-api.git

Steps to create the dataset

Run the following scripts in sequence

boxscore_data.py. It requires the argument '-year'. The values to be passed are 0, 1, 2..10. For 0 it will collect the records for the year 2018, for 1 the year 2017 and so on.

python boxscore_data.py -year 1 -output ~/mlb-data/api-output/  # get the data for year 2017

Alternatively you can download the dataset containing box/line/play-by-play scores from https://drive.google.com/drive/folders/1jLU5wYjic2BR21iOLn9Tkv415AWkFqfj?usp=sharing

extract_summaries_from_recap_html.py extracts the recaps from the html. The names of the htmls to be downloaded is available in the file recap_file_names.txt

python extract_summaries_from_recap_html -recaps ~/mlb-data/recap_file_names.txt -output_folder ~/mlb-data/html-output/

clean_summaries.py cleans the html of quotations and text incidental to the game.

python clean_summaries.py -input_folder ~/mlb-data/html-output/ -output_folder ~/mlb-data/html-output-cleaned/

create_combined_dataset.py results in a dataset with boxscores and summaries.

python create_combined_dataset.py -input_folder ~/mlb-data/api-output/ -input_summaries ~/mlb-data/html-output-cleaned/ -output_folder ~/mlb-data/combined/

preproc.py preprocesses the dataset into train, validation and test splits. The splits are defined in the file mlb_split_keys.txt.

python preproc.py -input ~/mlb-data/combined/ -mlb_split_keys ~/mlb-data/mlb_split_keys.txt -output ~/mlb-data/splits/

Alternatively you can download the json files from https://drive.google.com/drive/folders/1G4iIE-02icAU2-5skvLlTEPWDQQj1ss4?usp=sharing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mlb-data

Prerequisites

Steps to create the dataset

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
boxscore_data.py		boxscore_data.py
clean_summaries.py		clean_summaries.py
create_combined_dataset.py		create_combined_dataset.py
data2text_input_formatter.py		data2text_input_formatter.py
extract_summaries_from_recap_html.py		extract_summaries_from_recap_html.py
mlb_data_utils.py		mlb_data_utils.py
mlb_split_keys.txt		mlb_split_keys.txt
preproc.py		preproc.py
recap_file_names.txt		recap_file_names.txt
tokenizer.py		tokenizer.py

ratishsp/mlb-data-scripts

Folders and files

Latest commit

History

Repository files navigation

mlb-data

Prerequisites

Steps to create the dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages