Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About dataset and model training #10

Closed
jayer95 opened this issue Mar 22, 2021 · 7 comments
Closed

About dataset and model training #10

jayer95 opened this issue Mar 22, 2021 · 7 comments

Comments

@jayer95
Copy link

jayer95 commented Mar 22, 2021

Hello, I would like to ask questions about dataset and training.
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/2KI6IH
After I downloaded your dataset, there is a folder named Augmented_Samples_In_Parts, and the files inside are all files in numpy format.
After loading augmented_labels.npy, I found that this is a label file with 1 to 89,592 data, and the categories are the 20 categories mentioned in Table 1 of the paper.

  1. I have seen that many python scripts in the "develope" branch you developed use the following paths. I would like to ask what is "REU", "REU_Samples_and_Labels", and "labels.csv"? I can't find relevant files in the "dataverse_files.zip".
    ata_directory = "/home/gamagee/workspace/gunshot_detection/REU_Data/REU_Samples_and_Labels/"
    label_csv = data_directory + "labels.csv"
    sample_directory = data_directory + "Samples/"

Is there a teaching file (README.txt) for training the model? Please give me some tips, thank you very much!!!

@amorehead
Copy link
Collaborator

@jayer95,

Thank you for your patience in waiting for my response. I am glad to hear that you are excited about training your own models.

Regarding (1), the "python" directory in the develop branch (https://github.com/gabemagee/gunshot_detection/tree/develop/python) was used during this National Science Foundation Research Experience for Undergraduates (REU) program to develop the initial model architectures. Each person working on the project had their own subtasks and approaches they were taking to model training (thus, everyone having different directories). However, we ultimately ended up using scripts from @rjhosler 's directory to train the final models (One 1D model and two 2D models). The scripts used to train our final three models, as best I can recall, are:

1D model - https://github.com/gabemagee/gunshot_detection/blob/develop/python/Ryan/1D_train.py
2D model (128 x 64) - https://github.com/gabemagee/gunshot_detection/blob/develop/python/Ryan/spectrogram.py
2D model (128 x 128) - https://github.com/gabemagee/gunshot_detection/blob/develop/python/Ryan/spectrogram_train.py

@amorehead
Copy link
Collaborator

@jayer95,

You'll notice in the scripts linked above that they reference pre-built NumPy array files such as "samples = np.load(BASE_DIRECTORY + "gunshot_augmented_sample_spectrograms.npy")" and "train_wav = np.load(BASE_DIRECTORY + "128_128_augmented_training_spectrograms.npy")". For the 2D models, we opted to preprocessed all of the 1D NumPy sound samples into their respective spectrogram representations (128 x 64 and 128 x 128 depending on the model being trained). Uploaded on Dataverse in the form of 15 separate NumPy files is our master copy of the full dataset - a 32GB NumPy array containing all sound sample selections. We only split this file up into 15 smaller ones so we would be below Dataverse's individual file upload size limit.

@amorehead
Copy link
Collaborator

@jayer95,

In short, what you'll need to do to train your own models using our data is first decide how much of it you can use, depending on your computational resources. If you choose to use it all, you will need to run all scripts involving the full NumPy array on a machine with at least 32 GB of RAM (or more hopefully). Otherwise, you will need to manually parse through the individual files it was split into, determine a maximum number of samples you will be using, and then filter out the remaining labels from the corresponding NumPy labels file.

Then, to actually train a model, the 1D case should not be too difficult. You can reference Ryan's 1D CNN training script to see how we handled this case. For the 2D cases, you will need to look at how I converted the original 1D samples into 2D spectrograms using a script like this one (https://github.com/gabemagee/gunshot_detection/blob/develop/python/Alex/2D%20CNN/spectrogram_creation.py). Once you are familiar with the creation of these spectrograms and how to customize their size (64 vs. 128), you can adapt this to cache the spectograms as another NumPy array on your local or remote storage device. Then, you can change filenames as needed when you reference the 2D CNN training scripts from Ryan (https://github.com/gabemagee/gunshot_detection/blob/develop/python/Ryan/spectrogram.py and https://github.com/gabemagee/gunshot_detection/blob/develop/python/Ryan/spectrogram_train.py).

@jayer95
Copy link
Author

jayer95 commented Mar 31, 2021

@amorehead

Thank you very much for your professional reply.

During this period, I repeatedly experimented and tried to understand the source code you developed, and cross-compared the methods and directory names used by each developer for different projects.

I used np.concatenate to merge 15 split .npy files into one npy file ([augmented_samples.npy], 31.6GB), and checked each axis to make sure that 15 .npy files with the same shape were merged correctly.

Then, I refer to the code in the following link to convert [augmented_samples.npy] to [gunshot_augmented_sample_spectrograms_128_64.npy] and [gunshot_augmented_sample_spectrograms_128_128.npy] (file sizes are 2.94GB and 5.87GB respectively)
https://github.com/gabemagee/gunshot_detection/blob/develop/python/Alex/2D%20CNN/spectrogram_creation.py
The key to conversion to 128*128 is:
HOP_LENGTH = 345 * 2, replace to
HOP_LENGTH = 345
samples = np.array(spectrograms).reshape(-1, 128, 64, 1), replace to
samples = np.array(spectrograms).reshape(-1, 128, 128, 1)

In order to verify that the [gunshot_augmented_sample_spectrograms_128_64.npy] of the two scales was converted correctly, I converted the .npy file into a .jpg file (up to 80,000 strokes, length 64 and width 128 images), and looked at a few images with my eyes. (It is indeed similar to the images in the paper)
test_disp

Regarding the labels.csv file, although I can't find it in the folder directory, I guess it is [augmented_labels.npy]. I converted it to a .csv file and found that it has more than 80,000 lines after reading it. This (.csv or .npy file) is used to distinguish between gunshot and no-gunshot.

The current progress is here. Fortunately, my computer has 128G of memory, and the above operations can be performed normally.

I have found some minor issues, which are still under study. Regarding the [train_index] and [test_index] you provided on github, it seems that the number does not match (less than 20,000), but the actual total number (train+test) should be More than 80,000 pens.
https://github.com/gabemagee/gunshot_detection/blob/develop/raspberry_pi/indexes/training_set_indexes.npy

I need to study the latest directory address you provided, Ryan's 3 training projects. Thank you again for always being patient and very professional for answering my questions.

@jayer95
Copy link
Author

jayer95 commented Apr 6, 2021

@amorehead

Hi Amorehead,
Regarding the training 2D model (128 x 128) program,

2D model (128 x 128) - https://github.com/gabemagee/gunshot_detection/blob/develop/python/Ryan/spectrogram_train.py

train_wav = np.load(BASE_DIRECTORY + "128_128_augmented_training_spectrograms.npy")
test_wav = np.load(BASE_DIRECTORY + "128_128_augmented_testing_spectrograms.npy")
valid_wav = np.load(BASE_DIRECTORY + "128_128_augmented_validation_spectrograms.npy")

train_label = np.load(BASE_DIRECTORY + "augmented_training_labels.npy")
test_label = np.load(BASE_DIRECTORY + "augmented_testing_labels.npy")
valid_label = np.load(BASE_DIRECTORY + "augmented_validation_labels.npy")

Which program did you use when you split the dataset into training and testing and validation?

I referenced the following 3 programs, this seems to be the production process of data augmentation (Training, testing, validation)
Training: https://github.com/gabemagee/gunshot_detection/blob/develop/python/Alex/Data%20Preprocessing%20(Training)/data_augmentation.py
Testing:
https://github.com/gabemagee/gunshot_detection/blob/develop/python/Alex/Data%20Preprocessing%20(Testing)/data_augmentation.py
Validation:
https://github.com/gabemagee/gunshot_detection/blob/develop/python/Alex/Data%20Preprocessing%20(Validation)/data_augmentation.py

But the .npy (32GB) file I downloaded from dataverse_files.zip seems to have been data augmented.
How can I split the augmented_samples.npy (32GB) and augmented_labels.npy into training and testing and validation with the proportions mentioned in the paper?

In addition, for training 128x64 programs, it uses different methods to read training and testing and validation. Which one is the way to load .npy in your final training?

2D model (128 x 64) - https://github.com/gabemagee/gunshot_detection/blob/develop/python/Ryan/spectrogram.py

## Loading augmented NumPy files as NumPy arrays

In[ ]:

samples = np.load(BASE_DIRECTORY + "gunshot_augmented_sample_spectrograms.npy")
labels = np.load(BASE_DIRECTORY + "gunshot_augmented_sound_labels.npy")

print("Successfully loaded all spectrograms and labels as NumPy arrays...")
print("Type of the spectrograms array:", samples.dtype)

## Instantiating a sample weights NumPy array

In[ ]:

sample_weights = np.array(
[1 for normally_recorded_sample in range(len(samples) - 660)] + [15 for raspberry_pi_recorded_sample in range(660)])
#print("Shape of samples weights before splitting:", sample_weights.shape)

## Restructuring the label data

In[ ]:

labels = np.array([("gun_shot" if label == 1 else "other") for label in labels])
label_binarizer = LabelBinarizer()
labels = label_binarizer.fit_transform(labels)
labels = np.hstack((labels, 1 - labels))

### Debugging of the sample and label data's shape (optional)

In[ ]:

#print("Shape of samples array:", samples.shape)
#print("Shape of labels array:", labels.shape)

## Arranging the data

In[ ]:

'''
kf = KFold(n_splits=3, shuffle=True)
for train_index, test_index in kf.split(samples):
train_wav, test_wav = samples[train_index], samples[test_index]
train_label, test_label = labels[train_index], labels[test_index]
train_weights, test_weights = sample_weights[train_index], sample_weights[test_index]
'''

all_index = np.arange(len(samples))
train_index = np.load("training_set_indexes.npy")
test_index = np.load("testing_set_indexes.npy")
valid_index = np.delete(all_index, list(train_index) + list(test_index))

print(train_index)
print(test_index)
print(valid_index)

train_wav, test_wav, valid_wav = samples[train_index], samples[test_index], samples[valid_index]
train_label, test_label, valid_label = labels[train_index], labels[test_index], labels[valid_index]
train_weights, test_weights, valid_weights = sample_weights[train_index], sample_weights[test_index], sample_weights[valid_index]

How can I get the following two .npy?
https://github.com/gabemagee/gunshot_detection/blob/develop/raspberry_pi/indexes/training_set_indexes.npy
https://github.com/gabemagee/gunshot_detection/blob/develop/raspberry_pi/indexes/testing_set_indexes.npy

@amorehead
Copy link
Collaborator

@jayer95,

Thank you for your patience once again. You will most likely need to reach out to Ryan Hosler (@rjhosler), one of the coauthors of the paper, to get your latest questions answered. As Ryan did most of the work in the "Ryan" directory, he will know best how to reproduce the dataset and model training procedure. If it helps, the paper lists his (I believe) current email address with which you can contact him (i.e. [email protected]). Please let me know if he is able to answer your questions.

Thanks once again.

@jayer95
Copy link
Author

jayer95 commented Apr 25, 2021

@amorehead
Thank you, I will prepare the question again and send a letter asking him.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants