About dataset and model training #10

jayer95 · 2021-03-22T09:01:09Z

Hello, I would like to ask questions about dataset and training.
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/2KI6IH
After I downloaded your dataset, there is a folder named Augmented_Samples_In_Parts, and the files inside are all files in numpy format.
After loading augmented_labels.npy, I found that this is a label file with 1 to 89,592 data, and the categories are the 20 categories mentioned in Table 1 of the paper.

I have seen that many python scripts in the "develope" branch you developed use the following paths. I would like to ask what is "REU", "REU_Samples_and_Labels", and "labels.csv"? I can't find relevant files in the "dataverse_files.zip".
ata_directory = "/home/gamagee/workspace/gunshot_detection/REU_Data/REU_Samples_and_Labels/"
label_csv = data_directory + "labels.csv"
sample_directory = data_directory + "Samples/"

Is there a teaching file (README.txt) for training the model? Please give me some tips, thank you very much!!!

amorehead · 2021-03-31T16:14:51Z

@jayer95,

Thank you for your patience in waiting for my response. I am glad to hear that you are excited about training your own models.

Regarding (1), the "python" directory in the develop branch (https://github.com/gabemagee/gunshot_detection/tree/develop/python) was used during this National Science Foundation Research Experience for Undergraduates (REU) program to develop the initial model architectures. Each person working on the project had their own subtasks and approaches they were taking to model training (thus, everyone having different directories). However, we ultimately ended up using scripts from @rjhosler 's directory to train the final models (One 1D model and two 2D models). The scripts used to train our final three models, as best I can recall, are:

1D model - https://github.com/gabemagee/gunshot_detection/blob/develop/python/Ryan/1D_train.py
2D model (128 x 64) - https://github.com/gabemagee/gunshot_detection/blob/develop/python/Ryan/spectrogram.py
2D model (128 x 128) - https://github.com/gabemagee/gunshot_detection/blob/develop/python/Ryan/spectrogram_train.py

amorehead · 2021-03-31T16:19:42Z

@jayer95,

You'll notice in the scripts linked above that they reference pre-built NumPy array files such as "samples = np.load(BASE_DIRECTORY + "gunshot_augmented_sample_spectrograms.npy")" and "train_wav = np.load(BASE_DIRECTORY + "128_128_augmented_training_spectrograms.npy")". For the 2D models, we opted to preprocessed all of the 1D NumPy sound samples into their respective spectrogram representations (128 x 64 and 128 x 128 depending on the model being trained). Uploaded on Dataverse in the form of 15 separate NumPy files is our master copy of the full dataset - a 32GB NumPy array containing all sound sample selections. We only split this file up into 15 smaller ones so we would be below Dataverse's individual file upload size limit.

amorehead · 2021-03-31T16:26:05Z

@jayer95,

In short, what you'll need to do to train your own models using our data is first decide how much of it you can use, depending on your computational resources. If you choose to use it all, you will need to run all scripts involving the full NumPy array on a machine with at least 32 GB of RAM (or more hopefully). Otherwise, you will need to manually parse through the individual files it was split into, determine a maximum number of samples you will be using, and then filter out the remaining labels from the corresponding NumPy labels file.

Then, to actually train a model, the 1D case should not be too difficult. You can reference Ryan's 1D CNN training script to see how we handled this case. For the 2D cases, you will need to look at how I converted the original 1D samples into 2D spectrograms using a script like this one (https://github.com/gabemagee/gunshot_detection/blob/develop/python/Alex/2D%20CNN/spectrogram_creation.py). Once you are familiar with the creation of these spectrograms and how to customize their size (64 vs. 128), you can adapt this to cache the spectograms as another NumPy array on your local or remote storage device. Then, you can change filenames as needed when you reference the 2D CNN training scripts from Ryan (https://github.com/gabemagee/gunshot_detection/blob/develop/python/Ryan/spectrogram.py and https://github.com/gabemagee/gunshot_detection/blob/develop/python/Ryan/spectrogram_train.py).

jayer95 · 2021-03-31T17:37:36Z

@amorehead

Thank you very much for your professional reply.

During this period, I repeatedly experimented and tried to understand the source code you developed, and cross-compared the methods and directory names used by each developer for different projects.

I used np.concatenate to merge 15 split .npy files into one npy file ([augmented_samples.npy], 31.6GB), and checked each axis to make sure that 15 .npy files with the same shape were merged correctly.

Then, I refer to the code in the following link to convert [augmented_samples.npy] to [gunshot_augmented_sample_spectrograms_128_64.npy] and [gunshot_augmented_sample_spectrograms_128_128.npy] (file sizes are 2.94GB and 5.87GB respectively)
https://github.com/gabemagee/gunshot_detection/blob/develop/python/Alex/2D%20CNN/spectrogram_creation.py
The key to conversion to 128*128 is:
HOP_LENGTH = 345 * 2, replace to
HOP_LENGTH = 345
samples = np.array(spectrograms).reshape(-1, 128, 64, 1), replace to
samples = np.array(spectrograms).reshape(-1, 128, 128, 1)

In order to verify that the [gunshot_augmented_sample_spectrograms_128_64.npy] of the two scales was converted correctly, I converted the .npy file into a .jpg file (up to 80,000 strokes, length 64 and width 128 images), and looked at a few images with my eyes. (It is indeed similar to the images in the paper)

Regarding the labels.csv file, although I can't find it in the folder directory, I guess it is [augmented_labels.npy]. I converted it to a .csv file and found that it has more than 80,000 lines after reading it. This (.csv or .npy file) is used to distinguish between gunshot and no-gunshot.

The current progress is here. Fortunately, my computer has 128G of memory, and the above operations can be performed normally.

I have found some minor issues, which are still under study. Regarding the [train_index] and [test_index] you provided on github, it seems that the number does not match (less than 20,000), but the actual total number (train+test) should be More than 80,000 pens.
https://github.com/gabemagee/gunshot_detection/blob/develop/raspberry_pi/indexes/training_set_indexes.npy

I need to study the latest directory address you provided, Ryan's 3 training projects. Thank you again for always being patient and very professional for answering my questions.

jayer95 · 2021-04-06T02:36:18Z

@amorehead

Hi Amorehead,
Regarding the training 2D model (128 x 128) program,

2D model (128 x 128) - https://github.com/gabemagee/gunshot_detection/blob/develop/python/Ryan/spectrogram_train.py

train_wav = np.load(BASE_DIRECTORY + "128_128_augmented_training_spectrograms.npy")
test_wav = np.load(BASE_DIRECTORY + "128_128_augmented_testing_spectrograms.npy")
valid_wav = np.load(BASE_DIRECTORY + "128_128_augmented_validation_spectrograms.npy")

train_label = np.load(BASE_DIRECTORY + "augmented_training_labels.npy")
test_label = np.load(BASE_DIRECTORY + "augmented_testing_labels.npy")
valid_label = np.load(BASE_DIRECTORY + "augmented_validation_labels.npy")

Which program did you use when you split the dataset into training and testing and validation?

I referenced the following 3 programs, this seems to be the production process of data augmentation (Training, testing, validation)
Training: https://github.com/gabemagee/gunshot_detection/blob/develop/python/Alex/Data%20Preprocessing%20(Training)/data_augmentation.py
Testing:
https://github.com/gabemagee/gunshot_detection/blob/develop/python/Alex/Data%20Preprocessing%20(Testing)/data_augmentation.py
Validation:
https://github.com/gabemagee/gunshot_detection/blob/develop/python/Alex/Data%20Preprocessing%20(Validation)/data_augmentation.py

But the .npy (32GB) file I downloaded from dataverse_files.zip seems to have been data augmented.
How can I split the augmented_samples.npy (32GB) and augmented_labels.npy into training and testing and validation with the proportions mentioned in the paper?

In addition, for training 128x64 programs, it uses different methods to read training and testing and validation. Which one is the way to load .npy in your final training?

2D model (128 x 64) - https://github.com/gabemagee/gunshot_detection/blob/develop/python/Ryan/spectrogram.py

## Loading augmented NumPy files as NumPy arrays

In[ ]:

samples = np.load(BASE_DIRECTORY + "gunshot_augmented_sample_spectrograms.npy")
labels = np.load(BASE_DIRECTORY + "gunshot_augmented_sound_labels.npy")

print("Successfully loaded all spectrograms and labels as NumPy arrays...")
print("Type of the spectrograms array:", samples.dtype)

## Instantiating a sample weights NumPy array

In[ ]:

sample_weights = np.array(
[1 for normally_recorded_sample in range(len(samples) - 660)] + [15 for raspberry_pi_recorded_sample in range(660)])
#print("Shape of samples weights before splitting:", sample_weights.shape)

## Restructuring the label data

In[ ]:

labels = np.array([("gun_shot" if label == 1 else "other") for label in labels])
label_binarizer = LabelBinarizer()
labels = label_binarizer.fit_transform(labels)
labels = np.hstack((labels, 1 - labels))

### Debugging of the sample and label data's shape (optional)

In[ ]:

#print("Shape of samples array:", samples.shape)
#print("Shape of labels array:", labels.shape)

## Arranging the data

In[ ]:

'''
kf = KFold(n_splits=3, shuffle=True)
for train_index, test_index in kf.split(samples):
train_wav, test_wav = samples[train_index], samples[test_index]
train_label, test_label = labels[train_index], labels[test_index]
train_weights, test_weights = sample_weights[train_index], sample_weights[test_index]
'''

all_index = np.arange(len(samples))
train_index = np.load("training_set_indexes.npy")
test_index = np.load("testing_set_indexes.npy")
valid_index = np.delete(all_index, list(train_index) + list(test_index))

print(train_index)
print(test_index)
print(valid_index)

train_wav, test_wav, valid_wav = samples[train_index], samples[test_index], samples[valid_index]
train_label, test_label, valid_label = labels[train_index], labels[test_index], labels[valid_index]
train_weights, test_weights, valid_weights = sample_weights[train_index], sample_weights[test_index], sample_weights[valid_index]

How can I get the following two .npy?
https://github.com/gabemagee/gunshot_detection/blob/develop/raspberry_pi/indexes/training_set_indexes.npy
https://github.com/gabemagee/gunshot_detection/blob/develop/raspberry_pi/indexes/testing_set_indexes.npy

amorehead · 2021-04-23T16:09:57Z

@jayer95,

Thank you for your patience once again. You will most likely need to reach out to Ryan Hosler (@rjhosler), one of the coauthors of the paper, to get your latest questions answered. As Ryan did most of the work in the "Ryan" directory, he will know best how to reproduce the dataset and model training procedure. If it helps, the paper lists his (I believe) current email address with which you can contact him (i.e. [email protected]). Please let me know if he is able to answer your questions.

Thanks once again.

jayer95 · 2021-04-25T17:48:43Z

@amorehead
Thank you, I will prepare the question again and send a letter asking him.

amorehead closed this as completed May 6, 2021

xiangzi-yuan mentioned this issue May 18, 2024

Which .ipnyb used to process the dataset from development branch? #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About dataset and model training #10

About dataset and model training #10

jayer95 commented Mar 22, 2021 •

edited

Loading

amorehead commented Mar 31, 2021

amorehead commented Mar 31, 2021

amorehead commented Mar 31, 2021

jayer95 commented Mar 31, 2021 •

edited

Loading

jayer95 commented Apr 6, 2021 •

edited

Loading

amorehead commented Apr 23, 2021

jayer95 commented Apr 25, 2021

About dataset and model training #10

About dataset and model training #10

Comments

jayer95 commented Mar 22, 2021 • edited Loading

amorehead commented Mar 31, 2021

amorehead commented Mar 31, 2021

amorehead commented Mar 31, 2021

jayer95 commented Mar 31, 2021 • edited Loading

jayer95 commented Apr 6, 2021 • edited Loading

## Loading augmented NumPy files as NumPy arrays

In[ ]:

## Instantiating a sample weights NumPy array

In[ ]:

## Restructuring the label data

In[ ]:

### Debugging of the sample and label data's shape (optional)

In[ ]:

## Arranging the data

In[ ]:

amorehead commented Apr 23, 2021

jayer95 commented Apr 25, 2021

jayer95 commented Mar 22, 2021 •

edited

Loading

jayer95 commented Mar 31, 2021 •

edited

Loading

jayer95 commented Apr 6, 2021 •

edited

Loading