Storing raw data in non-audiable format #976

ahmaeldesoky · 2024-04-14T13:02:55Z

ahmaeldesoky
Apr 14, 2024

Hi all,

I'm new to the package and bioacoustics in general. My question is simply that I need to save the raw audio data that I collect from the field in a non-audible format (for privacy and GDPR requirements); however, I still need a format that can be used later as input for different ML/CNN models (pre-trained that are available in bioacoustics-model-zoo but also custom ones). What would be the best way to handle this? I thought of the spectrogram objects as the way (opensoundscape.spectrogram.Spectrogram object), but I'm not sure how will this be handy. Of course, there will be always a limitation that there won't be any ground truth data for validating the model outputs by the human ear, but I'm trying to optimize as much as possible.

Thanks in advance

Ahmed

sammlapp · 2024-04-15T20:06:15Z

sammlapp
Apr 15, 2024
Maintainer

Hi @ahmaeldesoky that is an interesting question. I'm not familiar with the privacy requirements related to "non-audible formats", but if you store the data as spectrograms it wouldn't be very hard for someone to re-create audio files from them. I wonder if encrypting the audio files on your computer would satisfy the requirements? Then, they will still be audio files but can be password-protected.

0 replies

fherb2 · 2024-10-08T21:37:08Z

fherb2
Oct 8, 2024

Hi,

Although somewhat off-topic with regard to the specific enquiry, I would like to take the title of the enquiry as an opportunity for the following idea:

Since I record environment sound for bird predicting 24/7 since one and a half year at one location and I prepare to add additional environment records via AudioMoth, I'm thinking also for a good solution to collect the sound data in a better way than a folder with tens of thousands of wave files. And we should thinking for use cases where the meta information are more than creator, GPS and time data. Maybe users want to include time stamped weather data for better predictions or for meta-using of the sound source predictions: Recognizing animal species is not an end in itself.

Possible database solution

Of course you can collect such meta information in additional of tens of thousands of text- or csv-files. But scientists in other disciplines who also have to work with large amounts of data, combined with metadata, have already created a good solution for this: HDF5 (BSD-like license for general use), https://www.hdfgroup.org/solutions/hdf5/

HDF5 offers some really good advantages for such data as environmental sound data:

Data inside such HDF5 database files:

made for data collections in the form of tables of sampled data, named 'datasets' (for example audio samples with a start time stamp and a sample rate of one or more synchronized channels)
made to add additional meta data to such datasets
made to order data(sets) in folder like structures, where the folders also can have meta data
structure, datasets and metadata can be used to make such a HDF5 database file completely self-explanatory
there are viewers that can display the contents of HDF5 files, so you can inspect the information and the "self-explanatory" before you start to write scripts for processing
Ideal for storing data in connection with the principle of ‘good scientific work’ when you publish your results and make the data sources available to others. (In this context, I mean not such date which are good to be placed in Pandas files. Only, I mean the source sampling data with all the meta information.)

The result is not a zip file with a flat or deep folder structure of wave files (for example) and an explaning document how to use the names of the files, the interpretation of text or csv files as meta information ... No, the result is one HDF5 file.

Processing of such data:

Implementation of the I/O operations are trimmed for fast access also in big files (The development goal was to be able to access data much faster than via table files stored in file folders.)
Access to parts of data without to copy all data into the memory. For Python: You say, what you want and you get a numpy array of the samples.
API bindings for a lot of languages, of course also Python and R
A simple to understand API. So the learning curve is very steep.
transparent lossless compressing by entropy compression methods possible

This is the reason that I decide to collect my audio data in such HDF5 database files. I have created functions to insert audio files into HDF5 files including metadata, as well as functions to restore the original files with name and content or just to compare with files in folders if necessary. The metadata contains all the original information. Any metadata can be added. I am in the process of realizing the corresponding accesses to this data so that it can be treated like very large audio files of periods of weeks or months.

Question to the developer of opensoundscape

Is there general interest from opensoundscpae developers to possibly consider such support for large sound data sets via HDF5? Maybe I can participate with a development of my library that is customised for opensoundscape. Since I want to use opensoundscape to analyse my data, I will write some functions to bring HDF5 and opensounscape together anyway.

Please let me know in case there is some interest.

Best regards,
Frank

3 replies

sammlapp Oct 8, 2024
Maintainer

Hi Frank, this sounds like an interesting project. I think making a single implementation of metadata storage that will generally work for any project is nearly impossible, but if you develop and open source implementation for your project, perhaps others could adapt it for other use cases.

What I would recommend is using OpenSoundscape as a dependency in your package and writing interfaces for your database. In particular, you can subclass the Audio class to implement a .from_hdf() class method (see from_file implementation for reference), and include some custom parsing of the metadata that gets added to the audio object. You might also subclass BoxedAnnotations or other pieces of the annotations module if your file contains some sort of audio annotations.

If you're planning to do some machine learning with the data, you'll likely also want to write a custom Preprocessor class subclassing BasePreprocessor, which specifies the sample loading operations.

Please share if you come up with something that you think will be useful to others, and let us know if you have questions about opensoundscape.

fherb2 Oct 12, 2024

Ok. I follow your recommendations and I will share the results if all things have the state for a good reuse.

Best regards, Frank

sammlapp Oct 15, 2024
Maintainer

Came across this today, the use of npy format for fast out-of-memory access of audio + metadata seems similar to your use case https://janclemenslab.org/das/technical/data_formats.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing raw data in non-audiable format #976

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Storing raw data in non-audiable format #976

ahmaeldesoky Apr 14, 2024

Replies: 2 comments · 3 replies

sammlapp Apr 15, 2024 Maintainer

fherb2 Oct 8, 2024

Possible database solution

Data inside such HDF5 database files:

Processing of such data:

Question to the developer of opensoundscape

sammlapp Oct 8, 2024 Maintainer

fherb2 Oct 12, 2024

sammlapp Oct 15, 2024 Maintainer

ahmaeldesoky
Apr 14, 2024

Replies: 2 comments 3 replies

sammlapp
Apr 15, 2024
Maintainer

fherb2
Oct 8, 2024

sammlapp Oct 8, 2024
Maintainer

sammlapp Oct 15, 2024
Maintainer