Skip to content

marco-giordano/spoken-italian-datasets

Repository files navigation

Spoken Italian Datasets Collection

DOI

Overview

This repository hosts a curated collection of datasets focused on spoken Italian. The collection is maintained as a Google Sheet to provide easy access and encourage contributions.

Features

  • Rich Metadata: Information about dataset sources, sizes, formats, licenses, and more.
  • Interactive Access: Explore the collection interactively via a Jupyter Notebook.
  • Community Contributions: Open to contributions for adding or updating datasets.

Dataset Features Mapped

  • Name
  • Publisher/Promoter
  • Size: number of hours, dimension in *byte,...
  • Format: encoding, meta-file availability...
  • Source and Context: broadcast media, Telephone speech, Social Media & Online Platforms, Field Recordings...
  • Type of Speech: conversational, monologues, spontaneous, read...
  • Regional Varieties: dialects, idioms...
  • Socioinguistic variations: age, gender, socio-economic status.
  • Multilingual Corpus: if belongs to wider multilingual corpora.
  • Data Collection: Characteristics of data collection process.
  • Recording Methods: mic tech, sampling rate, bit depth, environment control.
  • Participant Selection: age, region, language proficiency.
  • Transcription Standard: orthographic, phonetic...
  • Annotation Levels: Lexical, syntactic, prosodic...
  • Tools and Software Used
  • Quality Control and Validation: data quality, consistency, reliability
  • Availability: type of access, if open or restricted. Type of license.
  • URL
  • Paper
  • Other

Access the Collection

The full collection is available at this link:

View the collection

A suitable query interface will be realized soon!



Licensing

This repository contains two types of content, each with its own licensing:

By using or contributing to this repository, you agree to abide by the respective licenses.


Contributing

Contributions are welcome! To add a new dataset to the collection:

  1. Fork this repository.
  2. Add your dataset details to the Google Sheet or propose edits via GitHub Issues.
  3. Submit a pull request.

See CONTRIBUTING.md for more details.


Citation

If you use this collection in your work, please cite it as: [Giordano M., Rinaldi C.]. "Spoken Italian Datasets Collection." GitHub, 2024. https://github.com/marco-giordano/spoken-italian-datasets


Acknowledgments

Thanks to all contributors and the community for supporting open data in spoken language research.

About

A collection of datasets of spoken Italian.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages