Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add recipe for the Santa Barbara Corpus of Spoken American English (SBCSAE) #1395

Merged
merged 14 commits into from
Oct 4, 2024

Conversation

mmaciej2
Copy link
Contributor

This PR adds a recipe for the SBCSAE corpus, including our cleanup efforts (further details here)

The bulk of the work is in the _filename_to_supervisions() and _parse_raw_transcripts() functions, which are based around parsing the transcript files, which were designed to be human-readable, and have a lot of challenges in automatic processing.

The other significant processing is in the apply_aligned_stms() function, which involves downloading .stm files produced via a realignment procedure to improve segmentation boundaries, and merging the improved segmentation into the supervisions generated directly from the corpus.

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Can you also add an entry in docs/corpus.rst table of recipes?

@pzelasko pzelasko added this to the v1.28.0 milestone Oct 1, 2024
@mmaciej2
Copy link
Contributor Author

mmaciej2 commented Oct 3, 2024

I've updated the docs/corpus.rst file and also added a fix for the failed unit test for python version 3.8.

pzelasko
pzelasko previously approved these changes Oct 3, 2024
@pzelasko pzelasko enabled auto-merge (squash) October 3, 2024 23:39
@pzelasko
Copy link
Collaborator

pzelasko commented Oct 3, 2024

Thanks!

auto-merge was automatically disabled October 4, 2024 14:31

Head branch was pushed to by a user without write access

@pzelasko pzelasko enabled auto-merge (squash) October 4, 2024 17:41
@pzelasko pzelasko merged commit d1b078b into lhotse-speech:master Oct 4, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants