Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-Flight Analysis Tool: Create a stand-alone utility to qualify plaintext files prior to processing #10

Open
turnkit opened this issue Jan 6, 2023 · 4 comments

Comments

@turnkit
Copy link

turnkit commented Jan 6, 2023

Create a python script to qualify and sort the files in the whisp_out .txt source folder and bin them into the following sub-folders:

  • 3_problem folder: e.g. of problematic files that would go in this folder are files with: no periods in the entire file, Alternatively (also?) -- there are no periods in the first three lines. Also if the entire file is less than 1024 bytes it should go in here. We may discover additional ways in the future to determine of these transcriptions have likely failed. Any likely transcription failures should be screened and dumped in here so that they don't get the processed for paragraph chunking.

  • 1_plain folder: the .txt file passes the step above (it seems good) AND it also contains no Unicode characters but only ASCII text.

  • 2_unicode folder: contains .txt files which pass the first step and it DOES contain at least one Unicode characters

There is probably a broad way to test for Unicode characters but some examples are music symbols included here:

[04:31.880 --> 04:38.880] ♪ Open my eyes, invade my life, my Lord ♪
[04:38.880 --> 04:51.880] ♪♪
[04:51.880 --> 04:59.880] ♪ I can't do this without You, You're my fire, You're my fire ♪

or Chinese such as found in the file SID1193.mp3.txt:

I have started on this plan,
或者一次過一次的開始來畫, and I have not found that I could get on very far.
我發現我不能夠進行太多。 It was so I was trying to do something that the Lord did not want me to do.
好像我一直要做一點,就說不要我做的事。 I wonder if you have had that experience.
我不知道你們有沒有這種經歷。 You try to do something, but you just have no life in it at all.
你想要做一點這樣的事,可是你的一隻不感覺有生命。 The thing becomes dead.

@turnkit
Copy link
Author

turnkit commented Jan 6, 2023

another test for "problematic" source whisper files is when the first line is repeated three times.

Here's the .srt output but the .txt file would just have the plaintext repeated three times.

processing: 0/SID0607.mp3 Fri Jan 6 03:55:10 CST 2023
[00:00.000 --> 00:02.060] you
[00:30.000 --> 00:32.060] you
[01:00.000 --> 01:02.060] you
[01:30.000 --> 01:32.060] you
processing: 0/SID0608.mp3 Fri Jan 6 03:56:04 CST 2023
[00:00.000 --> 00:02.000] You
[00:30.000 --> 00:32.000] You
[01:00.000 --> 01:02.000] You
[01:30.000 --> 01:32.000] You

@turnkit
Copy link
Author

turnkit commented Jan 6, 2023

A better (or another) pre-flight test for "problematic" files... check each text line of the file. If the same line repeats five times in a row (not just at the beginning but any five consecutive times) the dump it in the problematic bin.

Here's an example found in: SID1285.mp3.txt

definitive focus what has been the argument of the apostle all the way along the line
from verse 17. And again, verse 22 confirms the interpretation of verses 17 to 21, which
I have been presenting. And it's not really compatible with any other construction of
the relation which the mosaic covenant sustains to the Abrahamic. In a word, it is
that the Abrahamic covenant is the fabric around which is woven the web of the mosaic
ritual and ordinance. Mosaic ritual and ordinance. Now that was the relation between
the Abrahamic covenant and the mosaic ritual. And it's not really compatible with any other
construction of the mosaic ritual, but it's not really compatible with any other
construction of the mosaic ritual, but it's not really compatible with any other
construction of the mosaic ritual, but it's not really compatible with any other
construction of the mosaic ritual, but it's not really compatible with any other
construction of the mosaic ritual, but it's not really compatible with any other
construction of the mosaic ritual, but it's not really compatible with any other
construction of the mosaic ritual, but it's not really compatible with any other
construction of the mosaic ritual, but it's not really compatible with any other

@turnkit turnkit changed the title Create a stand-alone utility to qualify plaintext files prior to processing Pre-Flight Analysis Tool: Create a stand-alone utility to qualify plaintext files prior to processing Jan 6, 2023
@turnkit
Copy link
Author

turnkit commented Jan 6, 2023

@turnkit
Copy link
Author

turnkit commented Jan 6, 2023

checking to see if the same line is repeated in succession three to five times might now always work.

Here's the end of one file that is probably legit. Maybe we can ignore if the line is "..." ? (Assuming this is SID0661.mp3)

00] In the last...
[59:23.200 --> 59:33.200] ...
[59:33.200 --> 59:43.200] ...
[59:43.200 --> 59:53.200] ...
[59:53.200 --> 01:00:03.200] ...
[01:00:03.200 --> 01:00:13.200] ...
[01:00:13.200 --> 01:00:23.200] ...
[01:00:23.200 --> 01:00:32.200] ...
processing: 0/SID0662.mp3 Fri Jan 6 07:25:20 CST 2023
[00:00.000 --> 00:21.000] Good morning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant