-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-Flight Analysis Tool: Create a stand-alone utility to qualify plaintext files prior to processing #10
Comments
another test for "problematic" source whisper files is when the first line is repeated three times. Here's the .srt output but the .txt file would just have the plaintext repeated three times. processing: 0/SID0607.mp3 Fri Jan 6 03:55:10 CST 2023 |
A better (or another) pre-flight test for "problematic" files... check each text line of the file. If the same line repeats five times in a row (not just at the beginning but any five consecutive times) the dump it in the problematic bin. Here's an example found in: SID1285.mp3.txtdefinitive focus what has been the argument of the apostle all the way along the line |
checking to see if the same line is repeated in succession three to five times might now always work. Here's the end of one file that is probably legit. Maybe we can ignore if the line is "..." ? (Assuming this is SID0661.mp3) 00] In the last... |
Create a python script to qualify and sort the files in the whisp_out .txt source folder and bin them into the following sub-folders:
3_problem folder: e.g. of problematic files that would go in this folder are files with: no periods in the entire file, Alternatively (also?) -- there are no periods in the first three lines. Also if the entire file is less than 1024 bytes it should go in here. We may discover additional ways in the future to determine of these transcriptions have likely failed. Any likely transcription failures should be screened and dumped in here so that they don't get the processed for paragraph chunking.
1_plain folder: the .txt file passes the step above (it seems good) AND it also contains no Unicode characters but only ASCII text.
2_unicode folder: contains .txt files which pass the first step and it DOES contain at least one Unicode characters
There is probably a broad way to test for Unicode characters but some examples are music symbols included here:
[04:31.880 --> 04:38.880] ♪ Open my eyes, invade my life, my Lord ♪
[04:38.880 --> 04:51.880] ♪♪
[04:51.880 --> 04:59.880] ♪ I can't do this without You, You're my fire, You're my fire ♪
or Chinese such as found in the file SID1193.mp3.txt:
I have started on this plan,
或者一次過一次的開始來畫, and I have not found that I could get on very far.
我發現我不能夠進行太多。 It was so I was trying to do something that the Lord did not want me to do.
好像我一直要做一點,就說不要我做的事。 I wonder if you have had that experience.
我不知道你們有沒有這種經歷。 You try to do something, but you just have no life in it at all.
你想要做一點這樣的事,可是你的一隻不感覺有生命。 The thing becomes dead.
The text was updated successfully, but these errors were encountered: