-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing timing of index events without causing crackling #96
base: master
Are you sure you want to change the base?
Conversation
Hi, this seems as a very good solution! I was testing it, but this solution adds unwanted pauses, you can test it for example spelling a word. It's also noticeable in other places where punctuation is present. Is more noticeable on the NVDA's alpha versions. |
Yep, still noticeable for me here as well. |
I've spent quite some time investigating longer pauses when spelling. According to my measurement in my version there is 38% slowdown when spelling a word (695 ms vs 502 ms).
|
Hi, I would not like to add latency to the synth. In Espeak the problem when spelling is not noticeable, I will analyze the reasons. I haven't had time to test lately due to lack of time, but I will definitely do it. There is an alternative that I'm considering of now. an extra setting to use the current approach, or the accurate indexing method. |
If this introduces latency via the expedient of 3300 or more samples being buffered, at 11025 Hz this would introduce a delay of near 300ms. This is unacceptable if it actually happens this way, although I suppose that depends on how much faster than real time ECI actually is. I think we should stop trying to work around flaws in WinMM and look at the WASAPI alternative. Does feeding small chunks also cause crackling? What about the direct adaptation of passing a ctypes pointer to the buffer rather than storing it in a bytes object? |
The latency is not perceptible except in some cases, it would never reach 300ms, it would be very noticeable. I don't think it's a good idea to focus only on wasapi, that would be forgetting all the users who for some reason cannot update to the latest version of NVDA. |
We'd leave the existing code as a backward compatibility layer or maybe an option, since the WASAPI build isn't even out of alpha yet I agree that we can't just drop all WinMM support. I'm just saying we might just require the WASAPI build for accurate indexing. |
I think I found why this pR was causing extra pauses when spelling. Just pushed commit fixing this. |
This is my second attempt of trying to fix indexing without regressing sound quality. This supercedes my previous PR #94. This would fix #22 if accepted.
My investigation
I digged into the issue of crackling and here is my best understanding why it happens.
Why crackling doesn't happen at current master version? After consuming chunk of size 11, current version doesn't send the short chunk to player right away, but rather buffers it and eventually combines with the following chunk, which happens to be just silence. At the cost of not catching the precise moment when 11 chunk has completed playing and thus degrading index firing precision.
Even though NVDA is going to inevitably switch from winMM to Wasapi, I thought it is still good to have a better understanding of crackling and maybe this will also help in some way with future wasapi versions.
How this PR fixes indexing without introducing crackling.
Indexing is fixed in the similar way to my previous PR #94, by setting
onDone
callback toplayer.feed()
call. If more than 1 index is sent together, then onDone callback would trigger all of them, so no index event will be lost.On a high level, crackling problem is fixed by delayed flushing of audio chunks. When we come across chunk of size 11 - as in my example above, we would find ourselves in a situation where the previous chunk hasn't been flushed yet (or in other words
player.feed()
hasn't been called on it). This allows us to combine chunk of size 11 with the previous chunk of size 3300 and flush both of them in a single call toplayer.feed()
.More precisely, when we receive a new chunk of audio in
eciCallback()
function, we would check how much audio has already been buffered (inaudioStream
) and how much audio is coming in in the new chunk (variablelp
). We will only flushaudioStream
if both values are at leastsamples*2
or in other words if both values are at least as big as the buffer we use to communicate with eloquence. If this condition is satisfied, we will flush old contents ofaudioStream
(which is long enough not to cause crackling), and then we will truncateaudioStream
and buffer in it incoming audio chunk. One more case when we would end up flushing is when there is an index event recorded (e.g.len(indexes) > 0
) - in this case we have to flush to respect accurate index timing.This complex condition is written in _ibmeci.py lines 369...375 and I will refer to it as "the condition" below.
To illustrate this further, consider "rate 0" example I mentioned above. Eloquence generates audio in 4 chunks:
Here is how new algorithm would process this sequence:
audioStream
is empty.eciCallback()
. SinceaudioStream
is empty, the condition evaluates toFalse
, and we don't flush anything and simply store new chunk of audio inaudioStream
.True
, since both buffered data and incoming chunk are equal to buffer size. So we callpleyStream()
to flush contents ofaudioStream
, and after that we clearaudioStream
and store new chunk in it.audioStream
and we don't flush it. At this point,audioStream
contains two recent chunks, or 3311 samples.indexes
.True
, since we have index stored in global variableindexes
. So we callplayStream
to flushaudioStream
, which has 3311 samples at this moment - please note that we just prevented crackling by sending small chunk together with the previous one. Then we again clearaudioStream
and store new chunk there.playStream()
to flush any data still present inaudioStream
, so we would flush 847 samples.Alternatives considered
Algorithm presented above is more complicated and harder to reason about. Another approach I considered was to play with buffer size. However no matter what buffer size would be, it will always be possible to find an example, where the last chunk generated for a phrase is going to be small, and thus this will not ultimately solve crackling problem. So the algorithm in this PR seems to be the only solution that effectively prevents sending too small chunks to player, while also respecting timing of index events.
Testing performed