Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added sound recognition: emoji and morse code #99

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

vladturcuman
Copy link

@vladturcuman vladturcuman commented May 1, 2021

This was forked from the JoshuaAHill:fft_josh branch of the JoshuaAHill/codal-microbit-v2 fork.

The changes here allow the microbit to recognise sounds - a generic implementation with samples of the happy, sad, hello, twinkle and soaring emoji sounds added - and transmit and receive morse code data.

Overview of the data pipeline

image

Audio processor

Overview

The audio processor takes in the microphone data (sampled at 10^6 / MIC_SAMPLE_DELTA Hz which is 11000 Hz now) and produces AudioFrameAnalysis data.

An AudioFrameAnalysis has the fundamental frequencies of a frame - maximum MAXIMUM_NUMBER_OF_FREQUENCIES and ordered from the most likely to the least.

Algorithm

The audio processor accumulates microphone data as it comes in and after getting audio_samples_number of them it process the frame.

It transforms the date from time domain to frequency domain using the CMSIS fft: arm_rfft_fast_f32.

If the standard deviation (std) is lower than std_threshold then the frame is considered silent - no fundamental frequency.

It then filters out the frequencies that have magnitude lower than the mean + std_threshold * std. This ensures that only outlier frequencies are being considered.

It then filters out the neighbour frequencies around the peaks.

Some of these operations are implemented together to optimize the algorithm.

Emoji

The process of adding new sounds to be recognised is described in this issue.

Sound recogniser

Overview

The sound recogniser takes in data from the audio processor - in form of AudioFrameAnalysis. It then tries to match the history against samples of sounds.

Sound fingerprint

A sound has multiple sequences, each sequence having multiple samples to account for the randomness. Each sequence also has:

  • a threshold - the maximum absolute difference between the sampled frequency and the heard frequency.
  • maximum number of deviations - the maximum number of data points that can be more than the threshold away from the sampled frequency

A sound also has a maximum number of deviations - the total maximum deviations across all sequences.

Emoji recogniser

Overview

The emoji recogniser is a subclass of sound recogniser that defines the actual samples for the emoji sounds. They are just parts of the emoji sounds that can be recognised: remain quite consistent across multiple plays of the sound.

Example

Taking the happy sound as an example, there are a few constants defined:

  • happy_sequences the number of sequences in the happy sound
  • happy_max_deviations the maximum number of deviations in the sound - i.e. a deviation is considered a data point that is more than the allowed threshold off the sampled frequency
  • happy_samples a 3-dimensional array with the sampled sound:
    • the first dimension is the different sequences
    • the second is the samples in each sequence
    • the third is the data points in each sample of each sequence
  • happy_thresholds an array with the thresholds for each of the sequences
  • happy_deviations an array with the maximum deviations for each sequence
  • happy_nr_samples an array with the number of samples in each sequence

All these are packaged in a Sound struct.

Morse Code

We use the following elements for morse code:

Morse code element char representation duration
dot '.' unit
dash '-' unit * 3
element gap not represented unit
letter gap ' ' unit * 3
word gap ';' unit * 7
end of transmission '#' unit * 10

Dot and dash are audible, the rest are silent.

The reason 'element gap' is not represented is that it can be easily added appropriately at the time that the morse code is played over the speaker. When interpreting morse played by another source, it can just be ignored.

Morse Encoder

The morse encoder has two uses:

  • it can encode a string of characters into its morse representation.

    • This is done by transforming every character using a map and adding spaces between characters and words when appropriate.
    • At the end, the end of transmission character is added.
    • Whenever a lowercase letter is encountered, it is converted to uppercase.
    • Whenever an unknown character is encountered, it is converted to '&'.
  • it can decode a series of morse elements into the corresponding string.

    • Decoding takes every continuous sequence of dots and dashes and transforms it into the appropriate character using another map.
    • This is done by building up every continuous sequence of dots and dashes into a small buffer and, when another element is encountered, transforming it into the appropriate character using a map. The buffer is reset every time this happens.
    • All of the generated characters are added to the output string.

Morse Player

The morse player takes strings of characters and converts them into morse code using a morse encoder.

Then, it takes every element one by one and plays it over the speaker.

It also adds 'element space' frames between consecutive dots and dashes, in order to help differentiate between audible elements.

Morse recogniser

Overview

The sound recogniser takes in data from the audio processor - in form of AudioFrameAnalysis. It then tries to identify whether the listening frequency was played during that frame. Based on the last frames it sends dots ('.'), dashes ('-'), letter spaces (' '), word spaces (';'), and end-of-transmission ('#') to the data sink connected.

Algorithm

The recogniser first converts the AudioFrameAnalysis data in booleans which represent whether or not the listening frequency was being played in that frame. A timeUnit is calculated based on the sampling frequency and the number of samples taken before analysis by the audio processor - it represents the number of frames a dot should take.

It has 2 values: zeros and ones which count for how many timeUnits the frequency wasn't being played/ was being played. The value synchronised keeps track of the last transition - true if the last transition was from the frequency not being played to it being played, false otherwise. At each transition, the values of zeros or ones are being interpreted into the right character and then reset.

All characters are being accumulated into a buffer and send together when the end-of-transmission character ('#') is found.

Morse Interpreter

This class takes morse data from a morse recogniser and uses a morse encoder to decode it.

It then calls an event signalling the fact that it is done processing some data.

The last processed data can then be taken from a public field.

Sample main.cpp for emoji recognising

#include "MicroBit.h"
#include "MicroBitAudioProcessor.h"
#include "EmojiRecogniser.h"
#include "StreamNormalizer.h"
#include "StreamSplitter.h"

MicroBitAudioProcessor *fft = NULL;
MicroBitSoundRecogniser *recogniser = NULL;

MicroBit uBit;

void onSound(MicroBitEvent event) {
    recogniser->stopAnalysing();
    ManagedString sound = EmojiRecogniser::getSoundName(event);
    uBit.display.scroll(sound);
    recogniser->startAnalysing();
}

void play_hello(MicroBitEvent) {
    recogniser->stopAnalysing();
    uBit.audio.soundExpressions.play(ManagedString("hello"));
    recogniser->startAnalysing();
}

int main()
{
    uBit.init();
    
    NRF52ADCChannel *mic = uBit.adc.getChannel(uBit.io.microphone);
    mic->setGain(7,0);
    uBit.io.runmic.setDigitalValue(1);
    uBit.io.runmic.setHighDrive(true);

    StreamNormalizer *processor = new StreamNormalizer(mic->output, 1.2f, true, DATASTREAM_FORMAT_8BIT_SIGNED, 10);
    StreamSplitter *splitter = new StreamSplitter(processor->output);
    fft         = new MicroBitAudioProcessor(*splitter, EMOJI_AUDIO_SAMPLES_NUMBER, EMOJI_STD_THRESHOLD);
    recogniser  = new EmojiRecogniser(*fft);

    recogniser->startAnalysing();

    uBit.messageBus.listen(DEVICE_ID_BUTTON_A, DEVICE_BUTTON_EVT_CLICK, play_hello);

    uBit.messageBus.listen(DEVICE_ID_EMOJI_RECOGNISER, DEVICE_EMOJI_RECOGNISER_EVT_HAPPY,   onSound);
    uBit.messageBus.listen(DEVICE_ID_EMOJI_RECOGNISER, DEVICE_EMOJI_RECOGNISER_EVT_HELLO,   onSound);
    uBit.messageBus.listen(DEVICE_ID_EMOJI_RECOGNISER, DEVICE_EMOJI_RECOGNISER_EVT_SAD,     onSound);
    uBit.messageBus.listen(DEVICE_ID_EMOJI_RECOGNISER, DEVICE_EMOJI_RECOGNISER_EVT_SOARING, onSound);
    uBit.messageBus.listen(DEVICE_ID_EMOJI_RECOGNISER, DEVICE_EMOJI_RECOGNISER_EVT_TWINKLE, onSound);
    
    while(1){
        uBit.display.print("-");
        uBit.sleep(1000);
    }   
}

Sample main.cpp for morse transmission

#include "MicroBit.h"
#include "MicroBitAudioProcessor.h"
#include "StreamNormalizer.h"
#include "StreamSplitter.h"
#include "MicroBitMorseInterpreter.h"
#include "MicroBitMorsePlayer.h"

MicroBitAudioProcessor *fft = NULL;
MicroBitMorseRecogniser *recogniser = NULL;
MicroBitMorseInterpreter *interpreter = NULL;

MicroBit uBit;

#define FREQUENCY 2700
#define TIME_UNIT 186

void onMessage(MicroBitEvent){
    interpreter->stopInterpreting();
    uBit.display.scroll( interpreter -> lastMessage );
    interpreter->startInterpreting();
}

void sendMessage(MicroBitEvent){
    interpreter->stopInterpreting();
    MicroBitMorsePlayer mp = MicroBitMorsePlayer(uBit, FREQUENCY, TIME_UNIT);
    mp.play("microbit");
    interpreter->startInterpreting();
}


int main(){
    uBit.init();

    NRF52ADCChannel *mic = uBit.adc.getChannel(uBit.io.microphone);
    mic->setGain(7,0);
    uBit.io.runmic.setDigitalValue(1);
    uBit.io.runmic.setHighDrive(true);

    StreamNormalizer *processor = new StreamNormalizer(mic->output, 1.2f, true, DATASTREAM_FORMAT_8BIT_SIGNED, 10);
    StreamSplitter *splitter = new StreamSplitter(processor->output);
    fft         = new MicroBitAudioProcessor(*splitter, MORSE_AUDIO_SAMPLES_NUMBER, MORSE_STD_THRESHOLD);
    recogniser  = new MicroBitMorseRecogniser(*fft, FREQUENCY, TIME_UNIT);
    interpreter = new MicroBitMorseInterpreter(*recogniser, uBit);

    interpreter->startInterpreting();

    uBit.messageBus.listen(DEVICE_ID_BUTTON_A, DEVICE_BUTTON_EVT_CLICK, sendMessage);
    uBit.messageBus.listen(DEVICE_ID_MORSE_INTERPRETER, DEVICE_MORSE_INTERPRETER_EVT_NEW_MESSAGE, onMessage);


    while(1){
        uBit.display.print(" ");
        uBit.sleep(91500);
    }

}

JoshuaAHill and others added 9 commits March 3, 2021 18:16
[inputpipeline] Set default SPL level to not be active on start
Changed the MicroBitAudioProcessor to also be a DataSource. Added a Hann Window and Harmonic Product Spectrum before and after fft to get more accurate results for square wave detection.

Added MicroBitSoundRecogniser as a DataSink, which recognises given patterns in the frequency data. It connects to the audio processor to get the frequency analysis. The constructor is protected such that the class becomes abstract. EmojiRecogniser inherits it and implements the constructor - adds the happy sound as the sound to be recognised.

To recognise the happy sound the following main.cpp can be used:

#include "MicroBitSerial.h"
#include "MicroBit.h"
#include "CodalDmesg.h"
#include "MicroBitAudioProcessor.h"
#include "EmojiRecogniser.h"
#include "StreamNormalizer.h"
#include "Tests.h"

#include "LevelDetector.h"
#include "StreamSplitter.h"

static NRF52ADCChannel *mic = NULL;
static StreamNormalizer *processor = NULL;
static MicroBitAudioProcessor *fft = NULL;

static LevelDetector *level = NULL;
static StreamSplitter *splitter = NULL;

MicroBitSoundRecogniser *recogniser = NULL;

MicroBit uBit;

void onSound(ManagedString sound) {
    recogniser->stopAnalisying();

    uBit.display.scroll(sound);

    recogniser->startAnalisying(onSound);
}

int main()
{
    uBit.init();

    NRF52ADCChannel *mic = uBit.adc.getChannel(uBit.io.microphone);
    mic->setGain(7,0);
    uBit.io.runmic.setDigitalValue(1);
    uBit.io.runmic.setHighDrive(true);

    StreamNormalizer *processor = new StreamNormalizer(mic->output, 1.2f, true, DATASTREAM_FORMAT_8BIT_SIGNED, 10);
    StreamSplitter *splitter = new StreamSplitter(processor->output);
    fft         = new MicroBitAudioProcessor(*splitter);
    recogniser  = new EmojiRecogniser(*fft, uBit);

    recogniser->startAnalisying(onSound);

    while(1){
        uBit.display.print("-");
        uBit.sleep(1000);
    }

}
@vladturcuman vladturcuman changed the title Added the sound recognition: emoji and morse code Added sound recognition: emoji and morse code May 1, 2021
@vladturcuman vladturcuman marked this pull request as ready for review May 6, 2021 10:11
@JohnVidler JohnVidler added this to the future-placeholder milestone Jul 4, 2022
@JohnVidler
Copy link
Collaborator

I think this needs review and some further development before we can merge this now. Its been a while and the base classes for the audio pipeline have changed.

It all looks very cool in theory however, so I'll add this to my 'to test' list to see how it works with the upstream changes.

@JohnVidler JohnVidler self-assigned this Jan 25, 2023
@JohnVidler JohnVidler added the enhancement New feature or request label Jan 25, 2023
@microbit-carlos microbit-carlos removed this from the future-placeholder milestone Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants