Speech detection with Tensorflow 1.4 on Raspberry Pi 3 – Part 2: Live audio inferencing using PyAudio

Here is link to Part 1

Now we know, how to loop around the inferencing function, capture a voice for a fixed time and process it. What we need now is a program to listen to the input stream and measure the audio level. This will help us to take a decision if we need to capture the audio data or not.

File1: audio_intensity.py
The following code, reads a CHUNK of data from the stream and measure average intensity, prints it out so that we will know how much ambient noise is there in the background. First we need to figure out the average intensity level (INTENSITY) so that we will get a threshold number to check for.

import pyaudio
import wave
import math
import audioop
import time
 
p = pyaudio.PyAudio() 
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
CHUNK = 512 
RECORD_SECONDS = 1
WAVE_OUTPUT_FILENAME = "file.wav"
INTENSITY=11000
 
def audio_int(num_samples=50):
    print ('Getting intensity values from mic.')
    p = pyaudio.PyAudio()

    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=RATE,
                    input=True,
                    frames_per_buffer=CHUNK)
    #----------------------checks average noise level-------------------------
    cur_data = stream.read(CHUNK)
    values = [math.sqrt(abs(audioop.avg(cur_data, 4)))
                for x in range(num_samples)]
    values = sorted(values, reverse=True)
    r = sum(values[:int(num_samples * 0.2)]) / int(num_samples * 0.2)
    #---------------------prints out avg noise level--------------------------
    print (' Average audio intensity is r', r)
    time.sleep(.1)

    stream.close()
    p.terminate()
    return r


if(__name__ == '__main__'):
    while (True):
    audio_int()  # To measure your mic levels

File 2: audio_intensity_trigger.py
In this program, I have added an infinite loop and a check for INTENSITY level before printing the average audio level. If the room is silent or just background noise nothing is triggered. I have kept is as ‘11000’. Make sure that you change it according to output of audio_intensity.py. If its output is, say 8000, keep the intensity as 9000 or 10000.

......
......
......
while True:
  cur_data = stream.read(CHUNK)
  values = [math.sqrt(abs(audioop.avg(cur_data, 4)))
            for x in range(num_samples)]
  values = sorted(values, reverse=True)
  r = sum(values[:int(num_samples * 0.2)]) / int(num_samples * 0.2)
  #print " Finished "
  if (r > INTENSITY):
    print (' Average audio intensity is r', r)

stream.close()
......
......

File 3: audio_trigger_save_wav.py
T
his one will wait for the threshold and once triggered, it will save 1 second of audio to a file in wave format together with 5 frames of  previous voice chunks. This is important, otherwise our recording will not contain the starting of words or the words will be biased towards first half of 1 second and remaining half will be empty. The spectrogram when generated by tensorflow will looked chopped off.

......
......
......
    prev_data0=[]
    prev_data1=[]
    prev_data2=[]
    prev_data3=[]
    prev_data4=[]
    while True:
      #reading current data
      cur_data = stream.read(CHUNK)
      values = [math.sqrt(abs(audioop.avg(cur_data, 4)))
                for x in range(num_samples)]
      values = sorted(values, reverse=True)
      r = sum(values[:int(num_samples * 0.2)]) / int(num_samples * 0.2)
      if (r > INTENSITY):
        #-------------------------------------------------if triggered; file.wav = 5 previous frames + capture 1 sec of voice-------------------------------
        print (' Average audio intensity is r', r)
        frames = []
        frames.append(prev_data0)
        frames.append(prev_data1)
        frames.append(prev_data2)
        frames.append(prev_data3)
        frames.append(prev_data4)
        frames.append(cur_data)
        #---------------getting 1 second of voice data-----------------
        for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
          data = stream.read(CHUNK)
          frames.append(data)
        print ('finished recording')
        #-------------     ---saving wave file-------------------------
        waveFile = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
        waveFile.setnchannels(CHANNELS)
        waveFile.setsampwidth(p.get_sample_size(FORMAT))
        waveFile.setframerate(RATE)
        waveFile.writeframes(b''.join(frames))
        waveFile.close()
      #------------------------------------------------------if not triggered; saving previous values to a FIFO of 5 levels----------------------------------
      prev_data0=prev_data1
      prev_data1=prev_data2
      prev_data2=prev_data3
      prev_data3=prev_data4
      prev_data4=cur_data
     stream.close()
......
......
......

File 4: wav_trigger_inference.py
T
his is the modified tensorflow inference file (label_wav.py).  I have fused the program audio_trigger_save_wav.py to label_wav.py. The usage is,

cd /tensorflow/examples/speech_commands
touch file.wav ; to create a dummy file for the first pass
python3 wav_trigger_inference.py --graph=./my_frozen_graph.pb --labels=./conv_labels.txt --wav=file.wav

The while loop is around run_graph(). If the audio is detected and is above threshold; wave file is captured and given for inferencing. Once the results are printed out, it continue listening for the next audio.

....
....
....
      waveFile.writeframes(b''.join(frames))
      waveFile.close()
      with open(wav, 'rb') as wav_file:
        wav_data = wav_file.read()
      run_graph(wav_data, labels_list, input_name, output_name, how_many_labels)
    prev_data0=prev_data1
    prev_data1=prev_data2
....
....
....
 parser.add_argument(
      '--how_many_labels',
      type=int,
      default=1,# -------------------this will make use that, it prints out only one result with max probability------------------------
      help='Number of results to show.')
....
....
....

Here is the result. There are some errors while processing since the graph is not accurate. I could train it only till 88% accuracy. More data argumentation is needed for improving the accuracy and I may need to fiddle around with all the switches that is provided by tensorflow for training. But this is good enough to create a speech controlled device using raspberry pi.

Speech detection with Tensorflow 1.4 on Raspberry Pi 3 – Part 1: Getting audio file using PyAudio

In previous posts 1 and about speech detection using tensorflow, it is shown how to inference a 1 sec audio sample using the graph that is trained in tensorflow by running label_wav.py. This series of posts will look into inferencing a continuous stream of audio. there is an excellent post  by Allan in medium.com which shows how to do the same but, I was not happy with the results and the code was quiet a lot to understand. It uses tensorflow audio functions to process the audio. I will be using pyAudio to process audio since it is easy to understand and later, I may move into tensorflow audio processing. The code posted is running on raspberry pi 3 but it should be able to run on any linux system without any modification.

To get the audio, you need to purchase a usb sound card as shown in the figure below, this is available in ebay/aliexpress or amazon. Connect a 2.5mm mic to it or like I did, scavenge a mic from old electronics and a 2.5mm audio jack and connect it together.

usb audio card on paspberry pi for tensorflow
USB microphone, pi noir camera and earphones for audio on raspberry pi 3

The following python code will record a 1 sec audio and save it as a .wav file. For tensorflow speech recognition we use a sampling rate of 16K (RATE), single channel (CHANNELS) and 1 sec duration (RECORD_SECONDS).

import pyaudio
import wave
 
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
CHUNK = 512 
RECORD_SECONDS = 1
WAVE_OUTPUT_FILENAME = "file.wav"
 
audio = pyaudio.PyAudio()
 
# start Recording
stream = audio.open(format=FORMAT, channels=CHANNELS,
                rate=RATE, input=True,
                frames_per_buffer=CHUNK)
print "recording..."
frames = []
 
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
    data = stream.read(CHUNK)
    frames.append(data)
print "finished recording"
 
 
# stop Recording
stream.stop_stream()
stream.close()
audio.terminate()
 
waveFile = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
waveFile.setnchannels(CHANNELS)
waveFile.setsampwidth(audio.get_sample_size(FORMAT))
waveFile.setframerate(RATE)
waveFile.writeframes(b''.join(frames))
waveFile.close()

When you run pyaudio.PyAudio() ALSA may print out errors like the one shown below.

ALSA error raspberry pi 3 tensorflow

The errors can be removed by commenting out the corresponding devices in /usr/share/alsa/alsa.conf.

alsa conf error raspberry pi 3

Next step is to integrate this to label_wav.py in tensorflow branch: tensorflow/examples/speech_commands/label_wav.py. In the updated file; mod_label_wav.py, I have added a for loop around run_graph() to record a 1 sec wav audio. A new audio sample will be recorded every time when the loop runs and the audio is rewritten with the same file name.

audio inferencing tensorflow raspberry pi

Here is the output. The input file is given as file.wav from the same directory …../speech_commands, the file will be overwritten each time when recording finishes. To start with, create a dummy file file.wav to run the script.

touch file.wav
wget https://github.com/kiranjose/python-tensorflow-speech-recognition/blob/master/mod_label_wav.py
python3 mod_label_wav.py --graph=./my_frozen_graph.pb --labels=./conv_labels.txt --wav=./file.wav

audio inferencing raspberry pi tensorflow

Files:
1. pyAudio_wav.py

2. mod_label_wav.py

This is by no means a great method to do speech inferencing. We need to wait for the script to record and again for the next command. But, this is the start. In the next post I will explain how to detect an audio threshold to activate the recording/inferencing. For this I have forked a google assistant speech invoking script written by jeysonmc, this will be the starting point.