Speech detection with Tensorflow 1.4 on Raspberry Pi 3 – Part 2: Live audio inferencing using PyAudio

Here is link to Part 1

Now we know, how to loop around the inferencing function, capture a voice for a fixed time and process it. What we need now is a program to listen to the input stream and measure the audio level. This will help us to take a decision if we need to capture the audio data or not.

File1: audio_intensity.py
The following code, reads a CHUNK of data from the stream and measure average intensity, prints it out so that we will know how much ambient noise is there in the background. First we need to figure out the average intensity level (INTENSITY) so that we will get a threshold number to check for.

import pyaudio
import wave
import math
import audioop
import time
 
p = pyaudio.PyAudio() 
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
CHUNK = 512 
RECORD_SECONDS = 1
WAVE_OUTPUT_FILENAME = "file.wav"
INTENSITY=11000
 
def audio_int(num_samples=50):
    print ('Getting intensity values from mic.')
    p = pyaudio.PyAudio()

    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=RATE,
                    input=True,
                    frames_per_buffer=CHUNK)
    #----------------------checks average noise level-------------------------
    cur_data = stream.read(CHUNK)
    values = [math.sqrt(abs(audioop.avg(cur_data, 4)))
                for x in range(num_samples)]
    values = sorted(values, reverse=True)
    r = sum(values[:int(num_samples * 0.2)]) / int(num_samples * 0.2)
    #---------------------prints out avg noise level--------------------------
    print (' Average audio intensity is r', r)
    time.sleep(.1)

    stream.close()
    p.terminate()
    return r


if(__name__ == '__main__'):
    while (True):
    audio_int()  # To measure your mic levels

File 2: audio_intensity_trigger.py
In this program, I have added an infinite loop and a check for INTENSITY level before printing the average audio level. If the room is silent or just background noise nothing is triggered. I have kept is as ‘11000’. Make sure that you change it according to output of audio_intensity.py. If its output is, say 8000, keep the intensity as 9000 or 10000.

......
......
......
while True:
  cur_data = stream.read(CHUNK)
  values = [math.sqrt(abs(audioop.avg(cur_data, 4)))
            for x in range(num_samples)]
  values = sorted(values, reverse=True)
  r = sum(values[:int(num_samples * 0.2)]) / int(num_samples * 0.2)
  #print " Finished "
  if (r > INTENSITY):
    print (' Average audio intensity is r', r)

stream.close()
......
......

File 3: audio_trigger_save_wav.py
T
his one will wait for the threshold and once triggered, it will save 1 second of audio to a file in wave format together with 5 frames of  previous voice chunks. This is important, otherwise our recording will not contain the starting of words or the words will be biased towards first half of 1 second and remaining half will be empty. The spectrogram when generated by tensorflow will looked chopped off.

......
......
......
    prev_data0=[]
    prev_data1=[]
    prev_data2=[]
    prev_data3=[]
    prev_data4=[]
    while True:
      #reading current data
      cur_data = stream.read(CHUNK)
      values = [math.sqrt(abs(audioop.avg(cur_data, 4)))
                for x in range(num_samples)]
      values = sorted(values, reverse=True)
      r = sum(values[:int(num_samples * 0.2)]) / int(num_samples * 0.2)
      if (r > INTENSITY):
        #-------------------------------------------------if triggered; file.wav = 5 previous frames + capture 1 sec of voice-------------------------------
        print (' Average audio intensity is r', r)
        frames = []
        frames.append(prev_data0)
        frames.append(prev_data1)
        frames.append(prev_data2)
        frames.append(prev_data3)
        frames.append(prev_data4)
        frames.append(cur_data)
        #---------------getting 1 second of voice data-----------------
        for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
          data = stream.read(CHUNK)
          frames.append(data)
        print ('finished recording')
        #-------------     ---saving wave file-------------------------
        waveFile = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
        waveFile.setnchannels(CHANNELS)
        waveFile.setsampwidth(p.get_sample_size(FORMAT))
        waveFile.setframerate(RATE)
        waveFile.writeframes(b''.join(frames))
        waveFile.close()
      #------------------------------------------------------if not triggered; saving previous values to a FIFO of 5 levels----------------------------------
      prev_data0=prev_data1
      prev_data1=prev_data2
      prev_data2=prev_data3
      prev_data3=prev_data4
      prev_data4=cur_data
     stream.close()
......
......
......

File 4: wav_trigger_inference.py
T
his is the modified tensorflow inference file (label_wav.py).  I have fused the program audio_trigger_save_wav.py to label_wav.py. The usage is,

cd /tensorflow/examples/speech_commands
touch file.wav ; to create a dummy file for the first pass
python3 wav_trigger_inference.py --graph=./my_frozen_graph.pb --labels=./conv_labels.txt --wav=file.wav

The while loop is around run_graph(). If the audio is detected and is above threshold; wave file is captured and given for inferencing. Once the results are printed out, it continue listening for the next audio.

....
....
....
      waveFile.writeframes(b''.join(frames))
      waveFile.close()
      with open(wav, 'rb') as wav_file:
        wav_data = wav_file.read()
      run_graph(wav_data, labels_list, input_name, output_name, how_many_labels)
    prev_data0=prev_data1
    prev_data1=prev_data2
....
....
....
 parser.add_argument(
      '--how_many_labels',
      type=int,
      default=1,# -------------------this will make use that, it prints out only one result with max probability------------------------
      help='Number of results to show.')
....
....
....

Here is the result. There are some errors while processing since the graph is not accurate. I could train it only till 88% accuracy. More data argumentation is needed for improving the accuracy and I may need to fiddle around with all the switches that is provided by tensorflow for training. But this is good enough to create a speech controlled device using raspberry pi.

7 thoughts on “Speech detection with Tensorflow 1.4 on Raspberry Pi 3 – Part 2: Live audio inferencing using PyAudio

  1. Great! Thanks for your explanation!
    By the way, would there be any way to increase the number of words
    this code can understand?
    It would be much better if it can simply understand one thru ten

    1. Just open tensorflow/examples/speech_commands/train.py

      ———————————————————————————
      If you want to train on your own data, you’ll need to create .wavs with your
      recordings, all at a consistent length, and then arrange them into subfolders
      organized by label. For example, here’s a possible file structure:
      my_wavs >
      up >
      audio_0.wav
      audio_1.wav
      down >
      audio_2.wav
      audio_3.wav
      other>
      audio_4.wav
      audio_5.wav
      You’ll also need to tell the script what labels to look for, using the
      `–wanted_words` argument. In this case, ‘up,down’ might be what you want, and
      the audio in the ‘other’ folder would be used to train an ‘unknown’ category.
      —————————————————————————————

      The difficult part is getting the data to train. In your case, you need to collect several hundreds of one sec audio samples from one to ten, the more data samples the better will be the result.

      Check out more details here https://www.tensorflow.org/versions/master/tutorials/audio_recognition#custom_training_data

      Happy to see if you manage to get results with your custom data.

      Cheers.

  2. Sir,
    I was having this problem since i first time run your code on PI,
    the error is “-9997 Invalid sample rate” ,but the code is default 16000, can you help me with it?

      1. I have used a less expensive usb audio adapter from aliexpress. The sample rate depends on the hardware used. which is the hardware you are using to covert audio?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.