Speech detection with Tensorflow 1.4 on Raspberry Pi 3 – Part 2: Live audio inferencing using PyAudio

Here is link to Part 1

Now we know, how to loop around the inferencing function, capture a voice for a fixed time and process it. What we need now is a program to listen to the input stream and measure the audio level. This will help us to take a decision if we need to capture the audio data or not.

File1: audio_intensity.py
The following code, reads a CHUNK of data from the stream and measure average intensity, prints it out so that we will know how much ambient noise is there in the background. First we need to figure out the average intensity level (INTENSITY) so that we will get a threshold number to check for.

import pyaudio
import wave
import math
import audioop
import time
 
p = pyaudio.PyAudio() 
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
CHUNK = 512 
RECORD_SECONDS = 1
WAVE_OUTPUT_FILENAME = "file.wav"
INTENSITY=11000
 
def audio_int(num_samples=50):
    print ('Getting intensity values from mic.')
    p = pyaudio.PyAudio()

    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=RATE,
                    input=True,
                    frames_per_buffer=CHUNK)
    #----------------------checks average noise level-------------------------
    cur_data = stream.read(CHUNK)
    values = [math.sqrt(abs(audioop.avg(cur_data, 4)))
                for x in range(num_samples)]
    values = sorted(values, reverse=True)
    r = sum(values[:int(num_samples * 0.2)]) / int(num_samples * 0.2)
    #---------------------prints out avg noise level--------------------------
    print (' Average audio intensity is r', r)
    time.sleep(.1)

    stream.close()
    p.terminate()
    return r


if(__name__ == '__main__'):
    while (True):
    audio_int()  # To measure your mic levels

File 2: audio_intensity_trigger.py
In this program, I have added an infinite loop and a check for INTENSITY level before printing the average audio level. If the room is silent or just background noise nothing is triggered. I have kept is as ‘11000’. Make sure that you change it according to output of audio_intensity.py. If its output is, say 8000, keep the intensity as 9000 or 10000.

......
......
......
while True:
  cur_data = stream.read(CHUNK)
  values = [math.sqrt(abs(audioop.avg(cur_data, 4)))
            for x in range(num_samples)]
  values = sorted(values, reverse=True)
  r = sum(values[:int(num_samples * 0.2)]) / int(num_samples * 0.2)
  #print " Finished "
  if (r > INTENSITY):
    print (' Average audio intensity is r', r)

stream.close()
......
......

File 3: audio_trigger_save_wav.py
T
his one will wait for the threshold and once triggered, it will save 1 second of audio to a file in wave format together with 5 frames of  previous voice chunks. This is important, otherwise our recording will not contain the starting of words or the words will be biased towards first half of 1 second and remaining half will be empty. The spectrogram when generated by tensorflow will looked chopped off.

......
......
......
    prev_data0=[]
    prev_data1=[]
    prev_data2=[]
    prev_data3=[]
    prev_data4=[]
    while True:
      #reading current data
      cur_data = stream.read(CHUNK)
      values = [math.sqrt(abs(audioop.avg(cur_data, 4)))
                for x in range(num_samples)]
      values = sorted(values, reverse=True)
      r = sum(values[:int(num_samples * 0.2)]) / int(num_samples * 0.2)
      if (r > INTENSITY):
        #-------------------------------------------------if triggered; file.wav = 5 previous frames + capture 1 sec of voice-------------------------------
        print (' Average audio intensity is r', r)
        frames = []
        frames.append(prev_data0)
        frames.append(prev_data1)
        frames.append(prev_data2)
        frames.append(prev_data3)
        frames.append(prev_data4)
        frames.append(cur_data)
        #---------------getting 1 second of voice data-----------------
        for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
          data = stream.read(CHUNK)
          frames.append(data)
        print ('finished recording')
        #-------------     ---saving wave file-------------------------
        waveFile = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
        waveFile.setnchannels(CHANNELS)
        waveFile.setsampwidth(p.get_sample_size(FORMAT))
        waveFile.setframerate(RATE)
        waveFile.writeframes(b''.join(frames))
        waveFile.close()
      #------------------------------------------------------if not triggered; saving previous values to a FIFO of 5 levels----------------------------------
      prev_data0=prev_data1
      prev_data1=prev_data2
      prev_data2=prev_data3
      prev_data3=prev_data4
      prev_data4=cur_data
     stream.close()
......
......
......

File 4: wav_trigger_inference.py
T
his is the modified tensorflow inference file (label_wav.py).  I have fused the program audio_trigger_save_wav.py to label_wav.py. The usage is,

cd /tensorflow/examples/speech_commands
touch file.wav ; to create a dummy file for the first pass
python3 wav_trigger_inference.py --graph=./my_frozen_graph.pb --labels=./conv_labels.txt --wav=file.wav

The while loop is around run_graph(). If the audio is detected and is above threshold; wave file is captured and given for inferencing. Once the results are printed out, it continue listening for the next audio.

....
....
....
      waveFile.writeframes(b''.join(frames))
      waveFile.close()
      with open(wav, 'rb') as wav_file:
        wav_data = wav_file.read()
      run_graph(wav_data, labels_list, input_name, output_name, how_many_labels)
    prev_data0=prev_data1
    prev_data1=prev_data2
....
....
....
 parser.add_argument(
      '--how_many_labels',
      type=int,
      default=1,# -------------------this will make use that, it prints out only one result with max probability------------------------
      help='Number of results to show.')
....
....
....

Here is the result. There are some errors while processing since the graph is not accurate. I could train it only till 88% accuracy. More data argumentation is needed for improving the accuracy and I may need to fiddle around with all the switches that is provided by tensorflow for training. But this is good enough to create a speech controlled device using raspberry pi.

Video inferencing on neural network trained using NVIDIA DIGITS with opencv

I have been playing with the inferencing code for some time. Here is a real time video inferencing using opencv to capture video and slice through the frames. The overall frame rate is low due to the system slowness. In the video, ‘frame’ is the normalised image caffe network sees after reducing mean image file . ‘frame2’ is the input image.

Caffe model is trained in NVIDIA DIGITS using goolgleNet(SGD, 100 epoch), it reached 100% accuracy by 76 epoch.
NVIDIA DIGITS goolgleNet caffe inferencing

Here is the inferencing code.


import numpy as np
import matplotlib.pyplot as plt
import caffe
import time
import cv2
cap = cv2.VideoCapture(0)
from skimage import io

MODEL_FILE = './deploy.prototxt'
PRETRAINED = './snapshot_iter_4864.caffemodel'
MEAN_IMAGE = './mean.jpg'
#Caffe
mean_image = caffe.io.load_image(MEAN_IMAGE)
caffe.set_mode_gpu()
net = caffe.Classifier(MODEL_FILE, PRETRAINED,
channel_swap=(2,1,0),
raw_scale=255,
image_dims=(256, 256))
#OpenCv loop
while(True):
    start = time.time()
    ret, frame = cap.read()
    resized_image = cv2.resize(frame, (256, 256)) 
    cv2.imwrite("frame.jpg", resized_image)
    IMAGE_FILE = './frame.jpg'
    im2 = caffe.io.load_image(IMAGE_FILE)
    inferImg = im2 - mean_image
    #print ("Shape------->",inferImg.shape)
    #Inferencing
    prediction = net.predict([inferImg])
    end = time.time()
    pred=prediction[0].argmax()
    #print ("prediction -> ",prediction[0]) 
    if pred == 0:
       print("cat")
    else:
       print("dog")
    #Opencv display
    cv2.imshow('frame',inferImg)
    cv2.imshow('frame2',im2)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break
cap.release()
cv2.destroyAllWindows()

 

 

Inferencing on the trained caffe model from NVIDIA DIGITS

With this post I will explain how to do inferencing on the trained network created with NVIDIA DIGITS through command line. link to the previous post here

In DIGITS UI, we have to upload a file into the model webpage to do inferencing. This is time consuming and not practical for real world appications. We need to deploy trained model as a standalone python application.
To achieve this we need to download the trained model from NVIDIA DIGITS model page. This will download a .tgz file to your computer. Open the .tgz file using this command

tar -xvzf filename.tgz
caffe model NVIDIA DIGITS
 

Save the ‘Image mean’ image file from datasets page of NVIDIA DIGITS in to your computer.

NVIDIA DIGITS inferencing

Provide path for,

'Image mean' file    -> eg:'/home/catsndogs/mean.jpg'
deploy.prototext ->eg:'/home/catsndogs/deploy.prototxt'
caffemodel ->eg:'/home/catsndogs/snapshot_iter_480.caffemodel'
input image to test ->eg:'/home/catsndogs/image_to_test.jpg'

in the below python script.

import numpy as np
import matplotlib.pyplot as plt
import caffe
import time
from PIL import Image

MODEL_FILE = '/home/catsndogs/deploy.prototxt'
PRETRAINED = '/home/catsndogs/snapshot_iter_480.caffemodel'
MEAN_IMAGE = '/home/catsndogs/mean.jpg'
# load the mean image
mean_image = caffe.io.load_image(MEAN_IMAGE)
#input the image file need to be tested
IMAGE_FILE = '/home/catsndogs/image_to_test.jpg'
im1 = Image.open(IMAGE_FILE)
# Tell Caffe to use the GPU
caffe.set_mode_gpu()
# Initialize the Caffe model using the model trained in DIGITS
net = caffe.Classifier(MODEL_FILE, PRETRAINED,
channel_swap=(2,1,0),
raw_scale=255,
image_dims=(256, 256))
# Load the input image into a numpy array and display it
plt.imshow(im1)
# Iterate over each grid square using the model to make a class prediction
start = time.time()
inferImg = im1.resize((256, 256), Image.NEAREST)
inferImg -= mean_image
prediction = net.predict([inferImg])
end = time.time()
print(prediction[0].argmax())
pred=prediction[0].argmax()
if pred == 0: 
  print("cat")
else: 
  print("dog")
# Display total time to perform inference
print 'Total inference time: ' + str(end-start) + ' seconds'

Run the file with

python catsndogs.py

for inferencing.