Here is link to Part 1
Now we know, how to loop around the inferencing function, capture a voice for a fixed time and process it. What we need now is a program to listen to the input stream and measure the audio level. This will help us to take a decision if we need to capture the audio data or not.
File1: audio_intensity.py
The following code, reads a CHUNK of data from the stream and measure average intensity, prints it out so that we will know how much ambient noise is there in the background. First we need to figure out the average intensity level (INTENSITY) so that we will get a threshold number to check for.
import pyaudio import wave import math import audioop import time p = pyaudio.PyAudio() FORMAT = pyaudio.paInt16 CHANNELS = 1 RATE = 16000 CHUNK = 512 RECORD_SECONDS = 1 WAVE_OUTPUT_FILENAME = "file.wav" INTENSITY=11000 def audio_int(num_samples=50): print ('Getting intensity values from mic.') p = pyaudio.PyAudio() stream = p.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK) #----------------------checks average noise level------------------------- cur_data = stream.read(CHUNK) values = [math.sqrt(abs(audioop.avg(cur_data, 4))) for x in range(num_samples)] values = sorted(values, reverse=True) r = sum(values[:int(num_samples * 0.2)]) / int(num_samples * 0.2) #---------------------prints out avg noise level-------------------------- print (' Average audio intensity is r', r) time.sleep(.1) stream.close() p.terminate() return r if(__name__ == '__main__'): while (True): audio_int() # To measure your mic levels
File 2: audio_intensity_trigger.py
In this program, I have added an infinite loop and a check for INTENSITY level before printing the average audio level. If the room is silent or just background noise nothing is triggered. I have kept is as ‘11000’. Make sure that you change it according to output of audio_intensity.py. If its output is, say 8000, keep the intensity as 9000 or 10000.
...... ...... ...... while True: cur_data = stream.read(CHUNK) values = [math.sqrt(abs(audioop.avg(cur_data, 4))) for x in range(num_samples)] values = sorted(values, reverse=True) r = sum(values[:int(num_samples * 0.2)]) / int(num_samples * 0.2) #print " Finished " if (r > INTENSITY): print (' Average audio intensity is r', r) stream.close() ...... ......
File 3: audio_trigger_save_wav.py
This one will wait for the threshold and once triggered, it will save 1 second of audio to a file in wave format together with 5 frames of previous voice chunks. This is important, otherwise our recording will not contain the starting of words or the words will be biased towards first half of 1 second and remaining half will be empty. The spectrogram when generated by tensorflow will looked chopped off.
...... ...... ...... prev_data0=[] prev_data1=[] prev_data2=[] prev_data3=[] prev_data4=[] while True: #reading current data cur_data = stream.read(CHUNK) values = [math.sqrt(abs(audioop.avg(cur_data, 4))) for x in range(num_samples)] values = sorted(values, reverse=True) r = sum(values[:int(num_samples * 0.2)]) / int(num_samples * 0.2) if (r > INTENSITY): #-------------------------------------------------if triggered; file.wav = 5 previous frames + capture 1 sec of voice------------------------------- print (' Average audio intensity is r', r) frames = [] frames.append(prev_data0) frames.append(prev_data1) frames.append(prev_data2) frames.append(prev_data3) frames.append(prev_data4) frames.append(cur_data) #---------------getting 1 second of voice data----------------- for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)): data = stream.read(CHUNK) frames.append(data) print ('finished recording') #------------- ---saving wave file------------------------- waveFile = wave.open(WAVE_OUTPUT_FILENAME, 'wb') waveFile.setnchannels(CHANNELS) waveFile.setsampwidth(p.get_sample_size(FORMAT)) waveFile.setframerate(RATE) waveFile.writeframes(b''.join(frames)) waveFile.close() #------------------------------------------------------if not triggered; saving previous values to a FIFO of 5 levels---------------------------------- prev_data0=prev_data1 prev_data1=prev_data2 prev_data2=prev_data3 prev_data3=prev_data4 prev_data4=cur_data stream.close() ...... ...... ......
File 4: wav_trigger_inference.py
This is the modified tensorflow inference file (label_wav.py). I have fused the program audio_trigger_save_wav.py to label_wav.py. The usage is,
cd /tensorflow/examples/speech_commands touch file.wav ; to create a dummy file for the first pass python3 wav_trigger_inference.py --graph=./my_frozen_graph.pb --labels=./conv_labels.txt --wav=file.wav
The while loop is around run_graph(). If the audio is detected and is above threshold; wave file is captured and given for inferencing. Once the results are printed out, it continue listening for the next audio.
.... .... .... waveFile.writeframes(b''.join(frames)) waveFile.close() with open(wav, 'rb') as wav_file: wav_data = wav_file.read() run_graph(wav_data, labels_list, input_name, output_name, how_many_labels) prev_data0=prev_data1 prev_data1=prev_data2 .... .... .... parser.add_argument( '--how_many_labels', type=int, default=1,# -------------------this will make use that, it prints out only one result with max probability------------------------ help='Number of results to show.') .... .... ....
Here is the result. There are some errors while processing since the graph is not accurate. I could train it only till 88% accuracy. More data argumentation is needed for improving the accuracy and I may need to fiddle around with all the switches that is provided by tensorflow for training. But this is good enough to create a speech controlled device using raspberry pi.