In previous posts 1 and 2 about speech detection using tensorflow, it is shown how to inference a 1 sec audio sample using the graph that is trained in tensorflow by running label_wav.py. This series of posts will look into inferencing a continuous stream of audio. there is an excellent post by Allan in medium.com which shows how to do the same but, I was not happy with the results and the code was quiet a lot to understand. It uses tensorflow audio functions to process the audio. I will be using pyAudio to process audio since it is easy to understand and later, I may move into tensorflow audio processing. The code posted is running on raspberry pi 3 but it should be able to run on any linux system without any modification.
To get the audio, you need to purchase a usb sound card as shown in the figure below, this is available in ebay/aliexpress or amazon. Connect a 2.5mm mic to it or like I did, scavenge a mic from old electronics and a 2.5mm audio jack and connect it together.
The following python code will record a 1 sec audio and save it as a .wav file. For tensorflow speech recognition we use a sampling rate of 16K (RATE), single channel (CHANNELS) and 1 sec duration (RECORD_SECONDS).
import pyaudio import wave FORMAT = pyaudio.paInt16 CHANNELS = 1 RATE = 16000 CHUNK = 512 RECORD_SECONDS = 1 WAVE_OUTPUT_FILENAME = "file.wav" audio = pyaudio.PyAudio() # start Recording stream = audio.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK) print "recording..." frames =  for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)): data = stream.read(CHUNK) frames.append(data) print "finished recording" # stop Recording stream.stop_stream() stream.close() audio.terminate() waveFile = wave.open(WAVE_OUTPUT_FILENAME, 'wb') waveFile.setnchannels(CHANNELS) waveFile.setsampwidth(audio.get_sample_size(FORMAT)) waveFile.setframerate(RATE) waveFile.writeframes(b''.join(frames)) waveFile.close()
When you run pyaudio.PyAudio() ALSA may print out errors like the one shown below.
The errors can be removed by commenting out the corresponding devices in /usr/share/alsa/alsa.conf.
Next step is to integrate this to label_wav.py in tensorflow branch: tensorflow/examples/speech_commands/label_wav.py. In the updated file; mod_label_wav.py, I have added a for loop around run_graph() to record a 1 sec wav audio. A new audio sample will be recorded every time when the loop runs and the audio is rewritten with the same file name.
Here is the output. The input file is given as file.wav from the same directory …../speech_commands, the file will be overwritten each time when recording finishes. To start with, create a dummy file file.wav to run the script.
touch file.wav wget https://github.com/kiranjose/python-tensorflow-speech-recognition/blob/master/mod_label_wav.py python3 mod_label_wav.py --graph=./my_frozen_graph.pb --labels=./conv_labels.txt --wav=./file.wav
This is by no means a great method to do speech inferencing. We need to wait for the script to record and again for the next command. But, this is the start. In the next post I will explain how to detect an audio threshold to activate the recording/inferencing. For this I have forked a google assistant speech invoking script written by jeysonmc, this will be the starting point.