Wednesday, 11 January 2017

Building an Amazon Echo Like Device with a Raspberry Pi and Google Cloud Speech Api

In my previous post I showed how I wrote a python script to read out the latest news headlines using Googles text to speech api.  As I commented in that post, voice recognition and talking devices seem to be the in thing with the release of the Amazon Echo and Google Home.

In this post I show how I created a python script to record sound on your raspberry pi, invoke the google cloud speech api to interpret what was said, and then perform a command on your raspberry pi - so a bit like a basic Amazon Echo.

Setting up your mic

Before I get into the python code, you need a mic setup. As the Raspberry Pi does not have a soundcard you will need a USB mic or a webcam which has an inbuilt mic. I went for the latter and used a basic webcam from logitech.

Once you have your mic plugged in, follow the instructions in the "Step 1: Checking Your Microphone" in: https://diyhacking.com/best-voice-recognition-software-for-raspberry-pi/

Install prerequisites

There is one python library you need which is pycURL, which is used to send data to the Google Cloud Speech Api. Follow the instructions here: http://pycurl.io/docs/latest/install.html

You will also need to install SoX which is an opensource tool to analyse sound files. This is used in the script to detect whether any sound is on the recorded audio, before trying to send it to the google api.

You can install this by running:

 sudo apt-get install sox

One more thing to install, flac . Flac is used to record your sound file in a lossless format which is required by the google api:

You can install this by running:



 sudo apt-get install flac

Setup Google Cloud Speech Api

To do the voice to text processing I am using the speech api which is part of Google Cloud. It is in beta at the moment and offering a free trial.

Follow the instructions on the their site to get your api key which will be needed in the script:

The current downside I've found with this api is the latency. It's currently taking 5-6 seconds for a response to process a 2 second audio file. The google help files yes the response time should be similar to the length of audio being processed. 

Python Script

Now to the actual python code. 

All the files required can be downloaded from here:


The main file to look at is speechAnalyser.py.

This script does the following:

1. If no audio is playing (you don't want to record if you're playing something on your speakers), records sound from your microphone for 2 seconds
2. Uses SoX to check if any sound is on the file and is above a certain amplitude - this helps to not bother processing when there is silence or just background noises
3. If there is sound at a sufficient amplitude, then send the audio to the google api with a JSON message. As said earlier the google api takes a 5-6 seconds and returns a JSON message with the words detected.
4. If the trigger word in this case "Jarvis" is said during these two seconds, a beep sound is played.
5 Records another 3 seconds to listen for a user speaking a commandand sends to the google api like step 3
6.Checks if keyword found in returned text and executes the appropriate command. For example if "news" is mentioned it invokes the GetNews script which I described in my previous post.
7. Loops back to Step 1. 

Remeber to change the line below where it says with the key which was provided when you set up the Google Cloud Speech api


key = ''
stt_url = 'https://speech.googleapis.com/v1beta1/speech:syncrecognize?key=' + ke

Also you should customise your commands in the following section of code:

def listenForCommand(): 
 
 command  = transcribe(3)
 
 print time.strftime("%Y-%m-%d %H:%M:%S ")  + "Command: " + command 

 success=True 

 if command.lower().find("light")>-1  and  command.lower().find("on")>-1   :
  subprocess.call(["/usr/local/bin/tdtool", "-n 1"])
   
 elif command.lower().find("light")>-1  and  command.lower().find("off")>-1   :
  subprocess.call(["/usr/local/bin/tdtool", "-f 1"])
 elif command.lower().find("news")>-1 :
                os.system('python getNews.py')

  elif command.lower().find("weather")>-1 :
                os.system('python getWeather.py')
 
 elif command.lower().find("pray")>-1 :
                os.system('python sayPrayerTimers.py')
 
        elif command.lower().find("time")>-1 :
                subprocess.call(["/home/pi/Documents/speech.sh", time.strftime("%H:%M") ])
 
 elif command.lower().find("tube")>-1 :
                 os.system('python getTubeStatus.py')
 else:
  subprocess.call(["aplay", "i-dont-understand.wav"])
  success=False

 return success 

The other interesting part of the script to look at is, where it sends the data over to the Google Cloud Speech Api.

It creates a JSON message, and then encodes the audio in base64.

Within the outgoing JSON message, there is a phrases section, where I've included my trigger word "Jarvis", which makes it more likely the speech engine recognises this

The final bit then gets the text from the response.


#Send sound  to Google Cloud Speech Api to interpret
 #----------------------------------------------------
 
 print time.strftime("%Y-%m-%d %H:%M:%S ")  + "Sending to google api"


   # send the file to google speech api
 c = pycurl.Curl()
 c.setopt(pycurl.VERBOSE, 0)
 c.setopt(pycurl.URL, stt_url)
 fout = StringIO.StringIO()
 c.setopt(pycurl.WRITEFUNCTION, fout.write)
 
 c.setopt(pycurl.POST, 1)
 c.setopt(pycurl.HTTPHEADER, ['Content-Type: application/json'])

 with open(filename, 'rb') as speech:
  # Base64 encode the binary audio file for inclusion in the JSON
         # request.
         speech_content = base64.b64encode(speech.read())

 jsonContentTemplate = """{
    'config': {
         'encoding':'FLAC',
         'sampleRate': 16000,
         'languageCode': 'en-GB',
   'speechContext': {
        'phrases': [
         'jarvis'
      ],
     },
    },
    'audio': {
        'content':'XXX'
    }
 }"""


 jsonContent = jsonContentTemplate.replace("XXX",speech_content)

 #print jsonContent

 start = time.time()

 c.setopt(pycurl.POSTFIELDS, jsonContent)
 c.perform()


 #Extract text from returned message from Google
 #----------------------------------------------
 response_data = fout.getvalue()


 end = time.time()
 #print "Time to run:" 
 #print(end - start)


 #print response_data

 c.close()
 
 start_loc = response_data.find("transcript")
     temp_str = response_data[start_loc + 14:]
 #print "temp_str: " + temp_str
     end_loc = temp_str.find("\""+",")
     final_result = temp_str[:end_loc]
 #print "final_result: " + final_result
     return final_result






I have to give a big shout out to the following sites which gave me ideas on how to write this script:

https://diyhacking.com/best-voice-recognition-software-for-raspberry-pi/ - This contains the instructions on how to setup a microphoen on the raspberry pi
https://github.com/StevenHickson/PiAUISuite - Full Application which does what the above script does but is configurable. But not sure if it still works with the new Google Speech Api

20 comments:

  1. Hey, this is really cool. I installed everything but when I try to run it I get these errors output:

    sh: 1: flac: not found
    arecord: begin_wave:2516: write error
    Traceback (most recent call last):
    File "speechAnalyser.py", line 199, in
    spokenText = transcribe(2) ;
    File "speechAnalyser.py", line 57, in transcribe
    maxAmpValue = float(maxAmpValueText)
    ValueError: could not convert string to float: open i

    Have you experienced this error? It seems like something really simple to fix but I'm pretty new with this.

    ReplyDelete
  2. Uncomment some of the print statements to use to debug (remove the '#' symbols).

    Also did sox install properly? Check by running on the command line. Sox is used to check if the file is silent by looking at the maximum amplitude.

    Also is the test.flac file saved down?

    ReplyDelete
  3. Ok, so I tested Sox by converting a .wav file to a .au file, so it seems it's been installed properly. I uncommented the print statements but I'm not sure what you mean by the test.flac being "saved down". I now get this error:

    listening ..
    sh: 1: flac: not found
    arecord: begin_wave:2516: write error
    Popen outputsox FAIL formats: can't open input file `test.flac': No such file or directory

    Max Amp Start: 23
    Max Amop Endp: 30
    Max Amp: open i
    Traceback (most recent call last):
    File "speechAnalyser.py", line 199, in
    spokenText = transcribe(2) ;
    File "speechAnalyser.py", line 57, in transcribe
    maxAmpValue = float(maxAmpValueText)
    ValueError: could not convert string to float: open i

    ReplyDelete
    Replies
    1. Looks like the arecord command isn't creating the sound file as expected "test.flac".

      Can you see a file called "test.flac" exists in the same folder as where speechAnalyser.py is stored?

      Try running arecord from the command line to test if it is working.

      Delete
  4. I tested arecord on its own and it definitely works. I can't see "test.flac" stored in the folder so that line definitely isn't running properly. Are there meant to be apostrophes in line 34?

    ReplyDelete
  5. Add a line before 33 with this:
    print 'arecord -D plughw:1,0 -f cd -c 1 -t wav -d ' + str(duration) + ' -q -r 16000 | flac - -s -f --best --sample-rate 16000 -o ' + filename)

    this will print out the actual command being sent to arecord. Then yon can try running that on the command line separately.

    My suspicion is that flac isn't installed (which might be missing from my instructions)

    you will need to run :
    sudo apt-get install flac

    ReplyDelete
  6. You were right, I just needed to install flac. The only problem now is that it won't accept my API key for google cloud speech. Every time it tries to send the file to google it says "API key not valid. Please pass a valid API key."
    Any idea what's going wrong there?

    ReplyDelete
  7. In the Google Cloud Platform console, get into the API Manager and select "Credentials". Click on create credentials and select "API Key".

    Did you sign up fully to the goolge cloud platform. You need to provide payment details even though it's an beta limited time free trial.

    ReplyDelete
  8. Ye I'm sure I'm fully setup and everything. I'll regenerate the API key and have another go later.

    ReplyDelete
  9. Is it definitely an API key you're using, not a service account key?

    ReplyDelete
    Replies
    1. Yes , api key for sure. See this screenshot:
      https://dl.dropboxusercontent.com/u/427946/Rpi%20Speech/googlecloudapikey.JPG

      Delete
  10. Hi! Great tutorial. I've set up everything without any problems and also changed google cloud speech to wit.ai - working like a charm :)

    ReplyDelete
    Replies
    1. That's great. I had a play with wi.ai. It's very powerful to interpret text. But found it's speech recognition wasn't as good as googles.

      Delete
    2. After few days i can say that you are 100% true :) Switched back to google API

      Delete
    3. What kind of latency do you see with google cloud speech? For me it's consistently been ~5 seconds.

      Delete
    4. I'm using now Google Speech Recognition instead of Cloud Speech API. It's much faster than Google Cloud Services. Check out https://github.com/Uberi/speech_recognition and r.recognize_google(audio) method

      Delete
  11. Turned out I hadn't actually enabled the api key. It works perfectly. Thanks for all your help!

    ReplyDelete
    Replies
    1. Great news. Let us know what you use it for

      Delete
  12. is there anything else than google cloud platform speach api i can use? because i cant create acount

    ReplyDelete
    Replies
    1. Hi, you can try out these:

      1) wit.ai : https://wit.ai/docs/http/20160330#get-intent-via-speech-link

      2)Jasper : https://jasperproject.github.io/

      3)Microsoft Bing Speech API: https://www.microsoft.com/cognitive-services/en-us/speech-api

      I've only tried out wit.ai and it's good but didn't find speech recognition as good as Googles. I want to try out the Micrsoft Bing api as the latency (~5secs) is annoying with Google

      Delete