Modeling a Custom Wake Word: A Guide for the Dazed, Confused, and Programming-Illiterate
(Like Myself)
I have recently become enamored with an open-source virtual assistant project, name of Mycroft. One of the advantages of this system, besides its focus on privacy and security, is that you can train your own wake word, as well as perform various other customizations. However, going through the process I noticed all of the things I wished the documentation included, so once I figured out the process, I decided to write it down, so that others may have a guide to help them. This guide assumes you want to use the Picroft version of Mycroft. Other builds on other OSes might be different, but perhaps some information and instructions will transfer?
First, it’s important to know that while the process isn’t particularly difficult, but it is a little time-consuming. You’ll want to set aside some time to make your recordings in a quiet place. Further, this guide will assume a level of familiarity with basic Linux / Unix / *nix commands, some familiarity with using a command line, and knowledge of how to use a computer. None of this gets programmer-y or difficult, but some people see a command line and head for the hills. Embrace it, it’s fun.
Things you will need!
- A Raspberry Pi (obviously) running Mycroft. I’m using core 20.8.1 (Buster Keaton). If they update things in the future, I’ll try to remember this exists and update it as well.
- Another computer to do the modeling on. I’m using a Macbook Pro, but it will work on anything because we’re going to run things on a virtual machine (VM). If you have a machine with Ubuntu 18.04 installed on it, that will probably work as well.
- A virtual machine. I recommend VirtualBox from Oracle, for its price (none) and usability across just about every operating system that exists. Of course, you can run this straight on Linux, too, but I’m recommending a specific version of Ubuntu, so the VM is a good place to start unless you’re experienced enough to know exactly what you’re doing.
- An iso of Linux Ubuntu 18.04
- A fairly decent USB microphone. This doesn’t have to be expensive, but it’s a bit of an investment. Expect to spend at least $80 for a decent one. Under that the quality can get a bit shoddy. This one is fine. If you’re going to use a USB mic with VirtualBox, remember to select Devices->USB->The USB mic from the Devices menu, or the device will not show up in the VM due to the OSs inability to “share” the USB device. Make sure you’re getting good audio from the sound settings section of the control panel.
Second: I DO NOT RECOMMEND doing this on the Raspberry Pi, for a lot of reasons, but if you have a laptop or other computer, it will likely run faster on that. I had some trouble getting this to work on Ubuntu 20, so I went back to 18.04, which has worked very well for me so far. Let’s go through how I did all this.
Step zero: If you know all of what I’m about to say, and want to skip to why precise won’t work, scroll down to step whatever.
Step one: prepare your environment. I did this by getting VirtualBox installed, and then setting up a Ubuntu 18.04 virtual environment. You’ll want…quite a lot of disk space for this. I ended up allocating 10GB for the image to use, and that turned out to be too small. 20GB is ideal. Why 18.04? Well, because it’s what I used, it’s fairly recent, and it worked. I tried on 20.04, and ended up with a lot of issues getting precise to model and run correctly. Your milage may vary, but since this is a guide for the confused and frustrated, it’s what I’m recommending that you use. Get it installed in your VM with all the default software. You probably don’t need any add-ons.
Step two: you’ll want to install, well, a bunch of things. Starting with updating apt. It doesn’t really matter where you install this, but the home directory (“cd ~”) is where I installed my copy, and that works just fine.
sudo apt-get update
That will run, now install git
sudo apt install git
Congratulations, you have installed the world’s premiere file change-tracking software. That was fun. Let’s move on.
For some reason, we have to add another repository to git before we’re allowed to install all of the software we plan to install.
sudo add-apt-repository universe
Step three: We may now advance to the reason for our exercise – getting precise installed. Git makes it easy to grab a copy.
git clone https://github.com/MycroftAI/mycroft-precise.git
It will download a copy of precise 0.3.0, and you will now see a new directory in your home folder, “mycroft-precise”. Change into that directory, and edit setup.py with the following command:
sudo pico setup.py
Why pico and not nano? Because I learned Linux on SuSE 9 a loooong time ago, and it will open nano for you anyway. 😉
Okay, you’re going to have a lot of text here. But there’s one line we need to change in this file, way down at the bottom.
'numpy==1.16',
'tensorflow>=1.13,<1.14', # Must be on piwheels
'sonopy',
'pyaudio',
'keras<=2.1.5',
'h5py',
'wavio',
'typing',
'prettyparse>=1.1.0',
'precise-runner',
'attrs',
'fitipy<1.0',
'speechpy-fast',
'pyache'
See the line that says ‘h5py’? Change that to read
'h5py<3.0.0',
Save the file, and exit pico or nano or whatever you’re using. (Vi? Emacs? Vim? Editing the inodes with a magnetic needle?) Hooray, we can now install precise. This is done by running
./setup.sh
Step four: This will take some time to complete. It will download everything for you, do all sorts of magic, and eventually, return you (hopefully) to a command prompt. Congratulations, you have now installed precise. Make it usable by running
source .venv/bin/activate
From within the mycroft-precise directory. This will tell Python to include the precise libraries.
Step five: and one of the most important ones. Download the zip of mycroft-precise binary data, from this page. The file you need will change depending on the architecture you’re running Mycroft on. For a Picroft, it’s the armv7l files. Grab precise-engine.tar.gz, and unzip it. Within, you will find a folder called “precise-engine”. SSH (or SFTP, or whatever) into your Picroft, and navigate to
/home/pi/.mycroft/precise
And replace the existing precise-engine folder there with the one you just downloaded. Also, I’ve found you need to do the following bit of dumbassery:
chmod 777 /home/pi/.mycroft/precise/precise-engine/precise-engine
Which allows all users and all groups the ability to read, write, and execute that file. I tried with lesser permissions and ran into errors. Probably not a big deal, you are behind a decent firewall and not exposing your Pi to unknown incoming connections, right? Right?? Reboot your pi. Strictly necessary? …probably not? At the very least, do:
mycroft-start all restart
Okay, why did we replace all of that? Well, the current image (maybe updated by now) of Picroft does not include the latest version of precise; it’s running and older 0.2.0 version. This mis-match will cause your models to either: activate at the sound of an ant walking across the carpeting, or alternatively, it just won’t hear the wake word…at all. If either of these things are happening to you, it’s almost certainly due to a version mismatch. (Or a terrible quality data set, but let’s address that below.)
Onto modeling. There are several guides to running this software. The official Mycroft one is here, and some additional thoughts by contributor El-Tocino are here. Here are my thoughts, having messed with this thing for a while.
The official Mycroft-precise training page is a good start, but some additional information is needed. I say this because once I got the hang of creating models, I was able to do so without using almost any of the tools other than precise-train, precise-listen, precise-test, and precise-collect. You should also go ahead and make all of the directories that it asks you to. In my case, I called my model tng-computer (I’m a Trek fan, can you tell?) and so my dirs were
- tng-computer
- not-wake-word
- test
- not-wake-word
- wake-word
- wake-word
Let’s cover the data you’ll need first.
The official Mycroft-precise instructions tell you that you need around twelve samples. In my experience, this is too little, and indeed, ElTocino would agree with us here. For my model, I recorded about 300 samples of the wake word, which in my case, was “computer”. Half of these are me, half are my partner, and I sprinkled in all my kids, too, to help get a precise model. Everyone you want to be able to use Mycroft should contribute. It will help. Try to get lots of different inflections – rising, falling, neutral, happy, sad, disgusted. Think of any pronouncations that might be considered alternate. For instance, when saying “computer”, a lot of United States central midwest accents tend to drop the explicit “t”, so it sounds more like “compu-er”, without the explicit tapped “t” on the hard palate. I realized after I recorded that I needed to update my wake-words to include that.
Now, the not-wake-words…
You need a lot of not-wake data. Oodles of it. In addition to the Google Speech Commands (side note: it’s like they’re trying to camouflage that link) and the Public Domain Sounds Backup, you should record many, many instances of yourself saying every conceivable rhyme or word that might sound like your wake word. For my model, that was:
- intruder, abuser, bruiser, hooter, cooter, looter, scooter, shooter, suiter, commuter, consumer, confuser, polluter, recruiter, persecutor, prosecutor, you hear her, trouble shooter, sewer, stupor, etc.
Record lots of those. In addition, you will find that random noises that you didn’t expect will activate a poorly-trained model. To combat this, I recommend at least an hour of room noise, and possibly more, depending on all the activities that take place in your room. Have kids? Record an hour of them horsing around in the room. Record some of their television shows. Record conversations with your partner. For some reason, my daughter’s whistling tunes at the Mycroft set it off after I thought I had made a pretty good model. Add some of that and try again.
As for how to do this…
The downloadable sound sets are pretty self-explanatory. Download and unzip them into your not-wake-word directory. (Precise-train looks in directories recursively, so don’t worry about having folders within folders.) Recordings of the actual wake-word, and rhymes, are best done with the precise-collect tool. Just give them descriptive names so you can remember what they are, and start slappin’ that spacebar. Don’t leave silence at the beginning or end of the recording, this will mess up the training. For room noise recordings, you can use arecord:
arecord -f S16_LE -t wav -r 16000 --max-file-time 30 roomnoise_may16.wav
arecord is a little dumb in that it doesn’t give you feedback that it’s saving additional files, but it is. This command tells arecord to start recording in 30-second increments. This is useful because if someone says the wake-word accidentally, you can go back and pull that one file without losing the rest of your data. The name at the end is just…whatever you want to name it. I find the date and a description of the contents to be helpful for organizing purposes.
With the Google Speech Commands, and Public Domain sounds, and all my room noise and rhymes, I had about ~51K not-wake-sounds. This is probably sufficient. You can now run precise-train. I didn’t bother with the really long commands that El-Tocino posted – in part because I couldn’t find documentation as to what those switches do. (Maybe they’ll chime in here…) For my purposes, it was enough to do:
precise-train tng-computer.net tng-computer/ -e 150
It’s not many epochs (cycles of training) and that’s okay! I started hitting val_acc of 1.0 really quickly after about 100 epochs, and I think that if you have a high-quality dataset, I suspect you’ll see the same results. You’ll want to be hitting in the upper .9s, or you won’t have a good model.
Once you think you have a good model, you can use
precise-test tng-computer.net tng-computer/
Replacing those file names with yours, obviously, to see how well the model performs with the testing data. I recommend more testing data than the Mycroft-precise page does, I think it’s more informative. Precise-listen is also extremely helpful here:
precise-listen tng-computer.net
Which basically “auditions” your model for use with the currently-plugged in microphone. Belly on up to your mic, clear your throat, throw back a shot of whiskey if you’re of the legal drinking age in the country of your residence and if doing so complies with all relevant local and national laws, and say your wake word a few times. It will print a bunch of “XXXXXxxxxxx” on the screen as a sort of realtime graph. It’s kinda hacky, but also, props to whomever coded it, because it also works just fine. If it works, you should (assuming your sound is up) hear a happy little major third ding and see lots of uppercase Xs.
If you get good numbers and test results, congrats, you can now do the exciting part and use:
precise-convert tng-computer.net
Obviously replacing the .net file name with your own. You’ll get a .pb and .pb.params file. Dump those into your home dir on your Picroft. Follow the instructions here to edit your Mycroft config to tell it about the new file. I found that for me, the trigger_level and sensitivity had to be set quite permissively, at 1 and 0.9, respectively. It unintentionally activates a few times a day, at present, but that’s because this is a new model and I need to add some more not-wake-word data. It’s quite usable!
Some post-training stuff: you’re going to have to re-model. No, really, you will, and that’s okay. Your model might be great, but there will be sounds that you haven’t anticipated. Keep a little notebook of some of the things it unintentionally activates to, and do some additional recordings. You can also turn on wake-word saving on the Picroft (or the Mycroft) using the setting “record_wake_words”, which is easier to show a screenshot of than to write:
You may summon these settings be using:
mycroft-config edit user
By default, the Mycroft / Picroft will save all activations of the wake-word as a wav and store them in /tmp/mycroft_wake_words. From there, you can SSH in (or whatever) and copy all the unintentional activations over to your not-wake-word directory for retraining.
You’ll have to retrain a few times, and that’s okay. It’s quite rewarding to keep zeroing in on getting the thing as close to perfect as you can, and besides, if you weren’t into this, you’d just be using “Hey Google” anyway. 😉
You may see an error that the .params file isn’t working. This isn’t a problem…because reasons. Enjoy your new wake word, and LLAP. 🖖🏻 (Many thanks to El-Tocino for their help in getting this software working.)