Some thoughts about how best to create the large amounts of files necessary to create a custom wake word for Mycroft / precise

One of the things that’s been somewhat difficult to pin down are some of the false activations I’ve been seeing with Myroft / precise. It’s pretty damn good at detecting when I actually say the wake-word, but sometimes it activates at just…the oddest, small sounds, or little blips that sound nothing like it. So, I’ve been doing some digging, and here’s what I found.

First off, precise-collect is a tool within precise that allows the user to very quickly record lots of samples of whatever – wake-words, not-wake-words, or whatever it is you want to record. It’s a nice interface, command-line of course, but it seems to be reasonably fast. However, upong investivating the sound files that it was generating, I disovered a sort of startling number of audio artifacts. Small pops, blips, bloops, would happen about every twenty files or so, and sometimes they’re significant enough that I could easily see where they would end up confusing the learning that precise was attempting to do with them.

Also somewhat less than ideal is that it’s difficult to know when, exactly, to start speaking after you hit the spacebar. In general, I’ve found that there’s more space on the front end of my audio files than I’d like to have using precise-collect, and I suspect that the audio anomalies along with the unintentional silence was contributing to the a lower-quality dataset than I’d like to have. The “dead air” at the front of the files is particularly troublesome, as often the microphone would pick up anything else in the room. This is a pretty inexpensive USB mic that doesn’t have much off-axis rejection to speak of. Of course, I could set up my USB interface and try to get it working with the VM, then keep using precise-collect, but I got thinking: perhaps precise-collect is the wrong tool to bring to this particular problem.

To see why requires understanding that most modern professional audio editing is done looking at the waveform on a screen, in – obviously – a GUI. As a podcaster and long-ago live sound engineer, this is intuitive to me. Visual representation of a waveform makes it easy to see the data in a file without having to listen to it, and this is where precise-collect fails. A command-line utility could make some sense if it had some sort of “threshold” feature wherein it did not start to record until some arbitrary level was reached (defined in, say, decibels), but precise-collect is a simple utility whose purpose is to simply record the audio input into the correct format for precise to start modeling off of.

Now, I do know about the long-abandoned SoX, I considered that as a possibility, and subsequently rejected it. Firstly, SoX has one HELL of an esoteric (read: beginner-unfriendly) command line syntax, and while some searching eventually found me the command that I’d need to input to get SoX to strip silence from the beginning of the file and the end, the man pages are…written like someone wanted to deliberately confuse the user.

Secondly, SoX has an unfortunate habit of corrupting files. I tried twiddling the knobs, and it would still take about a full third of the input files, and output uselessly empty 44 byte (yes, byte) files. No matter how much I adjusted, I could not get it to stop doing this. It might have been easier to troubleshoot this had the man page been useful or there been a website with some support information, but there isn’t. There’s just old Sourceforge posts in miles-long post after post format.

Thirdly, all of this is happening via a command line. There’s no graphical preview for what the output files might look like, which would make adjusting the input parameters far easier. For the reasons I discussed above, audio tools are probably best used via a GUI, because of the intuitive nature of looking at a graphical waveform. If SoX had a graphical frontend, that would make a lot more sense, because command line tools have obvious advantages when it comes to scripting operations for large numbers of files that all need the same thing done to them. But it doesn’t, so even the files that SoX produced that were usable weren’t really, because it would cut off too much or too little of the silence, and I eventually lost patience trying to get this fiddly method to work. Finally, editing hundreds of individual files is a nightmare, even with a GUI. If there’s a way to consolidate all of the files into a cohesive whole and edit THAT, it would be easier, then re-split it. Combined with the audio artifact issue with precise-collect that I simply did not have the time or patience to track down, and this workflow was unacceptable to me.

So, what to do? Well, I came up with a process that seems to be working very well, and allowed me to produce a lot really high-quality wake-word files in a very short amount of time.

For this, I’m using two bits of software: Audacity, which I hope everyone is familiar with, and a new piece of software I found, WavePad.

The process is quite straightforward. First, I record in Audacity. I could do this with my Nice Microphone setup in Logic, but there’s simply no need at the sample rate we’re working with here. Plugging in my USB mic in a quiet room is just fine. Now, I record my wake-word as with precise, except I left a few seconds of silence between each sample, so I end up with a five-minute file of me saying “Computer” in various inflections / distances from the microphone / tones. Not having to record these all as separate instances makes the recording process go very quickly, and I could record a few hundred instances of my wake word in about ten minutes.

audacity

Of course, now, because we’re in Audacity, it makes it very easy to see any extraneous noise that needs to be cleaned up, chairs creaking, etc. All of this can be done quickly by looking at the waveform, then exporting the audio. Any other adjustments to the file can be done at this point – de-noising, correcting gain or clipping, etc.

In WavePad, it’s a simple task to open the long file, and then use the Edit -> Split File -> Split at silences command to generate the individual files. It works a charm – and the WavePad is perfectly free for non-commercial use. Using this method, once I dialed in the proper threshold for silences, the program output perfectly-chopped up instances of my wake-word.

wavpad_01 wavpad_02

I may find that I’m totally wrong about SoX, but at least for me, for the moment, this workflow has saved a lot of time and neatly avoided the issues I was having with precise-collect and SoX.

j j j

Training Precise on Mycroft / Picroft

Modeling a Custom Wake Word: A Guide for the Dazed, Confused, and Programming-Illiterate

(Like Myself)

I have recently become enamored with an open-source virtual assistant project, name of Mycroft. One of the advantages of this system, besides its focus on privacy and security, is that you can train your own wake word, as well as perform various other customizations. However, going through the process I noticed all of the things I wished the documentation included, so once I figured out the process, I decided to write it down, so that others may have a guide to help them. This guide assumes you want to use the Picroft version of Mycroft. Other builds on other OSes might be different, but perhaps some information and instructions will transfer?

First, it’s important to know that while the process isn’t particularly difficult, but it is a little time-consuming. You’ll want to set aside some time to make your recordings in a quiet place. Further, this guide will assume a level of familiarity with basic Linux / Unix / *nix commands, some familiarity with using a command line, and knowledge of how to use a computer. None of this gets programmer-y or difficult, but some people see a command line and head for the hills. Embrace it, it’s fun.

Things you will need!

  • A Raspberry Pi (obviously) running Mycroft. I’m using core 20.8.1 (Buster Keaton). If they update things in the future, I’ll try to remember this exists and update it as well.
  • Another computer to do the modeling on. I’m using a Macbook Pro, but it will work on anything because we’re going to run things on a virtual machine (VM). If you have a machine with Ubuntu 18.04 installed on it, that will probably work as well.
  • A virtual machine. I recommend VirtualBox from Oracle, for its price (none) and usability across just about every operating system that exists. Of course, you can run this straight on Linux, too, but I’m recommending a specific version of Ubuntu, so the VM is a good place to start unless you’re experienced enough to know exactly what you’re doing.
  • An iso of Linux Ubuntu 18.04
  • A fairly decent USB microphone. This doesn’t have to be expensive, but it’s a bit of an investment. Expect to spend at least $80 for a decent one. Under that the quality can get a bit shoddy. This one is fine. If you’re going to use a USB mic with VirtualBox, remember to select Devices->USB->The USB mic from the Devices menu, or the device will not show up in the VM due to the OSs inability to “share” the USB device. Make sure you’re getting good audio from the sound settings section of the control panel.

Second: I DO NOT RECOMMEND doing this on the Raspberry Pi, for a lot of reasons, but if you have a laptop or other computer, it will likely run faster on that. I had some trouble getting this to work on Ubuntu 20, so I went back to 18.04, which has worked very well for me so far. Let’s go through how I did all this.

Step zero: If you know all of what I’m about to say, and want to skip to why precise won’t work, scroll down to step whatever.

Step one: prepare your environment. I did this by getting VirtualBox installed, and then setting up a Ubuntu 18.04 virtual environment. You’ll want…quite a lot of disk space for this. I ended up allocating 10GB for the image to use, and that turned out to be too small. 20GB is ideal. Why 18.04? Well, because it’s what I used, it’s fairly recent, and it worked. I tried on 20.04, and ended up with a lot of issues getting precise to model and run correctly. Your milage may vary, but since this is a guide for the confused and frustrated, it’s what I’m recommending that you use. Get it installed in your VM with all the default software. You probably don’t need any add-ons.

Step two: you’ll want to install, well, a bunch of things. Starting with updating apt. It doesn’t really matter where you install this, but the home directory (“cd ~”) is where I installed my copy, and that works just fine.

sudo apt-get update

That will run, now install git

sudo apt install git

Congratulations, you have installed the world’s premiere file change-tracking software. That was fun. Let’s move on.

For some reason, we have to add another repository to git before we’re allowed to install all of the software we plan to install.

sudo add-apt-repository universe

Step three: We may now advance to the reason for our exercise – getting precise installed. Git makes it easy to grab a copy.

git clone https://github.com/MycroftAI/mycroft-precise.git

It will download a copy of precise 0.3.0, and you will now see a new directory in your home folder, “mycroft-precise”. Change into that directory, and edit setup.py with the following command:

sudo pico setup.py

Why pico and not nano? Because I learned Linux on SuSE 9 a loooong time ago, and it will open nano for you anyway. 😉

Okay, you’re going to have a lot of text here. But there’s one line we need to change in this file, way down at the bottom.

'numpy==1.16',
'tensorflow>=1.13,<1.14', # Must be on piwheels
'sonopy',
'pyaudio',
'keras<=2.1.5',
'h5py',
'wavio',
'typing',
'prettyparse>=1.1.0',
'precise-runner',
'attrs',
'fitipy<1.0',
'speechpy-fast',
'pyache'

See the line that says ‘h5py’? Change that to read

'h5py<3.0.0',

Save the file, and exit pico or nano or whatever you’re using. (Vi? Emacs? Vim? Editing the inodes with a magnetic needle?) Hooray, we can now install precise. This is done by running

./setup.sh

Step four: This will take some time to complete. It will download everything for you, do all sorts of magic, and eventually, return you (hopefully) to a command prompt. Congratulations, you have now installed precise. Make it usable by running

source .venv/bin/activate

From within the mycroft-precise directory. This will tell Python to include the precise libraries.

Step five: and one of the most important ones. Download the zip of mycroft-precise binary data, from this page. The file you need will change depending on the architecture you’re running Mycroft on. For a Picroft, it’s the armv7l files. Grab precise-engine.tar.gz, and unzip it. Within, you will find a folder called “precise-engine”. SSH (or SFTP, or whatever) into your Picroft, and navigate to

/home/pi/.mycroft/precise

And replace the existing precise-engine folder there with the one you just downloaded. Also, I’ve found you need to do the following bit of dumbassery:

chmod 777 /home/pi/.mycroft/precise/precise-engine/precise-engine

Which allows all users and all groups the ability to read, write, and execute that file. I tried with lesser permissions and ran into errors. Probably not a big deal, you are behind a decent firewall and not exposing your Pi to unknown incoming connections, right? Right?? Reboot your pi. Strictly necessary? …probably not? At the very least, do:

mycroft-start all restart

Okay, why did we replace all of that? Well, the current image (maybe updated by now) of Picroft does not include the latest version of precise; it’s running and older 0.2.0 version. This mis-match will cause your models to either: activate at the sound of an ant walking across the carpeting, or alternatively, it just won’t hear the wake word…at all. If either of these things are happening to you, it’s almost certainly due to a version mismatch. (Or a terrible quality data set, but let’s address that below.)

Onto modeling. There are several guides to running this software. The official Mycroft one is here, and some additional thoughts by contributor El-Tocino are here. Here are my thoughts, having messed with this thing for a while.

The official Mycroft-precise training page is a good start, but some additional information is needed. I say this because once I got the hang of creating models, I was able to do so without using almost any of the tools other than precise-train, precise-listen, precise-test, and precise-collect. You should also go ahead and make all of the directories that it asks you to. In my case, I called my model tng-computer (I’m a Trek fan, can you tell?) and so my dirs were

  • tng-computer
    • not-wake-word
    • test
      • not-wake-word
      • wake-word
    • wake-word

Let’s cover the data you’ll need first.

The official Mycroft-precise instructions tell you that you need around twelve samples. In my experience, this is too little, and indeed, ElTocino would agree with us here. For my model, I recorded about 300 samples of the wake word, which in my case, was “computer”. Half of these are me, half are my partner, and I sprinkled in all my kids, too, to help get a precise model. Everyone you want to be able to use Mycroft should contribute. It will help. Try to get lots of different inflections – rising, falling, neutral, happy, sad, disgusted. Think of any pronouncations that might be considered alternate. For instance, when saying “computer”, a lot of United States central midwest accents tend to drop the explicit “t”, so it sounds more like “compu-er”, without the explicit tapped “t” on the hard palate. I realized after I recorded that I needed to update my wake-words to include that.

Now, the not-wake-words…

You need a lot of not-wake data. Oodles of it. In addition to the Google Speech Commands (side note: it’s like they’re trying to camouflage that link) and the Public Domain Sounds Backup, you should record many, many instances of yourself saying every conceivable rhyme or word that might sound like your wake word. For my model, that was:

  • intruder, abuser, bruiser, hooter, cooter, looter, scooter, shooter, suiter, commuter, consumer, confuser, polluter, recruiter, persecutor, prosecutor, you hear her, trouble shooter, sewer, stupor, etc.

Record lots of those. In addition, you will find that random noises that you didn’t expect will activate a poorly-trained model. To combat this, I recommend at least an hour of room noise, and possibly more, depending on all the activities that take place in your room. Have kids? Record an hour of them horsing around in the room. Record some of their television shows. Record conversations with your partner. For some reason, my daughter’s whistling tunes at the Mycroft set it off after I thought I had made a pretty good model. Add some of that and try again.

As for how to do this…

The downloadable sound sets are pretty self-explanatory. Download and unzip them into your not-wake-word directory. (Precise-train looks in directories recursively, so don’t worry about having folders within folders.) Recordings of the actual wake-word, and rhymes, are best done with the precise-collect tool. Just give them descriptive names so you can remember what they are, and start slappin’ that spacebar. Don’t leave silence at the beginning or end of the recording, this will mess up the training. For room noise recordings, you can use arecord:

arecord -f S16_LE -t wav -r 16000 --max-file-time 30 roomnoise_may16.wav

arecord is a little dumb in that it doesn’t give you feedback that it’s saving additional files, but it is. This command tells arecord to start recording in 30-second increments. This is useful because if someone says the wake-word accidentally, you can go back and pull that one file without losing the rest of your data. The name at the end is just…whatever you want to name it. I find the date and a description of the contents to be helpful for organizing purposes.

With the Google Speech Commands, and Public Domain sounds, and all my room noise and rhymes, I had about ~51K not-wake-sounds. This is probably sufficient. You can now run precise-train. I didn’t bother with the really long commands that El-Tocino posted – in part because I couldn’t find documentation as to what those switches do. (Maybe they’ll chime in here…) For my purposes, it was enough to do:

precise-train tng-computer.net tng-computer/ -e 150

It’s not many epochs (cycles of training) and that’s okay! I started hitting val_acc of 1.0 really quickly after about 100 epochs, and I think that if you have a high-quality dataset, I suspect you’ll see the same results. You’ll want to be hitting in the upper .9s, or you won’t have a good model.

Once you think you have a good model, you can use

precise-test tng-computer.net tng-computer/

Replacing those file names with yours, obviously, to see how well the model performs with the testing data. I recommend more testing data than the Mycroft-precise page does, I think it’s more informative. Precise-listen is also extremely helpful here:

precise-listen tng-computer.net

Which basically “auditions” your model for use with the currently-plugged in microphone. Belly on up to your mic, clear your throat, throw back a shot of whiskey if you’re of the legal drinking age in the country of your residence and if doing so complies with all relevant local and national laws, and say your wake word a few times. It will print a bunch of “XXXXXxxxxxx” on the screen as a sort of realtime graph. It’s kinda hacky, but also, props to whomever coded it, because it also works just fine. If it works, you should (assuming your sound is up) hear a happy little major third ding and see lots of uppercase Xs.

If you get good numbers and test results, congrats, you can now do the exciting part and use:

precise-convert tng-computer.net

Obviously replacing the .net file name with your own. You’ll get a .pb and .pb.params file. Dump those into your home dir on your Picroft. Follow the instructions here to edit your Mycroft config to tell it about the new file. I found that for me, the trigger_level and sensitivity had to be set quite permissively, at 1 and 0.9, respectively. It unintentionally activates a few times a day, at present, but that’s because this is a new model and I need to add some more not-wake-word data. It’s quite usable!

Some post-training stuff: you’re going to have to re-model. No, really, you will, and that’s okay. Your model might be great, but there will be sounds that you haven’t anticipated. Keep a little notebook of some of the things it unintentionally activates to, and do some additional recordings. You can also turn on wake-word saving on the Picroft (or the Mycroft) using the setting “record_wake_words”, which is easier to show a screenshot of than to write:

settings

You may summon these settings be using:

mycroft-config edit user

By default, the Mycroft / Picroft will save all activations of the wake-word as a wav and store them in /tmp/mycroft_wake_words. From there, you can SSH in (or whatever) and copy all the unintentional activations over to your not-wake-word directory for retraining.

You’ll have to retrain a few times, and that’s okay. It’s quite rewarding to keep zeroing in on getting the thing as close to perfect as you can, and besides, if you weren’t into this, you’d just be using “Hey Google” anyway. 😉

You may see an error that the .params file isn’t working. This isn’t a problem…because reasons. Enjoy your new wake word, and LLAP. 🖖🏻 (Many thanks to El-Tocino for their help in getting this software working.)

j j j