We've all been there. Your program isn't doing what you want it to, even though nothing seems off about the code.
In a scenario like this, one has no choice but to turn to the rubber duck. You explain your code to it, and (seemingly out of thin air) you find out exactly what the problem is.
Why not take this mundane, one-way interaction to the next level? (seriously, why not?)
Speech Recognition
To recognize speech, we're using the
speech_recognition
library with
faster_whisper
.
import os
import time
import speech_recognition as sr
import torch
from faster_whisper import WhisperModel
First, let's load the model weights required for
faster_whisper
:
MODEL = "small.en"
model = WhisperModel(
MODEL,
device="cpu",
compute_type="int8"
)
pipeline = model.transcribe
Next, let's initialize the mic and recognizer:
r = sr.Recognizer()
m = sr.Microphone()
with m as source:
print("SIIIIIILLLLEENNCE")
r.adjust_for_ambient_noise(source)
Finally, let's create the
recognize
function. I don't know why
I'm doing it this way, even though the
speech_recognition
library has its own
faster_whisper
integration (i yoinked this code out from an older
repo of mine).
def recognize():
with m as source:
audio = r.listen(source)
with open("audio.wav", "wb") as f:
f.write(audio.get_wav_data())
t = time.perf_counter()
with torch.inference_mode():
segments, info = pipeline(
"audio.wav",
vad_filter=True
)
recognized = ""
for segment in segments:
recognized += segment.text
os.remove("audio.wav")
return recognized
Text-To-Speech
I was just strolling around
huggingface
the other day, when I found this intriguing
text-to-speech model called
kokoro
. I've been wanting to try it out for quite a while
now, so I decided to use it for this project.
It's REALLY fast, even on my potato PC. Anyway, back to the code. Let's import what we need real quick and set up some stuff:
from kokoro import KPipeline
import soundfile as sf
from pydub import AudioSegment
from pydub.playback import play
import os
pipeline = KPipeline(lang_code='a')
FILE = "s.wav"
For the speak
function, we just run
inference, save it to an audio file, play the audio
file and delete it.
def speak(text):
generator = pipeline(
text,
voice='af_heart'
)
for i, (gs, ps, audio) in enumerate(generator):
sf.write(
FILE,
audio,
24000
)
audio = AudioSegment.from_wav(FILE)
play(audio)
os.remove(FILE)
break
Putting It All Together
For the actual responses, let's use
ollama
. Since my PC can't handle large
models very well, I'll just stick to
llama3.2:3b
for now.
Let's import ollama
, as well as the
other two files, which I saved as
microphone.py
and tts.py
.
import ollama
from microphone import recognize
from tts import speak
I'm no prompt engineer, but this prompt works pretty well:
SYSPROMPT = """I'm a programmer. You are my rubber duck. Your responses should be short, concise, insightful and motivating. Your responses should be REALLY short. One sentence, not more than 15 words. You're not allowed to see the code. Be a good listener, give insightful hints."""
Now, we can initialize the
ollama
client and set up some stuff:
client = ollama.Client()
model = "llama3.2"
messages = [
{
"role": "system",
"content": SYSPROMPT,
},
]
And now for the mainloop!
while True:
print("Say something!")
messages.append({
"role": "user",
"content": recognize()
})
response = client.chat(
model=model,
messages=messages
)
messages.append({
"role": "assistant",
"content": response["message"].content
})
speak(response["message"].content)
print(messages[-1]["content"])
It's never been easier to build AI applications. Seriously, it didn't take me TWO HOURS to make this! And it works surprizingly well (it's still dumb tho). Demo video coming soon!
All the code is available on
GitHub. Feel free to check it out! Fair warning: I'm new
to NixOS. Even by my standards, the dependency
management on the repo is quite trash. Most deps are
managed throught nix-shell
, except for
the kokoro
inference library, which
uses venv
.