We've all been there. Your program isn't doing what you want it to, even though nothing seems off about the code.
In a scenario like this, one has no choice but to turn to the rubber duck. You explain your code to it, and (seemingly out of thin air) you find out exactly what the problem is.
Why not take this mundane, one-way interaction to the next level? (seriously, why not?)
Speech Recognition
To recognize speech, we're using the
speech_recognition
library with
faster_whisper
.
import os
import time
import speech_recognition as sr
import torch
from faster_whisper import WhisperModel
First, let's load the model weights required for
faster_whisper
:
MODEL = "small.en"
model = WhisperModel(
MODEL,
device="cpu",
compute_type="int8"
)
pipeline = model.transcribe
Next, let's initialize the mic and recognizer:
r = sr.Recognizer()
m = sr.Microphone()
with m as source:
print("SIIIIIILLLLEENNCE")
r.adjust_for_ambient_noise(source)
Finally, let's create the recognize
function. I don't know why I'm doing it this way,
even though the speech_recognition
library has its own faster_whisper
integration (i yoinked this code out from an older
repo of mine).
def recognize():
with m as source:
audio = r.listen(source)
with open("audio.wav", "wb") as f:
f.write(audio.get_wav_data())
t = time.perf_counter()
with torch.inference_mode():
segments, info = pipeline(
"audio.wav",
vad_filter=True
)
recognized = ""
for segment in segments:
recognized += segment.text
os.remove("audio.wav")
return recognized
Text-To-Speech
I was just strolling around
huggingface
the other day, when I found this intriguing
text-to-speech model called
kokoro
.
I've been wanting to try it out for quite a while now,
so I decided to use it for this project.
It's REALLY fast, even on my potato PC. Anyway, back to the code. Let's import what we need real quick and set up some stuff:
from kokoro import KPipeline
import soundfile as sf
from pydub import AudioSegment
from pydub.playback import play
import os
pipeline = KPipeline(lang_code='a')
FILE = "s.wav"
For the speak
function, we just run inference,
save it to an audio file, play the audio file and delete it.
def speak(text):
generator = pipeline(
text,
voice='af_heart'
)
for i, (gs, ps, audio) in enumerate(generator):
sf.write(
FILE,
audio,
24000
)
audio = AudioSegment.from_wav(FILE)
play(audio)
os.remove(FILE)
break
Putting It All Together
For the actual responses, let's use ollama
.
Since my PC can't handle large models very well, I'll just
stick to llama3.2:3b
for now.
Let's import ollama
, as well as the other two
files, which I saved as microphone.py
and
tts.py
.
import ollama
from microphone import recognize
from tts import speak
I'm no prompt engineer, but this prompt works pretty well:
SYSPROMPT = """I'm a programmer. You are my rubber duck. Your responses should be short, concise, insightful and motivating. Your responses should be REALLY short. One sentence, not more than 15 words. You're not allowed to see the code. Be a good listener, give insightful hints."""
Now, we can initialize the ollama
client
and set up some stuff:
client = ollama.Client()
model = "llama3.2"
messages = [
{
"role": "system",
"content": SYSPROMPT,
},
]
And now for the mainloop!
while True:
print("Say something!")
messages.append({
"role": "user",
"content": recognize()
})
response = client.chat(
model=model,
messages=messages
)
messages.append({
"role": "assistant",
"content": response["message"].content
})
speak(response["message"].content)
print(messages[-1]["content"])
It's never been easier to build AI applications. Seriously, it didn't take me TWO HOURS to make this! And it works surprizingly well (it's still dumb tho). Demo video coming soon!
All the code is available on
GitHub.
Feel free to check it out! Fair warning: I'm new to
NixOS. Even by my standards, the dependency
management on the repo is quite trash. Most deps
are managed throught nix-shell
, except
for the kokoro
inference library, which
uses venv
.