The Rubber Duck Just Got Smarter

The rubber duck is an invaluable debugging tool, essential to becoming a good programmer. In this post, join me as I humbly attempt to make a rubber ducky that can talk and respond (spoiler alert: it's still dumb).

28-04-2025 [Programming] [Weekend Project]

We've all been there. Your program isn't doing what you want it to, even though nothing seems off about the code.

In a scenario like this, one has no choice but to turn to the rubber duck. You explain your code to it, and (seemingly out of thin air) you find out exactly what the problem is.

Why not take this mundane, one-way interaction to the next level? (seriously, why not?)

Speech Recognition

To recognize speech, we're using the speech_recognition library with faster_whisper.

import os
import time

import speech_recognition as sr
import torch
from faster_whisper import WhisperModel

First, let's load the model weights required for faster_whisper:

MODEL = "small.en"
model = WhisperModel(
    MODEL,
    device="cpu",
    compute_type="int8"
)
pipeline = model.transcribe

Next, let's initialize the mic and recognizer:

r = sr.Recognizer()
m = sr.Microphone()
with m as source:
    print("SIIIIIILLLLEENNCE")
    r.adjust_for_ambient_noise(source)

Finally, let's create the recognize function. I don't know why I'm doing it this way, even though the speech_recognition library has its own faster_whisper integration (i yoinked this code out from an older repo of mine).

def recognize():
    with m as source:
        audio = r.listen(source)

    with open("audio.wav", "wb") as f:
        f.write(audio.get_wav_data())

    t = time.perf_counter()
    with torch.inference_mode():
        segments, info = pipeline(
            "audio.wav",
            vad_filter=True
        )

    recognized = ""
    for segment in segments:
        recognized += segment.text

    os.remove("audio.wav")

    return recognized

Text-To-Speech

I was just strolling around huggingface the other day, when I found this intriguing text-to-speech model called kokoro. I've been wanting to try it out for quite a while now, so I decided to use it for this project.

It's REALLY fast, even on my potato PC. Anyway, back to the code. Let's import what we need real quick and set up some stuff:

from kokoro import KPipeline
import soundfile as sf
from pydub import AudioSegment
from pydub.playback import play
import os

pipeline = KPipeline(lang_code='a')
FILE = "s.wav"

For the speak function, we just run inference, save it to an audio file, play the audio file and delete it.

def speak(text):
    generator = pipeline(
        text,
        voice='af_heart'
    )

    for i, (gs, ps, audio) in enumerate(generator):
        sf.write(
            FILE,
            audio,
            24000
        )
        audio = AudioSegment.from_wav(FILE)
        play(audio)
        os.remove(FILE)
        break

Putting It All Together

For the actual responses, let's use ollama. Since my PC can't handle large models very well, I'll just stick to llama3.2:3b for now.

Let's import ollama, as well as the other two files, which I saved as microphone.py and tts.py.

import ollama

from microphone import recognize
from tts import speak

I'm no prompt engineer, but this prompt works pretty well:

SYSPROMPT = """I'm a programmer. You are my rubber duck. Your responses should be short, concise, insightful and motivating. Your responses should be REALLY short. One sentence, not more than 15 words. You're not allowed to see the code. Be a good listener, give insightful hints."""

Now, we can initialize the ollama client and set up some stuff:

client = ollama.Client()
model = "llama3.2"
messages = [
    {
        "role": "system",
        "content": SYSPROMPT,
    },
]

And now for the mainloop!

while True:
    print("Say something!")
    messages.append({
        "role": "user",
        "content": recognize()
    })

    response = client.chat(
        model=model,
        messages=messages
    )
    messages.append({
        "role": "assistant",
        "content": response["message"].content
    })

    speak(response["message"].content)
    print(messages[-1]["content"])

It's never been easier to build AI applications. Seriously, it didn't take me TWO HOURS to make this! And it works surprizingly well (it's still dumb tho). Demo video coming soon!

All the code is available on GitHub. Feel free to check it out! Fair warning: I'm new to NixOS. Even by my standards, the dependency management on the repo is quite trash. Most deps are managed throught nix-shell, except for the kokoro inference library, which uses venv.

<- Previous Post Next Post ->