It took some time, but we finally got Kokoro TTS (v1.0) running in-browser w/ WebGPU acceleration! This enables real-time text-to-speech without the need for a server. Looking forward to your feedback!
amelius 34 days ago [-]
Now that's what I call "server-less" computing!
deivid 34 days ago [-]
Amazing! I'm interested in models running locally and Kokoro seems amazing. Are you aware of similar models but for Speech to text?
xenova 34 days ago [-]
We have released a bunch of speech recognition demos (using whisper, moonshine, and others). For example:
How can I understand what's in the compiled JS though? Is there some source for that?
Ono-Sendai 34 days ago [-]
whisper
sebastiennight 34 days ago [-]
This is brilliant. All we need now is for someone to code a frontend for it so we can input an article's URL and have this voice read it out loud... built-in local voices on MacOS are not even close to this Kokoro model
satvikpendem 34 days ago [-]
There are a few already, I assume MacWhisper will add it. That being said, I am also working on a (crossplatform, in Flutter) UI for this.
sebastiennight 33 days ago [-]
My understanding is that MacWhisper is a front-end for Whisper.cpp so... it does Speech-to-text? (transcribing what you dictate)
Here I'm talking about the model shared in this thread, which is text-to-speech (reading out loud content from the web)
satvikpendem 33 days ago [-]
Yes, I am saying they might include features for TTS in addition to their current STT feature set. Seems like many of these sorts of apps are looking to add both to be more full fledged.
waynenilsen 34 days ago [-]
Incredible work! I have listened to several tts and to have this be free and in complete control of the customer is absolutely incredible. This will unlock new use cases
Brilliant job! Love how fast it is, I'm sure if the rapid pace of speech ML continues we'll have Speech to Speech models directly running in our browser!
dust42 34 days ago [-]
It's already there, Hibiki by Kyutai.org was released yesterday with speech to speech, french to english on Iphone:
(I get the joke that for some definition of real-time this is real-time).
The reason why I use an API is because time to first byte is the most important metric in the apps I'm working on.
That aside, kudos for the great work and I'm sure one day the latency on this will be super low as well.
itishappy 34 days ago [-]
Sounds terrible on Chrome with an AMD 5700XT.
Sounds great on Chrome with an Nvidia 1650Ti.
Sounds great on Chrome on a Pixel 6.
Sound like being bitcrushed. Maybe a 64 vs 32 bit error? Solid results when working.
ASalazarMX 33 days ago [-]
Ubuntu 24.04 LTS. Works great on Firefox, on Chromium audio files are silent, even when downloaded and opened with a media player.
Edit: Sorry, it was a problem of my specific audio setup, it works equally well on Chromium.
SubiculumCode 34 days ago [-]
Kokoro gives pretty good voices and is quite light...making it useful despite its lack of voice cloning capability. However, I haven't figured out how to run it in the context of a tts server without homebrewing the server...which maybe is easy? IDK.
Fantastic work. My dream would be to use this for a browser audiobook generator for epubs. I made a cli audiobook generator with Piper [0] that got some traction and I wanted to port it to the browser, but there were too many issues. [1]
Is there source anywhere? Seems the assets/ folder is bundled js.
In my opinion, there's a ton of opportunity for private, progressive web apps with this while WebGPU is still relatively newly implemented.
Would love to collaborate in some way if others are also interested in this
Sounds horribly in chrome with an amd gpu, why is that?
mdaniel 34 days ago [-]
Are you somehow implying that everyone in the AI arms race believes that only CUDA exists?! /s
But, in a more serious tone: the story that I hear about AMD GPUs is that they are, in fact, shittier because AMD themselves give fewer shits. GIGO
CyberDildonics 34 days ago [-]
What is this comment saying? You think the results are different just because of AMD hardware? If there is a difference it would be a software bug.
dragonwriter 34 days ago [-]
Everyone in the space only caring about (and therefore testing on) Nvidia/CUDA as suggested in GP is exactly why a software bug that seriously impacts results but only effects AMD GPUs would get through into released software very easily.
CyberDildonics 34 days ago [-]
That would be a webgpu bug or an AMD bug, not a bug in this software.
realsid 34 days ago [-]
Amazing ! This is my first time witnessing a model of such prowess run in browser. Curious about quantization and webml
yawnxyz 34 days ago [-]
holy cow, how did they get the OpenAI voices like Alloy and Echo, generated in-browser and sounding 99% the same?
this is astounding
djeastm 34 days ago [-]
Fyi I tried this on my Galaxy S21 with both Brave and Chrome browsers and just got screeching noises in the audio
mewse-hn 34 days ago [-]
the mere idea of voice software's error mode being uncontrollable screeching is the most hilarious thing to me
WebGPU actually generates the speech entirely in the browser. Web Speech is great too, but less practical if the model is complicated to set up and integrate with the speech API on the host.
scarface_74 34 days ago [-]
I don’t understand. From what I can tell, it’s natively supported on all modern browsers and on Windows, Macs, iOS and Android
moron4hire 33 days ago [-]
The implementation of the Web Speech API usually involves the specific browser vendor calling out to their own, proprietary, cloud-based TTS APIs. I say "usually" because, for a time, Microsoft used their local Windows Speech API in Edge, but I believe they've stopped that and have largely deprecated Windows Speech for Azure Speech even at the OS level.
scarface_74 33 days ago [-]
Just to be clear, are you really saying that speech with text to speech is server hosted and not on device for Windows?
You could do text to speech on a 1Mhz Apple //e using the 1 bit speaker back in the 80s (software automated mouth) and MacinTalk was built into the Mac in 1984. I know it’s built into both the Mac and iOS devices and run off line.
But I do see how cross platform browsers like Firefox would want a built in solution that doesn’t depend on the vendor.
moron4hire 33 days ago [-]
If the application is still using the deprecated Microsoft Speech API (SAPI), it's being done locally, but that API hasn't received updates in like a decade and the output is considerably lower quality than what people expect to hear today.
Firefox on Windows is one such application that still uses SAPI. I don't know what uses does on other operating systems. Like, on Android, I imagine it uses whatever is the built-in OS TTS API, which likely goes through Google Cloud.
But anything that sounds at all natural, from any of the OS or browser vendors, is going through some cloud TTS API now.
asqueella 29 days ago [-]
I’m pretty sure that built-in TTS on Mac and iPhone is local (and has been for ages).
scarface_74 33 days ago [-]
Any luck with getting this running on iOS 18.2.1 running Safari? I have tfe WebGPU feature flag turned on (Settings -> Safari -> Advanced) and I’ve tried a few other WebGPU demos successfully
jasonjmcghee 33 days ago [-]
Loads for a while then crashes for me. Guessing too much RAM usage
butz 34 days ago [-]
Generating audio takes a bit, but wow, 92MB model for really decent sounding speech. Is there a way to plug this thing into speech dispatcher on Linux and use for accessibility?
ranger_danger 34 days ago [-]
How do I download this and run it actually offline?
- https://huggingface.co/spaces/Xenova/whisper-web
- https://huggingface.co/spaces/Xenova/whisper-webgpu
- https://huggingface.co/spaces/Xenova/realtime-whisper-webgpu
- https://huggingface.co/spaces/webml-community/moonshine-web
How can I understand what's in the compiled JS though? Is there some source for that?
Here I'm talking about the model shared in this thread, which is text-to-speech (reading out loud content from the web)
I made https://app.readaloudto.me/ as a hobby thing and now it could be enhanced with a local tts option!
https://x.com/neilzegh/status/1887498102455869775
https://github.com/kyutai-labs/hibiki
(I get the joke that for some definition of real-time this is real-time).
The reason why I use an API is because time to first byte is the most important metric in the apps I'm working on.
That aside, kudos for the great work and I'm sure one day the latency on this will be super low as well.
Sounds great on Chrome with an Nvidia 1650Ti.
Sounds great on Chrome on a Pixel 6.
Sound like being bitcrushed. Maybe a 64 vs 32 bit error? Solid results when working.
Edit: Sorry, it was a problem of my specific audio setup, it works equally well on Chromium.
Is there source anywhere? Seems the assets/ folder is bundled js. In my opinion, there's a ton of opportunity for private, progressive web apps with this while WebGPU is still relatively newly implemented.
Would love to collaborate in some way if others are also interested in this
[0] https://github.com/C-Loftus/QuickPiperAudiobook/ [1] https://github.com/rhasspy/piper/issues/352
But, in a more serious tone: the story that I hear about AMD GPUs is that they are, in fact, shittier because AMD themselves give fewer shits. GIGO
this is astounding
https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...
You could do text to speech on a 1Mhz Apple //e using the 1 bit speaker back in the 80s (software automated mouth) and MacinTalk was built into the Mac in 1984. I know it’s built into both the Mac and iOS devices and run off line.
But I do see how cross platform browsers like Firefox would want a built in solution that doesn’t depend on the vendor.
Firefox on Windows is one such application that still uses SAPI. I don't know what uses does on other operating systems. Like, on Android, I imagine it uses whatever is the built-in OS TTS API, which likely goes through Google Cloud.
But anything that sounds at all natural, from any of the OS or browser vendors, is going through some cloud TTS API now.
Quality sounded good compared to a lot of other small TTS models I've tried.