Download "Transcribe Text via SFSpeechRecognizer (Lesson 05)"

Download this video with UDL Client
  • Video mp4 HD+ with sound
  • Mp3 in the best quality
  • Any size files
Video tags
|

Video tags

swift
swiftui
combine
xcode
ios
app
development
programming
code
apple
academy
tutorial
build
iphone
Subtitles
|

Subtitles

00:00:00
In today's lesson, we will be using SF
00:00:02
Speech Analyzer, a new API to allow
00:00:06
voice input for our local AI chat app.
00:00:10
This topic is a bit more complex to
00:00:13
implement as we won't really be living
00:00:14
in Swift UI land for this lesson. So,
00:00:18
I've already prepared all the code and I
00:00:19
will walk you through it during the
00:00:21
lesson. Parts of the code are also
00:00:23
linked in the video description for your
00:00:25
own reference as I won't go through
00:00:28
every single detail today. So basically
00:00:30
there are three things we need to do.
00:00:32
First of all, we of course need to
00:00:33
implement a transcription service and
00:00:36
I've called it speechtoext service in
00:00:38
this project. Then we need to add this
00:00:41
little dictation button down here. So
00:00:43
this is just an SF symbol button. And
00:00:46
then lastly, we need to pull everything
00:00:48
together and use the speech analyzer in
00:00:52
our existing business logic. So let's
00:00:55
get right started. In our UI, we already
00:00:58
have our safe area insert for the bottom
00:01:01
here with our text field and our send
00:01:04
button. So these two things down here.
00:01:07
And then we also added a second button
00:01:11
with a microphone. So while we are
00:01:14
recording and this just basically flips
00:01:16
a a boolean is recording, it shows one
00:01:18
SF symbol and then let me quickly turn
00:01:22
that on. So this is the SF symbol while
00:01:24
we are recording and then this is the SF
00:01:26
symbol when we stop recording. Of course
00:01:28
microphone stuff doesn't work in the
00:01:30
simulator. So you'll have to run that on
00:01:32
your real device. And before I forget
00:01:34
about it, we'll of course need to have
00:01:36
this privacy microphone usage
00:01:38
description string in our info.plist
00:01:40
file in order to get the hardware access
00:01:42
needed for this feature. So now that we
00:01:44
have our little microphone button in
00:01:46
there, we either call start recording or
00:01:48
stop recording depending on what the
00:01:50
current state is. And I've moved all of
00:01:53
the new variables and both of the
00:01:56
functions to the bottom of the file just
00:01:58
for our reference. But I would advise
00:02:00
you to keep all of the state variable
00:02:02
and also the speech to text service at
00:02:04
the top of the file. So you have a nice
00:02:06
structure. So basically we have our is
00:02:08
recording flag. Then we have a potential
00:02:11
error message to show. We also have a
00:02:14
transcription task and we need to store
00:02:16
this here but we'll get into that in
00:02:18
just a second. And we also have our
00:02:20
speechtoext service which we'll also go
00:02:23
through in a minute. So then we have our
00:02:25
start recording function and our stop
00:02:27
recording function. Let's start by
00:02:29
looking at the start recording function.
00:02:31
What does it actually do? Because we're
00:02:33
actually streaming audio here. So we
00:02:35
make sure that we're not recording
00:02:36
already. So this button can't be pressed
00:02:38
twice. This is already somehow checked
00:02:41
by the UI or by the button itself, but
00:02:44
just to make sure nothing bad happens
00:02:46
here. And then of course we set is
00:02:48
recording to true. We create our
00:02:50
transcription task and this is just a
00:02:52
swift concurrency task but we're storing
00:02:54
it here in the variable so it doesn't
00:02:56
get cancelled when this function is over
00:02:59
because we are streaming audio. So this
00:03:01
task uh lasts for quite a long time
00:03:04
potentially until we cancel it manually
00:03:06
in our stop recording function. So in
00:03:08
there we await authorization to use the
00:03:11
microphone and speech to text. Then we
00:03:15
create a stream and we'll look into what
00:03:17
this transcribe function does in a
00:03:19
minute. We have a four try await. So we
00:03:22
are streaming in the partial results
00:03:25
from the microphone and we're assigning
00:03:28
self.input to this new value and
00:03:30
self.input is just the variable that our
00:03:33
text field writes to and the variable of
00:03:36
the message that we send to our large
00:03:39
language model. So basically we take the
00:03:41
partial result and we assign it to our
00:03:43
input. That way it also automatically
00:03:46
gets displayed in the text field and if
00:03:49
there's any error we will also assign it
00:03:51
to our error message and you can show
00:03:53
that on your screen whichever way you'd
00:03:55
like. Since we're already here let's
00:03:56
also briefly look into the stop
00:03:58
recording function before we go into how
00:04:00
the transcription using as a speech
00:04:02
analyzer actually works. So we make sure
00:04:05
that we are actually recording then we
00:04:07
set it to false. We cancel and nil out
00:04:10
our transcription task. So making sure
00:04:13
that no new audio gets streamed into our
00:04:15
text field variable. And then we also
00:04:18
call stop transcribing on our speechto
00:04:20
text service to clean up everything over
00:04:22
there. So let's have a look at our
00:04:24
speechtoext service. And this file is
00:04:26
linked via a GitHub gist in the video
00:04:29
description. So you can check it out and
00:04:31
download it over there or reference it
00:04:32
for your own implementation. At the top
00:04:35
here we have two rappers for some old um
00:04:38
completion handler based APIs for SF
00:04:40
speech recognizer and AV audio session.
00:04:43
I won't go into detail. You can look
00:04:44
these up in the GS just these are just
00:04:46
concurrency wrappers to make our API a
00:04:49
bit nicer to use. As I said, I won't go
00:04:52
through this line by line because it is
00:04:54
quite complex, but I will try to tell
00:04:56
you the logic behind how it works. And
00:04:58
then if you want to go into it yourself,
00:05:00
you're free to check out the file or the
00:05:02
documentation that I linked here on top
00:05:04
of the file. So, as you can see, there
00:05:06
is quite a bit of code in here. So, we
00:05:08
have an authorize function. We have a
00:05:11
transcribe function. This is probably
00:05:13
the most uh interesting one. We have a
00:05:16
stop transcribing function, reset, and
00:05:19
prepare. engine. So let's look into
00:05:22
transcribe because this is the only
00:05:24
function that actually gets called in
00:05:26
our UI to do the transcription. So if
00:05:28
you remember from our UI implementation
00:05:32
here we have a stream that we get from
00:05:34
the speechto text service transcribe
00:05:36
function. So it has to return an async
00:05:39
throwing stream. So we can use the for
00:05:42
tryw weight syntax here. So this is how
00:05:45
you create one of these and then you
00:05:46
have a continuation and you can yield
00:05:49
values into the continuation to have a
00:05:51
new entry in the stream. This is very
00:05:54
similar to how combine was used back in
00:05:57
the day. In there we create a task
00:05:59
because this is an async throwing
00:06:01
stream. We have some safety checks that
00:06:04
our speech recognizer is already created
00:06:07
and that it is available on the device.
00:06:11
And then this is basically the
00:06:13
interesting part. And as you can see,
00:06:14
it's actually not that much code.
00:06:15
There's just quite a bit of overhead and
00:06:17
some error handling in here. The
00:06:20
interesting thing is to do the speech
00:06:22
transcription, we need a speech engine
00:06:25
and a speech transcription request. And
00:06:29
we create these in the prepare engine
00:06:31
function just to keep this code a bit
00:06:33
more organized. So let's have a look at
00:06:35
the prepare engine function. And you can
00:06:36
already see this returns an AV audio
00:06:39
engine and then SF speech audio buffer
00:06:42
recognition request. So quite a
00:06:44
mouthful. And then this is basically
00:06:46
just a bunch of setup code for an audio
00:06:48
session. Making sure to set the category
00:06:51
to record with a measurement mode and
00:06:54
duck others options to make sure that
00:06:56
the audio is clean and usable for our
00:06:59
use case here. Then we create our audio
00:07:01
engine. We do a bit more setup here, but
00:07:04
that's not too interesting, I believe.
00:07:06
Second thing that we return aside from
00:07:08
our engine is our recognition request or
00:07:11
speech recognition request. So, we
00:07:14
create that over here.
00:07:17
It's important that we set should report
00:07:18
partial results to true because that's
00:07:20
what we want to do. We want to stream in
00:07:23
the um audio. We also want to add
00:07:25
punctuation because they the user might
00:07:28
say multiple sentences and use periods
00:07:31
or question marks for example. And we're
00:07:33
also going to tell the audio recognition
00:07:36
request that this is in fact a
00:07:37
dictation. This just helps it internally
00:07:39
to be a bit more accurate. If we want to
00:07:41
we can set requires ondevice recognition
00:07:43
to true to only use the ondevice models.
00:07:46
But in our use case it doesn't really
00:07:48
matter. All right. Then we do some more
00:07:50
setup. We prepare our engine. then we
00:07:52
start it and we return our engine and
00:07:55
our SF speech audio buffer recognition
00:07:57
request. So once we have called prepare
00:07:59
engine here, we are now locally uh
00:08:02
storing our audio engine and our request
00:08:05
and then we're creating our real
00:08:07
recognition task here. And once again
00:08:10
there's a lot of uh state checking and
00:08:13
error handling up here but actually it's
00:08:16
pretty simple.
00:08:18
So with our recognition task we get a
00:08:20
result and a an optional error here.
00:08:25
So if we have an error of course we uh
00:08:28
show some error state and we reset the
00:08:30
speech to text service. Not really
00:08:32
important right now. You can of course
00:08:34
change the implementation for your own
00:08:36
app. What's interesting is that we now
00:08:39
use the continuation that I mentioned
00:08:41
beforehand from our async throwing
00:08:43
stream. So we can iterate over these
00:08:46
results and we yield the best
00:08:49
transcription
00:08:51
as a formatted string. This works
00:08:54
because we set should report partial
00:08:56
results to true. And just as a reminder,
00:08:58
we get this transcription here within
00:09:01
the closure of our recognition task.
00:09:03
This result object is actually of type
00:09:07
SF speech recognition result and it
00:09:09
doesn't only have a best transcription.
00:09:11
It also has an is final property. So if
00:09:15
the transcription is final, we will
00:09:17
finish our continuation. So the for loop
00:09:20
will then stop and we will also reset
00:09:23
our speechtoext service to be clean
00:09:26
again for the next dictation. Then of
00:09:28
course we also have a stop transcribing
00:09:30
function and a reset function. You can
00:09:32
look into these yourself if you're
00:09:34
interested. Once again, this file is
00:09:36
linked as a GitHub gist in the video
00:09:38
description. And there we have it. We
00:09:40
have successfully added dictation to our
00:09:42
local AI chat app only using ondevice
00:09:46
firstparty Apple framework. So this is
00:09:49
actually super easy. SF speech analyzer
00:09:52
also works pretty accurately as far as
00:09:55
my tests are concerned and the community
00:09:57
feedback that I've heard. This is a big
00:09:59
improvement over the old APIs that we've
00:10:02
had previously. Of course, SF Speech
00:10:04
Analyzer is only available in iOS 26 and
00:10:07
up, but that's totally fine for our use
00:10:09
case as we're using Foundation models,
00:10:11
which is also only available in iOS 26
00:10:13
and

Description:

This lesson walks through an example of integrating SFSpeechRecognizer from Apple's Speech framework into an iOS 26 local AI chat app. Example Code: https://gist.github.com/chFlorian/0bc373278b11cff5ea547d112a5b5ac6 Join this channel to get access to perks: https://www.youtube.com/channel/UCYt_AtiKPyda44NYzwABvQQ/join 🚀 LaunchBuddy: https://apple.co/3iFcjjW 📚 Try CWC+: https://bit.ly/cwc_flo 🔭 Astro for ASO: https://flowritesco.de/astro ☕️ Buy me a coffee: https://ko-fi.com/flowritescode 👋 Links: https://flowritesco.de 🛠 Forge: https://apple.co/3riG8MQ Affiliate Links ❤ 📕 SwiftUI & Combine Books: https://www.bigmountainstudio.com/a/tpgmp 🔬 Get Reports about your apps: https://appfigures.com/r/5by3g 📊 Privacy focused analytics: https://dashboard.telemetrydeck.com/registration/organization?referralCode=27AOWO4R1TTEJBST 💻 The most powerful mac app for developers: https://devutils.app/?ref=flo ☕️ Support me: https://ko-fi.com/flowritescode If you have any video suggestions please feel free to let me know by a comment. Get in contact via Twitter: https://twitter.com/FloWritesCode

Mediafile available in formats

popular icon
Popular
hd icon
HD video
audio icon
Only sound
total icon
All
* — If the video is playing in a new tab, go to it, then right-click on the video and select "Save video as..."
** — Link intended for online playback in specialized players

Questions about downloading video

question iconHow can I download "Transcribe Text via SFSpeechRecognizer (Lesson 05)" video?arrow icon

    http://univideos.ru/ website is the best way to download a video or a separate audio track if you want to do without installing programs and extensions.

    The UDL Helper extension is a convenient button that is seamlessly integrated into YouTube, Instagram and OK.ru sites for fast content download.

    UDL Client program (for Windows) is the most powerful solution that supports more than 900 websites, social networks and video hosting sites, as well as any video quality that is available in the source.

    UDL Lite is a really convenient way to access a website from your mobile device. With its help, you can easily download videos directly to your smartphone.

question iconWhich format of "Transcribe Text via SFSpeechRecognizer (Lesson 05)" video should I choose?arrow icon

    The best quality formats are FullHD (1080p), 2K (1440p), 4K (2160p) and 8K (4320p). The higher the resolution of your screen, the higher the video quality should be. However, there are other factors to consider: download speed, amount of free space, and device performance during playback.

question iconWhy does my computer freeze when loading a "Transcribe Text via SFSpeechRecognizer (Lesson 05)" video?arrow icon

    The browser/computer should not freeze completely! If this happens, please report it with a link to the video. Sometimes videos cannot be downloaded directly in a suitable format, so we have added the ability to convert the file to the desired format. In some cases, this process may actively use computer resources.

question iconHow can I download "Transcribe Text via SFSpeechRecognizer (Lesson 05)" video to my phone?arrow icon

    You can download a video to your smartphone using the website or the PWA application UDL Lite. It is also possible to send a download link via QR code using the UDL Helper extension.

question iconHow can I download an audio track (music) to MP3 "Transcribe Text via SFSpeechRecognizer (Lesson 05)"?arrow icon

    The most convenient way is to use the UDL Client program, which supports converting video to MP3 format. In some cases, MP3 can also be downloaded through the UDL Helper extension.

question iconHow can I save a frame from a video "Transcribe Text via SFSpeechRecognizer (Lesson 05)"?arrow icon

    This feature is available in the UDL Helper extension. Make sure that "Show the video snapshot button" is checked in the settings. A camera icon should appear in the lower right corner of the player to the left of the "Settings" icon. When you click on it, the current frame from the video will be saved to your computer in JPEG format.

question iconHow do I play and download streaming video?arrow icon

    For this purpose you need VLC-player, which can be downloaded for free from the official website https://www.videolan.org/vlc/.

    How to play streaming video through VLC player:

    • in video formats, hover your mouse over "Streaming Video**";
    • right-click on "Copy link";
    • open VLC-player;
    • select Media - Open Network Stream - Network in the menu;
    • paste the copied link into the input field;
    • click "Play".

    To download streaming video via VLC player, you need to convert it:

    • copy the video address (URL);
    • select "Open Network Stream" in the "Media" item of VLC player and paste the link to the video into the input field;
    • click on the arrow on the "Play" button and select "Convert" in the list;
    • select "Video - H.264 + MP3 (MP4)" in the "Profile" line;
    • click the "Browse" button to select a folder to save the converted video and click the "Start" button;
    • conversion speed depends on the resolution and duration of the video.

    Warning: this download method no longer works with most YouTube videos.

question iconWhat's the price of all this stuff?arrow icon

    It costs nothing. Our services are absolutely free for all users. There are no PRO subscriptions, no restrictions on the number or maximum length of downloaded videos.