Advent of 2024, Day 7 – Microsoft Azure AI – Speech service in AI Services
This article is originally published at https://tomaztsql.wordpress.com
In this Microsoft Azure AI series:
- Dec 01: Microsoft Azure AI – What is Foundry?
- Dec 02: Microsoft Azure AI – Working with Azure AI Foundry
- Dec 03: Microsoft Azure AI – Creating project in Azure AI Foundry
- Dec 04: Microsoft Azure AI – Deployment in Azure AI Foundry
- Dec 05: Microsoft Azure AI – Deployment parameters in Azure AI Foundry
- Dec 06: Microsoft Azure AI – AI Services in Azure AI Foundry
In Azure AI Foundry you will find the speech playground with the vast variety of solutions to enhance and add the functionalities to your applications.
Speech service will give you capabilities to convert speech to text, realtime translations, fast transcriptions, voice assistant and others.
Speech playground can redirect you to speech portal: https://speech.microsoft.com/portal where you build and create a solutions with containers and connect it to other Azure services.
In the playground environment you can test the quality of real-time transcriptions. Since I am native Slovenian, I wanted to check how good it works and how can I use the multi-speakers transcription, as well as the phrase list.
Speaker diarization allows you to separate speakers in audio data if there are multiple speakers. When enabled, the number of speakers is automatically detected and labeled in transcription results. It is recommended to use audio with fewer than 20 speakers.
Phrase list will help you to identify known phrases in audio data, like a person’s name or a specific location. By providing a list of ed phrases, you improve the accuracy of speech recognition. There are some limitations with available languages, but this will be constantly updated.
The transcription can be downloaded using JSON (only excerpt shown here):
[
{
"Id": "d6bbe9568b9e47ce9645a32a4193775d",
"RecognitionStatus": 0,
"Offset": 500000,
"Duration": 40200000,
"Channel": 0,
"Type": "ConversationTranscription",
"SpeakerId": "Guest-1",
"DisplayText": "Živjo moje ime Tomaž in tole je en hiter test transkripcije.",
"NBest": [
{
"Confidence": 0.882362,
"Lexical": "živjo moje ime tomaž in tole je en hiter test transkripcije",
"ITN": "živjo moje ime tomaž in tole je en hiter test transkripcije",
"MaskedITN": "živjo moje ime tomaž in tole je en hiter test transkripcije",
"Display": "Živjo moje ime Tomaž in tole je en hiter test transkripcije.",
"Words": [
{
And the sample audio file:
Going into the speech portal (out of Playground), there are plenty of SDKs available for the start and options to get the API keys to connect the services
And the Python script:
import time
from scipy.io import wavfile
try:
import azure.cognitiveservices.speech as speechsdk
except ImportError:
print("""
Importing the Speech SDK for Python failed.
Refer to
https://docs.microsoft.com/azure/cognitive-services/speech-service/quickstart-python for
installation instructions.
""")
import sys
sys.exit(1)
# Set up the subscription info for the Speech Service:
# Replace with your own subscription key and service region (e.g., "centralus").
# See the limitations in supported regions,
# https://docs.microsoft.com/azure/cognitive-services/speech-service/how-to-use-conversation-transcription
speech_key, service_region = "YourSubscriptionKey", "YourServiceRegion"
# This sample uses a wavfile which is captured using a supported Speech SDK devices (8 channel, 16kHz, 16-bit PCM)
# See https://docs.microsoft.com/azure/cognitive-services/speech-service/speech-devices-sdk-microphone
conversationfilename = "YourConversationWavFile"
# This sample demonstrates how to use conversation transcription.
def conversation_transcription():
"""transcribes a conversation"""
# Creates speech configuration with subscription information
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
channels = 1
bits_per_sample = 16
samples_per_second = 16000
# Create audio configuration using the push stream
wave_format = speechsdk.audio.AudioStreamFormat(samples_per_second, bits_per_sample, channels)
stream = speechsdk.audio.PushAudioInputStream(stream_format=wave_format)
audio_config = speechsdk.audio.AudioConfig(stream=stream)
transcriber = speechsdk.transcription.ConversationTranscriber(speech_config, audio_config)
done = False
def stop_cb(evt: speechsdk.SessionEventArgs):
"""callback that signals to stop continuous transcription upon receiving an event `evt`"""
print('CLOSING {}'.format(evt))
nonlocal done
done = True
# Subscribe to the events fired by the conversation transcriber
transcriber.transcribed.connect(lambda evt: print('TRANSCRIBED: {}'.format(evt)))
transcriber.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
transcriber.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
transcriber.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
# stop continuous transcription on either session stopped or canceled events
transcriber.session_stopped.connect(stop_cb)
transcriber.canceled.connect(stop_cb)
transcriber.start_transcribing_async()
# Read the whole wave files at once and stream it to sdk
_, wav_data = wavfile.read(conversationfilename)
stream.write(wav_data.tobytes())
stream.close()
while not done:
time.sleep(.5)
transcriber.stop_transcribing_async()
# This sample demonstrates how to use conversation transcription.
def conversation_transcription_from_microphone():
"""transcribes a conversation"""
# Creates speech configuration with subscription information
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
transcriber = speechsdk.transcription.ConversationTranscriber(speech_config)
done = False
def stop_cb(evt: speechsdk.SessionEventArgs):
"""callback that signals to stop continuous transcription upon receiving an event `evt`"""
print('CLOSING {}'.format(evt))
nonlocal done
done = True
# Subscribe to the events fired by the conversation transcriber
transcriber.transcribed.connect(lambda evt: print('TRANSCRIBED: {}'.format(evt)))
transcriber.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
transcriber.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
transcriber.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
# stop continuous transcription on either session stopped or canceled events
transcriber.session_stopped.connect(stop_cb)
transcriber.canceled.connect(stop_cb)
transcriber.start_transcribing_async()
while not done:
# No real sample parallel work to do on this thread, so just wait for user to type stop.
# Can't exit function or transcriber will go out of scope and be destroyed while running.
print('type "stop" then enter when done')
stop = input()
if (stop.lower() == "stop"):
print('Stopping async recognition.')
transcriber.stop_transcribing_async()
break
Tomorrow we will look more in details into speech service using with the additional services in Azure and AI Services.
All of the code samples will be available on my Github.
Thanks for visiting r-craft.org
This article is originally published at https://tomaztsql.wordpress.com
Please visit source website for post related comments.