We are developing speech processing technologies such as core
multilingual unlimited vocabulary speech recognizers for automatic speech
transcription, audio indexing, and speech-text alignment.
The
VoxSigma software
suite provides audio segmentation and partitioning, speaker
identification, language recognition, and large vocabulary speech recognition
capabilities in many languages.
VoxSigma has been designed for professional users needing to process large
quantities of audio and video documents such as broadcast data or call-center
communications, either in batch mode or in real-time.
VoxSigma™ Software Suite
The Vocapia Research
VoxSigma software
suite for Linux offers state of the art
performance for broadcast data and
conversational data in many languages. The
VoxSigma API includes Unix/Linux commands, C and
C++ libraries, REST API, import and export in XML format. The VoxSigma software
is available both via licensing and via our
web service.
In addition the Vocapia Scribe3 GUI offer a convenient way to upload, process and post-edit audio and
video
documents.
[Voxsigma request form]
Audio partitionning and speaker diarization
The first step in
the VoxSigma processing chain is audio partitionning which includes
the separation between speech and non-speech audio (such as noise or
music) and speaker diarization. Speaker diarization, also called
speaker segmentation and clustering, is the process of partitioning an
input audio stream into homogeneous segments according to speaker
identity.
Spoken language identification
Spoken language identification (LID) is the process of recognizing
the language spoken in an audio document (broadcast audio,
podcast, or telephone). The standard VoxSigma language
identification component can recognize one of 100 languages. By
default, it is assumed that each document contains only one
language. However, this can be adjusted by specifying the maximum
number of languages and optionally providing a limited list of
possible languages. The LID module includes a customization tool
that allows users to tailor the models to their specific data,
such as adding a missing language or dialect.
Speech Transcription
Speech transcription (also
called
speech-to-text,
or STT) is the process of recognizing the words spoken in an audio
document, while also providing a timecode, duration, and confidence
score for each identified word. The speech transcription module can
handle various audio types, including multilingual and dual-channel
audio. It supports an unlimited vocabulary for multiple languages,
including
Arabic, Cantonese,
Czech, Dutch,
English, Finnish, French, German, Greek, Hebrew, Hindi,
Hungarian, Italian, Latvian, Lithuanian, Mandarin, Pashto,
Persian, Polish, Portuguese,
Romanian, Russian,
Spanish, Swahili, Swedish,
Turkish and
Urdu.
The STT module includes a customization tool
that allows users to tailor the models to their specific data or task.
Keyword spotting and audio Indexing
Large vocabulary continuous speech recognition is a key technology
that can be used to enable content-based information access in audio
and video documents. Most of the linguistic information is encoded in
the audio channel of audiovisual data, which once transcribed can be
accessed using text-based tools. Via language identification, speech
recognition, and speaker recognition, spoken document retrieval can
support random access using specific criteria to relevant portions of
audio documents, reducing the time needed to identify recordings in
large audio/video databases.
Some applications are data mining, news-on-demand, media-monitoring, and telephone
speech analytics.
Speech-Text Alignment
Speech-text alignment is the process of synchronizing a speech signal with a
speech transcript or closely related text, providing time codes for words and
sentences. The alignment process assigns timecodes to each word and each
punctuation mark in the audio transcript and provides confidence scores to
identify areas where the alignment may not be perfect in particular when the
provided transcript differs from what has really been said. There are many uses
of this technology, including audio books, language learning, and video
subtitling.