We are developing speech processing technologies such as core
multilingual unlimited vocabulary speech recognizers for automatic speech
transcription, audio indexing, and speech-text alignment.
The
VoxSigma software
suite provides audio segmentation and partitioning, speaker
identification, language recognition, and large vocabulary speech recognition
capabilities in many languages.
VoxSigma has been designed for professional users needing to process large
quantities of audio and video documents such as broadcast data or call-center
communications, either in batch mode or in real-time.
VoxSigma™ Software Suite
The Vocapia Research
VoxSigma software
suite for Linux offers state of the art
performance for broadcast data and
conversational data in many languages. The
VoxSigma API includes Unix/Linux commands, C and
C++ libraries, REST API, import XML and export in XML or JSON formats. The VoxSigma software
is available both via licensing and via our
web service.
In addition the Vocapia Scribe3 GUI offer a convenient way to upload, process and post-edit audio and
video documents.
[Voxsigma request form]
Audio partitionning and speaker diarization
The first step in
the VoxSigma processing chain is audio partitionning which includes
the separation between speech and non-speech audio (such as noise or
music) and speaker diarization. Speaker diarization, also called
speaker segmentation and clustering, is the process of partitioning an
input audio stream into homogeneous segments according to speaker
identity.
Spoken language identification
Spoken language identification (LID) is the process of recognizing
the language spoken in an audio document (broadcast audio,
podcast, or telephone). The standard VoxSigma language
identification component can recognize one of 100 languages. By
default, it is assumed that each document contains only one
language. However, this can be adjusted by specifying the maximum
number of languages and optionally providing a limited list of
possible languages. The LID module includes a customization tool
that allows users to tailor the models to their specific data,
such as adding a missing language or dialect.
Speech Transcription
Speech transcription (also
called
speech-to-text,
or STT) is the process of recognizing the words spoken in an audio
document, while also providing a timecode, duration, and confidence
score for each identified word. The speech transcription module can
handle various audio types, including multilingual and dual-channel
audio. It supports an unlimited vocabulary for multiple languages,
including
Arabic, Cantonese,
Czech, Dutch,
English, Finnish, French, German, Greek, Hebrew, Hindi,
Hungarian, Italian, Latvian, Lithuanian, Mandarin, Pashto,
Persian, Polish, Portuguese,
Romanian, Russian,
Spanish, Swahili, Swedish,
Turkish and
Urdu.
The STT module includes a customization tool
that allows users to tailor the models to their specific data or task.
Keyword spotting and audio Indexing
Large vocabulary continuous speech recognition is a key technology
that can be used to enable content-based information access in audio
and video documents. Most of the linguistic information is encoded in
the audio channel of audiovisual data, which once transcribed can be
accessed using text-based tools. Via language identification, speech
recognition, and speaker recognition, spoken document retrieval can
support random access using specific criteria to relevant portions of
audio documents, reducing the time needed to identify recordings in
large audio/video databases.
Some applications are data mining, news-on-demand, media-monitoring, and telephone
speech analytics.
Speech-Text Alignment
Speech-text alignment is the process of synchronizing a speech signal with a
speech transcript or closely related text, providing time codes for words and
sentences. The alignment process assigns timecodes to each word and each
punctuation mark in the audio transcript and provides confidence scores to
identify areas where the alignment may not be perfect in particular when the
provided transcript differs from what has really been said. There are many uses
of this technology, including audio books, language learning, and video
subtitling.
Translation
Vocapia Research also provides automatic translation technologies capable of
processing both speech and text inputs. Combined with our speech recognition
modules, these tools allow users to automatically transcribe and translate
multilingual audio content with high accuracy. Whether translating a conversation,
a broadcast segment, or written documents, the system ensures reliable results by
leveraging advanced language models. This feature is particularly suited for
applications such as international media monitoring or cross-lingual content analysis.
Summarization
In addition to transcription and translation, Vocapia Research offers automatic
summarization technologies that can generate concise and coherent summaries and
titles from either audio or text sources. By combining speech recognition,
natural language processing, and semantic analysis, this module extracts the
essential information from long documents, meetings, or interviews, allowing
users to save time and focus on key insights. This functionality is ideal for
news aggregation, corporate intelligence, and large-scale information indexing.