Official Content

Converts an audio stream to plain text by transcribing the speech detected in the audio stream.

Parameters

Configuration

The following table resumes the configuration properties (access credentials) you must set in order to use this AI task.

  PropertyKey
ProviderType Id Key SecretKey
Alibaba 智能语音交互 app-key 用户AccessKey 用户AccessKey
Amazon - Transcribe Transcribe
Baidu 百度语音 百度语音 百度语音
Google  - Cloud Speech API -
IBM - SpeechToText API -
Microsoft - Speech API -
SAP - - -
Tencent 语音识别 语音识别 -

Sample

Taking the following spoken audio input, the table below shows the transcription made for each provider and the time it takes for processing it.

Provider Output Benchmark
Alibaba GXAI4101 - Parameter '&Locale' is malformed. Expected values: [Chinese (Simplified, Mainland), Mandarin (Simplified, Mainland), Cantonese (Traditional, Hong Kong)] 9040ms
Amazon
{
    "Text": "The first question that comes to mind is, What is the nexus you Nexus is a tool that automatically generate software programs such as applications for Windows, the Web and, smart devices which are always at the forefront of technological evolution."
    "Confidence": 0.982
    "Info":[
        {
            "Property":"The", 
            "Value": "{\"start\":0.70000,\"duration\":0.17000}"
        },
        ...
        {
            "Property":"evolution", 
            "Value": "{\"start\":14.51000,\"duration\":0.60000}"
        },
        {
            "Property":".", 
            "Value": "{\"start\":15.11000,\"duration\":00.00000}"
        }
    ]
}
202627ms
Baidu GXAI4101 - Parameter '&Locale' is malformed. Expected values: [Chinese (Simplified, Mainland), Mandarin (Simplified, Mainland), Cantonese (Traditional, Hong Kong)] N/A
Google
{
    "Text": "The first question that comes to mind is what is Genesis. The Nexus is a tool that automatically generate software program such as applications for Windows."
    "Confidence": 0.982
    "Info":[
        {
            "Property":"the",
            "Value": "{\"start\":0.00000,\"duration\":0.90000}"
        },
        ...
        {
            "Property":"Windows",
            "Value": "{\"start\":10.10000,\"duration\":0.50000}"
        },
        {
            "Property":".",
            "Value": "{\"start\":10.60000,\"duration\":0.00000}"
        }
    ]
}
6986ms
IBM
{
    "Text": "The first question that comes to mind is. What is your nexus. Next is a tool that automatically generate software programs such as applications for windows the web and smart devices which are always at the forefront technological evolution."
    "Confidence": 0.982
    "Info":[
        {
            "Property":"The",
            "Value": "{\"start\":0.71000,\"duration\":0.12000}"
        },
        ...
        {
            "Property":"evolution",
            "Value": "{\"start\":14.48000,\"duration\":0.63000}"
        },
        {
            "Property":".",
            "Value": "{\"start\":15.11000,\"duration\":0.00000}"
        }
    ]
}
8682ms
Microsoft 
{
    "Text": "The first question that comes to mind is what is genexus.",
    "Confidence": 1.0
}
3412ms
SAP N/A N/A
Tencent GXAI4101 - Parameter '&Locale' is malformed. Expected values: [Chinese (Simplified, Mainland), Mandarin (Simplified, Mainland), Cantonese (Traditional, Hong Kong)] N/A

Considerations

Short transcriptions

The transcription will be made only for short timing audio (up to 15 seconds) and short utterances. As a consequence of this second condition, text output can be "incomplete" regarding the audio input because the transcription will be made up to the first "silence mark" (e.g. as Microsoft does). The aim is to identify a voice command.

Chinese providers

Only support Chinese-spoken audios. Otherwise, it will raise a GXAI5000 error when the audio is provided in another (unknown) language.

For example, taking the following Chinese-spoken audio, you will get the result detailed on the below table.

Provider Output Benchmark
Alibaba
{
    "Text": "提出的第一个问题是什么是冰山冰山是一个自动生成软件,程序的工具,例如为应用程序和智能设备始终处于技术发展的最前沿。"
    "Confidence": 1.0
}
13160ms
Baidu
{
    "Text": "提出的第一个问题是一个自动生成软件程序的工具,例如应用程序,智能设备始终处于技术发展的最前沿",
    "Confidence": 1.0
}
101354ms
Tencent 
{
    "Text": "提出的第一个问题是什么仅三十一个自动生成软件程序的工具例如应用程序可智能设备始终处于技术发展的最前沿",
    "Confidence": 1.0
}
98457ms

Amazon provider

The audio file must be uploaded to Amazon S3. In case you have set Storage Provider property with your Amazon credentials, any audio will be automatically stored on your S3 bucket to be processed. In another case, you must provide an URL with one of the following expressions:
+ http://{bucket}.s3.amazonaws.com/{path/to/filename.ext}
+ http://{bucket}.s3-{region}.amazonaws.com/{path/to/filename.ext}
+ http://s3.amazonaws.com/{bucket}/{path/to/filename.ext}
+ http://s3-{region}.amazonaws.com/{bucket}/{path/to/filename.ext}
The {region} must match with the region of your access credentials (or can be empty only when your region is 'us-east-1').

Notes

  • Input audio format depends on the provider type.
    - Amazon WS supports mp3, mp4, wav and flac.
    - Baidu AI supports pcm, wav, and amr.
    - IBM Watson supports mp3, mp4, wav, ogg, flac and webm (GeneXus 16 Upgrade 0 only supports mp3)
    - Microsoft Azure supports wav only.
    - Google Cloud AI supports mp3, wav and ogg.
    - Tencent AI supports wav only.
    Use this site to verify your audio has an appropriate mime-type.

  • Audio format conversion can be done by using an external tool.
    GeneXusAI integrates this feature automatically as an experimental feature if you follow these steps:
    1) Download the ffmpeg tool depending on your server's operative system (i.e. Linux or Windows).
    2) Attach the binary file to your knowledge base as a File object.
    3) Set the Extract for {gen} Generator property in True, being '{gen}' your working generator (Java, .NET or .NET Core).
    4) Set the {gen} Generator Extract Directory property with "Resources" value.
    5) Ensure that the extracted binary file has execution permission where the webapp is running.
    6) Give an &audio input to this task without worrying about its format.
    Take into account that adding this feature the performance can be degraded.

    IMPORTANT: As an experimental feature, no support is provided and it is subject to breaking changes without any advertisement. Use it at your own risk.

  • Microsoft's Bing Speech API will be deprecated and its credentials must be updated to Speech API before October 2019.

Scope

Generators:  .NET.NET FrameworkJavaAppleAndroidAngular
Connectivity:  Online

Availability

This procedure is available as of GeneXus 16.

See also



Last update: February 2024 | © GeneXus. All rights reserved. GeneXus Powered by Globant