Converts an audio stream to plain text by transcribing the speech detected in the audio stream.
The following table resumes the configuration properties (access credentials) you must set in order to use this AI task.
|
PropertyKey |
ProviderType |
Id |
Key |
SecretKey |
Alibaba |
智能语音交互 app-key |
用户AccessKey |
用户AccessKey |
Amazon |
- |
Transcribe |
Transcribe |
Baidu |
百度语音 |
百度语音 |
百度语音 |
Google |
- |
Cloud Speech API |
- |
IBM |
- |
SpeechToText API |
- |
Microsoft |
- |
Speech API |
- |
SAP |
- |
- |
- |
Tencent |
语音识别 |
语音识别 |
- |
Taking the following spoken audio input, the table below shows the transcription made for each provider and the time it takes for processing it.
Provider |
Output |
Benchmark |
Alibaba |
GXAI4101 - Parameter '&Locale' is malformed. Expected values: [Chinese (Simplified, Mainland), Mandarin (Simplified, Mainland), Cantonese (Traditional, Hong Kong)] |
9040ms |
Amazon |
{
"Text": "The first question that comes to mind is, What is the nexus you Nexus is a tool that automatically generate software programs such as applications for Windows, the Web and, smart devices which are always at the forefront of technological evolution."
"Confidence": 0.982
"Info":[
{
"Property":"The",
"Value": "{\"start\":0.70000,\"duration\":0.17000}"
},
...
{
"Property":"evolution",
"Value": "{\"start\":14.51000,\"duration\":0.60000}"
},
{
"Property":".",
"Value": "{\"start\":15.11000,\"duration\":00.00000}"
}
]
}
|
202627ms |
Baidu |
GXAI4101 - Parameter '&Locale' is malformed. Expected values: [Chinese (Simplified, Mainland), Mandarin (Simplified, Mainland), Cantonese (Traditional, Hong Kong)] |
N/A |
Google |
{
"Text": "The first question that comes to mind is what is Genesis. The Nexus is a tool that automatically generate software program such as applications for Windows."
"Confidence": 0.982
"Info":[
{
"Property":"the",
"Value": "{\"start\":0.00000,\"duration\":0.90000}"
},
...
{
"Property":"Windows",
"Value": "{\"start\":10.10000,\"duration\":0.50000}"
},
{
"Property":".",
"Value": "{\"start\":10.60000,\"duration\":0.00000}"
}
]
}
|
6986ms |
IBM |
{
"Text": "The first question that comes to mind is. What is your nexus. Next is a tool that automatically generate software programs such as applications for windows the web and smart devices which are always at the forefront technological evolution."
"Confidence": 0.982
"Info":[
{
"Property":"The",
"Value": "{\"start\":0.71000,\"duration\":0.12000}"
},
...
{
"Property":"evolution",
"Value": "{\"start\":14.48000,\"duration\":0.63000}"
},
{
"Property":".",
"Value": "{\"start\":15.11000,\"duration\":0.00000}"
}
]
}
|
8682ms |
Microsoft |
{
"Text": "The first question that comes to mind is what is genexus.",
"Confidence": 1.0
}
|
3412ms |
SAP |
N/A |
N/A |
Tencent |
GXAI4101 - Parameter '&Locale' is malformed. Expected values: [Chinese (Simplified, Mainland), Mandarin (Simplified, Mainland), Cantonese (Traditional, Hong Kong)] |
N/A |
The transcription will be made only for short timing audio (up to 15 seconds) and short utterances. As a consequence of this second condition, text output can be "incomplete" regarding the audio input because the transcription will be made up to the first "silence mark" (e.g. as Microsoft does). The aim is to identify a voice command.
Only support Chinese-spoken audios. Otherwise, it will raise a GXAI5000 error when the audio is provided in another (unknown) language.
For example, taking the following Chinese-spoken audio, you will get the result detailed on the below table.
Provider |
Output |
Benchmark |
Alibaba |
{
"Text": "提出的第一个问题是什么是冰山冰山是一个自动生成软件,程序的工具,例如为应用程序和智能设备始终处于技术发展的最前沿。"
"Confidence": 1.0
}
|
13160ms |
Baidu |
{
"Text": "提出的第一个问题是一个自动生成软件程序的工具,例如应用程序,智能设备始终处于技术发展的最前沿",
"Confidence": 1.0
}
|
101354ms |
Tencent |
{
"Text": "提出的第一个问题是什么仅三十一个自动生成软件程序的工具例如应用程序可智能设备始终处于技术发展的最前沿",
"Confidence": 1.0
}
|
98457ms |
The audio file must be uploaded to Amazon S3. In case you have set Storage Provider property with your Amazon credentials, any audio will be automatically stored on your S3 bucket to be processed. In another case, you must provide an URL with one of the following expressions:
+ http://{bucket}.s3.amazonaws.com/{path/to/filename.ext}
+ http://{bucket}.s3-{region}.amazonaws.com/{path/to/filename.ext}
+ http://s3.amazonaws.com/{bucket}/{path/to/filename.ext}
+ http://s3-{region}.amazonaws.com/{bucket}/{path/to/filename.ext}
The {region} must match with the region of your access credentials (or can be empty only when your region is 'us-east-1').
-
Input audio format depends on the provider type.
- Amazon WS supports mp3, mp4, wav and flac.
- Baidu AI supports pcm, wav, and amr.
- IBM Watson supports mp3, mp4, wav, ogg, flac and webm (GeneXus 16 Upgrade 0 only supports mp3)
- Microsoft Azure supports wav only.
- Google Cloud AI supports mp3, wav and ogg.
- Tencent AI supports wav only.
Use this site to verify your audio has an appropriate mime-type.
-
Audio format conversion can be done by using an external tool.
GeneXusAI integrates this feature automatically as an experimental feature if you follow these steps:
1) Download the ffmpeg tool depending on your server's operative system (i.e. Linux or Windows).
2) Attach the binary file to your knowledge base as a File object.
3) Set the Extract for {gen} Generator property in True, being '{gen}' your working generator (Java, .NET or .NET Core).
4) Set the {gen} Generator Extract Directory property with "Resources" value.
5) Ensure that the extracted binary file has execution permission where the webapp is running.
6) Give an &audio input to this task without worrying about its format.
Take into account that adding this feature the performance can be degraded.
IMPORTANT: As an experimental feature, no support is provided and it is subject to breaking changes without any advertisement. Use it at your own risk.
- Microsoft's Bing Speech API will be deprecated and its credentials must be updated to Speech API before October 2019.
This procedure is available as of GeneXus 16.