Post sponsored by

MIL OSI Translation. Region: Russian Federation –

Source: Novosibirsk State University – Novosibirsk State University –

Specialists from the Artificial Intelligence Research Institute (AIRI) and the Moscow Institute of Physics and Technology (MIPT), in the course of joint work, additionally trained the Wav2Vec2-Large-Ru-Golos neural network developed by NSU scientists for recognition of voice commands for controlling an autonomous robot.

Researcher at the Laboratory of Applied Digital Technologies International Scientific and Educational Mathematical Center of NSU Ivan Bondarenko noted that the neural network models Wav2Vec2-Large-Ru-Golos and Wav2Vec2-Large-Ru-Golos-With-LM, due to their high speech recognition quality indicators, as well as due to ease of use and modification, turned out to be in demand in the community of specialists speech recognition in Russian. According to his estimates, in some periods the total statistics of downloads of these models reached several thousand per month. NSU scientists do not have the technical capabilities to track who uses these neural network models so massively and for what purposes, but some cases become known to them, and additional training to recognize voice commands to control an autonomous robot is one of them.

— The essence of our colleagues’ work was as follows: they proposed using large language models like ChatGPT, only open (LLaMA2 and MiniGPT4), to automatically generate an action plan for an autonomous robot depending on changing environmental conditions based on tasks assigned to the robot by a person. At first glance, the idea of generating an action plan (that is, solving an automatic control problem) using neural network models of language instead of specialized algorithms seems surprising, since automatic control and natural language are very different scientific subjects. But in fact, both can be considered as a sequence of elements of some sign system. Accordingly, a deep neural network that “understands” language can be fully trained to transform a command given by a person into a chain of visual-motor control instructions that ensure the robot executes this command. So, for example, a simple human command “give me a glass of water” must be transformed into a fairly long chain of manipulation of objects and movements in space performed by a robot, explained Ivan Bondarenko.

The scientist noted that at the described stage, another problem arises – in addition to the fact that the on-board intelligence of the robot must be able to generate a detailed plan of low-level control instructions for a high-level command from a human, it must also be able to correctly hear this command pronounced by a person using a voice. This is exactly the problem that colleagues from AIRI solved using neural network speech recognition models Wav2Vec2-Large-Ru-Golos and Wav2Vec2-Large-Ru-Golos-With-LM.

— Our colleagues compared these neural networks with the Whisper-Medium model from OpenAI on the open speech corpus Sberdevices Golos and came to the conclusion that both of our Wav2Vec2 variants recognize Russian speech better than the solution from OpenAI. Moreover, if we add a typo correction module as an additional stage of processing speech recognition results, then the error level of our models is reduced by three to four percentage points, for example, from 12.4% errors for the usual Wav2Vec2-Large-Ru-Golos to 9% for combinations of Wav2Vec2-Large-Ru-Golos with the YaSpeller typo correction module. True, in speech sound recordings of voice commands collected by colleagues from AIRI and MIPT in specific operating conditions of the robot, the speech recognition error increases to 50% or even more, said Ivan Bondarenko.

The scientist explained these errors by the extremely high level of acoustic noise and the specifics of the microphone system installed on the robot. He noted that after additional training of Wav2Vec2-Large-Ru-Golos on just half an hour of annotated audio recordings of voice commands, the word recognition error rate dropped to 20% without typo correction and to 11% with additional typo correction. By comparison, the average error rate for another person’s recognition of human speech is believed to be about 25%. Thus, the ability of the speech recognition models developed by NSU scientists to be effectively retrained to solve more specialized speech recognition problems, even on small training samples, turned out to be useful for their robotics colleagues.

— In my deep conviction, research in the field of artificial intelligence should be open. Openness not only reduces the problem of reproducibility of scientific experiments. Openness also plays an important social role, making the results of individual scientific groups common. Openness ensures the relay of scientific knowledge, allows some scientists to continue where others left off, and thereby accelerates the process of scientific knowledge. Therefore, we are making the results of our research open in the hope that they will be useful to colleagues from other scientific teams. And our hopes come true! — summed up Ivan Bondarenko.

Note; This information is raw content directly from the source of the information. This is exactly what the source states and does not reflect the position of MIL-OSI or its clients.

Please note; This information is raw content directly from the information source. It is accurate to what the source is stating and does not reflect the position of MIL-OSI or its clients.

EDITOR’S NOTE: This article is a translation. Apologies should the grammar and or sentence structure not be perfect.

MIL OSI News (multilanguage service)