22 August 2024 We are often bombarded with a multitude of impressions in our everyday lives – it can sometimes be difficult to keep track of everything. After all, all impressions not only have to be perceived, but also interpreted, which ultimately opens up a wide range of options for action. This is where the LUMINOUS (Language Augmentation for Humanverse) system, developed at the German Research Centre for Artificial Intelligence (DFKI), comes into play. The technology collects the countless impressions, interprets them and can suggest an appropriate action using generative and multimodal language models (MLLM). Didier Stricker, head of the ‘Augmented Reality’ research department at DFKI: ‘The technology we have developed makes virtual worlds more intelligent. Intuitive interaction (via text) with the system and automatic generation of complex behaviours and processes through ‘generative AI’ or so-called ‘Multi-Modal Large Language Models’ enable us not only to experience them, but also to test them. To achieve this, we at LUMINOUS are working in parallel on several approaches such as automatic code generation, the rapid input of new data and other solutions.’
System observes, interprets – and makes recommendations for action
In the new LUMINOUS project, DFKI is working on next-generation augmented reality (XR) systems. In the future, MLLM will join the existing technical extensions of our visually perceived reality, such as in the form of texts, animations or the overlaying of virtual objects, and redefine interaction with augmented reality (XR) technology.
Muhammad Zeshan Afzal, a researcher from the Augmented Reality department at the German Research Centre for Artificial Intelligence (DFKI), uses a scenario to explain what this can look like in practice:
‘A fire starts in a room. In this case, our system first determines where the person – who is equipped with our technology – is currently located. It then collects relevant data from its immediate surroundings, such as the presence of a fire extinguisher or an emergency exit, in order to pass this on to the generative and multimodal language model. This then determines a suitable recommendation for action, such as initiating the extinguishing process using a fire extinguisher, closing windows or getting to safety.’
Learning from descriptions creates flexibility
Until now, research and development endeavours were largely limited to the spatial tracking of users and their environment. The result: very specific, limited and non-generalisable representations, as well as predefined graphic visualisations and animations. ‘Language Augmentation for Humanverse’ aims to change this in the future.
To achieve this, the researchers at DFKI are developing a platform with language support that adapts to individual, non-predefined user needs and previously unknown augmented reality environments. The adaptable concept originates from zero-shot learning (ZSL), an AI system that is trained to recognise and categorise objects and scenarios – without having seen exemplary reference material in advance. When implemented, LUMINOUS will use its database of image descriptions to build up a flexible image and text vocabulary that makes it possible to recognise even unknown objects or scenes in images and videos.
‘We are currently investigating possible applications for the everyday care of sick people, implementation of training programmes, performance monitoring and motivation,’ says Zeshan Afzal.
As a kind of translator, the LLM from the LUMINOUS project should be able to describe everyday activities on command and play them out to users via a voice interface or avatar. The visual assistance and recommended actions provided in this way will then support everyday activities in real time.
LUMINOUS in practice
The results of the project will be tested in three pilot projects focussing on neurorehabilitation (support for stroke patients with speech disorders), immersive safety training in the workplace and the testing of 3D architectural designs.
In the case of neurorehabilitation of stroke patients with severe communication deficits (aphasia), realistic virtual characters (avatars) support the initiation of conversation through image-directed models. These are based on natural language and enable generalisation to other activities of daily life. Objects in the scene (including people) are recognised in real time using eye-tracking and object recognition algorithms.
Patients can then ask the avatar or MLLM to articulate either the name of the object, the whole word to be produced, the first phoneme or the first speech sound.
To use the speech models in the patient’s unique environment, patients undergo personalised and intensive XR-based training. The LUMINOUS project captures the movements and style of the human trainer with a minimal number of sensors to enable the modelling and instantiation of three-dimensional avatars. The aim is to use only kinematic information derived exclusively from the input of the headset, the position of the head and the hands during training.
Future users of these new XR systems will be able to interact seamlessly with their environment using language models while having access to constantly updated global and domain-specific knowledge sources.
In this way, new XR technologies can be used in the future for distance learning and training, entertainment or healthcare services, for example. By providing assistance, LUMINOUS learns and constantly expands its knowledge – beyond just the training data. By providing names and text descriptions to the LLM, it can in turn generate the names of unknown objects from images. Recognised image features are linked to the corresponding text descriptions.
Schreibe einen Kommentar