ImageBind a new AI Model that combine different senses just like people do
ImageBind a new AI Model that combine different senses just like people do. It unnderstand images, video,audio,depth, thermal and spatial movement
When humans absorb information from the world, we innately use multiple senses, such as seeing a busy street and hearing the sounds of car engines. Today, we’re introducing an approach that brings machines one step closer to humans’ ability to learn simultaneously, holistically, and directly from many different forms of information — without the need for explicit supervision (the process of organizing and labeling raw data). We have built and are open-sourcing Imagebind the first AI model capable of binding information from six modalities. The model learns a single embedding, or shared representation space, not just for text, image/video, and audio, but also for sensors that record depth (3D), thermal (infrared radiation), and inertial measurement units, which calculate motion and position. ImageBind equips machines with a holistic understanding that connects objects in a photo with how they will sound, their 3D shape, how warm or cold they are, and how they move.
ImageBind can outperform prior specialist models trained individually for one particular modality, as described in our paper But most important, it helps advance AI by enabling machines to better analyze many different forms of information together. For example, using ImageBind, meta make a sense could create images from audio, such as creating an image based on the sounds of a rain forest or a bustling market. Other future possibilities include more accurate ways to recognize, connect, and moderate content, and to boost creative design, such as generating richer media more seamlessly and creating wider multimodal search functions.
ImageBind is part of Meta’s efforts to create a multimodal AI system that learn from all possible types of data around them. As the number of modalities increases, ImageBind opens the floodgates for researchers to try to develop new, holistic systems, such as combining 3D and IMU sensors to design or experience immersive, virtual worlds. ImageBind could also provide a rich way to explore memories — searching for pictures, videos, audio files or text messages using a combination of text, audio, and image.
In typical AI systems, there is a specific embedding (that is, vectors of numbers that can represent data and their relationships in machine learning) for each respective modality. ImageBind shows that it’s possible to create a joint embedding space across multiple modalities without needing to train on data with every different combination of modalities. This is important because it’s not feasible for researchers to create datasets with samples that contain, for example, audio data and thermal data from a busy city street, or depth data and a text description of a seaside cliff.