WordPress Ad Banner

MIT Researchers Introduce MAGE Framework for Enhanced Image Recognition and Generation


In a significant breakthrough, MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) has unveiled a MAGE framework capable of handling both image recognition and image generation tasks with remarkable accuracy. Known as Masked Generative Encoder (MAGE), this unified computer vision system offers a wide range of applications and eliminates the need for training two separate systems for image identification and generation.

The introduction of MAGE comes at a time when enterprises are increasingly embracing AI, particularly generative technologies, to enhance their workflows. However, the MIT researchers acknowledge that the system still has certain imperfections and requires further refinement in the coming months to ensure its widespread adoption.

WordPress Ad Banner

So, how does MAGE work?

Today, building image generation and recognition systems largely revolves around two processes: state-of-the-art generative modeling and self-supervised representation learning. In the former, the system learns to produce high-dimensional data from low-dimensional inputs such as class labels, text embeddings or random noise. In the latter, a high-dimensional image is used as an input to create a low-dimensional embedding for feature detection or classification. 

These two techniques, currently used independently of each other, both require a visual and semantic understanding of data. So the team at MIT decided to bring them together in a unified architecture. MAGE is the result. 

To develop the system, the group used a pre-training approach called masked token modeling. They converted sections of image data into abstracted versions represented by semantic tokens. Each of these tokens represented a 16×16-token patch of the original image, acting like mini jigsaw puzzle pieces.

Once the tokens were ready, some of them were randomly masked and a neural network was trained to predict the hidden ones by gathering the context from the surrounding tokens. That way, the system learned to understand the patterns in an image (image recognition) as well as generate new ones (image generation).

“Our key insight in this work is that generation is viewed as ‘reconstructing’ images that are 100% masked, while representation learning is viewed as ‘encoding’ images that are 0% masked,” the researchers wrote in a paper detailing the system. “The model is trained to reconstruct over a wide range of masking ratios covering high masking ratios that enable generation capabilities, and lower masking ratios that enable representation learning. This simple but very effective approach allows a smooth combination of generative training and representation learning in the same framework: same architecture, training scheme, and loss function.”

In addition to producing images from scratch, the system supports conditional image generation, where users can specify criteria for the images and the tool will cook up the appropriate image.

“The user can input a whole image and the system can understand and recognize the image, outputting the class of the image,” Tianhong Li, one of the researchers behind the system, told VentureBeat. “In other scenarios, the user can input an image with partial crops, and the system can recover the cropped image. They can also ask the system to generate a random image or generate an image given a certain class, such as a fish or dog.”

Potential for many applications

When pre-trained on data from the ImageNet image database, which consists of 1.3 million images, the model obtained a fréchet inception distance score (used to assess the quality of images) of 9.1, outperforming previous models. For recognition, it achieved an 80.9% accuracy rating in linear probing and a 71.9% 10-shot accuracy rating when it had only 10 labeled examples from each class.

“Our method can naturally scale up to any unlabeled image dataset,” Li said, noting that the model’s image understanding capabilities can be beneficial in scenarios where limited labeled data is available, such as in niche industries or emerging technologies.

Similarly, he said, the generation side of the model can help in industries like photo editing, visual effects and post-production with the its ability to remove elements from an image while maintaining a realistic appearance, or, given a specific class, replace an element with another generated element.

“It has [long] been a dream to achieve image generation and image recognition in one single system. MAGE is a [result of] groundbreaking research which successfully harnesses the synergy of these two tasks and achieves the state of the art of them in one single system,” said Huisheng Wang, senior software engineer for research and machine intelligence at Google, who participated in the MAGE project.

“This innovative system has wide-ranging applications, and has the potential to inspire many future works in the field of computer vision,” he added.

More work needed

Moving ahead, the team plans to streamline the MAGE system, especially the token conversion part of the process. Currently, when the image data is converted into tokens, some of the information is lost. Li and team plan to change that through other ways of compression.

Beyond this, Li said they also plan to scale up MAGE on real-world, large-scale unlabeled image datasets, and to apply it to multi-modality tasks, such as image-to-text and text-to-image generation.