WordPress Ad Banner

Meta Unveils CM3leon: Advancing Text-to-Image Generation

Meta is making significant progress in its research on generative AI models, unveiling its latest project called CM3leon (pronounced “chameleon”). CM3leon is a multimodal foundation model designed for text-to-image and image-to-text creation, specifically for generating automatic captions for images.

While AI-generated images are not new, with popular tools like Stable Diffusion, DALL-E, and Midjourney already available, Meta’s techniques and claimed performance for CM3leon set it apart.

WordPress Ad Banner

Text-to-image generation typically relies on diffusion models, as seen in Stable Diffusion, for image creation. However, CM3leon takes a different approach by utilizing a token-based autoregressive model.

Meta researchers explain in their paper titled “Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning” that diffusion models have been dominant due to their strong performance and computational efficiency. On the other hand, token-based autoregressive models offer improved global image coherence but are more computationally expensive to train and use for inference.

Surprisingly, CM3leon demonstrates that the token-based autoregressive model can be more efficient than the diffusion model approach. Meta researcher stated in a blog post, “CM3leon achieves state-of-the-art performance for text-to-image generation, despite being trained with five times less compute than previous transformer-based methods.”

Meta’s approach to image training with CM3leon is centered on ethical considerations. Instead of scraping images from the internet, which has raised legal concerns for diffusion-based models, Meta sourced licensed images from Shutterstock. This decision allows them to sidestep issues related to image ownership and attribution without compromising performance.

CM3leon follows a two-stage process: retrieval-augmented pre-training and supervised fine-tuning (SFT). The pre-training stage enhances the model’s capabilities, while SFT optimizes resource utilization and image quality. Meta draws parallels between SFT and OpenAI’s use of the approach in training ChatGPT, emphasizing its effectiveness for generative tasks that require understanding complex prompts.

The research paper highlights that instruction tuning significantly enhances multi-modal model performance across various tasks, such as image caption generation, visual question answering, text-based editing, and conditional image generation.


Examining the sample sets of generated images shared by Meta in a blog post, CM3leon’s ability to understand complex, multi-stage prompts and produce high-resolution images is evident and impressive.

As of now, CM3leon is a research project, and it remains unclear when or if Meta will make this technology available to the public on its platforms. However, considering its impressive capabilities and increased efficiency in image generation, it seems highly likely that CM3leon and its approach to generative AI will eventually extend beyond research.