Meta, formerly known as Facebook, has been at the forefront of artificial intelligence (AI) for over a decade, utilizing it to power their range of products and services, including News Feed, Facebook Ads, Messenger, and virtual reality. With the increasing demand for more advanced and scalable AI solutions, Meta recognizes the need for innovative and efficient AI infrastructure.
At the recent AI Infra @ Scale event, a virtual conference organized by Meta’s engineering and infrastructure teams, the company made several announcements regarding new hardware and software projects aimed at supporting the next generation of AI applications. The event featured Meta speakers who shared their valuable insights and experiences in building and deploying large-scale AI systems.
One significant announcement was the introduction of a new AI data center design optimized for both AI training and inference, the primary stages of developing and running AI models. These data centers will leverage Meta’s own silicon called the Meta training and inference accelerator (MTIA), a chip specifically designed to accelerate AI workloads across diverse domains, including computer vision, natural language processing, and recommendation systems.
Meta also unveiled the Research Supercluster (RSC), an AI supercomputer that integrates a staggering 16,000 GPUs. This supercomputer has been instrumental in training large language models (LLMs), such as the LLaMA project, which Meta had previously announced in February.
“We have been tirelessly building advanced AI infrastructure for years, and this ongoing work represents our commitment to enabling further advancements and more effective utilization of this technology across all aspects of our operations,” stated Meta CEO Mark Zuckerberg.
Meta’s dedication to advancing AI infrastructure demonstrates their long-term vision for utilizing cutting-edge technology and enhancing the application of AI in their products and services. As the demand for AI continues to evolve, Meta remains at the forefront, driving innovation and pushing the boundaries of what is possible in the field of artificial intelligence.
Building AI infrastructure is table stakes in 2023
Meta is far from being the only hyperscaler or large IT vendor that is thinking about purpose-built AI infrastructure. In November, Microsoft and Nvidia announced a partnership for an AI supercomputer in the cloud. The system benefits (not surprisingly) from Nvidia GPUs, connected with Nvidia’s Quantum 2 InfiniBand networking technology.
A few months later in February, IBM outlined details of its AI supercomputer, codenamed Vela. IBM’s system is using x86 silicon, alongside Nvidia GPUs and ethernet-based networking. Each node in the Vela system is packed with eight 80GB A100 GPUs. IBM’s goal is to build out new foundation models that can help serve enterprise AI needs.
Not to be outdone, Google has also jumped into the AI supercomputer race with an announcement on May 10. The Google system is using Nvidia GPUs along with custom designed infrastructure processing units (IPUs) to enable rapid data flow.
What Meta’s new AI inference accelerator brings to the table
Meta is now also jumping into the custom silicon space with its MTIA chip. Custom built AI inference chips are also not a new thing either. Google has been building out its tensor processing unit (TPU) for several years and Amazon has had its own AWS inferentia chips since 2018.
For Meta, the need for AI inference spans multiple aspects of its operations for its social media sites, including news feeds, ranking, content understanding and recommendations. In a video outlining the MTIA silicon, Meta research scientist for infrastructure Amin Firoozshahian commented that traditional CPUs are not designed to handle the inference demands from the applications that Meta runs. That’s why the company decided to build its own custom silicon.
“MTIA is a chip that is optimized for the workloads we care about and tailored specifically for those needs,” Firoozshahian said.
Meta is also a big user of the open source PyTorch machine learning (ML) framework, which it originally created. Since 2022, PyTorch has been under the governance of the Linux Foundation’s PyTorch Foundation effort. Part of the goal with MTIA is to have highly optimized silicon for running PyTorch workloads at Meta’s large scale.
The MTIA silicon is a 7nm (nanometer) process design and can provide up to 102.4 TOPS (Trillion Operations per Second). The MTIA is part of a highly integrated approach within Meta to optimize AI operations, including networking, data center optimization and power utilization.