Interviews

Democratizing the hardware side of large language models

August 1, 2022

integrated circuit board ic — Image credit: 123RF

There’s growing concern that artificial intelligence—namely deep learning—is becoming centralized within a few very wealthy companies. This shift does not apply to all areas of AI, but it is certainly the case for large language models, deep learning systems composed of billions of parameters and trained on terabytes of text data.

Accordingly, there has been growing interest in democratizing LLMs and making them available to a broader audience. However, while there have been impressive initiatives in open-sourcing models, the hardware barriers of large language models have gone mostly unaddressed.

This is one of the problems that Cerebras, a startup that specializes in AI hardware, aims to solve with its Wafer Scale processor. In an interview with TechTalks, Cerebras CEO Andrew Feldman discussed the hardware challenges of LLMs and his company’s vision to reduce the costs and complexity of training and running large neural networks.

Large language models are hard to run

The industry’s growing interest in creating larger neural networks has made it more challenging for cash- and resource-constrained organizations to enter the field. Today, training and running LLMs at the scale of models such as GPT-3 and Gopher costs millions of dollars and requires huge amounts of compute resources.

Even running a trained model such as the open-source BLOOM or Facebook’s OPT-175B requires substantial investment in GPUs and specialized hardware.

But one aspect that is often less talked about is the technical difficulties of training and running very large deep learning models. Even if you can secure the millions of dollars required to train an LLM, you’ll still need expertise in parallel and distributed computing, which is very hard to come by.

“What’s hard about [training LLMs] isn’t the machine learning. It is the distributed compute necessitated by the GPU, the fact that it’s a small compute engine,” Feldman said. “You have to break up this big problem and spread it out over lots of little engines. That work, distributed parallel computation, is obscure and it’s rare and only a few organizations in the world are good at it.”

A model can be broken across three dimensions: memory, compute, and communication. Finding the right mode of distribution and hardware configuration can become extremely difficult as LLMs grow bigger and bigger. And there’s no one-size-fits-all approach that can be repeated across different ML models and hardware stacks.

“Once you do it for one group of a thousand GPUs or ten thousand GPUs, you have to do it again for a different cluster of GPUs. It’s a bespoke solution,” Feldman said.

Three ways to distribute ML models

distributed computing deep neural networks

If a neural network is small enough to fit on a single GPU, then training it will be easy. This means that the GPU has enough memory and compute cores to hold the deep learning model’s parameters and run the tons of matrix multiplications that happen during training. This is the best-case scenario for training deep neural networks.

To speed up training, you can run a “data parallel” configuration. In this case, you can add additional GPUs, each of which contains a copy of the neural network. You can then train each of the models on a part of your training data and then average the results.

Current GPUs can fit deep learning models with several hundred million parameters.

However, when the ML model grows to billions of parameters, training becomes much more complicated because the model won’t fit on a single GPU. In this case, engineers create a “pipelined model parallel” architecture, in which the model is distributed across several GPUs. The ML engineer must determine how many GPUs are needed, how to split the model, which layers to fit on each processor, and implement the code for training the model in chunks through the GPU cluster sequentially. In this configuration, memory and IO bandwidth become bottlenecks.

“The problem is that your communication got long and is now hard. You have to run the process sequentially,” Feldman said. “You must complete the first layer before you move on to the next. The processors are connected by a switch. You have to figure out the latency. This is a real problem if you’re doing it for a hundred-layer network.”

Andrew Feldman Cerebras CEO — Andrew Feldman, CEO at Cerebras

Things get even more complicated for very large neural networks such as current state-of-the-art LLMs. In some cases, some of the layers of your model become so large that they won’t fit on a single GPU. In this case, you must split the layer and spread it across two or more GPUs, also known as the “tensor model parallel” architecture. Like pipelined model parallel, it requires manual coding and configuration, it is a bespoke solution, and it is extremely difficult to get right.

To optimize the process of training large language models, researchers have to use all three types of parallel computing techniques at the same time. Recently, Meta AI released the timeline and logs for its OPT-175B model, which was trained on 992 A100 GPUs. The details show some of the trial and error, continuous tweaking, and failures that engineers face when they train large language models on huge clusters of GPUs.

“By the time you get to these big models, 20 billion parameters, you are using every tool in your toolchest,” Feldman said. “And by the time you get to spreading these models to over 2,000 or 3,000 GPUs, imagine the complexity of having to measure the latency between processors, to think about how to break up every layer, spread it over six or eight or 20 or 50 processors. This is brutal. This is the fundamental challenge that is faced by putting a big problem like a large neural network on a cluster of small processors.”

What makes things worse is the brittleness of the system. As soon as you make a change to the architecture of your neural network, you’ll have to redesign and reconfigure large parts of your hardware architecture.

The single-processor solution

WSE-2-A100-comparison — Cerebras Wafer Scale Engine (source: Cerebras)

All the challenges of parallel and distributed computing can be overcome if the model fits on a single processor. This is what Cerebras aims to do with its Wafer Scale Engine 2 (WSE-2) processor and its CS-2 compute cluster.

The WSE-2 is a super-large processor (around 462 centimeters squared) that has been specially designed for AI. It has 850,000 programmable cores for tensor operations, the computation that underlies deep neural networks. The processor comes with 40 gigabytes of on-chip memory at a bandwidth that is 1,000 times faster than traditional GPUs.

One important feature of WSE-2 is its Weight Streaming architecture, which disaggregates memory from compute. Weight Streaming allows neural network weight memory to be scaled independently of the computation cores. This allows a single CS-2 system to run arbitrarily large models, up to 100 trillion parameters, according to Feldman.

“When you buy a GPU, on the back of the interposer is the memory. You can’t buy more memory for that GPU. If you want more memory, you have to buy another GPU. If you want another GPU, the memory comes with it. By disaggregating it, we can support enormous parameter networks on a single system,” Feldman says.

In a video posted online, Cerebras engineers show that training LLMs with 1.5, 6, and 20 billion parameters can be trained without the need for extra hardware configuration.

The computation cost of training large neural networks with WSE-2 is about a fifth of GPUs, according to Feldman. And this does not account for the costs of hiring distributed and parallel compute talent. WSE-2 also consumes much less electricity than GPUs, which is important given the environmental concerns that the costs of training neural nets have raised.

What does this mean for applied AI and academic research?

Feldman hopes that by removing the technical barriers, Cerebras will make it possible for smaller organizations that don’t have access to distributed computing talent to do research on large neural networks. The company has already cooperated with several institutions on scientific projects, including epigenomics, cancer research, and fluid dynamics modeling.

In addition, WSE can be very useful in the applied machine learning sector. Many organizations are interested in benefitting from advances in LLMs in real-world applications. But in commercial applications, a very important factor is the cost-efficiency of the ML model.

One of the popular ways to reduce the costs of LLMs is to train or finetune a smaller model for a specific application. For example, instead of using a 175-billion-parameter model, a company can take a pre-trained model that is significantly smaller (e.g., 6 billion parameters) and perform a few epochs of extra training on application-specific examples. Experiments show that smaller finetuned models outperform larger models on application-specific tasks. And they cost a lot less to run.

However, finetuning ML models with several billion parameters still has all the distributed computing challenges involved in the largest LLMs.

“Most organizations can’t afford either the amount of time on a thousand GPUs or they don’t have the ability to do the distributed compute. Even finetuning requires distributed computing expertise,” Feldman said. “A system like ours makes that finetuning and training from scratch into a sort of push button. You don’t need any of the distributed compute expertise. That’s how we bring to a much broader audience the benefits of these very large models. Otherwise, they’re going to stay in the domain of OpenAI and Azure, DeepMind and Google, a very small group of companies.”

Democratizing large-scale deep learning

Feldman draws a parallel between the history of deep learning and electrical engineering.

“Thirty of forty years ago, we used to do research in universities in electrical engineering in circuit and chip design,” he said.

But producing chips became so expensive that you couldn’t do it at universities.

“The only place you could do chip design and circuit design was a very small number of big processor companies. That was really bad for the industry and stunted the industry’s growth and innovation,” Feldman said. “And large language models begin exactly where hardware engineering, circuit design, ended. It begins with this research being done in a very small number of very well-funded public companies. They’re the only ones pushing the boundaries. And even rich universities like Stanford and MIT are cut out. What’s the impact on state schools and other institutions? It’s really hard. And those are real fundamental problems.”

Democratizing large language models—and large neural networks in general—will ultimately depend on breaking the financial and technical barriers.

“The way you get more democracy with the technology, the way you get more users, the way you get it out of a very narrow group of super-large technology companies is to drive down the costs and make it simpler to use,” Feldman said. “If you need to have expertise in distributed computing across thousands of GPUs using data parallel, pipelined model parallel, and tensor model parallel, I can assure you it’s going to stay in a very select group of organizations. If instead, you can stay strictly in the data parallel group, most organizations, universities, national laboratories, large enterprises, all of them have access to the technologies.”

1 COMMENT

Seb August 17, 2022 at 7:25 am

Wow, a wafer-scale processor. Impressive. Just one thing: don’t the 46,225 mm^2 from the graphic correspond to 462 cm^2, rather than 46 cm^2 as the text has it?

Loading...

What OpenELM language models say about Apple’s generative AI strategy

Will infinite context windows kill LLM fine-tuning and RAG?

How to turn any LLM into an embedding model

AI in healthcare: Real-world applications for cost-savings and innovation

Stanford’s ReFT fine-tunes LLMs at a fraction of the cost

Fine-tune a Llama-2 language model with a single instruction

What to know about the rising threat of deepfake scams

4 reasons to use open-source LLMs (especially after the OpenAI drama)

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT

How to make your LLMs lighter with GPTQ quantization

What to know about open-source alternatives to GPT-4 Vision

The complete guide to LLM compression

A simple guide to gradient descent in machine learning

The complete guide to LLM fine-tuning

What is low-rank adaptation (LoRA)?

What to know about the security of open-source machine learning models

Understanding the impact of open-source language models

What we learned from the deep learning revolution

AI21 Labs’ mission to make large language models get their facts…