The popular approach to making AI more efficient has its drawbacks

[ad_1]

One of the most widely used techniques to make AI models more efficient, quantization, has limitations, and the industry may be quickly approaching them.

In the context of artificial intelligence, quantization refers to reducing the number of bits — the smallest units a computer can process — needed to represent information. Consider this analogy: When someone asks you what time it is, you’re likely to say “noon” — not “oh one thousand, one thousand, four milliseconds.” This is quantization. Both answers are correct, but one is a little more precise. How much precision you actually need depends on the context.

AI models consist of several components that can be quantified – in certain parameters, models use internal variables to make predictions or decisions. This is convenient, considering that the models perform millions of calculations when they run. Quantum models that have fewer qubits representing their parameters are less mathematically, and thus computationally, demanding. (To be clear, this is a different process than “distillation”, which is a more selective pruning of parameters.)

But quantization may have more trade-offs than previously assumed.

Table of Contents

The ever-shrinking model

According to A He studies From researchers at Harvard, Stanford, MIT, Databricks, and Carnegie Mellon, quantitative models perform worse if the original, non-quantitative version of the model is trained over a long period on a lot of data. In other words, at a certain point, it may be better to train a smaller model rather than downsize a large model.

That could be bad news for AI companies that train very large models (known to improve the quality of answers) and then scale them in an attempt to make their service less expensive.

The effects are already starting to appear. A few months ago, Developers and Academics I mentioned that sizing for Meta’s Llama 3 model tends to be “more damaging” compared to other models, perhaps because of the way it is trained.

“In my opinion, the number one cost to everyone in AI is and will remain heuristics, and our work shows that one important way to reduce it will not work forever,” Tanishk Kumar, a Harvard mathematics student and first author in the field. Paper, TechCrunch said.

Contrary to popular belief, AI model inference – running a model, such as when ChatGPT answers a question – is often more expensive in its entirety than training the model. Consider, for example, that Google spent… estimated $191 million to train one of Gemini’s flagship models – certainly a steep sum. But if a company were to use a model to generate 50-word answers for just half of all Google search queries, it would spend almost 6 billion dollars annually.

Major AI labs have embraced training models on huge data sets on the assumption that “scaling up” — that is, increasing the amount of data and computation used in training — will lead to increasingly more capable AI.

For example, Meta trained Llama 3 on a set of 15 trillion symbols. (Symbols represent bits of raw data; 1 million symbols equals about 750,000 words.) The previous generation, Llama 2, was trained on “only” 2 trillion symbols.

Evidence suggests that expansion eventually leads to diminishing returns; Anthropy and Google It is said Phenomenal models that were recently trained and did not live up to internal benchmark expectations. But there is no indication that the industry is ready to move meaningfully away from these established expansion methods.

How accurate is that exactly?

So, if labs are reluctant to train models on smaller datasets, is there a way to make the models less susceptible to degradation? maybe. Kumar says he and co-authors found that “lower-fidelity” training models can make them more robust. Be patient with us for a moment as we dive in a little.

“Precision” here refers to the number of digits a numeric data type can accurately represent. Data types are sets of data values, usually defined by a set of possible values and permitted operations; For example, the FP8 data type uses only 8 bits to represent a file Floating point number.

Most models today are trained at 16-bit or “half-precision” and “post-training quantum” at 8-bit precision. Some components of the model (for example, its parameters) are converted to a less accurate format at the expense of some accuracy. Think of it like doing calculations to one decimal place and then rounding to the nearest tenth, often giving you the best of both worlds.

Hardware vendors like Nvidia are pushing for lower accuracy for quantum model inference. The company’s new Blackwell chip supports 4-bit precision, specifically a data type called FP4; Nvidia has pitched this as a boon for data centers with constrained memory and power.

But very low quantification accuracy may not be desirable. According to Kumar, unless the original model is incredibly large in terms of number of parameters, resolutions lower than 7 or 8 bits may see a noticeable drop in quality.

If all this sounds a bit technical, don’t worry, it is. But the bottom line is simply that AI models are not fully understood, and well-known shortcuts that work in many types of calculations don’t work here. You wouldn’t say “noon” if someone asked you when you started your 100-meter dash, would you? It’s not entirely clear, of course, but the idea is the same:

“The key point in our work is that there are limitations that you cannot go beyond naively,” Kumar concluded. “We hope our work adds nuance to a discussion that often seeks low-fidelity default settings for training and inference.”

Kumar admits that his and his colleagues’ study was on a relatively small scale, and they plan to test it with more models in the future. But he thinks at least one idea will hold up: There’s no free lunch when it comes to reducing the costs of inference.

“Bit accuracy is important, and it’s not free,” he said. “You can’t reduce it forever without the models suffering. Models have limited capacity, so instead of trying to fit a quadrillion tokens into a small model, in my opinion, more effort would be put into organizing and carefully filtering the data, so that only the highest quality data is put into Smaller models I’m optimistic that new architectures that deliberately aim to make low-fidelity training stable will be important in the future.

[ad_2]

The ever-shrinking model

How accurate is that exactly?

Leave a Comment Cancel reply