AWS provides hot routing and caching for its Bedrock LLM service

[ad_1]

As companies move from experimenting with generative AI in limited prototypes to putting it into production, they are becoming more price conscious. Using large language models isn’t cheap, after all. One way to reduce cost is to return to the old concept: caching. The other way is to route simpler queries to smaller, more cost-effective forms. At its re:invent conference in Las Vegas, AWS today announced these two features for its Bedrock LLM hosting service.

Let’s talk about the caching service first. “Say there’s a document, and multiple people are asking questions about the same document,” Atul Deo, product manager at Bedrock, told me every time you pay. “And those context windows get longer and longer. For example, with Nova, we will have 300,000 context tokens and 2 million context tokens. I think by next year, it could go up much more.

Image credits:Os

Caching basically ensures that you don’t have to pay for the model to do repetitive work and reprocess the same (or very similar) queries over and over again. According to AWS, this can reduce cost by up to 90%, but there is an additional byproduct of this which is also that the latency to get an answer from the form is much lower (AWS says by up to 85%). Adobe, which tested fast caching for some generative AI applications on Bedrock, saw a 72% reduction in response time.

The other major new feature is Bedrock’s Intelligent Speed ​​Steering. Through this, Bedrock can automatically route claims to different models in the same model family to help companies achieve the right balance between performance and cost. The system automatically predicts (using a small language model) how each model will perform for a given query and then routes the request accordingly.

Image credits:Os

“Sometimes, my query might be very simple. Do I really need to send this query to the most capable model, which is very expensive and slow? Probably not. Basically, you want to create the idea of ​​’Hey, at runtime, based on Incoming claim, submit the correct query to the correct form.

LLM mentoring is not a new concept, of course. Start-up companies such as Martian A number of open source projects are addressing this as well, but AWS will likely argue that what sets its offering apart is that the router can intelligently route queries without a lot of human input. But it is also limited, as it can only direct queries to forms in the same form family. However, Dew told me that in the long term, the team plans to expand this system and give users more customizability.

Image credits:Os

Finally, AWS is also launching a new marketplace for Bedrock. The idea here is that although Amazon partners with many larger form providers, there are now hundreds of niche forms that may only have a few dedicated users, Dew said. As these customers ask the company to support them, AWS is launching a marketplace for these models, where the only major difference is that users will have to provision and manage their infrastructure capacity themselves — something Bedrock usually handles automatically. In total, AWS will offer about 100 of these emerging and niche models, with more to come.

[ad_2]

Leave a Comment