As businesses shift from experimenting with generative AI prototypes to deploying them in production, cost efficiency is becoming a priority. Running large language models (LLMs) can be expensive, prompting companies to explore strategies for reducing costs. Two such approaches are leveraging caching and directing simpler queries to smaller, more affordable models. AWS introduced these features for its Bedrock LLM hosting service at its re:Invent conference in Las Vegas.
Caching to Reduce Costs and Latency
Caching helps avoid repetitive processing of the same or similar queries, significantly cutting down costs. For instance, when multiple users ask questions about the same document, caching ensures the model doesn’t reprocess each query, reducing expenses by up to 90%. It also speeds up response times, with AWS claiming latency reductions of up to 85%. Adobe, which tested caching with its generative AI applications on Bedrock, reported a 72% improvement in response time.
Intelligent Prompt Routing
The second feature, intelligent prompt routing, optimizes the balance between cost and performance by automatically directing prompts to the most suitable models within the same family. A smaller model may handle simpler queries, while more complex requests go to larger, more powerful models. This system uses a smaller language model to predict which model will best handle a given query, minimizing unnecessary expenses.
Atul Deo, AWS's director of product for Bedrock, explained, "For simple queries, there’s no need to use the most expensive, slowest model. Instead, the system identifies the appropriate model at runtime based on the incoming prompt." While similar technologies exist, AWS emphasizes its solution’s ability to route intelligently with minimal human intervention, though it currently only works within a single model family. The company plans to expand this capability in the future.
Bedrock Model Marketplace
AWS is also launching a marketplace for Bedrock, catering to the growing number of specialized models with smaller user bases. While these models require users to manage their own infrastructure capacity, unlike the standard Bedrock service, the marketplace aims to accommodate customer demand for niche solutions. Initially, about 100 specialized models will be available, with more expected to follow.
These advancements underline AWS's commitment to making generative AI more cost-effective and accessible while meeting diverse business needs.