Portal ERP
BackSecondary Hero

DigitalOcean launches Inference Engine with a router that matches AI requests to models

The platform's four inference services include a router that sends each request to the most suitable model, with one customer reporting inference costs down more than 40%

Redação Portal ERP
Jun 04, 2026
T|Fonte:18px
4 min read
DigitalOcean launches Inference Engine with a router that matches AI requests to models

DigitalOcean, a cloud computing provider that serves developers and small to mid-sized businesses, has launched its Inference Engine, a set of services for running AI inference workloads in production. The engine brings together four capabilities, an Inference Router, Batch Inference, Serverless Inference and Dedicated Inference, so development teams can match each type of workload to a performance and cost profile through one provider rather than combining separate services.

The component the company emphasised is the Inference Router, which addresses a common inefficiency in agentic AI systems where every request is sent to the most expensive model regardless of how demanding the task is. With the router, a developer defines a pool of models, describes tasks and priorities in natural language, and the system matches each request to an appropriate model based on cost and latency.

It runs on a Mixture of Experts router model that DigitalOcean built, which removes the need for teams to build and maintain their own routing infrastructure. LawVo, a legal technology company, reported that the router reduced its inference costs by more than 40% by routing each request based on complexity.

The other three services target different workload patterns. Dedicated Inference provides reserved capacity for teams running sustained, high-volume workloads that need predictable performance. Serverless Inference offers access to dozens of models through a single API key with scale-to-zero elasticity, meaning teams do not pay for idle capacity, and includes off-peak pricing. Batch Inference handles offline workloads that do not need real-time responses, reducing their cost by 50% through asynchronous execution and a guaranteed 24-hour completion window.

DigitalOcean said the engine was built around hardware and software integrations including vLLM, TensorRT and SGLang to increase token throughput, along with optimisations at the request and model level and distributed scaling for the uneven demand of production AI applications.

According to Artificial Analysis, an independent AI inference benchmarking platform, DigitalOcean delivered three times faster time-to-first-answer-token and three times higher output speed than Amazon Bedrock on the DeepSeek V3.2 model at 10,000 input tokens, and ranked as one of three providers in the most favourable quadrant on the platform's latency versus output speed chart.

The Inference Engine was developed alongside early design partners running production workloads. Hippocratic AI, a company that builds patient-facing healthcare AI agents, reported two times the production throughput and 40% lower P99 latency across more than 20 million patient interactions. Workato's Research Lab, which processes more than 1 trillion automated workloads, reported 77% faster time-to-first-token, 79% lower end-to-end latency and 67% lower inference costs on the platform.

"DigitalOcean's Inference Router gives us the kind of intelligent model selection we would otherwise have had to build ourselves. It routes each request to the right model based on complexity, helping us reduce inference costs by more than 40% while maintaining the accuracy, speed, and reliability our users expect," said Hovsep Seraydarian, Co-Founder and CTO, LawVo.

"Most teams building agentic systems today make a single model decision and apply it uniformly across their agentic workflows. They default to a frontier model and pay the generalization tax: premium prices and higher latency for work that often does not require the most expensive closed source model. Inference Router is the essential AI middleware that removes that tax by intelligently matching requests to the right model based on task, context, and developer-defined preferences. The result is a smarter operating model for inference - one that gives developers more control over quality, speed, and cost while helping AI-native builders move faster and build more durable businesses on DigitalOcean," said Vinay Kumar, CPTO, DigitalOcean.

"In healthcare AI, a node going down isn't just an SLA issue, it impacts patient experience. We've pressed DigitalOcean hard on reliability, access to the newest hardware, and the ability to scale efficiently. They've delivered," said Debajyoti Datta, Co-Founder, Hippocratic AI.

"Through close collaboration on performance optimization, DigitalOcean helped us accelerate our inference performance and overall progress by two to three times," said Oscar Wu, AI Research Scientist, Technical Lead, Workato.

Share:

Redação Portal ERP

Editorial Team

Portal ERP's editorial team brings the latest news and analysis on technology and business management.