Will Robots Use Reasoning At The Edge?

Aug. 14, 2025 –

Techniques like reasoning will have limited usefulness for resource-constrained edge applications like robotics and autonomous driving, Ambarella CTO Les Kohn told EE Times.
Reasoning improves inference quality but requires many times more tokens than basic inference.
“You can’t rely on reasoning nearly as much at the edge, especially if you’re trying to do some real-time processing,” Kohn said.

Large LLMs already have to be distilled to work with edge hardware. Ambarella showed several DeepSeek distillations to EE Times running on its N1 chip. The whole DeepSeek-R1 model would have been too big to run on an edge chip at 671B parameters, though running across many chips is “doable in principle,” Kohn said.

Model size, combined with reasoning and chain-of-thought techniques, is rapidly increasing the computational requirements for modern models, making them incompatible with most edge applications.
For robotics and autonomous vehicles, Ambarella’s research teams have instead been investigating integrating VLMs with the company’s real-time software stack. Kohn said the solution may be a combination of a fast, real-time model, which is responsible for the system’s basic safety, and a larger model to perform higher-level planning.

In his example – an autonomous vehicle navigating out of a parking lot where it doesn’t have a map – a higher-level planning model can look for hints, like an exit sign, to deduce how to get out. It can then hand off its plan to a real-time model to execute and ensure safety.

“And then, of course, you can iterate on that as you make progress through the parking lot,” he said. “That’s an example of this hybrid approach, which tries to leverage a large model for difficult things that require knowledge of the overall world that a real-time network wouldn’t have.”

These two models could run on the same chip, he said, given that the high-level planner only needs to run once every few seconds.

Algorithm-first

Proper hardware design for LLM and VLM workloads can improve computational efficiency to help cater to larger models, Kohn said, noting that Ambarella has further optimizations for LLMs coming in the next generation of its hardware.

While many model optimization techniques have tried to reduce the amount of matrix multiplications (or multiply-accumulate operations) they require, there is no point doing this without also considering other hardware bottlenecks, Kohn said...

Click here to read more...