Local LLM
Overview
AgentOpera supports interaction with the On-Device LLM. In our AI Terminal application, the On-Device LLM model is integrated. Every time a user enters chat content, the system will first call the local model router to ask whether it is capable of answering the user's chat question. If it is capable, the output of the local model will be used directly. If the local model is not capable of processing, the system will call the cloud-based AgentOpera to answer the user's question. At the same time, for the local model, we also support the use of local RAG to enhance the accuracy of the user's answer.
Architecture
The overall hybrid architecture of the local LLM and cloud AgentOpera is as follows.

Examples
We provide TensorOpera Edge AI SDK for AI Terminal applications. The following are examples of calling local model router and local inference.
await TensorOperaEdgeManager.getTensorOperaEdgeApi().runInference(
prompt: prompt, completionPromptMessage: [], enableLocalRouter: true, didInfer: { inferredResults, success in
if success {
self.response = inferredResults
print("Infer successfully using local inference: \(inferredResults)")
} else {
print("Infer failed, we need to use cloud inference.")
self.response = "We need to use cloud inference"
}
}, didFinish: {
tps in
self.tokensPerSecond = TensorOperaEdgeManager.getTensorOperaEdgeApi().getInferenceTps() ?? 0.0
}, didGetFirstToken: {
ttft in
self.timeToFistToken = ttft
})
}
Last updated