Local LLM

Overview

AgentOpera supports interaction with the On-Device LLM. In our AI Terminal application, the On-Device LLM model is integrated. Every time a user enters chat content, the system will first call the local model router to ask whether it is capable of answering the user's chat question. If it is capable, the output of the local model will be used directly. If the local model is not capable of processing, the system will call the cloud-based AgentOpera to answer the user's question. At the same time, for the local model, we also support the use of local RAG to enhance the accuracy of the user's answer.

Architecture

The overall hybrid architecture of the local LLM and cloud AgentOpera is as follows.

Examples

We provide TensorOpera Edge AI SDK for AI Terminal applications. The following are examples of calling local model router and local inference.

await TensorOperaEdgeManager.getTensorOperaEdgeApi().runInference(
                    prompt: prompt, completionPromptMessage: [], enableLocalRouter: true, didInfer: { inferredResults, success in
                        if success {
                            self.response = inferredResults
                            
                            print("Infer successfully using local inference: \(inferredResults)")
                        } else {
                            print("Infer failed, we need to use cloud inference.")
                            self.response = "We need to use cloud inference"
                        }
                    }, didFinish: {
                        tps in
                        self.tokensPerSecond = TensorOperaEdgeManager.getTensorOperaEdgeApi().getInferenceTps() ?? 0.0
                    }, didGetFirstToken: {
                        ttft in
                        self.timeToFistToken = ttft
                    })
            }

Last updated