Local Model Inference Framework

đź’ˇ Why Local Model Inference Framework

Local model inference framework offers a general-purpose, hardware-agnostic on-device model inference framework, an extensive set of built-in modules for non-neural network computations, efficient memory management and significantly lower the development cost of local inference models. Other solutions, like TFLite and PyTorch Mobile, offer efficient local inference but lack of programming flexibility and model generality provides by this framework.

  • Programming flexibility: Supports running Python code directly on the device, enabling flexible debugging and efficient development of edge-side models across diverse hardware environments. This approach ensures both development efficiency and data accuracy.

  • Model generality: Unlike existing model compilation solutions that often struggle with general-purpose model deployment—especially for non-neural network modules, which typically lack equivalent C++ ecosystem support—our framework offers a rich set of non-neural network computational modules. This ensures broad compatibility for product-grade model deployment across various devices.

  • Efficient memory management: Delivers optimized memory usage for resource-constrained devices, achieving maximum inference throughput under the same power envelope.

  • Offline & Weak Network Friendly: Enables fully end-to-end model execution without relying on network connectivity, ensuring smooth user experiences even in offline or poor network environments.

🔍 User cases: Model Deployment on Samsung Mobile Devices (ADL, InstantMesh, XTTS) show cased at CES.

In collaboration with Samsung, we successfully completed the deployment of Transformer variants (XTTS) and Stable Diffusion model variants (ADL, InstantMesh), powered by Samsung’s latest SoC chip. These models were seamlessly integrated with apps and showcased at CES.

Challenges

  1. The ecosystem for non-neural network computation libraries on the device is underdeveloped.

  2. Due to limited resources on devices, efficient memory management is essential.

Solution

We have implemented a rich set of non-neural network computation functions on-device (such as tokenizers, PyMeshLab, and more), along with advanced schedulers (including Karras Diffusion Schedulers, Top-K samplers, etc.). Additionally, we have applied careful quantization, distillation, partitioning, and fine-grained memory management to enable the efficient deployment of large models on resource-constrained edge devices. And we have further abstracted these capabilities into a framework for enhanced reusability.

XTTS( Voice Cloning)

As a variant of the Transformer, we leverage a scheduler to efficiently manage KV cache memory, and have implemented a wide range of audio-related and sampling-related operators on ARM-powered mobile devices.

InstantMesh(3D generation)

As a variant of Stable Diffusion, we leverage a scheduler to efficiently manage U-Net memory and have implemented a wide range of image-related and 3D-related operators on ARM-powered mobile devices.

ADL(Video generation)

As a variant of Stable Diffusion, we leverage a scheduler to efficiently manage U-Net memory and have implemented a wide range of text-related and video-related operators on ARM-powered mobile devices.

Last updated