Running DeepSeek R1 Distilled model locally on the mobile device using React-Native and llama.rn
Hello everyone. The hype around the new open-source Chinese LLM model is increasing. We already know we can run this model* locally on any computer but what about mobile devices?
* We will use distilled and quantinized model of original DeepSeek R1
First, let’s talk a little bit about distillation and quantization.
1. Distillation creates a smaller version of a LLM. The distilled LLM generates predictions much faster and requires fewer computational and environmental resources than the full LLM. However, the distilled model’s predictions are generally not quite as good as the original LLM’s predictions.
2. Quantization is a compression technique that involves mapping high-precision values to a lower precision one. For an LLM, that means modifying the precision of their weights and activations making it less memory intensive. This surely does have an impact on the capabilities of the model including the accuracy.
In other words less distilled and less quantized model will output results with better quality and precision, however, the trade-offs here are bigger size, and computational, and environmental resources.
For example, the original DeepSeek R1 model contains 671 billion parameters in size, and requires at least 800GB of HMB (High Bandwidth Memory).
We will start our project with the distilled and the smallest R1 version provided by unsolth — DeepSeek-R1-Distill-Qwen-1.5B
The Stack:
So let’s choose tech stuck we will run on:
1. React Native (expo) + react-native-gifted-chat + ExpoFileSystem
2. llama.rn
And that’s it! This is only needed to run your LLM locally!
Preparation:
1. Let’s start a new project
npx create-expo-app@latest ./DeepSeekMobile
2. Install necessary dependencies:
npx expo install react-native-gifted-chat react-native-reanimated react-native-safe-area-context react-native-get-random-values llama.rn expo-file-system
Installing llama.rn
, the library that will allow us to run our model on a device
expo-file-system
will help us to download and store our model on a device’s memory
All other dependencies provide better UX interaction with the model.
Also, we need to find model that we will use, and it’s GGUF version. I will take it from HuggingFace - unsolth repo.
https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf
Implementation:
Let’s start implementation.
First, we need to take the model and save it on a device. Let’s use expo-file-system.
Checking if the file already exists and downloading it if necessary:
import * as FileSystem from "expo-file-system";
const downloadLink =
"https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf";
const downloadResumable = FileSystem.createDownloadResumable(
downloadLink,
FileSystem.documentDirectory + "model.gguf",
{},
(progress) => {
console.log("downloading", progress);
},
);
const downloadModel = async () => {
try {
const res = await downloadResumable.downloadAsync();
return res
} catch (e) {
return null
}
};
The model, that we are using has a weight of about 700MB. So, it will take a little bit of time for it to download.
Let’s configure llama.rn to load our model correctly.
I created a file called @/llama/llama.config.ts
:
// @/llama/llama.config
import { initLlama, LlamaContext } from "llama.rn";
...
export const loadModel = async (modelPath: string) => {
const context = await initLlama({
model: modelPath,
use_mlock: true,
n_ctx: 131072,
n_gpu_layers: 1, // > 0: enable Metal on iOS
// embedding: true, // use embedding
});
return context;
};
The initiLLama
function will return context
object, which has an API handlers to communicate with the loaded model. More configuration you can find here: https://github.com/mybigday/llama.rn
And don't forget to star their project, they did an amazing job!
Let’s connect it all together into React Native screen:
import React, { useEffect, useState } from "react";
import { Button, View } from "react-native";
import * as FileSystem from "expo-file-system";
import { loadModel } from "../llama/llama.config";
import Chat from "@/components/Chat";
import { SafeAreaProvider, SafeAreaView } from "react-native-safe-area-context";
import { LlamaContext } from "llama.rn";
const downloadLink =
"https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf";
export default () => {
const [context, setContext] = useState<LlamaContext | null | undefined>(
null,
);
const downloadResumable = FileSystem.createDownloadResumable(
downloadLink,
FileSystem.documentDirectory + "model.gguf",
{},
() => {
console.log("downloading");
},
);
const downloadModel = async () => {
try {
const isExists = (await FileSystem.getInfoAsync(
FileSystem.documentDirectory + "model.gguf",
)).exists;
if (isExists) {
const context = await loadModel(
FileSystem.documentDirectory + "model.gguf",
);
setContext(context);
return;
}
const res = await downloadResumable.downloadAsync();
console.log("Finished downloading to ", res?.uri);
if (!res?.uri) {
console.log("no uri");
}
const context = await loadModel(res?.uri!);
setContext(context);
} catch (e) {
console.error(e);
}
};
useEffect(() => {
downloadModel();
}, []);
return (
<SafeAreaProvider>
<SafeAreaView style={{ flex: 1, backgroundColor: "black" }}>
{context && <Chat context={context} />}
</SafeAreaView>
</SafeAreaProvider>
);
};
Simple useEffect()
that downloads necessary files and runs our model.
We are setting state context
with our LLM context and passing it to the component called Chat
.
Before we start to create Chat.tsx
component, let’s define sendMessage()
function that will create completion and return text to the user:
// @/llama/llama.config
...
export const sendMessage = async (context: LlamaContext, message: string) => {
const msgResult = await context.completion(
{
messages: [
{
role: "user",
content: message,
},
],
n_predict: 1000,
stop: stopWords,
},
(data) => {
const { token } = data;
},
);
return msgResult.text;
};
This function gets context
— llama context that we created above in loadModel()
function. And message text that we will use for completion.
And now we are ready to create UI for communication with LLM:
import { sendMessage } from "@/llama/llama.config";
import { LlamaContext } from "llama.rn";
import React, { useCallback, useEffect, useState } from "react";
import { GiftedChat } from "react-native-gifted-chat";
import { v4 as uuid } from "uuid";
export default ({ context }: { context: LlamaContext }) => {
const [messages, setMessages] = useState([]);
const onSend = useCallback(async (messages: any) => {
const m = [...messages];
const text = await sendMessage(context, messages[0].text);
const messObj = {
_id: uuid(),
createdAt: Date.now(),
text,
user: {
_id: 0,
},
};
m.push(messObj);
setMessages((previousMessages) =>
GiftedChat.append(previousMessages, m)
);
}, []);
return (
<GiftedChat
messages={messages}
onSend={(messages) => onSend(messages)}
user={{
_id: 1,
}}
/>
);
};
And here we are, now let’s see the results:
That’s just amazing, that you can run your own LLM model locally, even on emulator.
Of course, the downside of this implementation is that mobile devices as for now not that powerful enough to support full-scaled models, however, it’s just the first step.
Full Code you can find in the REPO
Big Kudos to DeepSeek, Unsloth and of course llama.rn
Hit the clap if it was useful and see you in the comments :)