Running DeepSeek R1 Distilled model locally on the mobile device using React-Native and llama.rn

Git Stash Apply
5 min read2 days ago

Hello everyone. The hype around the new open-source Chinese LLM model is increasing. We already know we can run this model* locally on any computer but what about mobile devices?

* We will use distilled and quantinized model of original DeepSeek R1

First, let’s talk a little bit about distillation and quantization.

1. Distillation creates a smaller version of a LLM. The distilled LLM generates predictions much faster and requires fewer computational and environmental resources than the full LLM. However, the distilled model’s predictions are generally not quite as good as the original LLM’s predictions.

2. Quantization is a compression technique that involves mapping high-precision values to a lower precision one. For an LLM, that means modifying the precision of their weights and activations making it less memory intensive. This surely does have an impact on the capabilities of the model including the accuracy.

In other words less distilled and less quantized model will output results with better quality and precision, however, the trade-offs here are bigger size, and computational, and environmental resources.
For example, the original DeepSeek R1 model contains 671 billion parameters in size, and requires at least 800GB of HMB (High Bandwidth Memory).

We will start our project with the distilled and the smallest R1 version provided by unsolth — DeepSeek-R1-Distill-Qwen-1.5B

The Stack:

So let’s choose tech stuck we will run on:
1. React Native (expo) + react-native-gifted-chat + ExpoFileSystem
2. llama.rn

And that’s it! This is only needed to run your LLM locally!

Preparation:

1. Let’s start a new project

npx create-expo-app@latest ./DeepSeekMobile

2. Install necessary dependencies:

npx expo install react-native-gifted-chat react-native-reanimated react-native-safe-area-context react-native-get-random-values llama.rn expo-file-system

Installing llama.rn , the library that will allow us to run our model on a device

expo-file-system will help us to download and store our model on a device’s memory

All other dependencies provide better UX interaction with the model.

Also, we need to find model that we will use, and it’s GGUF version. I will take it from HuggingFace - unsolth repo.

https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf

Implementation:

Let’s start implementation.

First, we need to take the model and save it on a device. Let’s use expo-file-system.

Checking if the file already exists and downloading it if necessary:

import * as FileSystem from "expo-file-system";

const downloadLink =
"https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf";

const downloadResumable = FileSystem.createDownloadResumable(
downloadLink,
FileSystem.documentDirectory + "model.gguf",
{},
(progress) => {
console.log("downloading", progress);
},
);

const downloadModel = async () => {
try {
const res = await downloadResumable.downloadAsync();
return res
} catch (e) {
return null
}
};

The model, that we are using has a weight of about 700MB. So, it will take a little bit of time for it to download.

Let’s configure llama.rn to load our model correctly.
I created a file called @/llama/llama.config.ts :

// @/llama/llama.config

import { initLlama, LlamaContext } from "llama.rn";

...

export const loadModel = async (modelPath: string) => {
const context = await initLlama({
model: modelPath,
use_mlock: true,
n_ctx: 131072,
n_gpu_layers: 1, // > 0: enable Metal on iOS
// embedding: true, // use embedding
});

return context;
};

The initiLLama function will return context object, which has an API handlers to communicate with the loaded model. More configuration you can find here: https://github.com/mybigday/llama.rn

And don't forget to star their project, they did an amazing job!

Let’s connect it all together into React Native screen:

import React, { useEffect, useState } from "react";
import { Button, View } from "react-native";
import * as FileSystem from "expo-file-system";
import { loadModel } from "../llama/llama.config";
import Chat from "@/components/Chat";
import { SafeAreaProvider, SafeAreaView } from "react-native-safe-area-context";
import { LlamaContext } from "llama.rn";

const downloadLink =
"https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf";

export default () => {
const [context, setContext] = useState<LlamaContext | null | undefined>(
null,
);

const downloadResumable = FileSystem.createDownloadResumable(
downloadLink,
FileSystem.documentDirectory + "model.gguf",
{},
() => {
console.log("downloading");
},
);

const downloadModel = async () => {
try {
const isExists = (await FileSystem.getInfoAsync(
FileSystem.documentDirectory + "model.gguf",
)).exists;
if (isExists) {
const context = await loadModel(
FileSystem.documentDirectory + "model.gguf",
);
setContext(context);

return;
}

const res = await downloadResumable.downloadAsync();
console.log("Finished downloading to ", res?.uri);

if (!res?.uri) {
console.log("no uri");
}

const context = await loadModel(res?.uri!);
setContext(context);
} catch (e) {
console.error(e);
}
};

useEffect(() => {
downloadModel();
}, []);

return (
<SafeAreaProvider>
<SafeAreaView style={{ flex: 1, backgroundColor: "black" }}>
{context && <Chat context={context} />}
</SafeAreaView>
</SafeAreaProvider>
);
};

Simple useEffect() that downloads necessary files and runs our model.
We are setting state context with our LLM context and passing it to the component called Chat .

Before we start to create Chat.tsx component, let’s define sendMessage() function that will create completion and return text to the user:

// @/llama/llama.config

...

export const sendMessage = async (context: LlamaContext, message: string) => {
const msgResult = await context.completion(
{
messages: [
{
role: "user",
content: message,
},
],
n_predict: 1000,
stop: stopWords,
},
(data) => {
const { token } = data;
},
);

return msgResult.text;
};

This function gets context — llama context that we created above in loadModel() function. And message text that we will use for completion.

And now we are ready to create UI for communication with LLM:

import { sendMessage } from "@/llama/llama.config";
import { LlamaContext } from "llama.rn";
import React, { useCallback, useEffect, useState } from "react";
import { GiftedChat } from "react-native-gifted-chat";
import { v4 as uuid } from "uuid";

export default ({ context }: { context: LlamaContext }) => {
const [messages, setMessages] = useState([]);

const onSend = useCallback(async (messages: any) => {
const m = [...messages];
const text = await sendMessage(context, messages[0].text);

const messObj = {
_id: uuid(),
createdAt: Date.now(),
text,
user: {
_id: 0,
},
};

m.push(messObj);

setMessages((previousMessages) =>
GiftedChat.append(previousMessages, m)
);
}, []);

return (
<GiftedChat
messages={messages}
onSend={(messages) => onSend(messages)}
user={{
_id: 1,
}}
/>
);
};

And here we are, now let’s see the results:

That’s just amazing, that you can run your own LLM model locally, even on emulator.

Of course, the downside of this implementation is that mobile devices as for now not that powerful enough to support full-scaled models, however, it’s just the first step.

Full Code you can find in the REPO

Big Kudos to DeepSeek, Unsloth and of course llama.rn

Hit the clap if it was useful and see you in the comments :)

--

--

No responses yet