Simplified Tutorial on Running LLMs (Llama 3) Locally with llama.cpp

Run LLMs Locally llama.cpp Llama 3 Windows MAC OS

The landscape of AI is rapidly evolving, and with it, the accessibility of powerful large language models (LLMs) is also transforming. No longer are these models confined to cloud servers; running them locally is becoming increasingly feasible. This guide focuses on leveraging llama.cpp, a remarkable open-source C++ library, to execute LLMs such as Llama 3 on your own hardware. Whether you’re a seasoned developer, a curious machine learning enthusiast, or someone simply seeking to explore the capabilities of LLMs without relying on external servers, this tutorial provides a clear, step-by-step path to getting started with llama.cpp. This easy-to-install library is designed to optimize LLM inference on various hardware configurations, from your everyday desktop computer to more robust cloud-based infrastructures. The power of LLMs at your fingertips!

What You Need to Get Started

Before embarking on this journey, it’s crucial to ensure your system meets the minimum requirements. A recommended starting point is having at least 8 GB of VRAM. However, llama.cpp shines in its ability to adapt to resource constraints through various optimizations, particularly model quantization, which we will delve into later in this guide.

This tutorial is tailored for models like Llama-3–8B-Instruct, but the principles and techniques discussed are applicable to a wide range of models available on platforms like Hugging Face.

Understanding llama.cpp

So, what exactly is llama.cpp? At its core, it’s a streamlined C++ library engineered to simplify the process of running LLMs locally. It prioritizes efficient inference across diverse hardware setups, spanning from basic desktop configurations to high-performance cloud servers.

By choosing llama.cpp, you gain access to a suite of benefits:

Cross-Platform Compatibility: Works seamlessly across various operating systems, including Windows, macOS, and Linux.
Optimization: Focuses on efficient memory usage and accelerated computation, enabling LLMs to run on consumer-grade hardware.
Quantization Support: Enables you to reduce the memory footprint of LLMs, making them accessible to systems with limited resources.
Community Support: Benefits from a vibrant and active open-source community, constantly contributing to its improvement and expansion.

Challenges You Might Face

While llama.cpp is a powerful tool, it’s important to acknowledge potential hurdles:

Technical Proficiency: Requires familiarity with command-line interfaces and basic software compilation.
Hardware Limitations: Performance is still constrained by the capabilities of your hardware. Larger models may require more powerful systems.
Setup Complexity: Initial setup can be somewhat involved, requiring careful attention to detail.
Dependency Management: Managing dependencies can be challenging, especially on Windows.

Step-by-Step Setup

Let’s walk through the installation and setup process, ensuring you can get llama.cpp up and running smoothly with the Llama 3 model.

1. Cloning the Repository

First, you’ll need to download the llama.cpp repository from GitHub using the following commands in your terminal:

$ git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp

2. Building llama.cpp

Next, you’ll need to build the project. The build process varies slightly depending on your operating system.

For MacOS with Metal support, execute the following command:

$ make llama-server

Alternatively, you can use CMake:

$ cmake -B build
$ cmake --build build --config Release

For Windows with CUDA support, the commands are as follows:

C:UsersBob> make llama-server LLAMA_CUDA=1

Or, using CMake:

C:UsersBob> cmake -B build -DLLAMA_CUDA=ON
C:UsersBob> cmake --build build --config Release

Alternatively, you can Download a prebuilt release from the official repository:

LLAMA CPP Releases Page

Once the build process is complete, you’re ready to start using the model.

Downloading and Preparing the Model

In this example, we’ll use the Meta-Llama-3–8B-Instruct model, but you can adapt these steps for any model you prefer.

1. Install Hugging Face CLI

To begin, install the Hugging Face command-line interface using pip:

$ pip install -U "huggingface_hub[cli]"

Create an account on Hugging Face if you don’t already have one, and generate your access token from their settings page. You’ll need this to access the models.

2. Login to Hugging Face

$ huggingface-cli login

After logging in, accept the terms for the Llama-3–8B-Instruct model and await access approval.

3. Downloading the Model

You have two main options when downloading the model: non-quantized or GGUF quantized.

Non-Quantized: This option downloads the model in its original precision, offering the highest possible accuracy but requiring more resources.
GGUF Quantized: This option downloads a pre-quantized version of the model, reducing its size and memory footprint, making it suitable for systems with limited resources.

To download the non-quantized model:

$ huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --exclude "original/*"  --local-dir models/Meta-Llama-3-8B-Instruct

After downloading, install Python dependencies and convert the model to GGUF format:

$ python -m pip install -r requirements.txt
$ python convert-hf-to-gguf.py models/Meta-Llama-3-8B-Instruct

If you’re working with hardware constraints, you can download a quantized version directly:

$ huggingface-cli download path_to_gguf_model --exclude "original/*" --local-dir models/Meta-Llama-3-8B-Instruct

Using Quantization for Hardware Optimization

Quantization is a crucial technique that allows you to run models on devices with less memory, such as systems with less than 16 GB of VRAM. By reducing the precision of model weights (e.g., from 16-bit to 4-bit), you can significantly reduce memory consumption without drastically sacrificing performance.

To quantize the model:

$ ./llama-quantize ./models/Meta-Llama-3-8B-Instruct/ggml-model-f16.gguf ./models/Meta-Llama-3-8B-Instruct/ggml-model-Q4_K_M.gguf Q4_K_M

This command will generate a quantized model ready for local inference.

Running llama-server

Once your model is set up, you can launch the llama-server to handle HTTP requests, allowing you to interact with the model using standard APIs.

For MacOS:

$ ./llama-server -m models/Meta-Llama-3-8B-Instruct/ggml-model-Q4_K_M.gguf -c 2048

On Windows, use:

C:UsersBob> llama-server.exe -m models\Meta-Llama-3-8B-Instruct\ggml-model-Q4_K_M.gguf -c 2048

You can now start sending requests to http://localhost:8080.

Building a Python Chatbot

You can create a simple Python chatbot to interact with your model. Here’s a basic script to send requests to the llama-server:

import requests
import json

def get_response(server_url, messages, temperature=0.7, max_tokens=4096, stream=True):
    data = {
        "messages": messages,
        "temperature": temperature,
        "max_tokens": max_tokens,
        "stream": stream,
    }
    response = requests.post(f"{server_url}/v1/chat/completions", json=data)
    return response.json()

def chatbot(server_url):
    messages = [{"role": "system", "content": "You are a helpful assistant."}]
    while True:
        user_input = input("You: ")
        if user_input.lower() == "exit":
            break
        messages.append({"role": "user", "content": user_input})
        print("Assistant: ", get_response(server_url, messages))

if __name__ == "__main__":
    chatbot("http://localhost:8080")

This script establishes a chatbot that communicates with the llama-server to generate responses based on user input. Running LLMs locally using llama.cpp opens a world of possibilities.

Conclusion:

Utilizing llama.cpp to run large language models like Llama 3 locally presents a potent and efficient solution, particularly when high-performance inference is paramount. Despite its inherent complexity, llama.cpp offers significant flexibility with multi-platform support, quantization techniques, and hardware acceleration. While it may pose initial challenges for newcomers, its thriving community and comprehensive features make it a leading choice for developers and researchers alike.

For those willing to dive deep, llama.cpp unlocks endless opportunities for local and cloud-based LLM inference.

Alternative Solutions:

While llama.cpp is a robust solution for running LLMs locally, it’s not the only option. Here are two alternative approaches:

1. Using Transformers and Optimum Intel

The Hugging Face transformers library provides a high-level interface for working with LLMs. Coupled with optimum-intel, which optimizes models for Intel CPUs, this offers a potentially simpler path for local inference, especially if you are already familiar with the Hugging Face ecosystem.

Explanation:

transformers provides easy access to various LLMs.
optimum-intel leverages Intel’s Deep Learning Boost (DL Boost) instruction set for faster inference.
This approach avoids the need for manual compilation of C++ code, simplifying the setup process.

Code Example:

from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.intel import OVModelForCausalLM

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Optimize the model for Intel CPUs (optional)
#model = OVModelForCausalLM.from_pretrained(model_id, export=True)

# Generate text
prompt = "Write a short story about a cat who goes on an adventure."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

output = model.generate(input_ids, max_length=200, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(generated_text)

Note: The OVModelForCausalLM is optimized for Intel CPUs and can dramatically improve speed. Remove this if you are using a GPU.

2. Using MLC LLM

MLC LLM is a universal deployment solution that allows you to deploy any language model natively on a diverse set of hardware backends. It’s designed to be efficient and portable.

Explanation:

MLC LLM provides a unified interface for running LLMs on various hardware, including CPUs, GPUs (Nvidia, AMD, Apple Silicon), and even web browsers.
It uses a compiler-driven approach to optimize the model for the target hardware.
MLC LLM supports a wide range of models and quantization techniques.

Code Example:

While a full code example would be extensive, the general steps are:

Install MLC LLM: Follow the installation instructions on the MLC LLM website, which usually involves installing a Python package and potentially some system dependencies.
Download Model Artifacts: MLC LLM requires specific model artifacts. These are usually downloaded using a provided command-line tool.
Run the Model: Use the MLC LLM Python API or command-line interface to load and run the model.

The precise code will depend on the specific model and hardware you’re targeting. Consult the MLC LLM documentation for detailed instructions. This way of running LLMs locally with Llama 3 is another powerful method.

These alternatives offer different trade-offs between ease of use, performance, and hardware compatibility. Depending on your specific needs and technical expertise, one of these options might be a better fit than llama.cpp.

Simplified Tutorial on Running LLMs (Llama 3) Locally with llama.cpp

What You Need to Get Started

Understanding llama.cpp

Challenges You Might Face

Step-by-Step Setup

1. Cloning the Repository

2. Building llama.cpp

Downloading and Preparing the Model

1. Install Hugging Face CLI

2. Login to Hugging Face

3. Downloading the Model

Using Quantization for Hardware Optimization

Running llama-server

Building a Python Chatbot

Conclusion:

Alternative Solutions:

1. Using Transformers and Optimum Intel

2. Using MLC LLM

Share this:

Related posts:

Leave a Reply Cancel reply