running-llms-locally
Running Local LLMs
A comprehensive guide to running Large Language Models (LLMs) on your local machine using various frameworks and tools.
Table of Contents
- Overview
- Requirements
- Frameworks and Tools
- llama.cpp
- Ollama
- HuggingFace Transformers
- HuggingFace Transformers - Quantized (BitsAndBytes)
- TorchAO
- vLLM
- [GPT-NeoX] (#GPT-Neox)
- [Triton] - (# Triton Inference Server TensorRT backend)
- LM Studio
- Performance Comparison
- Contributing
- License
Overview
Running LLMs locally offers several advantages including privacy, offline access, and cost efficiency. This repository provides step-by-step guides for setting up and running LLMs using various frameworks, each with its own strengths and optimization techniques.
Requirements
General requirements for running LLMs locally:
- Hardware:
- CPU: Modern multi-core processor (8+ cores recommended)
- RAM: 16GB minimum, 32GB+ recommended
- GPU: NVIDIA GPU with 8GB+ VRAM (16GB+ recommended for larger models)
- Storage: 20GB+ free space (varies by model size)
- Software:
- Python 3.8+
- CUDA 11.7+ and cuDNN (for GPU acceleration)
- Git
Specific requirements are listed in each framework section.
Frameworks and Tools
llama.cpp
llama.cpp is a C/C++ implementation of LLaMA that’s optimized for CPU and GPU inference.
Installation
# Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build the project
make
# Download and convert a model (example with TinyLlama)
python3 -m pip install torch numpy sentencepiece
python3 scripts/convert.py <path_to_hf_model>
# Quantize the model (optional)
./quantize <path_to_model>/ggml-model-f16.bin <path_to_model>/ggml-model-q4_0.bin q4_0
Usage
# Run inference
./main -m <path_to_model>/ggml-model-q4_0.bin -n 512 -p "Write a short poem about programming:"
Advantages
- Extremely memory-efficient through quantization
- Works well on CPU-only setups
- Supports various model architectures (LLaMA, Mistral, Falcon, etc.)
- Available as a library for integration into other applications
Ollama
Ollama provides an easy way to run open-source LLMs locally with a simple API.
Installation
# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download/windows
Usage
# Pull a model
ollama pull mistral
# Run a model
ollama run mistral
# API usage
curl -X POST http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "What is computational linguistics?"
}'
Advantages
- Simplified setup and usage
- Integrated model library with one-command downloads
- REST API for easy integration
- Cross-platform support
- No Python environment needed
HuggingFace Transformers
HuggingFace Transformers is a popular library that provides thousands of pre-trained models.
Installation
pip install transformers torch
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf" # Requires HF auth token
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Generate text
inputs = tokenizer("Write a short story about:", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Advantages
- Extensive model support
- Easy integration with PyTorch ecosystem
- Rich documentation and community support
- Seamless model switching
HuggingFace Transformers - Quantized (BitsAndBytes)
Quantization with BitsAndBytes allows running larger models with reduced memory requirements.
Installation
pip install transformers torch bitsandbytes accelerate
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# Configure quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
# Load model with quantization
model_name = "meta-llama/Llama-2-13b-hf" # Requires HF auth token
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto"
)
# Generate text
inputs = tokenizer("Explain quantum computing:", return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Advantages
- Run larger models on consumer hardware
- Minimal performance impact despite compression
- Compatible with most HuggingFace models
- 4-bit and 8-bit quantization options
TorchAO
TorchAO (PyTorch Ahead-of-Time Optimization) enables efficient inference through quantization and optimization techniques.
Installation
pip install torch torchao
Usage
import torch
import torchao
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Apply TorchAO optimizations
optimized_model = torchao.optimize(
model,
quantization=True,
dtype=torch.float16,
inplace=False
)
# Generate text
inputs = tokenizer("Explain how solar panels work:", return_tensors="pt")
outputs = optimized_model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Advantages
- Advanced quantization techniques
- Hardware-specific optimizations
- Compatible with PyTorch ecosystem
- Flexible configuration options
GPT-NeoX - (#GPT-Neox)
GPT-NeoX is a series of models developed by EleutherAI and built using a modified version of Megatron. It offers various sizes, ranging from 20B to 100B parameters.
Installation
pip install gpt-neo
Usage
from gpt_neo import GPTPredictor
from gpt_neo.utils import download_model
# Download the model (requires EleutherAI auth token)
download_model("gpt-neo-2.7B") # Change the model size as
needed
# Load and initialize the model predictor
predictor = GPTPredictor.from_pretrained("gpt-neo-2.7B",
device="cuda:0")
# Generate text
inputs = "Write a short story about:"
outputs = predictor(inputs, max_length=100)
print(outputs["text"])
Advantages
- Large model capacity
- Open source and transparent development
- Supports multiple architectures (e.g., BERT, Megatron)
- Strong performance on various tasks
Triton Inference Server (TensorRT backend)
Triton Inference Server is an open-source, extensible, and customizable AI inferencing server developed by NVIDIA. It supports a wide variety of models, including LLMs like BERT, RoBERTa, DistilBert, and Electra, among others.
Installation
- Install Triton Inference Server: Follow the instructions for [installing Triton Inference Server](https://github.com/NVIDIA/Triton/blob/main/docs/quicksServer](https://github.com/NVDIA/Triton/blob/main/docs/quickstarts/quickstart-docker.md).
Usage
- Prepare the model: Convert your Hugging Face Transformers model to an ONNX format using [the ONNX model converter](https://github.com/onnx/onnx/blob/master/tools/onnxconverter](https//github.com/onnx/onnx/blob/master/tools/onnxtools/README.md).
- Export the ONNX model for TensorRT: Use the TensorRT optimizer to convert your ONNX model to a TensorRT engine.
- Deploy the model in Triton Inference Server: Create and
configure a
model_config.pbtxt
file for your model. Then, create and register a Triton model server with the configuration file and the TensorRT engine. - Use the model in your application: Integrate Triton Inference Server into your application using the Triton Inference Client Library or any other method that suits your needs.
Advantages
- High performance through the use of TensorRT optimizations
- Flexible deployment options (e.g., on-premises, in the cloud)
- Supports a wide variety of models and customization
- Efficient handling of multiple models at scale
vLLM
vLLM is a high-throughput and memory-efficient inference engine.
Installation
pip install vllm
Usage
from vllm import LLM, SamplingParams
# Initialize the model
llm = LLM(model="meta-llama/Llama-2-7b-hf") # Requires HF auth token
# Set sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=100
)
# Generate text
prompts = ["Write a short story about space exploration:"]
outputs = llm.generate(prompts, sampling_params)
# Print generated text
for output in outputs:
print(output.text)
Command-line usage
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-hf
Advantages
- PagedAttention for efficient memory usage
- Multi-GPU support with tensor parallelism
- OpenAI-compatible API
- High throughput for batch processing
- Efficient continuous batching
LM Studio
LM Studio is a desktop application for running local LLMs with a graphical interface.
Installation
- Download the installer from https://lmstudio.ai/
- Install and launch the application
Usage
- Download models from the built-in model library
- Configure inference parameters using the GUI
- Chat with the model through the interface
- Optionally expose an API server compatible with OpenAI
Advantages
- User-friendly GUI
- No coding required
- Built-in model discovery and management
- Visual parameter tuning
- Performance metrics visualization
- Compatible with various model formats
Performance Comparison
Framework | Memory Usage | Inference Speed | Setup Complexity | GPU Support | CPU Support |
---|---|---|---|---|---|
llama.cpp | Very Low | Moderate | Moderate | Good | Excellent |
Ollama | Low | Good | Very Low | Good | Good |
HF Transformers | High | Moderate | Low | Excellent | Good |
HF - BitsAndBytes | Moderate | Good | Low | Excellent | Limited |
TorchAO | Moderate | Good | Moderate | Excellent | Good |
vLLM | Moderate | Excellent | Moderate | Excellent | Limited |
LM Studio | Varies | Good | Very Low | Good | Good |
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.