S-LoRA: Serving Thousands of Concurrent LoRA Adapters [paper]

Latest News

A fair scheduler VTC (paper) has been integrated into S-LoRA. See file slora/server/router/vtc_req_queue.py.

Abstract

The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services.

Requirements

CUDA 11.8 compatible GPU
- Recommended: GPUs from the Ampere family, like the A100, which support bfloat16 operations.
- Note: Older GPUs from the Turing family like the T4, which do not support bfloat16, are not supported.
1.13 <= PyTorch <= 2.0.1

Installation

conda create -n slora python=3.9
conda activate slora
# Optional: Install CUDA via conda for a smoother installation experience,
# but you may need to manually set the Anaconda path variables.
# conda install cuda -c nvidia/label/cuda-11.8.0
# set environment variables: export TORCH_CUDA_ARCH_LIST="8.0 8.6"
pip install torch==2.0.1
pip install -e .

Make sure triton==2.1.0

For more details on installing CUDA via conda, refer to the CUDA Installation Guide by NVIDIA.

Example Run

Real model weights

cd benchmarks
python launch_server.py --num-adapter 100 --num-token 10000 --model-setting Real
python run_exp.py --debug --model-setting Real

Dummy weights

cd benchmarks
python launch_server.py --num-adapter 100 --num-token 10000 --dummy
python run_exp.py --debug

Test