AI & Machine Learning
โ
14.0k
Python
lyogavin/airllm
14.0k
Stars
1.4k
Forks
127
Issues
Python
Language
AirLLM is a memory-optimized inference framework that runs 70B+ parameter LLMs on a single 4GB GPU without quantization. Decomposes models into layer-wise shards that load and unload dynamically. Supports Llama, Mistral, QWen, ChatGLM architectures with optional 4-bit/8-bit compression for 3x speed gains. Cross-platform including macOS. Makes massive AI models accessible on consumer hardware.
View on GitHub
git clone https://github.com/lyogavin/airllm.git
Quick Start Example
python
from airllm import AutoModel
# Run 70B model on 4GB GPU
model = AutoModel.from_pretrained(
"meta-llama/Llama-2-70b-hf"
)
input_text = "Explain quantum computing"
output = model.generate(input_text, max_length=200)
print(output)