python convert.py models/llama-13b/ ./quantize models/llama-13b/ggml-model-f16.gguf models/llama-13b/q4_k_m.gguf q4_k_m
This is where comes in. It’s a compression technique that reduces the precision of the model's numbers (weights) from high-precision floating points (like 32-bit floats) down to smaller integers (like 4-bit integers). It’s like taking a high-resolution RAW photo and converting it to a compressed JPEG. You lose some nuance, but the file size drops by 90%, and for most people, the picture looks the same. gpt4allloraquantizedbin+repack