Quantize with
conviction.
litmus-lab benchmarks your model across FP16, INT8 and INT4 on your own GPU — then a deterministic offline engine tells you exactly which precision to ship.
// the problem
Every tool tells you “INT4 uses less memory.”
That number decides nothing.
Memory reduction alone doesn't determine deployment quality. The same quantized model can quietly:
become slower
Lower precision can tank throughput instead of helping it.
lose coherence
Weights scramble and generations quietly fall apart.
spike TTFT
Latency to first token balloons under some kernels.
destabilize logits
4-bit pushes fragile architectures into instability.
save almost nothing
Sometimes the VRAM you reclaim isn't worth the tax.
litmus-lab exists to mathematically decide whether quantization is actually worth deploying — on your hardware.
// capabilities
Everything you need to ship quantized — with proof.
Multi-precision benchmarking
Profile Native FP16, INT8 and INT4 (NF4) on the exact same prompt and architecture — measured side by side, not estimated.
Offline recommendation engine
A deterministic, rule-based heuristic weighs VRAM, speed and perplexity to output a single deployment verdict. No APIs. No hallucinations.
VRAM isolation & cleanup
Every pass runs in an isolated worker with aggressive CUDA cache flushing, GC and IPC clearing — so memory leaks never fake your VRAM readings.
Context-length protection
Reads max_position_embeddings and scales test sequences safely — so fragile older architectures never crash on out-of-bound indices.
Beautiful terminal dashboard
Every benchmark renders as a clean, rich-formatted table right in your CLI — readable at a glance, copy-paste ready for a report.
// four signals
The numbers that actually decide a deployment.
Peak GPU memory allocated
reclaimed at INT4
Generation throughput
higher is better
Time to first token
lower is better
Linguistic degradation
small is safe
// three steps
From pip install to a deployment verdict.
Install
One pip command. Zero config, no API keys, nothing leaves your machine.
$ pip install litmus-labProfile
Point it at any Hugging Face causal LM and a prompt. It runs FP16, INT8 and INT4 in isolated passes.
$ litmus-lab --model Qwen/Qwen2.5-7B-Instruct \
--prompt "Explain transformers"Deploy with a verdict
Read the table, get a deterministic recommendation, and ship the precision that's actually best for your GPU.
$ → Recommendation: Deploy INT4 (NF4)Works with most Hugging Face causal language models
// early access
Stop guessing.
Join the waitlist.
Be first to profile FP16 · INT8 · INT4 on your own GPU and ship the precision your hardware actually wants.
No spam. Just one email when it's ready.