zero-dependency · runs fully offline

Quantize with
conviction.

litmus-lab benchmarks your model across FP16, INT8 and INT4 on your own GPU — then a deterministic offline engine tells you exactly which precision to ship.

$ pip install litmus-labno cloud APIsno hallucinated advice
litmus-lab — profiling session

// the problem

Every tool tells you “INT4 uses less memory.”
That number decides nothing.

Memory reduction alone doesn't determine deployment quality. The same quantized model can quietly:

01

become slower

Lower precision can tank throughput instead of helping it.

02

lose coherence

Weights scramble and generations quietly fall apart.

03

spike TTFT

Latency to first token balloons under some kernels.

04

destabilize logits

4-bit pushes fragile architectures into instability.

05

save almost nothing

Sometimes the VRAM you reclaim isn't worth the tax.

litmus-lab exists to mathematically decide whether quantization is actually worth deploying — on your hardware.

// capabilities

Everything you need to ship quantized — with proof.

Multi-precision benchmarking

Profile Native FP16, INT8 and INT4 (NF4) on the exact same prompt and architecture — measured side by side, not estimated.

Offline recommendation engine

A deterministic, rule-based heuristic weighs VRAM, speed and perplexity to output a single deployment verdict. No APIs. No hallucinations.

VRAM isolation & cleanup

Every pass runs in an isolated worker with aggressive CUDA cache flushing, GC and IPC clearing — so memory leaks never fake your VRAM readings.

Context-length protection

Reads max_position_embeddings and scales test sequences safely — so fragile older architectures never crash on out-of-bound indices.

Beautiful terminal dashboard

Every benchmark renders as a clean, rich-formatted table right in your CLI — readable at a glance, copy-paste ready for a report.

// four signals

The numbers that actually decide a deployment.

VRAMlower
0 MB

Peak GPU memory allocated

reclaimed at INT4

Tokens / sechigher
0.00

Generation throughput

higher is better

TTFTlower
0.000s

Time to first token

lower is better

Perplexitylower
+0.00 Δ

Linguistic degradation

small is safe

// three steps

From pip install to a deployment verdict.

01

Install

One pip command. Zero config, no API keys, nothing leaves your machine.

$ pip install litmus-lab
02

Profile

Point it at any Hugging Face causal LM and a prompt. It runs FP16, INT8 and INT4 in isolated passes.

$ litmus-lab --model Qwen/Qwen2.5-7B-Instruct \
  --prompt "Explain transformers"
03

Deploy with a verdict

Read the table, get a deterministic recommendation, and ship the precision that's actually best for your GPU.

$ → Recommendation: Deploy INT4 (NF4)

Works with most Hugging Face causal language models

PhiQwenGemmaMistralLlamaOPTFalconTinyLlamaDeepSeek
PhiQwenGemmaMistralLlamaOPTFalconTinyLlamaDeepSeek

// early access

Stop guessing.
Join the waitlist.

Be first to profile FP16 · INT8 · INT4 on your own GPU and ship the precision your hardware actually wants.

No spam. Just one email when it's ready.