Context compression

Long threads without long bills

Context compression intelligently strips tokens from long inputs, lowering costs, improving latency, and preserving quality.

Context compression
Request
{
"model":"google/gemma-4-31b-it"
"messages":[
0:{
"role":"user"
"content":"Q1 planning kicked off in early January with a cross-functional workshop that aligned on three strategic pillars: Scale, Stability, and Sustainability. Under the Scale pillar, we successfully provisioned an additional 40% compute capacity across three new AWS regions (us-east-2, eu-central-1, and ap-southeast-2), enabling us to handle projected 2.3x traffic growth through Q3."
}
]
"max_tokens":1024
"sansa":{
"compression":{
"level":0.5
"threshold":2048
}
}
}
{
"model":"google/gemma-4-31b-it"
"choices":[
0:{
"message":{
"role":"assistant"
"content":"..."
}
}
]
"sansa":{
"compression":{
"compressed":"Q1 planning in January aligned Scale Stability Sustainability. provisioned additional 40% compute AWS us-east-2 eu-central-1 ap-southeast-2 projected 2.3x traffic growth through Q3."
}
}
}
Keep signal & trim noise from long inputs

TOKEN SAVINGS

40%+

QUALITY

99%+

LATENCY

~20ms
The insight

Less input. Better answers.

Long inputs mix signal with boilerplate and noise. In benchmarks, context aware compression improved reading comprehension and output quality. Less noise in, better answers out.

svg-animation

Context compression analyzes all of the tokens within your prompt to evaluate their semantic weight and relative importance. When a large language model works to process the inputs it receives, every word contributes to its overall understanding. But not equally. Many filler words can be completely stripped away while leaving the core meaning intact. To achieve this, a scoring function evaluates each of the input token's significance and assigns a retention probability. Tokens falling below the retention threshold are pruned from the text. The resulting output is denser and preserves the original intent and contextual meaning while using fewer tokens, resulting in lower cost, higher quality responses, and improved long context understanding.

How it works

Semantic scoring, not truncation

Our compression model scores every span of text for relevance, then strips low-importance text before your LLM sees it. You control the compression level: low for precision tasks, heavy for maximum savings.

Performance

Faster, leaner, sharper

~40% faster end-to-end latency. ~30% fewer input tokens. Compression overhead sits below 20 ms. Speed and savings without a quality trade-off.

svg-animation

Your all-in-one AI backend.

Get started for free.