I Reduced My Local LLM Response Time by 83% by Changing One Prompting Strategy



When developers talk about optimizing LLM applications, the conversation usually revolves around model size, quantization, GPU memory, or inference speed.

However, one of the biggest performance bottlenecks often sits much higher in the stack: how we construct the prompt payload sent to the model.

Every message sent to an LLM includes some combination of:

  • System instructions
  • Conversation history
  • The user’s latest message

The larger that payload becomes, the more work the model must perform before it can generate a response.

For applications that process thousands or millions of requests, inefficient context management can increase both latency and operational costs significantly.

To better understand the impact of context management strategies, I benchmarked three common approaches using a local LLM-powered grammar correction application (Ollama + Gemma3:1b).

The results were surprising.


The Test Scenario

This benchmark was performed using a grammar correction workflow.

A typical request looked like this:

Input

i am going to market yesterday

Output

I went to the market yesterday.

Unlike a conversational assistant, grammar correction is largely a single-turn task. In most cases, previous messages provide little or no value when correcting the current sentence.

This makes grammar correction an ideal workload for evaluating whether conversation history is actually worth sending to the model.

I compared three different context management strategies across an 11-turn session.

Strategy 1: Full History

Every message in the conversation is appended to the prompt.

[
  systemPrompt,
  ...allPreviousMessages,
  currentMessage
]

This is the default behavior used by many chat applications and LLM clients.

Strategy 1


Strategy 2: Full History with Prompt Caching

The entire conversation history is still sent, but the runtime is allowed to reuse previously processed context whenever possible.

[
  systemPrompt,
  ...allPreviousMessages,
  currentMessage
]

The payload remains identical to Strategy 1, but the runtime can avoid repeating some prompt-processing work.

Strategy 2


Strategy 3: Stateless Prompting

Only the system prompt and the current user message are sent.

[
  systemPrompt,
  currentMessage
]

No conversation history is preserved.

Because grammar correction is inherently a single-turn task, this approach is expected to perform well while minimizing context size.

Strategy 3


Why Context Size Matters

Before an LLM generates a response, it must process the input context.

As conversation history grows, the amount of text included in each request grows as well.

Without any optimization, larger prompts generally mean:

  • More prompt-processing work
  • Higher latency
  • More tokens consumed
  • Increased operating costs when using hosted APIs

Prompt caching can reduce some of this overhead by allowing the runtime to reuse previously processed context.

Stateless prompting takes a different approach: instead of optimizing large prompts, it avoids sending them in the first place.


Benchmark Results

Cumulative Session Latency

The difference between the three strategies became increasingly visible as the session progressed.

TurnFull History (No Cache)Full History (1h Cache)Stateless (Prompt Only)
Turn 119.13 seconds2.21 seconds2.33 seconds
Turn 28.84 seconds2.48 seconds2.72 seconds
Turn 311.53 seconds2.87 seconds2.69 seconds
Turn 412.12 seconds3.12 seconds2.88 seconds
Turn 515.99 seconds4.65 seconds4.06 seconds
Turn 616.72 seconds3.72 seconds3.46 seconds
Turn 718.35 seconds3.34 seconds3.04 seconds
Turn 820.38 seconds3.56 seconds3.13 seconds
Turn 922.46 seconds4.05 seconds3.30 seconds
Turn 1023.51 seconds3.34 seconds2.91 seconds
Turn 1127.33 seconds3.52 seconds3.15 seconds
Total Session Wait🛑 196.36s36.86s🚀 33.67s

The Full History approach experienced steadily increasing processing times as the conversation grew.

Both caching and stateless prompting dramatically reduced cumulative waiting time.

Compared to Full History, Stateless Prompting reduced total session latency by approximately 83%.

Strategy Compare


Context Token Usage

Latency tells only part of the story.

The benchmark also revealed substantial differences in context token consumption.

StrategyTotal Context TokensAverage Context Tokens per Request
Full History~3047~277
Full History + Cache~3012~274
Stateless~1149~104

Stateless prompting reduced context token usage by roughly 62%.

This reduction occurred because previous messages were no longer included in every request.

Importantly, for this grammar correction workload, removing conversation history did not meaningfully affect the quality of the corrections.


Understanding the Difference

The benchmark highlights two fundamentally different optimization techniques.

Prompt Caching

Prompt caching keeps historical context available for reuse.

Advantages:

  • Maintains conversational awareness
  • Reduces repeated prompt-processing work
  • Ideal for chat assistants and long conversations

Limitations:

  • Full context still exists
  • Cache effectiveness depends on implementation details
  • Context size continues growing

Stateless Prompting

Stateless prompting eliminates historical context entirely.

Advantages:

  • Minimal context size
  • Predictable token usage
  • Consistent latency
  • Simplified architecture

Limitations:

  • No conversational memory
  • Unsuitable for multi-turn reasoning tasks

For workloads where previous messages provide little value, stateless prompting can deliver substantial performance gains with virtually no downside.


When Stateless Architectures Make Sense

Not every AI application needs memory.

Stateless prompting often works well for:

  • Grammar correction
  • Translation
  • Text rewriting
  • Classification
  • Formatting
  • Sentiment analysis
  • Content moderation

Conversational history is often more important for:

  • General-purpose chat assistants
  • Coding copilots
  • Research assistants
  • Customer support agents
  • Long-running AI workflows

The key lesson is simple:

Context should be treated as a tool, not a default.

If your application does not benefit from historical messages, sending them anyway only increases latency and token consumption.


Building Cache-Friendly Systems

For applications that genuinely require history, prompt caching can still provide significant improvements.

To maximize cache effectiveness:

Keep Historical Messages Immutable

Avoid modifying older messages after they have been created.

Bad:

messages.push({
  role: "user",
  content: userInput,
  timestamp: Date.now()
});

Dynamic metadata can prevent cache reuse.

Better:

messages.push({
  role: "user",
  content: userInput
});

Keeping historical context stable increases the likelihood of cache hits.


Final Thoughts

This benchmark demonstrates that context management can have a larger impact on application performance than many developers realize.

For a grammar correction workload:

  • Stateless prompting reduced total latency by roughly 83%.
  • Context token usage dropped by approximately 62%.
  • Output quality remained effectively unchanged.

The most important takeaway is not that conversation history is bad.

The takeaway is that different workloads require different context strategies.

Many AI applications inherit chat-style architectures by default, even when their tasks are fundamentally single-turn operations.

By choosing a context strategy that matches the actual problem being solved, developers can significantly reduce latency, lower token consumption, and build more efficient AI systems.

After comparing these strategies, I believe every AI application should display context statistics directly in the chat interface. If users can see CPU usage, memory consumption, and network activity in other software, they should also be able to see how many context tokens are being sent to the model.

A simple panel showing Context Tokens, Response Tokens, Total Conversation Tokens, and a Send Full History checkbox would make context management transparent and help users understand the performance and cost implications of their conversations.