Prompt Engineering

Also known as: prompting, prompt design

TL;DR

Prompt engineering is the practice of writing inputs to an LLM that reliably produce the outputs you want. It includes structure (system prompts, few-shot examples) and reasoning patterns (chain-of-thought).

Prompt engineering is the discipline of writing inputs to an that produce reliable, useful outputs. It looks deceptively easy — you write English, the model writes English back — but doing it well, especially at production scale, is a real skill.

PROMPT ENGINEERINGAnatomy of a production prompt.ONE PROMPT↓ ORDER MATTERS · MODEL READS TOP-DOWNROLE · SYSTEMPersona and policy.L1"You are a senior tax analyst. Be concise.Refuse if the question is not about tax law."INSTRUCTIONWhat to do this turn.L2"Extract every taxable event from the document.Return one row per event, sorted by date."EXAMPLES · FEW-SHOTDemonstrate the format.L3IN : "Sold AAPL on 2024-03-12, gain $410."OUT : 2024-03-12 | AAPL | +410IN : "BTC fork received, basis $0."OUT : (varies) | BTC | +basisFORMAT CONSTRAINTSShape the output channel.L4pipe-separated · ISO dates · no preamblenever invent rows · "(varies)" for unknown datesINPUTThe actual document.L5<document> 2024 brokerage statement, 18 pages…</document>LLMreads top-downINSTRUCTION / FORMATEXAMPLESINPUTFIVE LAYERS · MOST PRODUCTION PROMPTS HAVE THIS SHAPE

The basic anatomy of a prompt

Most production prompts have a shape:

  • System prompt — sets the overall behavior, tone, persona, and any global constraints. The model treats this as more authoritative than user input.
  • Few-shot examples (optional) — input/output pairs that demonstrate the desired pattern. The model learns the format from the examples even when it can’t articulate the rule.
  • Task instructions — what specifically to do for this call.
  • Input — the actual content the model is operating on.
  • Output format — when you need structured output, specify the shape (JSON schema, XML tags, markdown sections).

A well-structured prompt makes all five explicit. A poorly-structured one mixes them and gets unreliable behavior.

Patterns that actually work

FEW-SHOT PROMPTING · ROLE BOUNDARIESOne prompt, many turns.ONE MODEL CALL · TOP-DOWN↓ EACH TURN HAS A ROLESHOT 1SHOT 2SHOT 3SYSTEMSYSYou classify customer messages intoone of: BILLING, BUG, FEATURE.Reply with only the label.task framingUSERT1"My invoice for March is wrong."demonstration 1 · inputASSISTANTT2BILLINGdemonstration 1 · outputUSERT3"App crashes on login since v4.2."demonstration 2 · inputASSISTANTT4BUGdemonstration 2 · outputUSERT5"Can you add dark mode?"demonstration 3 · inputASSISTANTT6FEATUREin-context pattern lockedUSERQRY"Why was I charged twice today?"final query · model answers ↓MODEL CONTINUES THE PATTERN→ BILLINGROLE BOUNDARIESSYSTEM · framingUSER · shotsASSISTANT · shotsUSER · final queryA FEW-SHOT PROMPT IS A SCRIPTED CONVERSATION · ROLES MATTER
Patterns that actually work
  • Few-shot prompting — show 2-5 input/output examples. Massively more reliable than describing the rule in words.
  • Chain-of-thought — ask the model to reason step by step before producing the final answer. The intermediate reasoning often saves it from arithmetic and logical mistakes it would otherwise make.
  • Role assignment — “You are a senior legal analyst…” biases the output toward that persona’s tone and rigor. Cliché, but works.
  • Output constraints — “Respond in JSON with keys X, Y, Z” produces dramatically more parseable output than asking in natural language.
  • Negative constraints — “Do not include any text outside the JSON” reliably suppresses chatty preambles.

Patterns that look right but break

  • Asking the model to be brief. It usually isn’t. Truncation is more reliable than asking nicely.
  • Treating the model as deterministic. Even at temperature 0, identical inputs can produce different outputs in production due to batching effects. Don’t depend on byte-for-byte stability.
  • Long instructions. Past ~500 tokens of system prompt, the model starts ignoring later parts. Front-load the critical instructions.
  • Mixing data and instructions in the same channel. If you concatenate user input directly into instructions, you’ve opened a prompt-injection vector. Use clear delimiters or, better, separate roles.

Why this is brittle

Prompts are inherently model-specific. A prompt that works on GPT-5 may fail on Claude, and a model version bump can subtly break a prompt that worked yesterday. Production systems that are 80% prompt and 20% application logic are accumulating maintenance debt — every model upgrade is a regression-test cycle.

Chain-of-thought (CoT) works because it expands the amount of computation the model gets to spend before committing to a final answer. A transformer with N layers can do at most N “steps of reasoning” before producing a token; if the answer requires more steps than that, the model has to compress them into a single forward pass and tends to fail. CoT lets the model use output tokens themselves as a scratchpad — each reasoning token gets its own forward pass, effectively unrolling the computation graph in time.

This explains the asymmetry. Tasks that genuinely require multi-step computation (multi-digit arithmetic, multi-hop logic, complex algorithm tracing) benefit dramatically from CoT — sometimes 30+ percentage points on benchmark accuracy. Tasks that are essentially recognition or shallow lookup (factual recall, single-step classification, sentiment analysis) gain nothing because they didn’t need extra computation in the first place; CoT just adds latency and potential for the model to talk itself out of the right answer.

The gotcha is that CoT can actively hurt on tasks where the model’s intuition outperforms its explicit reasoning. Asking a strong model to “think step by step” before classifying sentiment can produce labored, second-guessing reasoning that lands on a worse answer than the immediate response would have. The empirical pattern: CoT helps if the task error is “couldn’t compute the answer”, and hurts if the task error is “overthought a simple judgment”.

Modern models trained with CoT data baked into post-training (o1, Claude with extended thinking) blur the distinction further — they decide internally whether to expand reasoning, often outperforming explicit CoT prompts on the underlying task.

The alternative, when the task is stable and runs at scale, is . A fine-tuned model has the desired behavior baked into the weights, not pasted into the prompt. It survives provider migrations, model upgrades, and cost-pressure to switch to a smaller model — because the smaller fine-tuned model often outperforms the bigger prompted one anyway.

The pragmatic stance: prompt to validate the idea, fine-tune to ship the production version.

Prompt engineering is the cheap way to specify behavior. Fine-tuning is the durable way. Prototype with prompts; ship with weights.

Go further

When should I prompt vs fine-tune?

Prompt for low-volume, exploratory, or evolving tasks. Fine-tune for high-volume, stable, narrow tasks where per-call cost and consistency dominate. Prompting is right for one-off use; fine-tuning beats prompting on every metric (latency, cost, accuracy, calibration) once the task is well-defined and runs at scale.

Do prompting tricks actually generalize across models?

Not as much as people think. A prompt that works beautifully on GPT-5 may underperform on Claude or Gemini, and even a model upgrade within the same family can break a carefully-tuned prompt. Prompts are brittle artifacts; if your system depends on a specific phrasing, you have a maintenance liability.

What's chain-of-thought and when does it help?

Chain-of-thought (CoT) is the technique of asking the model to 'think step by step' before producing the final answer — letting it work through reasoning chains in its output before committing. Helps disproportionately on math, logic, and multi-step tasks. Doesn't help (and may hurt) on simple factual recall.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord