Prompt engineering is the practice of writing inputs to an LLM that reliably produce the outputs you want. It includes structure (system prompts, few-shot examples) and reasoning patterns (chain-of-thought).
Prompt engineering is the discipline of writing inputs to an LLM that produce reliable, useful outputs. It looks deceptively easy — you write English, the model writes English back — but doing it well, especially at production scale, is a real skill.
The basic anatomy of a prompt
Most production prompts have a shape:
System prompt — sets the overall behavior, tone, persona, and any global constraints. The model treats this as more authoritative than user input.
Few-shot examples (optional) — input/output pairs that demonstrate the desired pattern. The model learns the format from the examples even when it can’t articulate the rule.
Task instructions — what specifically to do for this call.
Input — the actual content the model is operating on.
Output format — when you need structured output, specify the shape (JSON schema, XML tags, markdown sections).
A well-structured prompt makes all five explicit. A poorly-structured one mixes them and gets unreliable behavior.
Patterns that actually work
Patterns that actually work
Few-shot prompting — show 2-5 input/output examples. Massively more reliable than describing the rule in words.
Chain-of-thought — ask the model to reason step by step before producing the final answer. The intermediate reasoning often saves it from arithmetic and logical mistakes it would otherwise make.
Role assignment — “You are a senior legal analyst…” biases the output toward that persona’s tone and rigor. Cliché, but works.
Output constraints — “Respond in JSON with keys X, Y, Z” produces dramatically more parseable output than asking in natural language.
Negative constraints — “Do not include any text outside the JSON” reliably suppresses chatty preambles.
Patterns that look right but break
Asking the model to be brief. It usually isn’t. Truncation is more reliable than asking nicely.
Treating the model as deterministic. Even at temperature 0, identical inputs can produce different outputs in production due to batching effects. Don’t depend on byte-for-byte stability.
Long instructions. Past ~500 tokens of system prompt, the model starts ignoring later parts. Front-load the critical instructions.
Mixing data and instructions in the same channel. If you concatenate user input directly into instructions, you’ve opened a prompt-injection vector. Use clear delimiters or, better, separate roles.
Why this is brittle
Prompts are inherently model-specific. A prompt that works on GPT-5 may fail on Claude, and a model version bump can subtly break a prompt that worked yesterday. Production systems that are 80% prompt and 20% application logic are accumulating maintenance debt — every model upgrade is a regression-test cycle.
Chain-of-thought (CoT) works because it expands the amount of computation the model gets to spend before committing to a final answer. A transformer with N layers can do at most N “steps of reasoning” before producing a token; if the answer requires more steps than that, the model has to compress them into a single forward pass and tends to fail. CoT lets the model use output tokens themselves as a scratchpad — each reasoning token gets its own forward pass, effectively unrolling the computation graph in time.
This explains the asymmetry. Tasks that genuinely require multi-step computation (multi-digit arithmetic, multi-hop logic, complex algorithm tracing) benefit dramatically from CoT — sometimes 30+ percentage points on benchmark accuracy. Tasks that are essentially recognition or shallow lookup (factual recall, single-step classification, sentiment analysis) gain nothing because they didn’t need extra computation in the first place; CoT just adds latency and potential for the model to talk itself out of the right answer.
The gotcha is that CoT can actively hurt on tasks where the model’s intuition outperforms its explicit reasoning. Asking a strong model to “think step by step” before classifying sentiment can produce labored, second-guessing reasoning that lands on a worse answer than the immediate response would have. The empirical pattern: CoT helps if the task error is “couldn’t compute the answer”, and hurts if the task error is “overthought a simple judgment”.
Modern models trained with CoT data baked into post-training (o1, Claude with extended thinking) blur the distinction further — they decide internally whether to expand reasoning, often outperforming explicit CoT prompts on the underlying task.
The alternative, when the task is stable and runs at scale, is fine-tuning . A fine-tuned model has the desired behavior baked into the weights, not pasted into the prompt. It survives provider migrations, model upgrades, and cost-pressure to switch to a smaller model — because the smaller fine-tuned model often outperforms the bigger prompted one anyway.
The pragmatic stance: prompt to validate the idea, fine-tune to ship the production version.
Prompt engineering is the cheap way to specify behavior. Fine-tuning is the durable way. Prototype with prompts; ship with weights.
Go further
When should I prompt vs fine-tune?
Prompt for low-volume, exploratory, or evolving tasks. Fine-tune for high-volume, stable, narrow tasks where per-call cost and consistency dominate. Prompting is right for one-off use; fine-tuning beats prompting on every metric (latency, cost, accuracy, calibration) once the task is well-defined and runs at scale.
Do prompting tricks actually generalize across models?
Not as much as people think. A prompt that works beautifully on GPT-5 may underperform on Claude or Gemini, and even a model upgrade within the same family can break a carefully-tuned prompt. Prompts are brittle artifacts; if your system depends on a specific phrasing, you have a maintenance liability.
Chain-of-thought (CoT) is the technique of asking the model to 'think step by step' before producing the final answer — letting it work through reasoning chains in its output before committing. Helps disproportionately on math, logic, and multi-step tasks. Doesn't help (and may hurt) on simple factual recall.