Wednesday, June 3, 2026

Designing Agentic Workflows: Core Considerations

When you build an agentic workflow, you are really designing a system where an LLM can plan, act, observe results, and iterate — not just answer a single prompt. The core aspects below are the ones that usually determine whether it works reliably in production.

---

## 1. Define the job boundary clearly

Start with what the agent is allowed to accomplish, and what it must never do.

- **Scope:** One well-defined outcome (e.g. “triage this alert and propose a fix”) beats “handle anything related to infra.”

- **Success criteria:** What does “done” look like? A merged PR? A Jira ticket? A human-approved plan?

- **Escalation:** When should the agent stop and ask a person instead of continuing?

Ambiguous goals are the main reason agent workflows feel impressive in demos but fail in real use.

---

## 2. Choose the right orchestration model

Not every task needs a fully autonomous agent.

| Pattern | Best for |

|---|---|

| **Fixed pipeline** | Predictable steps with known tools |

| **Planner + executor** | Multi-step tasks with branching |

| **Multi-agent** | Parallel research, review, or specialization |

| **Human-in-the-loop** | High-risk or irreversible actions |

A common mistake is making everything “fully agentic” when a deterministic workflow with one LLM step would be simpler and more reliable.

---

## 3. Tool design and permissions

Agents are only as good as the tools they can call.

- **Least privilege:** Give only the tools needed for the task.

- **Safe defaults:** Read-only first; require explicit approval for writes, deploys, deletes, or network calls.

- **Structured outputs:** Tools should return predictable JSON, not free-form text the agent must reinterpret.

- **Idempotency:** Assume the agent may retry; side effects should be safe to repeat.

Your Anaplan-style allowlists and protection toggles are a good example of this principle applied in practice.

---

## 4. State, memory, and context management

Agents fail when they lose track of what already happened.

- **Working memory:** Current task state, intermediate results, open questions.

- **External memory:** Docs, tickets, repo context, prior runs — retrieved on demand rather than stuffed into every prompt.

- **Context budget:** Summarize or drop stale history instead of sending the full transcript forever.

- **Handoffs:** If multiple agents are involved, define exactly what each one receives and returns.

---

## 5. Prompting, skills, and guardrails

Instructions should be layered, not one giant system prompt.

- **System rules:** Security, tone, non-negotiable constraints.

- **Skills/playbooks:** Reusable procedures for recurring tasks.

- **Task prompt:** The specific user request and current state.

- **Examples:** Few-shot examples for brittle formats or decision boundaries.

Also treat all external inputs — tool responses, web fetches, MCP output, user files — as **untrusted**. Validate before acting on them.

---

## 6. Reliability and failure handling

Agentic systems must assume things will go wrong.

- **Retries with limits:** Retry transient tool failures, not logical mistakes.

- **Checkpoints:** Save progress so a run can resume after interruption.

- **Verification steps:** Have the agent confirm outcomes (“did the test pass?”, “does the diff match the request?”).

- **Fallbacks:** Smaller model, simpler workflow, or human takeover.

A workflow that cannot recover gracefully from one bad tool call is not production-ready.

---

## 7. Observability and auditability

You need to answer: *What did the agent do, why, and with what result?*

- **Trace each step:** Prompt, tool call, tool result, model decision.

- **Attribute AI actions:** Especially for commits, PRs, and operational changes.

- **Metrics:** Success rate, retries, cost, latency, human intervention rate.

- **Replay/debug:** Ability to inspect a failed run without guessing.

Without this, debugging agent behavior is mostly speculation.

---

## 8. Evaluation before and after launch

Agent quality is behavioral, not just “the code compiles.”

- **Golden tasks:** A curated set of real scenarios with expected outcomes.

- **Regression evals:** Run after prompt, tool, or model changes.

- **Failure taxonomy:** Hallucinated tool use, wrong plan, unsafe action, incomplete task.

- **Continuous monitoring:** In production, sample live runs and review drift over time.

---

## 9. Cost, latency, and model selection

Agentic workflows multiply token and tool usage quickly.

- Use **smaller/faster models** for classification, routing, and summarization.

- Reserve **stronger models** for planning, synthesis, and ambiguous reasoning.

- Cache retrieval and repeated context where possible.

- Cap max steps, tool calls, and runtime per task.

---

## 10. Security and governance

This becomes critical once agents can modify systems.

- No hardcoded secrets; use scoped credentials.

- Approval gates for destructive or privileged operations.

- Sandboxing for command execution.

- Clear ownership: who is accountable when an agent opens a PR or changes config?

---

## A practical mental model

```mermaid

flowchart LR

  Goal[Clear goal] --> Plan[Plan / decompose]

  Plan --> Act[Use tools]

  Act --> Observe[Observe results]

  Observe --> Verify[Verify progress]

  Verify -->|Not done| Plan

  Verify -->|Blocked| Human[Human escalation]

  Verify -->|Done| Complete[Deliver outcome]

The hardest parts are usually not the LLM itself, but:


Clear termination conditions

Safe, well-scoped tools

Verification loops

Human checkpoints for risky actions