1. What Phase 4 Actually Is
Agent systems move beyond single-prompt interactions to automated workflows where the model decides which tools to use, executes actions, and adapts based on results. You’re no longer writing prompts. You’re designing .
The mindset shift? The model becomes an autonomous component in a larger system. It receives a goal, breaks it down into tasks, selects appropriate tools, executes those tools, and iterates until the goal is met. You define the objective and available tools, then the model figures out the path.

This phase assumes you’ve mastered reasoning from Phase 3. If the model can’t reliably think through multi-step problems with human guidance, it definitely can’t handle autonomous decision-making. Research from Yao et al. on ReAct showed that models need both reasoning and action capabilities working together. Neither alone produces reliable autonomous behavior.
2. Core Goal of This Phase
Create autonomous workflows where the model uses tools to achieve objectives without step-by-step human guidance.
You’re building systems that run independently. The model searches the web when it needs current information, executes Python when it needs calculations, queries databases when it needs data, and chains these actions together based on intermediate results.
This matters because manual prompting doesn’t scale. You can’t babysit every interaction when you’re processing thousands of requests or running continuous monitoring. handle routine tasks while escalating edge cases to humans.
3. Key Skills You Must Master
4. Practical Examples
Example 1: ReAct for Information Gathering
Task: Determine if a specific CVE affects your organization’s infrastructure.
ReAct loop:
Thought: I need to find details about CVE-2024-1234 to understand what systems it affects.
Action: web_search("CVE-2024-1234 vulnerability details")
Observation: CVE-2024-1234 affects Apache Tomcat versions 9.0.0 through 9.0.82. Remote code execution vulnerability.
Thought: Now I need to check what Tomcat version the organization runs.
Action: database_query("SELECT version FROM software_inventory WHERE name='Apache Tomcat'")
Observation: Organization runs Tomcat version 9.0.75
Thought: Version 9.0.75 falls within the affected range (9.0.0-9.0.82). Need to check if patches are available.
Action: web_search("CVE-2024-1234 patch Apache Tomcat")
Observation: Fixed in version 9.0.83, released January 2025.
Thought: I have enough information to provide a complete assessment.
Action: generate_report()
Final Output:
- Vulnerability: CVE-2024-1234 affects your Tomcat installation
- Current version: 9.0.75 (vulnerable)
- Recommended action: Upgrade to 9.0.83 or later
- Severity: High (RCE capability)
Why it works: Each reasoning step determines the next action. The model doesn’t guess at your infrastructure. It queries the actual inventory database. It doesn’t fabricate patch information. It searches for current data. Each observation informs subsequent reasoning.
Example 2: Programmatic Prompt Generation
Scenario: Generate security scan prompts dynamically based on asset type.
def generate_scan_prompt(asset_type, scan_depth, compliance_frameworks): base_prompt = f"Perform a {scan_depth} security scan on the {asset_type}.\n\n" # Add framework-specific checks based on asset type if asset_type == "web_application": checks = ["OWASP Top 10", "authentication mechanisms", "API security"] elif asset_type == "database": checks = ["access controls", "encryption at rest", "backup integrity"] elif asset_type == "cloud_infrastructure": checks = ["IAM policies", "network segmentation", "logging coverage"] base_prompt += "Focus on:\n" for check in checks: base_prompt += f"- {check}\n" # Add compliance requirements if compliance_frameworks: base_prompt += f"\nEnsure compliance with: {', '.join(compliance_frameworks)}\n" base_prompt += """ Output format: 1. Findings (Critical/High/Medium/Low) 2. Evidence for each finding 3. Remediation steps 4. Compliance gaps identified """ return base_prompt
Usage:
prompt1 = generate_scan_prompt("web_application", "comprehensive", ["PCI DSS", "SOC 2"]) prompt2 = generate_scan_prompt("database", "targeted", ["HIPAA"])
Why it works: Different asset types need different security checks. Hardcoding every variation is unmaintainable. The function adapts the prompt based on runtime parameters. Same code handles web apps, databases, and cloud infrastructure with appropriate focus areas.
Example 3: Tree of Thoughts for Problem Solving
Task: Design a backup strategy for a distributed system with multiple constraints.
Tree exploration:
Goal: Design backup strategy for distributed database cluster Step 1: Generate possible approaches - Approach A: Full daily backups to cold storage - Approach B: Incremental backups with weekly full backups - Approach C: Continuous replication to secondary region Step 2: Evaluate each approach Approach A evaluation: - Storage cost: High (full copies daily) - Recovery time: Fast (single restore point) - Bandwidth: Very high - Assessment: Expensive, not feasible for large datasets Approach B evaluation: - Storage cost: Medium (incremental saves space) - Recovery time: Medium (reconstruct from incrementals) - Bandwidth: Low - Assessment: Promising, need to verify backup window Approach C evaluation: - Storage cost: High (full secondary cluster) - Recovery time: Fastest (live failover) - Bandwidth: Continuous but moderate - Assessment: Best availability, highest cost Step 3: Select Approach B, verify constraints Thought: Check if incremental backup window fits maintenance window. Action: calculate_backup_time(dataset_size=5TB, incremental_rate=500GB/hr) Observation: Incremental backup takes 2 hours average Thought: Maintenance window is 4 hours. This fits. Decision: Proceed with Approach B If backup time exceeded window: - Backtrack to Step 2 - Try Approach C despite higher cost - Or modify Approach A with compression
Why it works: The model explores multiple approaches before committing. It evaluates feasibility at each branch. When constraints eliminate an option, it backtracks rather than forcing a bad solution. This matches how systems architects actually design: explore options, evaluate tradeoffs, backtrack when needed.
5. Common Mistakes at This Phase
Building agents before mastering reasoning. If your model can’t reliably solve problems with human guidance, autonomous operation will fail worse. Phase 4 requires Phase 3 competency.
Giving agents too many tools. Start with 3-5 essential tools. Models struggle to select appropriately from 20+ options. Expand the toolkit gradually as you verify selection accuracy.
Skipping error handling. Tools fail. APIs timeout. Searches return no results. Your agent needs explicit instructions for error conditions: retry logic, fallback tools, or human escalation.
No observation validation. The model receives tool output and acts on it. If that output is malformed, null, or error text, the agent will make bad decisions. Validate observations before feeding them back into reasoning loops.
Infinite loops from poor exit conditions. Define clear success criteria and maximum iteration counts. Without them, agents loop indefinitely when they can’t solve a problem.
6. How to Know You Are Ready for the Next Phase
There is no Phase 5. This is the current frontier of production prompt engineering.
The next frontier isn’t another phase. It’s building increasingly sophisticated agent architectures: multi-agent collaboration where specialized agents coordinate on complex tasks, meta-learning where agents improve their own prompts based on feedback, or hybrid systems where models and humans collaborate dynamically.

Agent systems represent the practical application of every prior phase. Zero-shot clarity ensures tool instructions are unambiguous. Few-shot patterns teach consistent tool output formatting. Reasoning strategies enable reliable multi-step planning. Only when all three foundations are solid can autonomous execution succeed.
Sources
Research Papers
- Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. Princeton University & Google Research. https://arxiv.org/abs/2210.03629
- Yao, S., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Princeton University & Google DeepMind. https://arxiv.org/abs/2305.10601
- Zhou, Y., et al. (2022). Large Language Models Are Human-Level Prompt Engineers. Google Research. https://arxiv.org/abs/2211.01910
- Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. Meta AI Research. https://arxiv.org/abs/2302.04761
Technical Documentation
- OpenAI. Function Calling Guide. https://platform.openai.com/docs/guides/function-calling
- LangChain. Agent Documentation. https://python.langchain.com/docs/modules/agents/
Standards & Frameworks
- OWASP Foundation. OWASP Top 10. https://owasp.org/www-project-top-ten/
- PCI Security Standards Council. PCI DSS Requirements. https://www.pcisecuritystandards.org/
- NIST. Cybersecurity Framework. https://www.nist.gov/cyberframework