How to Use AWS DevOps Agent: Complete Setup Guide (2026)
Last verified: May 14, 2026 · Format: Guide · Est. time: 25-35 min
It is 3 AM and your pager fires. A payment service is returning 500 errors across three availability zones. By the time you open your laptop, the AWS DevOps Agent has already correlated CloudWatch metrics with Datadog traces, identified that a recent deployment changed a database connection pool setting, built a topology graph of every affected downstream service, and drafted a mitigation plan with rollback steps waiting for your approval. That investigation took it four minutes. Manually, you would have needed forty.
This guide walks you through setting up AWS DevOps Agent from scratch: creating your first Agent Space, configuring IAM permissions, connecting your observability and ticketing tools, running your first investigation, and setting up proactive evaluations that catch problems before they become outages. Every step includes a verification checkpoint so you know the configuration is working before moving on. The entire setup takes about 90 minutes for a single application, and the 2-month free trial means you can validate the tool against your own environment before spending a dollar.
What You Need Before Starting
AWS DevOps Agent is an autonomous SRE agent built on Amazon Bedrock AgentCore (official docs), generally available since March 31, 2026. It operates in a dedicated console experience separate from Amazon Q Developer. Before you begin setup, confirm you have the following items in place. Missing any of these will block you at a specific step, and backtracking wastes time.
- ✓Step 1: Create Your Agent Space
- ✓Step 2: Configure IAM Roles
- ✓Step 3: Connect Observability
- ✓Step 4: Application Topology
- ✓Step 5: Incident Response
- ✓Step 6: Collaboration Tools
- ✓Step 7: On-Demand SRE Tasks
- ✓Step 8: Set Up Evaluations
Step 1: Creating Your First Agent Space
An Agent Space is the foundational unit of AWS DevOps Agent. It defines the scope of what the agent can see and act on: which AWS accounts, which third-party integrations, which operators have access, and what operational boundaries apply. You can run up to 100 Agent Spaces per account per region (adjustable via service quotas). Each space is an isolated boundary, meaning the agent working on your payment service cannot access the topology of your marketing platform unless you explicitly configure cross-space visibility.
- Open the AWS Management Console and navigate to the DevOps Agent service. It has its own console entry separate from Amazon Q Developer.
- Click Create Agent Space.
- Enter a descriptive name that maps to a logical application or service boundary. Examples:
payments-prod,checkout-service-us-east,platform-staging. - Select the AWS accounts you want this Agent Space to monitor. You can associate multiple accounts with a single space for cross-account visibility.
- Choose the supported region closest to your workloads: US East (N. Virginia), US West (Oregon), Europe (Frankfurt or Ireland), or Asia Pacific (Sydney or Tokyo).
- Click Create. The Agent Space initializes within 30-60 seconds.
Verification: After creation, the Agent Space should appear in your console list with a status of "Active." Click into it and confirm the associated AWS accounts are listed correctly. If the status shows "Initializing" for more than 2 minutes, check your region selection and IAM permissions.
Step 2: Configuring IAM Roles and Permissions
AWS DevOps Agent needs IAM roles with specific permissions to read telemetry, access CloudWatch logs, query your infrastructure, and interact with connected tools. The principle of least privilege applies here: grant only what the agent needs, nothing more.
- Navigate to IAM in the AWS Management Console.
- Create a new IAM role for the DevOps Agent service. Select AWS Service as the trusted entity type and choose DevOps Agent from the service list.
- Attach the AWS-managed policy
AWSDevOpsAgentServiceRolePolicy. This grants the baseline permissions the agent needs for CloudWatch metrics, logs, X-Ray traces, and CloudFormation stack discovery. - If you are connecting multiple AWS accounts, create a cross-account IAM role in each target account that trusts the agent's primary account. The agent uses
sts:AssumeRoleto access cross-account resources. - For operator access, configure IAM Identity Center (SSO) or individual IAM users with permissions to the DevOps Agent Web App. Operators interact with the agent through this separate web application, not the AWS Management Console.
- Review the trust policy to confirm only the DevOps Agent service principal can assume the role. Do not use wildcard principals.
Verification: In the Agent Space settings, navigate to the Permissions tab. The console should show a green checkmark next to each required permission. If any show a red X, the IAM role is missing that specific policy. Use the console's built-in permission analyzer to identify the exact missing action.
Step 3: Connecting Observability Tools
The agent needs access to your monitoring and logging data to perform investigations. AWS DevOps Agent natively supports CloudWatch (built-in), Datadog, Dynatrace, New Relic, Splunk, and Grafana with Prometheus. Any tool without a native integration can be connected via MCP (Model Context Protocol) servers.
CloudWatch (Native, No Extra Setup)
If your workloads already emit metrics and logs to CloudWatch, the agent picks them up automatically through the IAM role you configured in Step 2. No additional integration step is required. CloudWatch is the default telemetry source.
Third-Party Tools (Datadog, Dynatrace, Splunk, New Relic, Grafana)
- In the Agent Space settings, navigate to Integrations.
- Click Add Integration and select your observability provider.
- Enter the required credentials: API key and application key for Datadog, API token for Dynatrace, HEC token for Splunk, or API key for New Relic.
- Configure the data scope. Limit the integration to specific services or environments rather than granting access to your entire observability tenant.
- Test the connection. The console runs a connectivity check that confirms the agent can query metrics and logs from the connected tool.
MCP Servers (Custom Tools)
For tools without native integration, connect via MCP (Model Context Protocol) servers using Streamable HTTP transport. The agent supports OAuth 2.0, API Key, and SigV4 authentication methods. MCP servers are registered at the AWS account level and shared across all Agent Spaces in that account. Private and on-premises MCP servers connect through VPC. AWS Labs maintains 56+ open-source MCP servers on GitHub for common operational tools.
Verification: After connecting each tool, go to the Agent Space dashboard and look for the integration status indicator. Each connected source should show "Connected" with a recent timestamp. Ask the agent an on-demand question like "Show me the top 5 metrics from [tool name] for the last hour" to confirm data is flowing.
Never use wildcard principals in the DevOps Agent trust policy. Only the DevOps Agent service principal should be allowed to assume the role. Overly permissive trust policies create a privilege escalation path.
If you are monitoring multiple AWS accounts, each target account needs its own IAM role that trusts the agent's primary account via sts:AssumeRole. A missing cross-account role causes silent gaps in visibility.
When connecting third-party tools like Datadog or Splunk, limit the integration to specific services or environments. Granting full-tenant access exposes telemetry the agent does not need and increases your attack surface.
Step 4: Setting Up Application Topology
The topology engine is what makes AWS DevOps Agent more than a log search tool. It builds a continuously updated graph of your services, dependencies, and communication patterns. When something breaks, the agent already knows what depends on what, which is how it can identify root causes in minutes instead of hours.
- The agent begins building topology automatically once your Agent Space has connected AWS accounts and observability tools. No manual setup is required for the initial discovery.
- The agent discovers services from multiple sources: CloudFormation stacks, resource tags, CI/CD pipeline configurations, and actual traffic patterns observed through background learning.
- To accelerate topology accuracy, tag your AWS resources consistently. The agent uses tags like
service,environment,team, andapplicationto group related resources. - Connect your source control (GitHub, GitLab, or Azure DevOps) through the Integrations panel. This gives the agent visibility into deployment events, which it uses to correlate incidents with recent code changes.
- Review the topology graph in the DevOps Agent Web App. You can view services, dependencies, and traffic flow visually. The agent updates this graph continuously as your environment changes.
Verification: Open the Topology view in the DevOps Agent Web App. You should see your services and their dependencies mapped out. If the graph appears sparse, give the agent 24-48 hours of background learning time. The topology engine improves its accuracy as it observes more traffic patterns and incidents. If services are missing, check that the corresponding AWS accounts are associated with the Agent Space and that resource tags are consistent.
Step 5: Configuring Incident Response
Investigations are the primary operational mode. When an alert fires or an anomaly is detected, the agent launches an autonomous investigation that follows a five-phase cycle: detection, hypothesis generation, telemetry analysis, root cause analysis, and mitigation planning. The agent runs 24/7, which means incidents at 3 AM on a Saturday receive the same investigation quality as those during business hours.
- In the Agent Space settings, navigate to Investigation Settings.
- Configure alert sources. Connect PagerDuty, CloudWatch Alarms, or any alerting tool that can send webhooks. When an alert triggers, the agent automatically opens an investigation.
- Set the concurrency limit. The default is 3 concurrent investigations per Agent Space (adjustable). For high-incident environments, consider creating multiple Agent Spaces or increasing this quota.
- Configure the mitigation approval workflow. The agent produces mitigation plans with rollback procedures, but human approval is always required before execution. Set up notification channels (Slack, email, or the Web App) where operators receive approval requests.
- Enable the immutable audit journal. This is on by default and should not be disabled. Every investigation phase is recorded in a tamper-resistant log that integrates with AWS CloudTrail for compliance purposes. For broader AI safety controls on the Bedrock platform, see our Bedrock Guardrails breakdown. The agent cannot modify journal entries after they are written.
Verification: Trigger a test investigation. The simplest way is to create a CloudWatch Alarm with a deliberately low threshold that your application will breach. When the alarm fires, the agent should begin an investigation within 60 seconds. Check the Investigations panel in the Web App for a new entry with status "In Progress." Once it completes, review the root cause analysis and mitigation plan to confirm the agent correctly identified the source of the alarm.
Step 6: Connecting Collaboration Tools
The agent can push investigation results, mitigation approval requests, and evaluation findings to your team's existing communication channels. This keeps the feedback loop tight and prevents operators from needing to constantly check the Web App.
Slack Integration
- In the Agent Space settings, go to Integrations and click Add Integration.
- Select Slack and authorize the DevOps Agent Slack app in your workspace.
- Choose the channels where investigation summaries and approval requests should be posted. Use a dedicated channel like
#devops-agent-alertsrather than a general operations channel. - Configure notification preferences: all investigations, only critical severity, or only when human approval is needed.
ServiceNow Integration
- Select ServiceNow from the integrations list.
- Enter your ServiceNow instance URL and provide API credentials with incident creation permissions.
- Map investigation findings to ServiceNow incident fields. The agent can automatically create incidents and populate root cause analysis fields.
PagerDuty Integration
- Select PagerDuty and provide your integration key.
- Configure bidirectional sync: PagerDuty alerts trigger investigations, and investigation results update the PagerDuty incident with root cause details.
Verification: After connecting Slack, run another test investigation (or wait for the next real alert). Within 2-3 minutes of the investigation completing, you should receive a summary message in your configured Slack channel. The message should include the investigation title, root cause summary, severity, and a link to the full report in the Web App.
Step 7: Running Your First On-Demand SRE Task
On-Demand SRE tasks are the conversational mode of the agent. Operators can ask natural language questions, request custom reports, or task the agent with ad hoc analysis. Up to 10 concurrent on-demand tasks per Agent Space (adjustable). This is where teams interact with the agent outside of incident response.
- Open the DevOps Agent Web App (this is separate from the AWS Management Console).
- Navigate to your Agent Space and click New Task or open the chat interface.
- Start with a simple query to confirm connectivity:
Show me the top 5 services by error rate over the past 24 hours. - Try a comparison query:
Compare latency for the checkout service between the last two deployments. - Request a custom report:
Generate a weekly reliability summary for all services in this Agent Space, including error rates, latency p99, and deployment frequency.
The agent draws on all connected observability tools, the topology graph, and historical investigation data to answer these queries. It produces charts, tables, and narrative summaries depending on the question format.
Verification: The agent should respond within 30-60 seconds for simple queries and 2-5 minutes for complex analysis. Responses should reference specific metrics from your connected tools. If the agent returns "No data available," verify that the observability integrations from Step 3 are connected and that the time range you requested has data.
Step 8: Setting Up Evaluations
Evaluations are the proactive mode. Instead of waiting for alerts, the agent analyzes historical incident patterns, operational telemetry, and configuration drift to identify problems before they cause outages. Think of evaluations as a continuous reliability review that runs without human prompting. You can run 1 concurrent evaluation per Agent Space (this limit is not adjustable).
- In the Agent Space settings, navigate to Evaluation Settings.
- Enable evaluations for the Agent Space. This starts the background analysis process.
- Set evaluation scope. You can limit evaluations to specific services, environments, or risk categories.
- Configure notification preferences for evaluation findings. Each finding includes a severity rating and supporting evidence.
- Review the first batch of findings. The agent typically produces initial recommendations within 24-48 hours as it analyzes your historical patterns and current configuration.
Evaluations surface issues like: configuration drift between environments, services approaching resource limits, repeated incident patterns that suggest an underlying systemic issue, and deployments that correlated with reliability regressions. Every recommendation comes with evidence and suggested remediation steps.
Verification: After enabling evaluations, check the Evaluations panel daily for the first week. You should see findings appear within 24-48 hours. Each finding should include a severity level (Critical, High, Medium, Low), a description of the issue, supporting telemetry evidence, and a recommended action. If no findings appear after 48 hours and your environment is actively generating telemetry, check that the evaluation scope is not too narrow.
Pricing Reference
All three operational modes (Investigations, Evaluations, On-Demand SRE tasks) are billed at the same rate: $0.0083 per agent-second ($0.498 per agent-minute). There are no separate charges per mode.
Free trial: 2 months with 10 Agent Spaces, 20 hours of investigations, 15 hours of evaluations, and 20 hours of on-demand SRE tasks per month. No credit card required to start. AWS Support credits reduce costs further: 100% for Unified Operations, 75% for Enterprise, 30% for Business+.
Pricing verified from AWS documentation as of May 2026. All figures are estimates. Additional AWS service charges (data transfer, CloudWatch, etc.) apply separately.
The most common deployment blocker. If the agent's IAM role is missing the AWSDevOpsAgentServiceRolePolicy or lacks cross-account sts:AssumeRole permissions, investigations silently fail or return incomplete root cause analysis.
Teams often expect the topology graph to be complete immediately. The agent needs 24-48 hours of background learning to observe traffic patterns, map dependencies, and build an accurate service graph. A sparse topology at hour 2 is normal.
Expired or scoped-too-narrow API keys for Datadog, Dynatrace, or Splunk cause silent data gaps. Always test the connection after setup and re-verify if the agent's investigation results suddenly exclude a data source.
Troubleshooting and FAQ
service, environment, and team), and (3) the IAM role has permissions to read CloudFormation stacks and resource metadata in those accounts.Next Step
Start with a single Agent Space scoped to your most incident-prone application. Run it for two weeks during the free trial, collecting data on investigation accuracy and MTTR improvement. Compare the agent's root cause findings against your team's manual investigations during the same period. That comparison gives you concrete data for a business case. When you are ready to scale, create additional Agent Spaces for other services and explore the skills system to extend the agent's capabilities with custom operational procedures. For a deeper understanding of the architecture and limitations, read our What Is AWS DevOps Agent breakdown.