Building the N1 Agent : An AI-Powered Kubernetes Assistant

Summary
The N1 agent transforms how DevOps teams interact with Kubernetes infrastructures by combining conversational prompts and advanced automation, all powered by artificial intelligence. Concretely, it enables teams to communicate with their K8s cluster in natural language, while automatically and autonomously analyzing incidents reported in ITSM tools.
Introduction
What if managing Kubernetes was as simple as having a conversation ? For most DevOps teams, diagnosing a production incident means :
Switching between multiple tools (Rancher, Lens, monitoring dashboards, etc.) ;
Executing 20+ kubectl commands to isolate the root cause ;
Spending 30-60 minutes on what should be resolved in 5.
This is the main point driving the N1 Agent project : making infrastructure accessible through natural language and intelligently automating incident handling.
To achieve this, we deployed the OpenWeb UI application, an open-source web interface for interacting with AI models through a chat interface, for the conversational component. For automated incident management, an n8n workflow, an open-source automation tool operating on a principle of connected visual nodes, was set up to retrieve incidents from the ITSM platform, then query our Kubernetes clusters to perform contextual analysis and generate recommendations.
Both tools rely on an AI model hosted on Azure Foundry, Microsoft's platform for developing and deploying AI models, which allows us to leverage advanced artificial intelligence capabilities while maintaining control over our data in a sovereign cloud environment.
But then, does the AI directly access our Kubernetes clusters to execute kubectl commands in real-time ? Well no, it doesn't access them directly. This is where the final component comes in : MCPO, developed by the OpenWeb UI team, a proxy that bridges HTTP requests and the MCP protocol, there by securing and standardizing communication between the conversational interface and Kubernetes resources, based on the MCP protocol.
What is the Model Context Protocol (MCP) ?
The Model Context Protocol, or MCP, is an open-source standard protocol for connecting AI applications to external systems, developed by Anthropic and released in November 2024.
This protocol enables AI models such as GPT-4, Claude, or other LLMs to securely interact with external data sources, APIs, and tools in a standardized way. Rather than each AI application implementing its own custom integration methods, MCP provides a unified framework that allows models to access databases, execute commands, retrieve files, or interact with various services while maintaining security and control over these interactions.
An official list of MCP Servers for various platforms has been compiled and is available here.
Architecture Overview
REMARK : This is the configuration we have implemented in our infrastructure. However, there are many features available across the different components to improve or adapt this setup to your specific needs.
Here's the architecture we've implemented :
MCPO acts as the heart of this architecture as a bidirectional translator between HTTP and the MCP protocol. It loads MCP Servers, in our case, the Kubernetes MCP Server, and exposes their native capabilities as HTTP endpoints. This design allows any HTTP-capable client to interact with Kubernetes through standardized REST calls, while MCPO handles all protocol conversion behind the scenes.
For security reasons, since this component contains sensitive information, particularly the Kubernetes cluster kubeconfigs, we have applied dedicated host and network segmentation. This machine is completely isolated and exposed only through a specific port, accessible exclusively from the host running OpenWeb UI and n8n.
To simplify kubeconfig exports and user access management on the Kubernetes clusters, kubeconfigs are created and exported using Rancher, our Kubernetes management platform. Rancher provides a centralized interface to manage multiple Kubernetes clusters, handle authentication and role-based access control (RBAC), and securely generate kubeconfig files for users and services.
On a separate host, for the conversational component, users (Operator) connect to OpenWeb UI, which is configured with two key integrations : an AI model on Azure Foundry and MCPO as a tool provider. When a user submits a prompt requesting Kubernetes operations, the AI model interprets the intent and generates an HTTP call to the appropriate MCPO endpoint. MCPO converts this into MCP protocol messages, forwards them to the Kubernetes MCP Server for execution, and returns the results through the reverse path.
n8n follows a similar pattern but through visual workflow nodes. First retrieves Kubernetes cluster–related incidents from the ITSM tool. Then, a node calls the Azure Foundry model to analyze incidents, while HTTP Request nodes interact with MCPO, first discovering available endpoints, and finally executing specific Kubernetes commands. Once the analysis is complete, the incident is updated with the analysis and recommendations (or even automatically resolved if it is granted more than read-only permissions!).
Finally, Traefik is a reverse proxy that routes incoming traffic to the appropriate service based on the domain name, enabling multiple applications (Open WebUI, n8n) to be accessed through different URLs while running on the same infrastructure.
Technical Implementation
All components run in Docker containers. Below is the docker-compose.yaml file for the machine that hosts Open WebUI, n8n, and Traefik :
version: '3.8'
services:
traefik:
image: traefik:latest
container_name: traefik
restart: unless-stopped
command:
- "--api.insecure=true"
- "--providers.docker=true"
- "--providers.docker.exposedbydefault=false"
- "--providers.file.filename=/traefik-dynamic.yml"
- "--entrypoints.web.address=:80"
- "--entrypoints.websecure.address=:443"
ports:
- "80:80"
- "443:443"
- "8080:8080"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./certs:/certs:ro
- ./traefik-dynamic.yml:/traefik-dynamic.yml:ro
networks:
- mcphub-network
open-webui:
image: ghcr.io/open-webui/open-webui:0.6.41
container_name: open-webui
environment:
# SSO feature (Microsoft)
- MICROSOFT_CLIENT_ID=${MICROSOFT_CLIENT_ID}
- MICROSOFT_CLIENT_SECRET=${MICROSOFT_CLIENT_SECRET}
- MICROSOFT_CLIENT_TENANT_ID=${MICROSOFT_CLIENT_TENANT_ID}
- MICROSOFT_REDIRECT_URI=${MICROSOFT_REDIRECT_URI}
- OPENID_PROVIDER_URL=${OPENID_PROVIDER_URL}
- ENABLE_OAUTH_SIGNUP=true
env_file:
- .env
volumes:
- open-webui-data:/app/backend/data
extra_hosts:
- "host.docker.internal:host-gateway"
restart: unless-stopped
networks:
- mcphub-network
labels:
- "traefik.enable=true"
# Adapt this to your domain
- "traefik.http.routers.owui.rule=Host(`owui.your.domain`)"
- "traefik.http.routers.owui.entrypoints=websecure"
- "traefik.http.routers.owui.tls=true"
- "traefik.http.services.owui.loadbalancer.server.port=8080"
# Adapt this to your domain
- "traefik.http.routers.owui-http.rule=Host(`owui.your.domain`)"
- "traefik.http.routers.owui-http.entrypoints=web"
- "traefik.http.routers.owui-http.middlewares=redirect-to-https"
- "traefik.http.middlewares.redirect-to-https.redirectscheme.scheme=https"
n8n:
image: n8nio/n8n:stable
container_name: n8n
environment:
# Adapt this to your domain
- N8N_HOST=n8n.your.domain
- N8N_PORT=5678
- N8N_PROTOCOL=https
# Adapt this to your domain
- WEBHOOK_URL=https://n8n.your.domain/
- GENERIC_TIMEZONE=Europe/Zurich
- DB_TYPE=postgresdb
- DB_POSTGRESDB_HOST=n8n-postgres
- DB_POSTGRESDB_PORT=5432
- DB_POSTGRESDB_DATABASE=n8n
- DB_POSTGRESDB_USER=n8n
- DB_POSTGRESDB_PASSWORD=${N8N_POSTGRES_PASSWORD}
- N8N_CUSTOM_EXTENSIONS=/app/custom
- N8N_SECURE_COOKIE=false
- N8N_SKIP_AUTH_ON_OAUTH_CALLBACK=true
- N8N_ENFORCE_SETTINGS_FILE_PERMISSIONS=true
- N8N_RUNNERS_MODE=external
- N8N_RUNNERS_BROKER_LISTEN_ADDRESS=0.0.0.0
- N8N_RUNNERS_BROKER_PORT=5679
- N8N_RUNNERS_AUTH_TOKEN=${N8N_RUNNER_TOKEN}
- N8N_NATIVE_PYTHON_RUNNER=true
env_file: ".env"
volumes:
- n8n-data:/home/node/.n8n
- ./n8n-custom:/app/custom:ro
restart: unless-stopped
networks:
- mcphub-network
extra_hosts:
- "host.docker.internal:host-gateway"
depends_on:
- n8n-postgres
labels:
- "traefik.enable=true"
# Adapt this to your domain
- "traefik.http.routers.n8n.rule=Host(`n8n.your.domain`)"
- "traefik.http.routers.n8n.entrypoints=websecure"
- "traefik.http.routers.n8n.tls=true"
- "traefik.http.services.n8n.loadbalancer.server.port=5678"
# Adapt this to your domain
- "traefik.http.routers.n8n-http.rule=Host(`n8n.your.domain`)"
- "traefik.http.routers.n8n-http.entrypoints=web"
- "traefik.http.routers.n8n-http.middlewares=redirect-to-https"
n8n-postgres:
image: postgres:15-alpine
container_name: n8n-postgres
environment:
- POSTGRES_DB=n8n
- POSTGRES_USER=n8n
- POSTGRES_PASSWORD=${N8N_POSTGRES_PASSWORD}
env_file: ".env"
volumes:
- n8n-postgres-data:/var/lib/postgresql/data
restart: unless-stopped
networks:
- mcphub-network
n8n-runners:
image: n8nio/runners:stable
container_name: n8n-runners
environment:
- N8N_RUNNERS_TASK_BROKER_URI=http://n8n:5679
- N8N_RUNNERS_AUTH_TOKEN=${N8N_RUNNER_TOKEN}
env_file: ".env"
restart: unless-stopped
networks:
- mcphub-network
depends_on:
- n8n
volumes:
open-webui-data:
n8n-data:
n8n-postgres-data:
networks:
mcphub-network:
driver: bridge
The n8n-runners service is responsible for executing workflows and tasks outside the main n8n process, allowing for better scalability and parallel execution. It is therefore recommended to include it in any deployment where workflow performance and load distribution are important.
Here is the traefik-dynamic.yaml file for SSL certificates, which also needs to be configured to ensure HTTPS :
tls:
certificates:
# Adapt this to your domain
- certFile: /certs/your.domain.crt
keyFile: /certs/your.domain.key
Then, here is the docker-compose.yaml for the isolated host, with the MCPO component :
version: '3.8'
services:
mcpo:
build:
context: .
dockerfile: Dockerfile.mcpo
container_name: mcpo
command: mcpo --config /app/mcpo-config.json
ports:
- "8000:8000"
volumes:
- ./mcpo-config.json:/app/mcpo-config.json:ro
- ./configs:/app/configs:ro
restart: unless-stopped
networks:
- mcpproxy-network
extra_hosts:
- "host.docker.internal:host-gateway"
# Connection to Rancher
- "rancher.your.domain:IP"
networks:
mcpproxy-network:
driver: bridge
Next, you need to create a Dockerfile.mcpo for the MCPO container, which adds kubectl to the container to interact with Kubernetes clusters :
FROM ghcr.io/open-webui/mcpo:main
RUN apt-get update && \
apt-get install -y curl && \
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && \
chmod +x kubectl && \
mv kubectl /usr/local/bin/ && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
You also need to create the mcpo-config.json file, which contains the configuration for the MCP servers. This file defines each MCP server, the command to run, its arguments, and the environment variables required to connect to it :
{
"mcpServers": {
"your-kubernetes-cluster": {
"command": "npx",
"args": ["-y", "mcp-server-kubernetes"],
"env": {
"KUBECONFIG": "/app/configs/kubernetes/your-kubernetes-kubeconfig"
}
}
}
}
For Kubernetes, we use the mcp-server-kubernetes project. In this setup, npx (a Node.js package runner that allows you to execute npm packages without installing them globally) downloads and runs the mcp-server-kubernetes package inside the container. This exposes the necessary tools to interact with the Kubernetes clusters defined in the configuration, allowing MCPO to manage and execute commands on them.
To allow MCPO to connect to your Kubernetes cluster, you need to export your cluster’s kubeconfig and place it in the appropriate directory, in ./configs/kubernetes/ in this configuration. This directory will then be mounted into the MCPO container.
Finally, you can simply run your docker-compose to start all the services.
OpenWeb UI
Once the services are running, you can access the OpenWeb UI interface at https://owui.your.domain/. You can then create an administrator account as your first step.
Importing a Model
To integrate a pre-generated LLM from Azure AI Foundry, follow these steps :
Go to Settings > Admin Settings > Connections, then click Add Connection ;
Change the Provider Type to Azure OpenAI (if you are using Azure Foundry) ;
Fill in the fields : URL, Auth Bearer, API Version, and Model IDs ;
Save the configuration.
To control the LLM’s behavior and handle errors (so it does not stop if a command fails), you need to activate specific settings and provide a system prompt. For the Kubernetes agent, use the following system prompt :
You are a Senior SRE with over 10 years of experience.
You have access to MCP tools for Kubernetes.
ABSOLUTE RULE: In a SINGLE response, make multiple attempts if a tool fails.
If a tool fails:
Try 3-5 different variants IN THE SAME RESPONSE.
Never say "I’ll try later" or "I will come back."
Call the tools successively until one succeeds.
Example:
❌ Attempt 1: kubectl_logs_post (failed: unsupported resource)
❌ Attempt 2: kubectl_get_logs (failed: tool not found)
✅ Attempt 3: get_pod_logs (success!)
Result: [display the result]
You must perform all attempts in the same message, not across multiple turns.
A system prompt allows you to define the AI's behavior, role, and constraints, essentially providing instructions that guide how the model should respond and interact with users.
The configuration of these parameters and the system prompt is done as follows:
Go to Settings > Admin Settings > Connections, then Models, and select your model ;
Under System Prompt, add the above message to frame the LLM's behavior ;
In Advanced Params on the same page :
Temperature : 0.0 ;
Function Calling : Native ;
Save the configuration.
The temperature setting controls the randomness of the model's responses, a lower value (like 0.0) makes outputs more deterministic and focused, while higher values increase creativity and variability.
Function calling enables the AI to invoke multiple endpoints within the same request, allowing it to retry commands in case of failure.
Integrating an MCP Server
To add the MCP server for a Kubernetes cluster, follow these steps:
Go to Settings > Admin Settings > External Tools, then click on Add Connections ;
Select the OpenAPI type, then configure :
URL : http://mcpo-host:8000/XXX (the URL to the MCP server followed by the cluster name) ;
Name : A descriptive name ;
Save the configuration.
You can then select the cluster in a chat by clicking on the Integration > Tools button.
n8n
As with Open WebUI, once the services are started, you can access it at https://n8n.your.domain/. You can then create an administrator account.
Key components
For the n8n configuration, this obviously depends on your infrastructure and the tools you use. However, the three key nodes to use are as follows :
Azure OpenAI Chat Model (or equivalent depending on what you use) : This node connects to Azure's OpenAI service, allowing the workflow to send prompts and receive AI-generated responses for tasks like incident analysis or generating recommendations.
HTTP Request : It makes HTTP calls to external APIs, in this case, to interact with MCPO endpoints for discovering and executing Kubernetes operations.
AI Agent : This node enables the workflow to leverage AI capabilities for decision-making, analysis, and intelligent task execution within the automation pipeline. You can define instructions for it, such as the system prompt to frame the AI's behavior and the user message to provide the specific context of the request. Then, you can connect it to other components like the HTTP Request node so it can retrieve endpoints, as well as to your Azure OpenAI Model (or equivalent).
Despite the fact that n8n has an MCP node, we cannot use it in our case because, as a reminder, MCPO serves to translate between HTTP ←→ MCP. Therefore, in reality, the AI never directly contacts the MCP protocol.
You should be able to achieve the entire automation without having to code anything (except for very specific configurations). Indeed, if you manage your prompts correctly, the AI Agent component completely replaces the coding part.
For more details about n8n, we have written a dedicated article on this technology here.
Use Cases and Demonstrations
The main use cases for having an N1 Agent are primarily to improve DevOps team efficiency. Whether through the chat system, where you simply interact with the AI for debugging or deployment, or through n8n incident automation, where initial analysis or incident resolution can be done autonomously.
Additionally, if we want to open access to our Kubernetes clusters to other teams who may not have the necessary system knowledge, they can attempt application debugging through chat with an AI configured as a "Senior SRE".
Beyond basic operations, the N1 Agent can handle more advanced scenarios such as automated scaling decisions based on resource usage patterns, security audits and compliance checks across your infrastructure, or cost optimization recommendations by identifying underutilized resources.
Here is a demonstration of an interaction with the AI on a Kubernetes cluster through OpenWeb UI :
In this example, the AI successfully :
Listed the existing pods in a specific namespace ;
Created a new pod with the requested configuration ;
Re-listed the pods to verify successful creation ;
Provided clear explanations at each step.
Then, for the n8n part, here is a demonstration of our workflow, with an alert raised in our ITSM tool due to a pod in error :
In order, here are the steps :
Incidents related to Kubernetes clusters are retrieved from the ITSM tool ;
A first AI Agent “Structuration”, connected to Azure, analyzes the incident to structure it clearly for the next step ;
Then, "Analyze (K8s)" is configured as a "Senior SRE" and uses the Azure model as well, stores the information, and retrieves the MCPO endpoints for the MCP Server ;
Next, the command is sent to “Execute” AI Agent, which executes the command on MCPO through an HTTP Request (POST), and the returned result is sent back to “Analyze (K8s)” ;
This loop continues until the AI determines it has found the source of the problem. If found, it proceeds to the next step ;
“Final Answer” node simply reformats the response correctly and posts it to the ITSM tool.
Challenges and Conclusion
What’s the final result ? To measure the real-world impact of the N1 Agent, we compared incident resolution times between traditional manual workflows and our automated approach :
The results speak for themselves : what traditionally takes 45 minutes is now resolved in approximately 8 minutes, an 82% time reduction. The most significant gain occurs during the diagnostic phase, where the AI simultaneously executes multiple checks that would normally require sequential manual investigation.
Beyond the time savings, this automation eliminates the need to switch between multiple tools and enables continuous incident response, even outside business hours.
However, the AI era is still in its infancy, even more so for the MCP protocol, which celebrated its first anniversary in November 2025. Although development is growing rapidly, we still question whether we can truly grant the AI more than read-only access to our clusters ? And can we even trust its analysis compared to that of an expert SRE ? Indeed, during our tests, we have encountered instances where the AI hallucinated and provided incorrect information about the wrong cluster for instance.
Additionally, we have seen that there is an extensive list of MCP servers available for various use cases. Our objective moving forward is to extend this technology to other teams, enabling broader automation capabilities and improving operational efficiency across the organization. As the ecosystem matures and best practices emerge, we anticipate gradually expanding the AI's permissions while maintaining strict safeguards and monitoring.





