Let AI Handle Kubernetes Diagnostics—So You Don’t Have To

Diagnosing Kubernetes clusters isn’t rocket science. It’s worse: it’s repetitive, tedious, and eats up precious hours better spent on creative or impactful work. Scanning through logs, running the same kubectl commands, piecing together what went wrong—nobody signed up for this monotony.

So, we automated it. By combining the power of Large Language Models (LLMs) with Kubernetes' diagnostic tools, we’ve built a system that lets AI do the boring work while you focus on innovation.

Let’s dive into what we built, how we built it, and why this is your next must-have tool.

Overview

1. The Problem

Diagnosing Kubernetes clusters is a process filled with repetition:

Running kubectl get pods again and again.
Parsing logs line by line for the needle in the haystack.
Writing mental notes about what went wrong, only to rewrite them as fixes.

The real problem? It’s not hard—it’s boring and extremely time-consuming. When you’re firefighting production issues, the last thing you need is a manual slog through the diagnostics process.

2. Our Solution

We built a Kubernetes diagnostic assistant powered by OpenAI’s LLMs via the use of the LangChain project. It automates repetitive tasks like:

Identifying problematic pods (CrashLoopBackOff, Failed, Pending).
Analyzing logs, events, and dependencies.
Summarizing findings into actionable solutions.

In short, it turns Kubernetes troubleshooting into a hands-off operation.

3. Our Objectives

Eliminate the Boredom: Let AI handle repetitive commands and diagnostics.
Save Time: Quickly identify and resolve critical issues.
Actionable Insights: Provide a clear summary with solutions—not just a data dump.
Automation at Scale: Diagnose clusters, no matter how large, without manual intervention.

Technical Implementation

1. Automating Kubernetes Diagnostics with Python and LangChain

Our tool uses OpenAI’s GPT-3.5 model to interact with Kubernetes. Here’s the workflow:

Key Features

LLM-Powered Command Generation: The AI proposes and executes Kubernetes commands dynamically, such as:

 kubectl get pods --all-namespaces --field-selector=status.phase!=Running --kubeconfig=~/.kube/config --insecure-skip-tls-verify

Iterative Diagnostics: Each problematic pod is analyzed for:
- Logs (kubectl logs <pod>).
- Events (kubectl describe pod <pod>).
- Dependencies (ConfigMaps, Secrets, etc.).
Actionable Summaries: AI synthesizes findings into a Markdown table with:
- Pod Name
- Namespace
- Status
- Cause
- Solution

Snippet of the Main Function

def monitor_kubernetes():
    print("\n--- Starting Kubernetes Dynamic HealthCheck ---\n")
    executed_commands = interact_with_llm()
    print("\n--- Executed Commands ---\n")
    for cmd in executed_commands:
        print(f"{cmd}: Success" if cmd else "Failed")
    print("\n--- HealthCheck Completed ---\n")

2. Environment Configuration

We keep sensitive information like API keys and kubeconfig paths in an .env file:

OPENAI_API_KEY="XXXXXXXXXXXXXXXXXXXXXXXXXXX"
KUBECONFIG="~/.kube/config"

3. Dependency Management

The tool uses:

LangChain: To integrate the LLM and manage interactions.
pandas: For formatting findings.
urllib3: To handle Kubernetes HTTPS connections.
rich: To improve terminal output.

Install dependencies via:

pip install -r requirements.txt

4. Git Hygiene

A clean .gitignore ensures no sensitive files (like .env) are committed:

# Ignore virtual environment directories
env/
.env
venv/
.venv/

How It Works

1. Step-by-Step Workflow

Let AI Take the Lead: The LLM proposes commands to diagnose the cluster:

 kubectl get pods --all-namespaces --field-selector=status.phase!=Running --kubeconfig=~/.kube/config --insecure-skip-tls-verify

Iterative Analysis:
- For each problematic pod, the AI:
  - Retrieves logs.
  - Examines events.
  - Checks dependencies.
- Findings are returned as a Markdown table.
Summarized Output: The AI synthesizes results into an actionable summary, like:

| Pod Name | Namespace | Status | Cause Identified | Solution Proposed | | --- | --- | --- | --- | --- | | my-pod-1 | default | CrashLoopBackOff | Missing ConfigMap | Add the required ConfigMap | | my-pod-2 | kube-system | Pending | Node resource issue | Allocate more resources to the cluster |

Output Example:

 [DEBUG] LLM Proposes: ### Anomalies Detected in the Kubernetes Cluster:

 | Pod Name                  | Namespace         | Status           | Cause Identified                  | Solution Proposed                                      |
 |---------------------------|-------------------|------------------|-----------------------------------|--------------------------------------------------------|
 | loki-test-0               | loki              | CrashLoopBackOff | Failed to start properly          | Check pod logs for specific error messages and fix     |
 | k8s-monitor-54c8bfc448-x4bjh | monitoring      | CrashLoopBackOff | Continuous crashing               | Investigate logs and events for root cause and resolve |
 | loki-test-promtail-8bvb5  | loki              | Running          | -                                 | -                                                      |
 | loki-test-promtail-kv2cd  | loki              | Running          | -                                 | -                                                      |
 | loki-test-promtail-x6jsm  | loki              | Running          | -                                 | -                                                      |

 ### Recommendations:
 1. **loki-test-0 (loki)**:
    - **Status**: CrashLoopBackOff
    - **Cause**: Pod is failing to start properly.
    - **Solution**: Check pod logs for specific error messages and take necessary actions to resolve the issue.

 2. **k8s-monitor-54c8bfc448-x4bjh (monitoring)**:
    - **Status**: CrashLoopBackOff
    - **Cause**: Pod is continuously crashing.
    - **Solution**: Investigate logs and events for the root cause of the crashes and take corrective actions.

 3. **loki-test-promtail-8bvb5, loki-test-promtail-kv2cd, loki-test-promtail-x6jsm (loki)**:
    - **Status**: Running
    - **Cause**: No issues identified currently.
    - **Solution**: Monitor for any future issues.

 Ensure to address the critical anomalies promptly to stabilize the cluster.

 [DEBUG] No new commands or synthesis proposed. Ending interaction.

Reflections and Challenges

1. What Worked

Boredom Eliminated: AI took over repetitive tasks, freeing us to focus on value-added work.
Accuracy: The LLM excelled at identifying and prioritizing issues.
Time Saved: Diagnosis time for complex clusters dropped significantly.

2. Challenges

Parsing Logs: Kubernetes errors can be cryptic, requiring prompt tuning to improve AI responses.
Scalability: Handling large clusters requires additional optimization.

Get Started

Ready to automate Kubernetes diagnostics? Clone the repo and start today:

git clone https://github.com/cisel-dev/k8s-healthcheck-llm.git
python k8s_agent.py

A Humble Beginning

We know this tool isn’t perfect—far from it. It’s a starting point, a framework to help you tackle the often tedious task of Kubernetes diagnostics. While the functionality is practical and the prompts are designed to cover common use cases, every environment is unique, and so are its challenges.

We encourage you to adjust the prompts, tweak the workflows, the model, and extend the capabilities to fit your specific needs. This is your opportunity to make it better—more accurate, more insightful, and more tailored to your cluster.

So, get creative, dive in, and make it yours. Let’s automate the boring stuff.

Let AI Handle Kubernetes Diagnostics—So You Don’t Have To

Overview

1. The Problem

2. Our Solution

3. Our Objectives

Technical Implementation

1. Automating Kubernetes Diagnostics with Python and LangChain

Key Features

Snippet of the Main Function

2. Environment Configuration

3. Dependency Management

4. Git Hygiene

How It Works

1. Step-by-Step Workflow

Reflections and Challenges

1. What Worked

2. Challenges

Get Started

A Humble Beginning

Comments

More from this blog

Audit Your Entire Azure Tenant in Under 20 Minutes : AI-Driven CAF & WAF Compliance

Building the N1 Agent : An AI-Powered Kubernetes Assistant

Kubernetes Community Days Suisse Romande

Automating TLS for Envoy Gateway Using cert-manager and ACME HTTP-01 (with Let's Encrypt)

Migrating from Ingress NGINX to Gateway API : something to be afraid of ?

Command Palette

Overview

1. The Problem

2. Our Solution

3. Our Objectives

Technical Implementation

1. Automating Kubernetes Diagnostics with Python and LangChain

Key Features

Snippet of the Main Function

2. Environment Configuration

3. Dependency Management

4. Git Hygiene

How It Works

1. Step-by-Step Workflow

Reflections and Challenges

1. What Worked

2. Challenges

Get Started

A Humble Beginning

Comments

More from this blog