DevOps VN

DevOps VN Let's share knowledge
(1)

I am a writing enthusiast and want to share my knowledge with the community, so I created this page with the desire to be able to share my knowledge with as many people as possible.

Free Book Practical DevOps AI Release Chapter 14[Chapter 13: Multi-Source Log Integration]- Understanding the challenge:...
20/05/2026

Free Book Practical DevOps AI Release Chapter 14

[Chapter 13: Multi-Source Log Integration]
- Understanding the challenge: Moving from single log files to real infrastructure.
- Building API clients: Connect to Elasticsearch, Kubernetes, and AWS CloudWatch.
- Authentication and security: Handle API keys, IAM roles, and service accounts properly.
- Query optimization: Fetch logs efficiently without overwhelming your systems.
- Error handling: Deal with API rate limits, timeouts, and service unavailability.
- Log format normalization: Create a unified structure from different log formats.
- Testing each connector: Verify each integration works before combining them.

[Chapter 14: Cross-System Correlation and Analysis]
- The power of correlation: Understanding how events connect across systems.
- Building the aggregation pipeline: Combine logs from multiple sources into a unified view.
- Teaching correlation: Write prompts that instruct the AI to link related events.
- Time-based correlation: Match events that happened around the same time across different systems.
- Contextual analysis: Build narratives like "service crashed because database hit connection limits after deployment changed timeout settings."
- Implementing the full analysis loop: Pull logs, aggregate, correlate, analyze, and report.
- Testing correlation logic: Verify the agent correctly identifies related events.

๐—–๐—ฎ๐˜๐—ฐ๐—ต๐—ถ๐—ป๐—ด ๐—˜๐—ฟ๐—ฟ๐—ผ๐—ฟ๐˜€ ๐—ฌ๐—ผ๐˜‚ ๐——๐—ผ๐—ปโ€™๐˜ ๐—ž๐—ป๐—ผ๐˜„ ๐˜„๐—ถ๐˜๐—ต ๐—ฆ๐—ฅ๐—˜ ๐—”๐—ด๐—ฒ๐—ป๐˜ ๐Ÿ”ฅYou already know what to look for, but the errors that actually take down...
18/05/2026

๐—–๐—ฎ๐˜๐—ฐ๐—ต๐—ถ๐—ป๐—ด ๐—˜๐—ฟ๐—ฟ๐—ผ๐—ฟ๐˜€ ๐—ฌ๐—ผ๐˜‚ ๐——๐—ผ๐—ปโ€™๐˜ ๐—ž๐—ป๐—ผ๐˜„ ๐˜„๐—ถ๐˜๐—ต ๐—ฆ๐—ฅ๐—˜ ๐—”๐—ด๐—ฒ๐—ป๐˜ ๐Ÿ”ฅ

You already know what to look for, but the errors that actually take down production are the ones that donโ€™t.

A new log line nobody has seen before. A familiar warning that's suddenly firing 100x more than usual. A service that's been running fine for a year and just started spitting out something that looks weird but doesn't match any rule.

CRI, CNI, CSI -- three interfaces holding your entire Kubernetes cluster together.CRI - Container Runtime Interface Kube...
17/05/2026

CRI, CNI, CSI -- three interfaces holding your entire Kubernetes cluster together.

CRI - Container Runtime Interface Kubelet doesn't run containers directly. It talks to a runtime via gRPC. The runtime handles image pulls, container lifecycle, everything.

CNI - Container Network Interface Kubernetes has zero built-in networking logic. Every pod creation triggers a CNI call assign IP, set up routes, connect to network. Pod deleted, CNI cleans up.

CSI - Container Storage Interface, It is a standard interface that allows Kubernetes to connect with different storage systems without Kubernetes needing storage-specific code built in. Instead of Kubernetes developers adding support for every storage platform directly into Kubernetes, storage vendors create their own CSI drivers.

Learn this Kubernetes Troubleshooting ScenarioWhen deploying Kubeadm based kubernetes cluster on AWS with Calico CNI, yo...
14/05/2026

Learn this Kubernetes Troubleshooting Scenario

When deploying Kubeadm based kubernetes cluster on AWS with Calico CNI, you may encounter a connection timed out issue between Pods and CoreDNS.

We encountered this issue, and we have created a detailed blog that explains:
- Why the issue happens
- How to troubleshoot it step by step
- The actual root cause
- How AWS networking interacts with Calico
- How to fix it properly

New Incident Detail Page UI for the Versus SRE Agent. ๐—ง๐—ต๐—ถ๐˜€ ๐—ถ๐˜€ ๐—ฎ ๐—ณ๐—ฟ๐—ฒ๐—ฒ, ๐—ผ๐—ฝ๐—ฒ๐—ป-๐˜€๐—ผ๐˜‚๐—ฟ๐—ฐ๐—ฒ incident management tool with an AI SR...
13/05/2026

New Incident Detail Page UI for the Versus SRE Agent. ๐—ง๐—ต๐—ถ๐˜€ ๐—ถ๐˜€ ๐—ฎ ๐—ณ๐—ฟ๐—ฒ๐—ฒ, ๐—ผ๐—ฝ๐—ฒ๐—ป-๐˜€๐—ผ๐˜‚๐—ฟ๐—ฐ๐—ฒ incident management tool with an AI SRE Agent in beta mode, designed to automatically detect new problems and alert users before they are aware of them.

In the future, we will support problem analysis and an automated post-mortem feature to help you speed up problem analysis and reporting after an incident.

AWS Local Emulator. Support:- EC2- Auto Scaling- RDS- EKS- Cognito- ElastiCache - MSK - Athena - Glue Data Catalog + Sch...
12/05/2026

AWS Local Emulator. Support:
- EC2
- Auto Scaling
- RDS
- EKS
- Cognito
- ElastiCache
- MSK
- Athena
- Glue Data Catalog + Schema Registry
- Firehose
- S3
- DynamoDB
- IAM

We just shipped Versus Incident v1.4.0 โ€” and it includes something we've been building toward for a while: an AI SRE age...
11/05/2026

We just shipped Versus Incident v1.4.0 โ€” and it includes something we've been building toward for a while: an AI SRE agent that detects problems before you even notice them.

This article explains how Netflix traced severe container launch slowdowns to Linux mount lock contention, image layer m...
10/05/2026

This article explains how Netflix traced severe container launch slowdowns to Linux mount lock contention, image layer mount storms, and CPU architecture differences while scaling containers on modern Kubernetes infrastructure.

Introduce enhanced UI and SRE Agent for Versus Incident, enabling visualization of incident occurrences and SRE Agent le...
08/05/2026

Introduce enhanced UI and SRE Agent for Versus Incident, enabling visualization of incident occurrences and SRE Agent learning progress.

AI SRE Agent with Spike DetectionSpike detection answers a question that the normal "known/unknown" check cannot: "This ...
05/05/2026

AI SRE Agent with Spike Detection

Spike detection answers a question that the normal "known/unknown" check cannot: "This error is normal โ€” but why is it happening 50 times a minute instead of the usual 2?"

๐—ฉ๐—ฒ๐—ฟ๐˜€๐˜‚๐˜€ ๐—”๐—œ ๐—ฆ๐—ฅ๐—˜ ๐—”๐—ด๐—ฒ๐—ป๐˜ - ๐—ฆ๐—ต๐—ฎ๐—ฑ๐—ผ๐˜„ ๐— ๐—ผ๐—ฑ๐—ฒWe are taking steps to enhance our open-source project to better support various use ca...
02/05/2026

๐—ฉ๐—ฒ๐—ฟ๐˜€๐˜‚๐˜€ ๐—”๐—œ ๐—ฆ๐—ฅ๐—˜ ๐—”๐—ด๐—ฒ๐—ป๐˜ - ๐—ฆ๐—ต๐—ฎ๐—ฑ๐—ผ๐˜„ ๐— ๐—ผ๐—ฑ๐—ฒ

We are taking steps to enhance our open-source project to better support various use cases in production. Our first phase involves adding the SRE Agent feature, which will automatically detect new problems and alert engineers to the quick resolution of incidents that we have not encountered before.

Additionally, we now support the AI Agent in Shadow Mode. This feature simulates abnormal logs, metrics, and traces that should trigger alerts.

Address

Ho Chi Minh City
70000

Alerts

Be the first to know and let us send you an email when DevOps VN posts news and promotions. Your email address will not be used for any other purpose, and you can unsubscribe at any time.

Contact The Organization

Send a message to DevOps VN:

Share