kubernetesKQLazureapplication insights

Set alerts on crashed Kubernetes pods in Azure Kubernetes Service

Published at 2019-08-25

It's not always easy to tell what is happening inside a kubernetes cluster. There are many alternatives like prometheus that can do alerting and monitoring for you. However by using AKS (Azure Kubernetes Service) you get a lot of Azure tools included. Most prominently the centralized logging with Azure Monitor.

<figure> <img src="https://blog.novacare.no/content/images/2019/08/mael-santiago-concept-1.jpg" alt="Broken pod"> <figcaption>Credit: Mael Santiago. <a href="https://www.artstation.com/artwork/OYvQb">Source</a>.</figcaption> </figure>

To activate an alert you first need to create a query. To create a query you'll need to go to the AKS in the Azure portal and then go to "Logs" under "Monitoring". You'll see a window that looks something like this:

DeepinScreenshot_select-area_20190822211619

It's time to write some KQL! In this query, we get all events that involves a crashing pod. If you want to retrieve all pods that have status "CrashLoopBackOff" you want to filter the events by "BackOff". "CrashLoopBackOff" basically means that the pod has crashed and tries to start the pod again.

KubeEvents 
| where ClusterName =~ '<ClusterName>'
| where ObjectKind =~ 'Pod'
| where Reason =~ 'BackOff'
| project TimeGenerated, Name, ObjectKind, Reason, Message, Namespace, Count
| order by TimeGenerated desc

After that, you can simply create an alert rule that triggers by unhealthy pods! This need to be in the format of an aggregated value. So we need to change our query to make sure that the alert triggers. I chose to aggregate a sum of Count for every five minutes.

KubeEvents 
| where ClusterName =~ '<ClusterName>'
| where ObjectKind =~ 'Pod'
| where Reason =~ 'BackOff'
| project TimeGenerated, Count
| summarize AggregatedValue=sum(Count) by bin(TimeGenerated, 5m) 

Threshold with the value 100 and it would look like this image.

DeepinScreenshot_select-area_20190822220759

After that you should have an alert that triggers by "CrashLoopBackOff"! Enjoy!

Avatar of Author

Karl SolgÄrd

Norwegian software developer. Eager to learn and to share knowledge. Sharing is caring! Follow on social: Twitter and LinkedIn. Email me: karl@solgard.solutions