Monitoring & Alerts

What metrics do we monitor?

We provide comprehensive monitoring across servers and Kubernetes clusters using our open-source operations frameworks - LinuxAid and KubeAid - powered by Prometheus.

1. Server Monitoring (LinuxAid)

For servers, all metrics and alerting rules are defined in our LinuxAid stack here:

These include:

Host and hardware metrics
CPU, RAM, I/O, disk health
Filesystem and storage monitoring
Network performance

These rules are maintained as open source, updated regularly and we have also written a test suite for them.

2. Kubernetes Monitoring (KubeAid)

For Kubernetes environments, we use Kube Prometheus as our foundation and extend it with additional rules and mixins.

Baseline alerting rules:

Additional alerts and mixins:

Our Kubernetes alerting includes:

Node, pod, and container metrics
Control-plane and API server health
etcd performance and availability
Scheduler and controller-manager signals
Ingress, service and network health
Application-level alerts for selected open-source components
Optional mixin-based alert packs that can be enabled or disabled per cluster

This layered approach ensures platform-wide visibility with the flexibility to adapt to each customer’s stack.

How we handle alerts and incident escalation

Prometheus routes alerts to our central alert ingestion endpoint (with the option to exclude hosts or components upon request). Alerts are then processed by our internal operations platform, which:

Automatically creates or updates alert tickets
Notifies our 24/7 operations team
Applies your SLA to correctly prioritize and set deadlines
Tracks acknowledgment and resolution progress in real time
Performs automated escalations - first to the on-call engineer, then to management if needed
Some of our alerts are preemptive in nature i.e. fire before things can be broken/down.

We always prioritise restoring service as quickly as possible and then focus on addressing the root cause to prevent future recurrence.