Monitoring & Alerts
What metrics do we monitor?
We provide comprehensive monitoring across servers and Kubernetes clusters using our open-source operations frameworks - LinuxAid and KubeAid - powered by Prometheus.
1. Server Monitoring (LinuxAid)
For servers, all metrics and alerting rules are defined in our LinuxAid stack here:
These include:
- Host and hardware metrics
- CPU, RAM, I/O, disk health
- Filesystem and storage monitoring
- Network performance
These rules are maintained as open source, updated regularly and we have also written a test suite for them.
2. Kubernetes Monitoring (KubeAid)
For Kubernetes environments, we use Kube Prometheus as our foundation and extend it with additional rules and mixins.
Baseline alerting rules:
Additional alerts and mixins:
Our Kubernetes alerting includes:
- Node, pod, and container metrics
- Control-plane and API server health
- etcd performance and availability
- Scheduler and controller-manager signals
- Ingress, service and network health
- Application-level alerts for selected open-source components
- Optional mixin-based alert packs that can be enabled or disabled per cluster
This layered approach ensures platform-wide visibility with the flexibility to adapt to each customer’s stack.
How we handle alerts and incident escalation
Prometheus routes alerts to our central alert ingestion endpoint (with the option to exclude hosts or components upon request).
Alerts are then processed by our internal operations platform, which:
- Automatically creates or updates alert tickets
- Notifies our 24/7 operations team
- Applies your SLA to correctly prioritize and set deadlines
- Tracks acknowledgment and resolution progress in real time
- Performs automated escalations - first to the on-call engineer, then to management if needed
- Some of our alerts are preemptive in nature i.e. fire before things can be broken/down.
We always prioritise restoring service as quickly as possible and then focus on addressing the root cause to prevent future recurrence.