Monitoring a Homelab with Prometheus and Grafana
Implementing monitoring in a homelab using Prometheus, Grafana, Node Exporter, and cAdvisor for system and container visibility.
This post covers how monitoring and observability are implemented in my homelab using Prometheus and Grafana.
Once multiple services are running, visibility becomes critical. Without monitoring, it is difficult to understand system behaviour, detect issues early, or troubleshoot effectively.
Why Monitoring Matters
Monitoring provides insight into:
- System performance (CPU, memory, disk)
- Service health and availability
- Resource usage trends over time
Rather than reacting to failures, monitoring allows for early detection and proactive troubleshooting.
Monitoring Stack Overview
The monitoring stack consists of:
- Prometheus → metrics collection and storage
- Node Exporter → host-level metrics
- cAdvisor → container-level metrics
- Grafana → visualisation and dashboards
Each component plays a specific role in collecting and presenting data.
Architecture
Metrics flow through the system as follows:
Host System Docker Containers
↓ ↓
Node Exporter cAdvisor
\ /
\ /
Prometheus
↓
Grafana
Prometheus Configuration
Prometheus scrapes metrics from defined targets.
Example configuration:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['172.17.0.1:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
Key points:
- Node Exporter runs on the host
- cAdvisor runs as a container
- Prometheus scrapes both at regular intervals
Docker Networking Considerations
One issue encountered was how containers access host services.
On Linux, host.docker.internal is not always available, so:
- The Docker bridge gateway (
172.17.0.1) was used - Prometheus connects to Node Exporter via this address
This is an important detail when designing monitoring setups in Docker environments.
Grafana Dashboards
Grafana is used to visualise metrics collected by Prometheus.
Dashboards include:
- CPU usage (rate-based calculations)
- Memory utilisation
- Disk usage by mount point
- Container resource usage
Prebuilt dashboards were used initially and then customised to focus on relevant metrics.
Example Metric Query
CPU usage is calculated using PromQL:
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)
This shows the percentage of CPU actively being used.
Observability Approach
The goal is not just to collect data, but to make it actionable.
This includes:
- Highlighting critical metrics clearly
- Avoiding overly complex dashboards
- Focusing on signals that indicate real issues
For example:
- High disk usage on storage volumes
- Sustained CPU load spikes
- Container resource exhaustion
Issues Encountered
Several practical issues came up during setup:
- Docker networking and host access
- Missing or stale metrics
- Incorrect Prometheus scrape targets
- Dashboard data not updating correctly
These required debugging across multiple layers, including containers, networking, and configuration.
Key Learnings
- Monitoring is essential once systems grow beyond a few services
- Prometheus requires careful configuration of scrape targets
- Docker networking can introduce unexpected complexity
- Dashboards should prioritise clarity over quantity
What’s Next
The next step is exploring storage and data management, including RAID, filesystem design, and ensuring data integrity.