Skip to main content
The CertWatch Agent exposes Prometheus metrics and Kubernetes-compatible health endpoints for comprehensive observability.

Enabling Metrics

Metrics and health endpoints are enabled by default on port 8080. Configure via:
agent:
  metrics_port: 8080  # Set to 0 to disable
Or via environment variable:
export CW_METRICS_PORT=8080

Prometheus Metrics

Metrics are exposed at http://localhost:8080/metrics in Prometheus format.

Certificate Metrics

MetricTypeLabelsDescription
certwatch_certificate_days_until_expiryGaugehostname, portDays until certificate expires
certwatch_certificate_validGaugehostname, portCertificate validity (1=valid, 0=invalid)
certwatch_certificate_chain_validGaugehostname, portChain validity (1=valid, 0=invalid)
certwatch_certificate_expiry_timestamp_secondsGaugehostname, portCertificate expiry as Unix timestamp

Scan Metrics

MetricTypeLabelsDescription
certwatch_scan_totalCounterstatusTotal scans by status (success/failure)
certwatch_scan_duration_secondsHistogramTime taken to complete scans

Sync Metrics

MetricTypeLabelsDescription
certwatch_sync_totalCounterstatusTotal syncs by status (success/failure)
certwatch_sync_duration_secondsHistogramTime taken to sync with cloud

Heartbeat Metrics

MetricTypeLabelsDescription
certwatch_heartbeat_totalCounterstatusTotal heartbeats by status

Agent Info

MetricTypeLabelsDescription
certwatch_agent_infoGaugeversion, name, agent_idAgent metadata
certwatch_agent_certificates_configuredGaugeNumber of certificates configured

Health Endpoints

The agent exposes Kubernetes-compatible health check endpoints:
EndpointDescriptionUse Case
/healthzBasic liveness checkAlways returns OK if server is running
/readyzReadiness probeReturns 503 during initialization
/livezDeep liveness checkReturns 503 if no successful scans in 10 minutes

Kubernetes Probe Configuration

livenessProbe:
  httpGet:
    path: /livez
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3
When using the Helm chart, these probes are pre-configured automatically.

Heartbeat & Offline Alerts

The agent sends periodic heartbeats to CertWatch to enable offline detection:
agent:
  heartbeat_interval: 30s  # Set to 0 to disable
When heartbeats stop arriving, CertWatch can alert you that an agent has gone offline. This is useful for:
  • Detecting network issues between agent and cloud
  • Monitoring agent health across distributed infrastructure
  • Alerting when agents crash or are terminated

Prometheus Scrape Config

Add the agent to your Prometheus configuration:
scrape_configs:
  - job_name: 'certwatch-agent'
    static_configs:
      - targets: ['localhost:8080']
    scrape_interval: 30s

Kubernetes ServiceMonitor

If using the Prometheus Operator with our Helm chart:
# values.yaml
serviceMonitor:
  enabled: true
  interval: 30s
  labels:
    release: prometheus  # Match your Prometheus selector
See the Kubernetes deployment guide for full details.

Alerting Examples

Prometheus Alertmanager Rules

groups:
  - name: certwatch
    rules:
      # Certificate expiring soon
      - alert: CertificateExpiringSoon
        expr: certwatch_certificate_days_until_expiry < 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Certificate expiring soon"
          description: "{{ $labels.hostname }}:{{ $labels.port }} expires in {{ $value }} days"

      # Certificate expiring very soon
      - alert: CertificateExpiringCritical
        expr: certwatch_certificate_days_until_expiry < 7
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Certificate expiring critically soon"
          description: "{{ $labels.hostname }}:{{ $labels.port }} expires in {{ $value }} days"

      # Certificate invalid
      - alert: CertificateInvalid
        expr: certwatch_certificate_valid == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Certificate is invalid"
          description: "{{ $labels.hostname }}:{{ $labels.port }} certificate validation failed"

      # Chain validation failed
      - alert: CertificateChainInvalid
        expr: certwatch_certificate_chain_valid == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Certificate chain validation failed"
          description: "{{ $labels.hostname }}:{{ $labels.port }} has chain issues"

      # Agent not scanning
      - alert: AgentNotScanning
        expr: rate(certwatch_scan_total[10m]) == 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "CertWatch agent not scanning"
          description: "No scans have occurred in the last 15 minutes"

Grafana Dashboard

Create a Grafana dashboard with these panels:

Certificate Overview

# Certificates by days until expiry
sort_desc(certwatch_certificate_days_until_expiry)

# Certificates expiring within 30 days
count(certwatch_certificate_days_until_expiry < 30)

# Invalid certificates
count(certwatch_certificate_valid == 0)

Agent Health

# Scan success rate
rate(certwatch_scan_total{status="success"}[5m]) / rate(certwatch_scan_total[5m])

# Sync success rate
rate(certwatch_sync_total{status="success"}[5m]) / rate(certwatch_sync_total[5m])

# Scan duration (p99)
histogram_quantile(0.99, rate(certwatch_scan_duration_seconds_bucket[5m]))

Disabling Observability

To run without metrics and health endpoints:
agent:
  metrics_port: 0
  heartbeat_interval: 0
Disabling metrics removes the ability to use Kubernetes health probes. The agent will rely only on process liveness.