Phase 1 - Observability with Prometheus, Grafana & Loki

Series: Kubernetes Homelab on VMware Workstation Prerequisites: Phase 0 - Argo CD & GitOps complete Source Code: jmartinez-homelab-gitops

What We’re Building

By the end of this guide, you will have:

  • Prometheus collecting metrics from nodes, pods, and Kubernetes objects
  • Grafana with dashboards for cluster and application monitoring
  • Loki + Promtail for centralized log aggregation
  • All accessible via Traefik ingress at grafana.lab.local

Why Observability

You can’t manage what you can’t measure. In Kubernetes, you need visibility into:

  • Metrics — CPU, memory, network, request rates, error rates
  • Logs — Application output, system events, errors
  • Dashboards — Visual representation of cluster health

Prometheus + Grafana is the de facto standard for Kubernetes monitoring. Loki provides logging without the resource overhead of the ELK stack — ideal for a homelab.

Before You Start

Verify Phase 0 is complete:

# Argo CD running
kubectl get pods -n argocd
 
# Online Boutique deployed and synced
kubectl get application -n argocd
# NAME              SYNC STATUS   HEALTH STATUS
# online-boutique   Synced        Healthy

Step 0: Resize VMs

The monitoring stack requires more resources than the default VM allocations. If you’re running VMware Workstation on a 40GB laptop, allocate:

VMMemoryRationale
k3s-server8 GBControl-plane + Argo CD + scheduling
k3s-agent-14 GBMonitoring stack (Prometheus, Grafana)
k3s-agent-24 GBApplication workloads
k3s-agent-34 GBStateful workloads + overflow
Host~20 GBVMware + OS overhead

Total VM allocation: 20 GB. Leaves 20 GB for laptop OS.

How to Resize

  1. Shut down the VM: sudo shutdown -h now
  2. VMware Workstation → right-click VM → SettingsHardwareMemory
  3. Adjust to recommended value
  4. Start the VM
  5. Verify: kubectl describe node <node-name> | grep -A 5 Capacity

Step 1: Install Helm

Helm is the package manager for Kubernetes — similar to apt or brew, but for cluster applications.

curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version

Step 2: Add Helm Repos

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

Step 3: Deploy Prometheus + Grafana

The kube-prometheus-stack Helm chart bundles Prometheus, Grafana, Alertmanager, kube-state-metrics, and node-exporter in a single install.

kubectl create namespace monitoring
 
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.resources.requests.memory=512Mi \
  --set prometheus.prometheusSpec.resources.limits.memory=1Gi \
  --set prometheus.prometheusSpec.resources.requests.cpu=200m \
  --set prometheus.prometheusSpec.resources.limits.cpu=500m \
  --set prometheus.prometheusSpec.retention=7d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=5Gi \
  --set grafana.resources.requests.memory=128Mi \
  --set grafana.resources.limits.memory=256Mi \
  --set grafana.resources.requests.cpu=100m \
  --set alertmanager.alertmanagerSpec.resources.requests.memory=128Mi \
  --set alertmanager.alertmanagerSpec.resources.limits.memory=256Mi \
  --set kubeStateMetrics.resources.requests.memory=64Mi \
  --set prometheus-node-exporter.resources.requests.memory=32Mi

Resource limits are tuned for a homelab with ~18 GB total cluster RAM. Adjust if your setup differs.

What this deploys:

ComponentPurpose
PrometheusScrapes and stores time-series metrics
GrafanaDashboarding and visualization
AlertmanagerRoutes alerts to notification channels
kube-state-metricsExposes Kubernetes object states as metrics
node-exporterDaemonSet collecting hardware/OS metrics from each node

Verify:

kubectl get pods -n monitoring
# All pods should be Running

Step 4: Access Grafana

Get the Admin Password

kubectl get secret -n monitoring kube-prometheus-stack-grafana \
  -o jsonpath="{.data.admin-password}" | base64 -d; echo

Default username: admin

Option A: Port-forward (quick)

kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Open http://localhost:3000

Option B: Ingress (permanent)

kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grafana-ingress
  namespace: monitoring
  annotations:
    traefik.ingress.kubernetes.io/router.entrypoints: web
spec:
  rules:
    - host: grafana.lab.local
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: kube-prometheus-stack-grafana
                port:
                  number: 80
EOF

Update your /etc/hosts file:

<NODE_IP> boutique.lab.local argocd.lab.local grafana.lab.local

Access at http://grafana.lab.local

Step 5: Deploy Loki + Promtail

Loki is a log aggregation system designed to be Prometheus-like but for logs. Promtail is the agent that ships logs from nodes to Loki.

# Loki — single-binary mode for small clusters
helm install loki grafana/loki \
  --namespace monitoring \
  --set deploymentMode=SingleBinary \
  --set loki.commonConfig.replication_factor=1 \
  --set loki.storage.type=filesystem \
  --set singleBinary.replicas=1 \
  --set singleBinary.resources.requests.memory=256Mi \
  --set singleBinary.resources.limits.memory=512Mi \
  --set singleBinary.resources.requests.cpu=100m \
  --set singleBinary.persistence.size=5Gi \
  --set monitoring.selfMonitoring.grafanaAgent.installOperator=false \
  --set gateway.enabled=false \
  --set read.replicas=0 \
  --set write.replicas=0 \
  --set backend.replicas=0
 
# Promtail — collects and ships logs
helm install promtail grafana/promtail \
  --namespace monitoring \
  --set config.clients[0].url=http://loki.monitoring.svc:3100/loki/api/v1/push \
  --set resources.requests.memory=64Mi \
  --set resources.limits.memory=128Mi

Step 6: Connect Loki to Grafana

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-datasource
  namespace: monitoring
  labels:
    grafana_datasource: "1"
data:
  loki-datasource.yaml: |
    apiVersion: 1
    datasources:
      - name: Loki
        type: loki
        access: proxy
        url: http://loki.monitoring.svc:3100
        isDefault: false
        editable: true
EOF

Grafana auto-discovers this ConfigMap and adds Loki as a data source.

Step 7: Import Dashboards

In Grafana → Dashboards → Import, enter these community dashboard IDs:

DashboardIDDescription
Kubernetes Cluster Monitoring315Node/pod CPU, memory, network overview
Node Exporter Full1860Detailed hardware and OS metrics
Kubernetes Pods6417Per-pod resource usage
Loki Logs13639Log search and filtering interface

Step 8: Verify

# All monitoring pods running
kubectl get pods -n monitoring
 
# Prometheus scraping targets
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/targets — all targets should be UP
 
# Loki is ready
kubectl port-forward -n monitoring svc/loki 3100:3100
curl -s http://localhost:3100/ready
 
# Resource usage
kubectl top nodes
kubectl top pods -n monitoring

Expected Resource Usage (Homelab Optimized)

ComponentMemoryCPU
Prometheus256 – 512 Mi100m – 250m
Grafana64 – 128 Mi50m
Alertmanager64 – 128 Mi
Loki128 – 256 Mi50m
Promtail64 – 128 Mi
node-exporter (4 pods)~64 Mi~40m
kube-state-metrics32 – 64 Mi10m – 50m
Total~0.56 – 1.2 Gi~200m – 400m

Note: Original values were ~1.2 – 2.3 Gi memory. Reduced for homelab with limited resources.

Access Summary

ServiceURLCredentials
Online Boutiquehttp://boutique.lab.local
Argo CD UIhttp://argocd.lab.localadmin / bootstrap password
Grafanahttp://grafana.lab.localadmin / helm-generated password

Troubleshooting

IssueFix
Grafana ingress returns 404Verify ingress exists: kubectl get ingress -n monitoring
Prometheus targets DOWNCheck pod logs: kubectl logs -n monitoring deploy/kube-prometheus-stack-prometheus
Loki not receiving logsVerify Promtail: kubectl logs -n monitoring daemonset/promtail
Pods stuck in PendingNode out of resources: kubectl describe pod <name> -n monitoring
Argo CD stuck ProgressingDelete and recreate: kubectl delete application monitoring -n argocd
StatefulSet update errorDelete StatefulSet: kubectl delete statefulset <name> -n monitoring

Full troubleshooting guide: Fix: Monitoring Stack Sync & Resource Issues

Common Issues & Solutions (GitOps Implementation)

Detailed troubleshooting guide: Fix: Monitoring Stack Sync & Resource Issues

1. Loki Schema Configuration Error

Error: You must provide a schema_config for Loki Solution: Add loki.useTestSchema: true to Loki Helm values for testing:

loki:
  useTestSchema: true

1a. Loki chunks-cache Stuck in Pending

Error: loki-chunks-cache-0 stuck in Pending, Argo CD waiting for healthy state Solution: Disable chunks-cache and results-cache in Loki values:

chunksCache:
  enabled: false
resultsCache:
  enabled: false

Then delete existing StatefulSets: kubectl delete statefulset loki-chunks-cache loki-results-cache -n monitoring

2. Kustomize helmCharts CRDs Not Installed

Error: The Kubernetes API could not find monitoring.coreos.com/PrometheusRule Solution: Add includeCRDs: true to each helmChart entry in kustomization.yaml:

helmCharts:
  - name: kube-prometheus-stack
    includeCRDs: true

3. Large CRDs Failing with Annotation Size Limit

Error: metadata.annotations: Too long: may not be more than 262144 bytes Solution: Enable ServerSideApply in Argo CD Application:

syncOptions:
  - CreateNamespace=true
  - ServerSideApply=true

4. Argo CD IngressRoute Configuration

Issue: Login page refreshes back to login, 404 errors Solution: Use port 80 (HTTP) instead of 443 with serversTransport. Also enable insecure mode on Argo CD server:

# Set server.insecure in argocd-cmd-params-cm
kubectl patch configmap -n argocd argocd-cmd-params-cm \
  --type merge -p '{"data":{"server.insecure":"true"}}'
kubectl rollout restart deployment -n argocd argocd-server

Updated IngressRoute should use port 80:

services:
  - name: argocd-server
    port: 80  # Not 443

5. Dashboard ConfigMap Annotation Size

Error: ConfigMap "grafana-dashboard-1860" is invalid: metadata.annotations: Too long Solution: This is a known issue with large dashboard JSON. Consider using Grafana sidecar with label grafana_dashboard: "1" which doesn’t have this limitation.

6. Pods Stuck in Pending - Insufficient Memory

Error: 0/4 nodes are available: 4 Insufficient memory Solution: Reduce memory requests/limits for homelab. See updated resource table below.

7. StatefulSet Update Forbidden

Error: updates to statefulset spec for fields other than 'replicas', 'ordinals', 'template', 'updateStrategy'... are forbidden Solution: StatefulSets have immutable fields. Delete and recreate:

kubectl delete statefulset <name> -n monitoring
# Argo CD will recreate with new configuration

8. Prometheus PVC Missing accessModes

Error: Prometheus pod not deploying Solution: Add accessModes: [ReadWriteOnce] to volumeClaimTemplate:

prometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes:
            - ReadWriteOnce

9. Argo CD Sync Stuck on Old Revision

Error: Sync operation stuck on old commit, not updating to latest Solution: Delete and recreate the Argo CD Application:

kubectl delete application monitoring -n argocd
kubectl apply -f bootstrap/argocd/apps/monitoring-app.yaml

GitOps Integration (Implemented)

The monitoring stack is managed via Argo CD using Kustomize overlays. Actual structure:

infrastructure/
└── monitoring/
    ├── base/
    │   ├── kustomization.yaml
    │   └── namespace.yaml
    └── overlays/lab/
        ├── kustomization.yaml           # Includes helmCharts with includeCRDs: true
        ├── prometheus-values.yaml       # kube-prometheus-stack values
        ├── loki-values.yaml             # Loki values (with useTestSchema: true)
        ├── promtail-values.yaml         # Promtail values
        ├── ingress.yaml                 # Grafana ingress
        ├── loki-datasource.yaml         # Grafana datasource ConfigMap
        └── dashboard-*.yaml             # Community dashboard ConfigMaps

Argo CD Application (bootstrap/argocd/apps/monitoring-app.yaml):

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: monitoring
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/james-martinez0/jmartinez-homelab-gitops.git
    targetRevision: HEAD
    path: infrastructure/monitoring/overlays/lab
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true
    - ServerSideApply=true

Key configurations:

  • includeCRDs: true in helmCharts to install Prometheus CRDs
  • loki.useTestSchema: true for testing without schema config
  • ServerSideApply: true to handle large CRD annotations