Phase 1 - Observability with Prometheus, Grafana & Loki

Series: Kubernetes Homelab on VMware Workstation Prerequisites: Phase 0 - Argo CD & GitOps complete Source Code: jmartinez-homelab-gitops

What We’re Building

By the end of this guide, you will have:

Prometheus collecting metrics from nodes, pods, and Kubernetes objects
Grafana with dashboards for cluster and application monitoring
Loki + Promtail for centralized log aggregation
All accessible via Traefik ingress at grafana.lab.local

Why Observability

You can’t manage what you can’t measure. In Kubernetes, you need visibility into:

Metrics — CPU, memory, network, request rates, error rates
Logs — Application output, system events, errors
Dashboards — Visual representation of cluster health

Prometheus + Grafana is the de facto standard for Kubernetes monitoring. Loki provides logging without the resource overhead of the ELK stack — ideal for a homelab.

Before You Start

Verify Phase 0 is complete:

# Argo CD running
kubectl get pods -n argocd
 
# Online Boutique deployed and synced
kubectl get application -n argocd
# NAME              SYNC STATUS   HEALTH STATUS
# online-boutique   Synced        Healthy

Step 0: Resize VMs

The monitoring stack requires more resources than the default VM allocations. If you’re running VMware Workstation on a 40GB laptop, allocate:

VM	Memory	Rationale
k3s-server	8 GB	Control-plane + Argo CD + scheduling
k3s-agent-1	4 GB	Monitoring stack (Prometheus, Grafana)
k3s-agent-2	4 GB	Application workloads
k3s-agent-3	4 GB	Stateful workloads + overflow
Host	~20 GB	VMware + OS overhead

Total VM allocation: 20 GB. Leaves 20 GB for laptop OS.

How to Resize

Shut down the VM: sudo shutdown -h now
VMware Workstation → right-click VM → Settings → Hardware → Memory
Adjust to recommended value
Start the VM
Verify: kubectl describe node <node-name> | grep -A 5 Capacity

Step 1: Install Helm

Helm is the package manager for Kubernetes — similar to apt or brew, but for cluster applications.

curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version

Step 2: Add Helm Repos

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

Step 3: Deploy Prometheus + Grafana

The kube-prometheus-stack Helm chart bundles Prometheus, Grafana, Alertmanager, kube-state-metrics, and node-exporter in a single install.

kubectl create namespace monitoring
 
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.resources.requests.memory=512Mi \
  --set prometheus.prometheusSpec.resources.limits.memory=1Gi \
  --set prometheus.prometheusSpec.resources.requests.cpu=200m \
  --set prometheus.prometheusSpec.resources.limits.cpu=500m \
  --set prometheus.prometheusSpec.retention=7d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=5Gi \
  --set grafana.resources.requests.memory=128Mi \
  --set grafana.resources.limits.memory=256Mi \
  --set grafana.resources.requests.cpu=100m \
  --set alertmanager.alertmanagerSpec.resources.requests.memory=128Mi \
  --set alertmanager.alertmanagerSpec.resources.limits.memory=256Mi \
  --set kubeStateMetrics.resources.requests.memory=64Mi \
  --set prometheus-node-exporter.resources.requests.memory=32Mi

Resource limits are tuned for a homelab with ~18 GB total cluster RAM. Adjust if your setup differs.

What this deploys:

Component	Purpose
Prometheus	Scrapes and stores time-series metrics
Grafana	Dashboarding and visualization
Alertmanager	Routes alerts to notification channels
kube-state-metrics	Exposes Kubernetes object states as metrics
node-exporter	DaemonSet collecting hardware/OS metrics from each node

Verify:

kubectl get pods -n monitoring
# All pods should be Running

Step 4: Access Grafana

Get the Admin Password

kubectl get secret -n monitoring kube-prometheus-stack-grafana \
  -o jsonpath="{.data.admin-password}" | base64 -d; echo

Default username: admin

Option A: Port-forward (quick)

kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Open http://localhost:3000

Option B: Ingress (permanent)

kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grafana-ingress
  namespace: monitoring
  annotations:
    traefik.ingress.kubernetes.io/router.entrypoints: web
spec:
  rules:
    - host: grafana.lab.local
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: kube-prometheus-stack-grafana
                port:
                  number: 80
EOF

Update your /etc/hosts file:

<NODE_IP> boutique.lab.local argocd.lab.local grafana.lab.local

Access at http://grafana.lab.local

Step 5: Deploy Loki + Promtail

Loki is a log aggregation system designed to be Prometheus-like but for logs. Promtail is the agent that ships logs from nodes to Loki.

# Loki — single-binary mode for small clusters
helm install loki grafana/loki \
  --namespace monitoring \
  --set deploymentMode=SingleBinary \
  --set loki.commonConfig.replication_factor=1 \
  --set loki.storage.type=filesystem \
  --set singleBinary.replicas=1 \
  --set singleBinary.resources.requests.memory=256Mi \
  --set singleBinary.resources.limits.memory=512Mi \
  --set singleBinary.resources.requests.cpu=100m \
  --set singleBinary.persistence.size=5Gi \
  --set monitoring.selfMonitoring.grafanaAgent.installOperator=false \
  --set gateway.enabled=false \
  --set read.replicas=0 \
  --set write.replicas=0 \
  --set backend.replicas=0
 
# Promtail — collects and ships logs
helm install promtail grafana/promtail \
  --namespace monitoring \
  --set config.clients[0].url=http://loki.monitoring.svc:3100/loki/api/v1/push \
  --set resources.requests.memory=64Mi \
  --set resources.limits.memory=128Mi

Step 6: Connect Loki to Grafana

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-datasource
  namespace: monitoring
  labels:
    grafana_datasource: "1"
data:
  loki-datasource.yaml: |
    apiVersion: 1
    datasources:
      - name: Loki
        type: loki
        access: proxy
        url: http://loki.monitoring.svc:3100
        isDefault: false
        editable: true
EOF

Grafana auto-discovers this ConfigMap and adds Loki as a data source.

Step 7: Import Dashboards

In Grafana → Dashboards → Import, enter these community dashboard IDs:

Dashboard	ID	Description
Kubernetes Cluster Monitoring	315	Node/pod CPU, memory, network overview
Node Exporter Full	1860	Detailed hardware and OS metrics
Kubernetes Pods	6417	Per-pod resource usage
Loki Logs	13639	Log search and filtering interface

Step 8: Verify

# All monitoring pods running
kubectl get pods -n monitoring
 
# Prometheus scraping targets
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/targets — all targets should be UP
 
# Loki is ready
kubectl port-forward -n monitoring svc/loki 3100:3100
curl -s http://localhost:3100/ready
 
# Resource usage
kubectl top nodes
kubectl top pods -n monitoring

Expected Resource Usage (Homelab Optimized)

Component	Memory	CPU
Prometheus	256 – 512 Mi	100m – 250m
Grafana	64 – 128 Mi	50m
Alertmanager	64 – 128 Mi	—
Loki	128 – 256 Mi	50m
Promtail	64 – 128 Mi	—
node-exporter (4 pods)	~64 Mi	~40m
kube-state-metrics	32 – 64 Mi	10m – 50m
Total	~0.56 – 1.2 Gi	~200m – 400m

Note: Original values were ~1.2 – 2.3 Gi memory. Reduced for homelab with limited resources.

Access Summary

Service	URL	Credentials
Online Boutique	http://boutique.lab.local	—
Argo CD UI	http://argocd.lab.local	admin / bootstrap password
Grafana	http://grafana.lab.local	admin / helm-generated password

Troubleshooting

Issue	Fix
Grafana ingress returns 404	Verify ingress exists: `kubectl get ingress -n monitoring`
Prometheus targets DOWN	Check pod logs: `kubectl logs -n monitoring deploy/kube-prometheus-stack-prometheus`
Loki not receiving logs	Verify Promtail: `kubectl logs -n monitoring daemonset/promtail`
Pods stuck in Pending	Node out of resources: `kubectl describe pod <name> -n monitoring`
Argo CD stuck Progressing	Delete and recreate: `kubectl delete application monitoring -n argocd`
StatefulSet update error	Delete StatefulSet: `kubectl delete statefulset <name> -n monitoring`

Full troubleshooting guide: Fix: Monitoring Stack Sync & Resource Issues

Common Issues & Solutions (GitOps Implementation)

Detailed troubleshooting guide: Fix: Monitoring Stack Sync & Resource Issues

1. Loki Schema Configuration Error

Error: You must provide a schema_config for Loki Solution: Add loki.useTestSchema: true to Loki Helm values for testing:

loki:
  useTestSchema: true

1a. Loki chunks-cache Stuck in Pending

Error: loki-chunks-cache-0 stuck in Pending, Argo CD waiting for healthy state Solution: Disable chunks-cache and results-cache in Loki values:

chunksCache:
  enabled: false
resultsCache:
  enabled: false

Then delete existing StatefulSets: kubectl delete statefulset loki-chunks-cache loki-results-cache -n monitoring

2. Kustomize helmCharts CRDs Not Installed

Error: The Kubernetes API could not find monitoring.coreos.com/PrometheusRule Solution: Add includeCRDs: true to each helmChart entry in kustomization.yaml:

helmCharts:
  - name: kube-prometheus-stack
    includeCRDs: true

3. Large CRDs Failing with Annotation Size Limit

Error: metadata.annotations: Too long: may not be more than 262144 bytes Solution: Enable ServerSideApply in Argo CD Application:

syncOptions:
  - CreateNamespace=true
  - ServerSideApply=true

4. Argo CD IngressRoute Configuration

Issue: Login page refreshes back to login, 404 errors Solution: Use port 80 (HTTP) instead of 443 with serversTransport. Also enable insecure mode on Argo CD server:

# Set server.insecure in argocd-cmd-params-cm
kubectl patch configmap -n argocd argocd-cmd-params-cm \
  --type merge -p '{"data":{"server.insecure":"true"}}'
kubectl rollout restart deployment -n argocd argocd-server

Updated IngressRoute should use port 80:

services:
  - name: argocd-server
    port: 80  # Not 443

5. Dashboard ConfigMap Annotation Size

Error: ConfigMap "grafana-dashboard-1860" is invalid: metadata.annotations: Too long Solution: This is a known issue with large dashboard JSON. Consider using Grafana sidecar with label grafana_dashboard: "1" which doesn’t have this limitation.

6. Pods Stuck in Pending - Insufficient Memory

Error: 0/4 nodes are available: 4 Insufficient memory Solution: Reduce memory requests/limits for homelab. See updated resource table below.

7. StatefulSet Update Forbidden

Error: updates to statefulset spec for fields other than 'replicas', 'ordinals', 'template', 'updateStrategy'... are forbidden Solution: StatefulSets have immutable fields. Delete and recreate:

kubectl delete statefulset <name> -n monitoring
# Argo CD will recreate with new configuration

8. Prometheus PVC Missing accessModes

Error: Prometheus pod not deploying Solution: Add accessModes: [ReadWriteOnce] to volumeClaimTemplate:

prometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes:
            - ReadWriteOnce

9. Argo CD Sync Stuck on Old Revision

Error: Sync operation stuck on old commit, not updating to latest Solution: Delete and recreate the Argo CD Application:

kubectl delete application monitoring -n argocd
kubectl apply -f bootstrap/argocd/apps/monitoring-app.yaml

GitOps Integration (Implemented)

The monitoring stack is managed via Argo CD using Kustomize overlays. Actual structure:

infrastructure/
└── monitoring/
    ├── base/
    │   ├── kustomization.yaml
    │   └── namespace.yaml
    └── overlays/lab/
        ├── kustomization.yaml           # Includes helmCharts with includeCRDs: true
        ├── prometheus-values.yaml       # kube-prometheus-stack values
        ├── loki-values.yaml             # Loki values (with useTestSchema: true)
        ├── promtail-values.yaml         # Promtail values
        ├── ingress.yaml                 # Grafana ingress
        ├── loki-datasource.yaml         # Grafana datasource ConfigMap
        └── dashboard-*.yaml             # Community dashboard ConfigMaps

Argo CD Application (bootstrap/argocd/apps/monitoring-app.yaml):

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: monitoring
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/james-martinez0/jmartinez-homelab-gitops.git
    targetRevision: HEAD
    path: infrastructure/monitoring/overlays/lab
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true
    - ServerSideApply=true

Key configurations:

includeCRDs: true in helmCharts to install Prometheus CRDs
loki.useTestSchema: true for testing without schema config
ServerSideApply: true to handle large CRD annotations

James Lab

Explorer

Phase 1 - Observability with Prometheus, Grafana & Loki

Phase 1 - Observability with Prometheus, Grafana & Loki

What We’re Building

Why Observability

Before You Start

Step 0: Resize VMs

How to Resize

Step 1: Install Helm

Step 2: Add Helm Repos

Step 3: Deploy Prometheus + Grafana

Step 4: Access Grafana

Get the Admin Password

Option A: Port-forward (quick)

Option B: Ingress (permanent)

Step 5: Deploy Loki + Promtail

Step 6: Connect Loki to Grafana

Step 7: Import Dashboards

Step 8: Verify

Expected Resource Usage (Homelab Optimized)

Access Summary

Troubleshooting

Common Issues & Solutions (GitOps Implementation)

1. Loki Schema Configuration Error

1a. Loki chunks-cache Stuck in Pending

2. Kustomize helmCharts CRDs Not Installed

3. Large CRDs Failing with Annotation Size Limit

4. Argo CD IngressRoute Configuration

5. Dashboard ConfigMap Annotation Size

6. Pods Stuck in Pending - Insufficient Memory

7. StatefulSet Update Forbidden

8. Prometheus PVC Missing accessModes

9. Argo CD Sync Stuck on Old Revision

GitOps Integration (Implemented)

Table of Contents

Backlinks