Fix: Monitoring Stack Sync & Resource Issues

Date: 2026-04-08 Context: Monitoring stack (Prometheus, Loki) stuck in Progressing/OutOfSync state in Argo CD Source: jmartinez-homelab-gitops

Issues Encountered

  1. Argo CD Application stuck in Progressing state
  2. loki-chunks-cache pod stuck in Pending (0/2)
  3. kube-prometheus-stack-prometheus pod not deploying
  4. StatefulSet update forbidden error

Root Causes & Fixes

1. Missing kustomize.buildOptions for Helm Charts

Error: Kustomize failed to build with helmCharts

error: trouble configuring builtin HelmChartInflationGenerator
`: must specify --enable-helm

Root Cause: Argo CD needs --enable-helm flag to process helmCharts in kustomization.yaml

Fix: Initially tried adding kustomize.buildOptions to Argo CD Application, but this field is invalid in the Application spec. The correct approach is to ensure Argo CD is configured to handle Helm charts via Kustomize.

Resolution: Removed the invalid field. Argo CD with ServerSideApply handles helmCharts natively when includeCRDs: true is set.

# This is INVALID - do not use
spec:
  source:
    kustomize:
      buildOptions: "--enable-helm"

2. Insufficient Memory for Pods

Error: Pods stuck in Pending with Insufficient memory

0/4 nodes are available: 4 Insufficient memory

Root Cause: Memory requests/limits too high for homelab environment

Fix: Reduced resource allocations in prometheus-values.yaml and loki-values.yaml

Before → After:

ComponentMemory RequestMemory Limit
Prometheus512Mi → 256Mi1Gi → 512Mi
Grafana128Mi → 64Mi256Mi → 128Mi
Alertmanager128Mi → 64Mi256Mi → 128Mi
Loki256Mi → 128Mi512Mi → 256Mi
kubeStateMetrics64Mi → 32Mi— → 64Mi
node-exporter32Mi → 16Mi— → 32Mi

Total memory reduced: ~1.18Gi → ~0.56Gi


3. Loki chunks-cache & results-cache Stuck

Error: loki-chunks-cache-0 stuck in Pending, Argo CD waiting for healthy state

Root Cause: Loki Helm chart deploys chunks-cache and results-cache StatefulSets by default, even in SingleBinary mode. These use additional memory.

Fix: Explicitly disable in loki-values.yaml:

chunksCache:
  enabled: false
 
resultsCache:
  enabled: false

Important: Delete existing StatefulSets after changing values:

kubectl delete statefulset loki-chunks-cache -n monitoring
kubectl delete statefulset loki-results-cache -n monitoring

4. Prometheus PVC Missing accessModes

Error: Prometheus pod not deploying, PVC issues

Fix: Added accessModes to volumeClaimTemplate in prometheus-values.yaml:

prometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 2Gi

Also added podAntiAffinity: "soft" for better scheduling.


5. StatefulSet Update Forbidden

Error:

StatefulSet.apps "loki" is invalid: spec: Forbidden: updates to statefulset spec
for fields other than 'replicas', 'ordinals', 'template', 'updateStrategy',
'revisionHistoryLimit', 'persistentVolumeClaimRetentionPolicy' and
'minReadySeconds' are forbidden

Root Cause: StatefulSets have immutable fields. Changing storage size or other non-updatable fields requires recreation.

Fix: Delete and recreate the StatefulSet:

kubectl delete statefulset loki -n monitoring
# PVC is auto-deleted due to persistentVolumeClaimRetentionPolicy.whenDeleted: Delete
# Argo CD will recreate with new configuration

Argo CD Sync Stuck on Old Revision

Error: Sync operation stuck on old commit revision

Fix: Delete and recreate the Argo CD Application:

kubectl delete application monitoring -n argocd
kubectl apply -f bootstrap/argocd/apps/monitoring-app.yaml

Diagnostic Commands

# Check Argo CD application status
kubectl get application monitoring -n argocd
 
# Check sync revision
kubectl get application monitoring -n argocd -o jsonpath='{.status.sync.revision}'
 
# Check operation state
kubectl get application monitoring -n argocd -o jsonpath='{.status.operationState}'
 
# Check pod events
kubectl describe pod <pod-name> -n monitoring
 
# Check PVC status
kubectl get pvc -n monitoring
 
# Check StatefulSets
kubectl get statefulset -n monitoring
 
# Force sync by deleting operation state
kubectl patch application monitoring -n argocd --type merge -p '{"status":{"operationState":null}}'

Final Working Configuration

prometheus-values.yaml

prometheus:
  prometheusSpec:
    resources:
      requests:
        memory: 256Mi
        cpu: 100m
      limits:
        memory: 512Mi
        cpu: 250m
    retention: 7d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 2Gi
    podAntiAffinity: "soft"
 
grafana:
  resources:
    requests:
      memory: 64Mi
      cpu: 50m
    limits:
      memory: 128Mi
 
alertmanager:
  alertmanagerSpec:
    resources:
      requests:
        memory: 64Mi
      limits:
        memory: 128Mi
 
kubeStateMetrics:
  resources:
    requests:
      memory: 32Mi
      cpu: 10m
    limits:
      memory: 64Mi
      cpu: 50m
 
prometheus-node-exporter:
  resources:
    requests:
      memory: 16Mi
      cpu: 10m
    limits:
      memory: 32Mi
      cpu: 50m

loki-values.yaml

deploymentMode: SingleBinary
 
loki:
  useTestSchema: true
  commonConfig:
    replication_factor: 1
  storage:
    type: filesystem
 
singleBinary:
  replicas: 1
  resources:
    requests:
      memory: 128Mi
      cpu: 50m
    limits:
      memory: 256Mi
  persistence:
    size: 2Gi
 
monitoring:
  selfMonitoring:
    grafanaAgent:
      installOperator: false
 
gateway:
  enabled: false
 
read:
  replicas: 0
 
write:
  replicas: 0
 
backend:
  replicas: 0
 
chunksCache:
  enabled: false
 
resultsCache:
  enabled: false

Lessons Learned

  1. StatefulSets are immutable — Storage size and other fields cannot be updated. Delete and recreate.
  2. Argo CD sync can get stuck — Delete the Application and recreate to force fresh sync.
  3. Memory is precious in homelab — Start with minimal resource requests and increase as needed.
  4. Disable unused components — Loki’s cache components consume memory even if not needed.
  5. Check PVC accessModes — Always specify ReadWriteOnce for single-node PVCs.