SJVIK Labs · sjviklabs.com

Production infrastructure,
run from a 3-node cluster.

Proxmox VE on three HP EliteDesks. 14 LXC services, monitored, backed up, recovered from runbooks. The whole thing is documented as code in a private GitHub repo — and signed off as a public PDF every time main moves.

Stack · what's actually on

Three nodes, fourteen services.

All three are HP EliteDesk 800 G3 mini desktops. Quiet, low-power, enough headroom to run the whole thing without straining. Cluster runs Proxmox VE 9.1.6 with knet + secauth, quorate.

Nodes

  • nx-core-01

    Cluster leader

    i5-7500T · 64 GB RAM · 1 TB NVMe + 1 TB SATA

    11 LXCs (Traefik, AdGuard, monitoring, web apps)

  • nx-ai-01

    Inference

    i5-7500T · 32 GB RAM · 500 GB NVMe + 1 TB SATA

    Ollama CPU inference, content services

  • nx-store-01

    Storage & backup

    i5-7500T · 32 GB RAM · NVMe + 1 TB SATA

    Samba shares, Proxmox Backup Server

Services

  • DNS AdGuard Home P0 — LAN-wide name resolution
  • Reverse proxy Traefik v3 TLS termination, file-watcher
  • Observability Grafana + Prometheus 5 alert rules, email contact point
  • Status Uptime Kuma 13 HTTP monitors
  • SIEM Wazuh 4.14.4 14 agents, MITRE rules
  • Backup Proxmox Backup Server 7 daily + 4 weekly snapshots
  • Backup (offsite) Restic Nightly to 3 repos
  • Inference Ollama qwen2.5 family on CPU + GPU
  • IaC Ansible 13 roles, GitHub Actions CI

Architecture · how a request gets served

Two single points of failure, runbooks for both.

Every internal URL hits AdGuard Home first (DNS), then Traefik (TLS + reverse proxy), then the backend LXC. If either of the first two goes down, every internal URL fails — which is why each has its own recovery runbook with copy-pasteable triage commands and a phased fix tree.

┌──────────┐    DNS query     ┌──────────┐    HTTPS      ┌─────────────┐
│  Client  │ ───────────────▶ │ AdGuard  │ ──────────▶  │   Traefik   │ ───┐
│ (browser)│                  │ (LXC 100)│              │  (LXC 104)  │    │
└──────────┘                  └──────────┘              └─────────────┘    │
                                                                            ▼
                                                                    ┌──────────────┐
                                                                    │  Backend LXC │
                                                                    │ (e.g. .31)   │
                                                                    └──────────────┘

  AdGuard down → all *.lan fail.       Recovery: sops/recovery/adguard-lxc-100.md
  Traefik down → 502/connection-refused. Recovery: sops/recovery/traefik-lxc-104.md
  

Recovery time target

< 5 min for AdGuard, < 10 min for Traefik

Backup retention

7 daily + 4 weekly (PBS), nightly Restic

Cluster firewall

DROP inbound default · per-LXC overrides

SSH posture

Key-only · ed25519 mesh · fail2ban active

Practices · how it stays trustworthy

Built like work, not like a hobby.

  • Documentation as code

    Every infra change lands as a PR with a state-doc update and a change-log entry. The handbook PDF is auto-generated from that source on every merge.

  • Recovery before reaction

    P0 / P1 services have copy-pasteable runbooks. Decom changes leave dated config backups in-place so rollback is one cp away.

  • Defense-in-depth

    DROP-inbound cluster firewall, key-only SSH, fail2ban, unattended security upgrades, weekly state audits.

  • Two-tier disclosure

    The handbook ships in two PDFs: a full lab-internal version, and a public-redacted version produced by an explicit redaction filter. Public-safe by construction, not by hope.

  • Change-conscious

    Conventional commits, squash-merge to main, ansible-lint in CI. New services land via Ansible roles, not artisanal SSH sessions.

  • Observability in

    Grafana + Prometheus on every node, Wazuh SIEM with MITRE rules, Uptime Kuma fronting the *.lan estate. Alerts to email, contact-point provisioned-as-code.

Handbook · auto-generated, every merge

The lab, as a PDF.

Every push to main on the infra repo regenerates two handbook PDFs: an internal one (full IPs, ports, DDNS) and a public-redacted one for external sharing. Both are attached to a tagged GitHub Release. The link below always resolves to the latest *-public.pdf.

Latest release

SJVIK Labs Handbook

Public-redacted variant · DDNS, ports, and Tailscale IPs replaced by tokens via handbook/redactions.txt.

What's in it (table of contents)
  1. Architecture — project charter, lab overview, road map
  2. Inventory & Network — hardware, IPs, ports, SSH mesh, topology diagrams
  3. Recovery Runbooks — Traefik, AdGuard, full node restore
  4. Setup & Provisioning — Linux base, storage, monitoring, projects
  5. Appendix — recent change log entries