Common Self-Hosted Mistakes and How to Avoid Them

Common Self-Hosted Mistakes and How to Avoid Them

Your Backup Strategy Is Missing (Until It Isn't)

The single reason homelab projects fail isn't complexity—it's discovering you've lost everything when the drive dies at 2 AM. I've watched capable engineers lose years of configuration because they never ran a test restore.

Most self-hosted mistakes stem from treating backups as optional infrastructure theater. You need three copies: one on the live system, one on another device in your home, one offsite. This is the 3-2-1 rule, and it's not negotiable.

Set Up Automated Backups Now

On my T5810 with 24GB RAM running Ubuntu 24.04.1 LTS, I use Restic for incremental backups to both a local NAS and Backblaze B2. Here's the exact setup:

apt-get install restic
restic init --repo /mnt/nas/restic-backups
restic -r /mnt/nas/restic-backups backup /var/lib/docker /home/username/containers

For offsite, configure B2 credentials:

export B2_ACCOUNT_ID="your_id"
export B2_ACCOUNT_KEY="your_key"
restic -r b2://bucket-name init
restic -r b2://bucket-name backup /var/lib/docker

Then automate it with systemd. Create /etc/systemd/system/restic-backup.service:

[Unit]
Description=Restic Backup
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/bin/restic -r /mnt/nas/restic-backups backup /var/lib/docker /home/username/containers
Environment="RESTIC_REPOSITORY=/mnt/nas/restic-backups"
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Add the timer:

[Unit]
Description=Run Restic Backup Daily
Requires=restic-backup.service

[Timer]
OnCalendar=daily
OnCalendar=04:00
Persistent=true

[Install]
WantedBy=timers.target

Gotcha #1: Restic stores metadata efficiently but doesn't prune old snapshots automatically. Add a weekly prune job or you'll waste terabytes:

restic -r /mnt/nas/restic-backups forget --keep-daily 30 --keep-monthly 12

Test a restore monthly. Not annually—monthly. I use a separate LXC container to verify restores work. If you never restore, you're not backed up.

Permission Hell: Why Your Containers Can't Write Files

Your Docker container runs as UID 1000, but the mounted volume is owned by UID 1001. Files get created you can't delete. This kills homelab beginner mistakes faster than anything else.

The Right Way to Handle Permissions

Define explicit user/group ownership before mounting. On my setup, I create a service account for each container:

groupadd -g 10001 immich
useradd -u 10001 -g immich -s /sbin/nologin -d /nonexistent immich

Then mount volumes with proper ownership:

mkdir -p /mnt/immich-storage
chown 10001:10001 /mnt/immich-storage
chmod 750 /mnt/immich-storage

In your docker-compose.yaml, explicitly set the user:

services:
  immich:
    image: ghcr.io/immich-app/immich-server:v1.106.4
    user: "10001:10001"
    volumes:
      - /mnt/immich-storage:/usr/src/app/upload
    environment:
      - DB_HOSTNAME=postgres
      - DB_USERNAME=immich
      - DB_PASSWORD=${DB_PASSWORD}

Gotcha #2: Some images (looking at you, Plex) hardcode UID/GID internally. Check the Dockerfile. If it expects UID 99, you either match it or use the --userns-remap flag in your Docker daemon config (/etc/docker/daemon.json) and rebuild your volume ownership map. It's easier to just match:

useradd -u 99 -s /sbin/nologin nobody

Use this shell function to audit your volumes:

audit_perms() {
  for vol in /mnt/*/; do
    echo "=== $vol ==="
    ls -ld "$vol"
    find "$vol" -maxdepth 2 -type f ! -perm /u+r | head -5
  done
}

Over-Engineering: Why You Don't Need Kubernetes

Your homelab doesn't need high availability. You don't need service meshes, load balancers, or etcd clusters. Complexity kills more homelabs than failures do.

A single-node Docker Compose setup with proper restart policies handles 99% of use cases. I run Jellyfin, Immich, Frigate, and five other services on a Ryzen 5 5600X without orchestration. They restart on reboot and crash recovery is 10 seconds of console time.

Know When to Stop Adding Layers

Stop if you're adding infrastructure you don't understand. Your Traefik reverse proxy doesn't need Consul service discovery if you have 8 containers. Hard-code DNS entries.

version: '3.8'
services:
  jellyfin:
    image: jellyfin/jellyfin:10.8.13
    restart: unless-stopped
    ports:
      - "8096:8096"
    volumes:
      - /mnt/media/config:/config
      - /mnt/media/library:/media
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8096/health"]
      interval: 30s
      timeout: 10s
      retries: 3

That's enough. No Kubernetes, no load balancer. It restarts automatically if it dies. You can upgrade by pulling a new image and restarting.

Add complexity when you're genuinely operationalizing something—not when you're self-hosting.

No Monitoring Means Data Loss in the Dark

Your Fritzbox stops resolving DNS and you don't notice for three days. Your Frigate database fills your last 2GB of disk and crashes silently. Monitoring isn't optional.

Minimal Monitoring Stack

Prometheus + Alertmanager + ntfy.sh gets you 80% of the way:

docker pull prom/prometheus:v2.50.0
docker pull prom/alertmanager:v0.26.0

Create a Prometheus config that scrapes your Docker socket and a few key services:

global:
  scrape_interval: 30s

scrape_configs:
  - job_name: node
    static_configs:
      - targets: ['localhost:9100']
  - job_name: docker
    static_configs:
      - targets: ['localhost:9323']
  - job_name: frigate
    static_configs:
      - targets: ['localhost:5000']

Set up alerts for the things that actually kill homelabs:

groups:
  - name: critical
    rules:
      - alert: DiskSpaceWarning
        expr: node_filesystem_avail_bytes < 5368709120
        for: 5m
        annotations:
          summary: "Less than 5GB free"
      - alert: ServiceDown
        expr: up == 0
        for: 2m
        annotations:
          summary: "{{ $labels.job }} is down"

Post alerts to your phone via ntfy.sh:

curl -d "Homelab: {{ .GroupLabels.alertname }}" \
  ntfy.sh/your_homelab_alerts

Configuration Drift: Your Setup Isn't Reproducible

You SSH'd in, ran three commands, edited a config file manually, and six months later you have no idea how your media server actually works. When the drive dies, you're rebuilding from memory.

Version Control Everything

Create a git repo for your entire homelab setup:

mkdir -p ~/homelab/{docker,ansible,scripts}
git init ~/homelab
cd ~/homelab

# Structure it like this:
# docker-compose.yml
# ansible/
#   inventory.ini
#   playbooks/
#     setup-docker.yml
#     deploy-services.yml
# scripts/
#   backup.sh
#   health-check.sh

Every manual change gets documented in Ansible. I have a single playbook that sets up a fresh Ubuntu 24.04 instance and deploys all services:

ansible-playbook -i inventory.ini playbooks/deploy-services.yml

Keep secrets out of git (use Ansible vault or separate env files). Your docker-compose references a .env file that's never committed:

# .gitignore
.env
.env.*.local
secrets/

Common Issues

Backups Work Fine Until You Format the Wrong Drive

Before any backup test, verify your Restic repository exists and is readable:

restic -r /mnt/nas/restic-backups cat config

This confirms you have a valid repository without restoring 500GB. If this fails, your backup isn't real.

Containers Can't Write Despite Correct Ownership

Check the actual process permissions inside the container:

docker exec immich id

If it returns 0:0 (root) despite user: 10001, your image's entrypoint is overriding it. Rebuild the image or use a different one.

Prometheus Scrapes Fail Silently

Check the Targets page in Prometheus UI (port 9090). Failed scrapes show there with the actual error. Usually it's networking (your monitoring container can't reach your host's localhost on the bridge network—use host.docker.internal or the container's actual IP).

What You Have Now

A homelab that doesn't evaporate when hardware fails. Reproducible configuration that survives your curiosity. Permission structures that don't spawn mystery file ownership issues at 3 AM.

Start with backups. Everything else compounds on that foundation. Test your restore. Then add monitoring. Then version control. The infrastructure that survives isn't the fanciest—it's the one you actually understand and can rebuild in an afternoon.

Next steps: Implement Restic backups today, test a restore this week, set up your Prometheus + ntfy stack, then migrate your manual setup to Ansible playbooks. In that order.

Read more