A full Proxmox cluster rebuild from scratch takes somewhere between a weekend and a week, depending on how much of your config lives in Git versus your head. The VMs and LXCs themselves, the ones with actual state in them, those are the things you can't reconstruct from memory. Proxmox Backup Server (PBS) exists specifically for this problem: deduplicated, incremental backups of your entire virtualization layer, with verification built in.
If you're running a multi-node Proxmox cluster and your backup strategy is still "I'll just snapshot it manually before I do anything scary," this is the upgrade path. PBS slots into an existing cluster with surprisingly little friction, but the authentication model and a few operational quirks will trip you up if you don't know they're coming.
Why Not Just Use vzdump to NFS?
The built-in vzdump tool works. You can schedule backups to an NFS share and call it a day. I've seen plenty of homelabs run this way for years. The problem is what happens at scale.
With vzdump to a plain NFS target, every backup is a full copy. A 50 GB VM backed up daily for 30 days is 1.5 TB of storage, most of it identical data. PBS changes this fundamentally. It chunks the data, deduplicates across all backups (and across all VMs), and only transfers the changed chunks on subsequent runs. That 1.5 TB becomes something closer to 80-120 GB depending on churn rate.
The other thing vzdump alone doesn't give you is backup verification. PBS can mount and verify the integrity of every backup after it completes, checking that the data is actually restorable. That matters more than most people think. A backup you've never tested is just a hope.
I initially tried running vzdump backups to a Synology NFS share. It worked, but retention management was manual, dedup was nonexistent, and I had zero confidence that any given backup was actually restorable until I tried. PBS replaced all of that with a single integration point.
The PBS Placement Decision
Before installing anything, you need to answer one question: where does PBS run?
There are two reasonable options for a homelab:
Option 1: PBS as a VM on the cluster itself. Quick to set up, uses existing hardware, but your backup server lives on the infrastructure it's backing up. If you lose the node hosting PBS, you lose your backup target at the exact moment you need it most.
Option 2: PBS on a dedicated machine, physically separate from the cluster. This is the correct answer for anything you actually care about. A small mini-PC with a large spinning disk, or even an old desktop with a few terabytes of storage, is enough. The key property is that it's not on the same failure domain as your cluster.
I'd go with option 2 every time. A used mini-PC with a 4 TB drive costs less than the time you'll spend rebuilding a cluster from scratch. PBS itself is lightweight. It doesn't need much CPU or RAM. What it needs is disk space and network connectivity to your Proxmox nodes.
If you're running PBS on a NAS via NFS (mounting the NAS storage into a PBS VM), be aware that deduplication performance degrades over NFS compared to local storage. PBS's chunked dedup store does a lot of random I/O, and NFS adds latency to every operation. Local disk or direct-attached storage is preferable.
Installing PBS
PBS installs like any other Debian-based system. Download the ISO from the Proxmox site, boot it, run through the installer. The whole process takes about 10 minutes.
After installation, you'll access the web UI on port 8007:
https://10.0.0.50:8007
First thing to configure is a datastore, which is just a directory path where PBS will store backup chunks:
# On the PBS host, create the datastore directory
mkdir -p /mnt/backups/pbs-store
# Add it via the CLI (or through the web UI under Storage > Datastore)
proxmox-backup-manager datastore create main-store /mnt/backups/pbs-store
The datastore is where all the deduplicated chunks live. PBS handles the internal structure. You don't need to think about the file layout.
Adding PBS as Storage in Proxmox VE
On each Proxmox VE node (or once in a cluster, since storage config is shared), you add the PBS instance as a storage target. This is where the first gotcha lives.
In the PVE web UI, go to Datacenter > Storage > Add > Proxmox Backup Server. You'll need:
- Server address (the IP of your PBS host)
- Username and password (or API token)
- Datastore name
- Fingerprint (PBS uses a self-signed cert by default)
The fingerprint is available on the PBS dashboard or via:
# On the PBS host
proxmox-backup-manager cert info | grep Fingerprint
For a basic setup with username/password, this works out of the box. But if you're automating backup jobs or integrating with scripts, you'll want API tokens. And that's where things get interesting.
The API Token Authentication Trap
If you've worked with Proxmox API tokens before, you know PVE uses the format user@realm!tokenname with the secret passed as a separate header or parameter. PBS uses a similar but subtly different format, and the distinction will cost you hours if you don't catch it early.
The token format for PBS authentication:
# PVE token format (for reference)
user@realm!tokenname (secret passed separately)
# PBS token format in storage config
user@realm!tokenname (same structure, but permission model differs)
The real trap isn't the format. It's privilege separation.
When you create an API token in PBS, there's a checkbox labeled "Privilege Separation" that defaults to on. With privsep enabled, the token has its own independent permission set, completely separate from the user it belongs to. This means if your user backup@pbs has DatastoreBackup and DatastoreAudit roles on the datastore, but you created the token with privsep on and didn't assign those same roles to the token specifically, the token will authenticate successfully but return empty results or 403 errors on actual operations.
The fix:
# Create a user for backups
proxmox-backup-manager user create backup@pbs
# Create a token WITHOUT privilege separation
proxmox-backup-manager user generate-token backup@pbs pve-integration --privsep 0
# If you want privsep on (recommended for production), assign roles to the token directly
proxmox-backup-manager acl update / DatastoreBackup --auth-id backup@pbs!pve-integration
The --privsep 0 flag is the quick path for homelabs. The token inherits all permissions from its parent user. For a more locked-down setup, keep privsep on and explicitly grant the token the roles it needs. Either way, test the token before you walk away:
# Verify the token can actually list datastore contents
proxmox-backup-client list --repository 'backup@pbs!pve-integration@10.0.0.50:main-store'
If this returns an empty list (for a new datastore) or your existing backups, you're good. If it returns a 403 or permission error, check the privsep settings.
Scheduling Backup Jobs
With PBS added as a storage target in PVE, you schedule backups the same way you would any other vzdump job. Datacenter > Backup > Add. Select your PBS storage, pick the VMs and LXCs to include, set the schedule.
A reasonable starting configuration:
Schedule: daily at 02:00
Selection: all VMs and LXCs
Mode: snapshot (for running machines)
Retention: keep-last=7, keep-weekly=4, keep-monthly=3
Compression: zstd
This gives you a week of daily recovery points, a month of weekly snapshots, and three months of monthly archives. Because PBS deduplicates, the storage cost of this retention policy is a fraction of what you'd expect.
The snapshot mode is important. It creates a consistent point-in-time snapshot without stopping the VM. For most workloads this is fine. If you're running a database directly in a VM (not in Kubernetes), consider using the stop mode or pre-freeze hooks to ensure filesystem consistency.
# You can also trigger a one-off backup via CLI
vzdump 100 --storage pbs-target --mode snapshot --compress zstd
The Stale Lock File Problem
Backup jobs will occasionally fail with an error like:
ERROR: backup of VM 101 failed - can't acquire lock '/var/lock/pve-manager/vzdump-101.lck'
This happens when a previous vzdump process was interrupted (killed, node rebooted during backup, OOM, etc.) and didn't clean up its lock file. The fix is straightforward:
# Check for stale lock files
ls -la /var/lock/pve-manager/vzdump-*.lck
# Remove the stale lock (only if no vzdump process is actually running)
ps aux | grep vzdump
# If no vzdump is running for that VMID:
rm -f /var/lock/pve-manager/vzdump-101.lck
There's a less obvious variant of this problem. On some nodes, the /var/lock/pve-manager/ directory itself can disappear after a reboot. This directory lives on a tmpfs and should be recreated by systemd-tmpfiles on boot. If it's missing:
# Recreate the lock directory
mkdir -p /var/lock/pve-manager
To make this persistent, verify that the tmpfiles configuration includes it:
# Check if the config exists
cat /usr/lib/tmpfiles.d/pve-manager.conf
# Should contain a line like:
# d /var/lock/pve-manager 0755 root root -
If that file is missing or doesn't include the lock directory, create a drop-in:
echo 'd /var/lock/pve-manager 0755 root root -' > /etc/tmpfiles.d/pve-manager.conf
systemd-tmpfiles --create
Backup Verification
PBS has a built-in verification system that reads back every chunk in a backup and checks its integrity. This is the feature that separates "I have backups" from "I have backups I can actually restore from."
Schedule verification jobs in the PBS web UI under Datastore > Verify Jobs. A good cadence is to verify the most recent backup daily and do a full verification of all backups weekly. Verification is I/O intensive but doesn't affect PVE operations since it runs on the PBS host.
# Manual verification via CLI
proxmox-backup-client verify --repository 'backup@pbs@10.0.0.50:main-store'
If verification fails for a specific snapshot, PBS will flag it in the UI. Don't ignore these warnings. A failed verification means that backup may not be restorable.
PBS in the Context of a Full 3-2-1 Strategy
PBS handles one layer of your backup stack: the hypervisor layer. VMs and LXCs, their disks, their configs. But if you're running Kubernetes on top of those VMs, there's application-level state that PBS backs up only indirectly.
Consider the layers:
| Layer | What It Contains | Backup Tool | Recovery Speed |
|---|---|---|---|
| Hypervisor | VM disks, LXC rootfs, configs | PBS | Full VM restore in minutes |
| Kubernetes | PV data, etcd, secrets | Velero + MinIO | Namespace-level restore |
| GitOps | Manifests, Helm values, configs | Git (ArgoCD) | Re-sync from repo |
PBS gives you the "bare metal to running VMs" recovery path. If a node dies, you restore the VMs to another node and they come up exactly as they were. But the Kubernetes workloads inside those VMs have their own state (persistent volumes, databases, application data) that benefits from Velero-level backups running in parallel.
The combination is what makes 3-2-1 actually work:
- Three copies: live data + PBS backup + offsite copy (Synology, cloud bucket, second PBS instance)
- Two media types: local SSD/NVMe (live) + HDD (PBS datastore)
- One offsite: PBS supports built-in sync to a remote PBS instance, or you can replicate the datastore to a NAS for geographic separation
For the GitOps layer, ArgoCD already handles the "config as code" part. You don't need to back up Kubernetes manifests the traditional way because they're already in Git. What you need to back up is the state that isn't in Git: persistent volumes, database contents, secrets.
Garbage Collection and Datastore Maintenance
PBS deduplicates by storing data as content-addressed chunks. When you prune old backups, the chunks aren't immediately deleted. They become unreferenced. Garbage collection (GC) is the process that identifies and removes unreferenced chunks to reclaim disk space.
GC runs on a schedule within PBS. The default is usually fine, but keep an eye on the "Deduplication Factor" metric in the PBS dashboard. For a homelab with similar VMs (same base OS, similar packages), you'll typically see dedup factors between 3x and 8x. That means your backups are using 3-8x less space than the raw data size.
# Check datastore status including dedup factor
proxmox-backup-manager datastore list
# Manually trigger garbage collection
proxmox-backup-manager garbage-collection start main-store
If your dedup factor is close to 1x, something is off. Either your VMs have very little data in common (unlikely if they're running the same distro), or the chunk size configuration isn't optimal for your workload.
Monitoring Backup Health
PBS exposes metrics that you can pull into Grafana or any monitoring stack. The key things to watch:
- Last backup timestamp per VM/LXC: if a backup hasn't run in 24+ hours, something is broken
- Backup duration trends: a backup that used to take 10 minutes and now takes 60 suggests disk issues or unexpected data growth
- Verification status: any failed verifications need immediate attention
- Datastore usage: track the growth rate to predict when you'll need more storage
A simple monitoring approach is a cron job that checks for recent backups:
#!/bin/bash
# Check that every VM has a backup from the last 24 hours
CUTOFF=$(date -d '24 hours ago' +%s)
proxmox-backup-client list \
--repository 'backup@pbs@10.0.0.50:main-store' \
--output-format json | \
jq -r '.[] | select(.backup_time < '$CUTOFF') | .backup_id' | \
while read vm; do
echo "WARNING: $vm has no backup in the last 24 hours"
done
Lessons Learned
Test restores, not just backups. At least once a quarter, pick a VM and restore it to a temporary location. Verify it boots, verify the data is intact. A backup system you've never restored from is a hypothesis, not a strategy.
Privilege separation on API tokens is the silent killer. If your automated backups authenticate fine but return empty data or permission errors on operations, check privsep. This one issue probably accounts for half the "PBS isn't working" posts on the Proxmox forums.
Separate your failure domains. PBS running as a VM on the cluster it's backing up is better than no backups, but only barely. The whole point of backups is surviving hardware failure. A dedicated, physically separate PBS host (even a cheap one) fundamentally changes your recovery posture.
PBS handles the hypervisor layer, not the application layer. If you're running Kubernetes, you still need something like Velero for PV snapshots and namespace-level restores. PBS gives you "get back to running VMs." Velero gives you "get back to running applications." Both are necessary. Building a production homelab is only half the work if you don't have a plan for when things go wrong.
Deduplication makes aggressive retention policies cheap. Don't be stingy with retention. The marginal cost of keeping an extra month of weekly snapshots is tiny after dedup. The value of having that three-month-old snapshot when you discover slow data corruption is enormous.
Lock file issues are operational, not architectural. They're annoying, but they're just stale state from interrupted processes. Know where the lock files live, know how to check if a vzdump is actually running, and clean up when needed. Don't let a stuck lock file make you think PBS itself is broken.



