Rebuild Debian Docker VM 100 from Proxmox Template 9000¶
Summary¶
VM 100 (debian-docker) was destroyed and recreated from template 9000 as a full clone on Ceph RBD storage. The goal was to rebuild the main Docker VM cleanly, attach a dedicated Docker data disk, and reapply custom cloud-init from Proxmox snippets.
Environment¶
- Proxmox VE host:
mainframe - VM:
100(debian-docker) - Template VM:
9000 - Storage:
- VM disks:
cephpool(Ceph RBD) - Cloud-init snippets:
localat/var/lib/vz/snippets - Cloud-init files:
docker-userdata.ymldocker-net.yml- Guest OS: Debian cloud image / Debian 12 cloud-init guest
- Data disk target:
/var/lib/docker - Bind mount targets:
/opt/docker-apps/opt/compose- Intended static IP:
192.168.16.3
Problem¶
The VM needed a clean rebuild, but the recreation process hit storage syntax issues, cloud-init disk conflicts, and guest boot problems tied to the Docker data disk and cloud-init configuration.
Symptoms¶
- Proxmox RBD disk syntax error:
text unable to parse rbd volume name '200G' - Cloud-init disk creation conflict:
text rbd create 'vm-100-cloudinit' error: rbd: create error: (17) File exists - The added Ceph disk was initially created with the wrong effective size (
0T) despite a size argument being intended. - Guest console behavior suggested emergency-mode style boot problems and inaccessible root shell.
- Cloud-init warnings appeared later during boot.
Actions Taken¶
- Removed or prepared to remove the existing VM 100.
- Cloned template 9000 into VM 100 as a full clone on
cephpool. - Set the VM SCSI controller to
virtio-scsi-single. - Added a second virtual disk intended for Docker data as
scsi1. - Attached a cloud-init drive on
cephpool. - Applied Proxmox
cicustomto use: local:snippets/docker-userdata.ymllocal:snippets/docker-net.yml- Regenerated the cloud-init ISO with
qm cloudinit update 100. - Booted and inspected the VM from the Proxmox serial console.
Key Findings¶
- Proxmox Ceph/RBD disk creation syntax was sensitive and early attempts did not create the intended 150–200 GiB secondary disk correctly.
- The cloud-init drive already existed after cloning, so trying to create it again failed.
- The VM could boot far enough for serial console troubleshooting, which made it possible to diagnose storage and cloud-init issues without relying on SSH.
- The root cause of the boot trouble was not basic cloning failure; it was the state of the second disk and the guest provisioning logic.
Resolution¶
The rebuild proceeded with:
- VM disks on cephpool
- cloud-init snippets stored on local
- custom cloud-init attached after creation
- later fixes focused on correcting the second disk and guest-side mount behavior rather than rebuilding again immediately
Validation¶
Success at this stage was partial:
- VM 100 existed again on cephpool
- custom cloud-init was attached
- the VM booted far enough to provide serial console logs
- further storage and mount fixes were still required
Follow-Up Tasks¶
- Correct the second disk size and filesystem handling
- validate cloud-init YAML before future rebuilds
- confirm that the guest network config and Docker storage mount both apply cleanly on first boot
- document a stable VM rebuild sequence for future use
Lessons Learned¶
- Ceph-backed Proxmox disk creation syntax should be verified immediately after
qm set. - It is safer to create the VM first, then apply
--cicustom, then regenerate cloud-init. - Serial console access is essential during first-boot troubleshooting.
Diagnose Boot Failure and Missing Filesystem on /dev/sdb¶
Summary¶
After the rebuild, VM 100 booted but failed to mount the dedicated Docker data disk. Troubleshooting focused on boot logs, cloud-init messages, and the state of /dev/sdb.
Environment¶
- VM:
100(debian-docker) - Guest disk layout:
sda: root disksdb: intended Docker data disksr0: cloud-init NoCloud seed- Guest console: Proxmox serial console
- Cloud-init datasource: NoCloud from
/dev/sr0
Problem¶
The guest expected /dev/sdb to host /var/lib/docker, but that disk did not have a usable ext4 filesystem when the guest first booted.
Symptoms¶
- Boot log showed:
text EXT4-fs (sdb): VFS: Can't find ext4 filesystem - Mount activation failure:
text Activate mounts: FAIL:mount -a - Docker dependency failure tied to
/var/lib/docker - CIFS mounts initially reported:
text CIFS: VFS: No username specifiedbeforecifs-utilsand config were fully in place - Cloud-init warning:
text Invalid cloud-config provided: Please run 'sudo cloud-init schema --system' to see the schema errors.
Actions Taken¶
- Opened VM serial console.
- Reviewed boot output from kernel start through cloud-init final stage.
- Observed that
/dev/sdbwas detected as a 150 GiB disk. - Confirmed the cloud-init datasource was present on
/dev/sr0. - Identified that
/var/lib/dockerwas being mounted before/dev/sdbhad a valid ext4 filesystem. - Noted that Docker installation completed later in cloud-init, but Docker service still failed due to the missing filesystem.
Key Findings¶
/dev/sdbexisted and was visible to the guest, so this was not a Proxmox device presentation issue.- The failure was specifically that
/dev/sdblacked the expected ext4 filesystem. - Cloud-init ran far enough to install packages, configure networking, and create the
debianuser, so the user-data was not fully ignored. - The invalid cloud-config warning indicated that at least part of the YAML was malformed, which likely prevented some storage steps from behaving as intended.
Resolution¶
The issue was narrowed down to guest-side filesystem initialization and mount sequencing. The next step was to manually create the ext4 filesystem on /dev/sdb and then refine the cloud-init YAML so first-boot provisioning would handle it correctly in future rebuilds.
Validation¶
Validation came from the serial console:
- sdb was present
- root filesystem on sda1 mounted successfully
- cloud-init completed
- the exact mount failure for sdb was visible in logs
Follow-Up Tasks¶
- create ext4 on
/dev/sdb - mount
/var/lib/docker - update cloud-init to use valid schema and reliable disk setup
- retest boot after YAML changes
Lessons Learned¶
- Presence of a disk in
lsblkor kernel logs does not mean it is formatted or mountable. - Cloud-init can complete partially while still leaving critical storage steps broken.
- Boot logs are often enough to separate disk presentation problems from filesystem problems.
Manually Format /dev/sdb, Mount Docker Data, and Start Docker¶
Summary¶
Once the missing filesystem problem was identified, /dev/sdb was formatted manually, mounted at /var/lib/docker, and Docker was started successfully.
Environment¶
- VM:
100 - Docker root:
/var/lib/docker - Data disk:
/dev/sdb - Filesystem label used later in YAML:
docker-data
Problem¶
Docker could not start because /var/lib/docker was not mounted on the dedicated data disk.
Symptoms¶
/dev/sdbhad no filesystem inlsblk -f- Mount errors for
/var/lib/docker - Docker service could not start correctly until the data disk was mounted
Actions Taken¶
- Listed block devices and filesystems with
lsblk -f. - Created an ext4 filesystem on
/dev/sdband labeled itdocker-data. - Mounted
/var/lib/docker. - Started Docker and checked service health.
- Verified mounted storage with
df -h.
Key Findings¶
/dev/sdbwas empty before manual formatting.- After formatting,
/var/lib/dockermounted successfully on the dedicated disk. - Docker started successfully once its data-root path existed on the mounted ext4 filesystem.
- This confirmed the disk, mountpoint, and Docker daemon behavior were otherwise sound.
Resolution¶
Manual formatting of /dev/sdb restored the intended Docker storage design. /var/lib/docker was mounted from the dedicated data disk, and Docker was able to run.
Validation¶
Success was confirmed by:
- lsblk -f showing ext4 on /dev/sdb
- df -h /var/lib/docker showing /dev/sdb as the backing filesystem
- systemctl status docker showing Docker active and running
Follow-Up Tasks¶
- bake this storage logic into cloud-init so manual formatting is no longer required
- ensure
/etc/fstabis clean and uses a stable identifier - verify Docker daemon configuration still points to
/var/lib/docker
Lessons Learned¶
- Manually fixing the disk is a good recovery method, but provisioning should be fixed so it is not needed on rebuild.
- A dedicated Docker data disk should be validated before restoring application data.
Restore Bind Mount Layout for /opt/docker-apps and /opt/compose¶
Summary¶
After Docker was running again, the guest still lacked the expected bind mount layout that mapped Docker app data and compose files into user-friendly paths under /opt.
Environment¶
- Source paths:
/var/lib/docker/appdata/var/lib/docker/compose- Target paths:
/opt/docker-apps/opt/compose- Mount method: bind mounts via
/etc/fstab
Problem¶
The bind mount targets existed conceptually in the design, but the source directories either did not exist yet or the bind mounts had been added multiple times, causing confusion and layered mounts.
Symptoms¶
- Initial mount failures:
text mount: /opt/docker-apps: special device /var/lib/docker/appdata does not exist. mount: /opt/compose: special device /var/lib/docker/compose does not exist. findmntlater showed multiple stacked mount entries for the same targets, including references to both the old root disk and the new Docker data disk
Actions Taken¶
- Created source directories under
/var/lib/docker. - Created target directories under
/opt. - Added
/etc/fstabbind entries for: /var/lib/docker/appdata -> /opt/docker-apps/var/lib/docker/compose -> /opt/compose- Mounted the bind targets.
- Verified mount results with
findmnt. - Identified duplicate or layered bind mounts caused by earlier attempts.
- Cleaned
/etc/fstaband described a recovery sequence to unmount duplicates and remount once.
Key Findings¶
- The bind mounts failed initially because the source directories did not yet exist.
- Repeated
mountcalls plus duplicatefstablines produced confusing layered output. - The correct design is simple once the source directories exist and
fstabcontains only one clean entry for each target.
Resolution¶
The source directories and target directories were created, bind mounts were restored, and the configuration path layout under /opt became usable again.
Validation¶
Validation came from:
- findmnt showing /opt/docker-apps and /opt/compose
- /var/lib/docker mounted from /dev/sdb
- Docker running with its data directory on the dedicated disk
Follow-Up Tasks¶
- embed clean bind mount handling in cloud-init
- avoid duplicate mount commands during troubleshooting
- keep
/etc/fstabdeduplicated
Lessons Learned¶
- Bind mounts depend on source directories existing first.
- Repeated bind-mount attempts can produce misleading stacked output.
findmntis the best quick check for mount correctness.
Correct Cloud-Init YAML for the Rebuilt Docker VM¶
Summary¶
The cloud-init user-data for VM 100 was iteratively corrected to remove schema errors, properly handle /dev/sdb, and align with the new bind-mount layout under /opt/docker-apps.
Environment¶
- Cloud-init snippet:
/var/lib/vz/snippets/docker-userdata.yml - Network snippet:
/var/lib/vz/snippets/docker-net.yml - Snippet storage in Proxmox:
local:snippets - Intended app layout:
/opt/docker-apps/<app>/config
Problem¶
The user-data YAML contained invalid or incomplete configuration for password handling, disk setup, and mount layout.
Symptoms¶
- Cloud-init warning:
text Invalid cloud-config provided: Please run 'sudo cloud-init schema --system' to see the schema errors. fs_setupdid not produce the intended filesystem behavior during first boot- Docker and bind mount steps had to be repaired manually after boot
Actions Taken¶
- Compared the original user-data against the desired target design.
- Identified the
chpasswdblock as invalid in the way it was written. - Removed the invalid
chpasswdsection from the proposed YAML. - Switched the Docker data mount logic to rely on:
fs_setup- ext4 label
docker-data - stable mount definitions
- Reworked bind mount handling to reflect the actual structure:
/opt/docker-apps/<app>/config/opt/compose- Discussed moving network-dependent Docker repo setup from
bootcmdintoruncmd. - Built a cleaner version of the YAML that included:
- early directory creation
fs_setupon/dev/sdb- mount definitions
- Docker install steps
- Docker service ordering
Key Findings¶
- The invalid
chpasswdblock was a likely cause of schema validation warnings. - First-boot cloud-init storage behavior is sensitive to syntax and ordering; once it misses its opportunity, later YAML edits do not retroactively fix the guest without a clean reprovision or manual intervention.
- The YAML needed to match the actual restored directory layout, not the old
/DockerAppDatapath.
Resolution¶
A corrected cloud-init direction was established:
- remove invalid schema elements
- use fs_setup to initialize /dev/sdb
- mount Docker data from a stable label
- mount or bind /opt/docker-apps and /opt/compose
- move network-dependent install actions to a later stage
Validation¶
Validation was indirect but strong: - cloud-init completed enough to provision the guest - later manual fixes confirmed that the intended layout was valid - the revised YAML addressed the exact issues seen in boot logs and runtime behavior
Follow-Up Tasks¶
- validate the final YAML with cloud-init schema tools before the next rebuild
- keep a known-good cloud-init snippet under version control or archived in the homelab docs
- test a full destroy/recreate cycle once the final YAML is settled
Lessons Learned¶
- Cloud-init YAML should be treated like code: validate it before production use.
- Disk setup logic and bind mount design should be explicit and reproducible.
- When rebuilding important infrastructure VMs, preserve a working snippet history.
Restore Docker Application Data from Offen Backup to /opt/docker-apps¶
Summary¶
Application data previously backed up with Offen was prepared for restoration to the new host layout under /opt/docker-apps. The old archive structure still reflected the legacy backup source path, so the internal paths had to be inspected before extraction.
Environment¶
- Backup tool:
offen/docker-volume-backup:v2 - Backup archive found:
/srv/remotemount/NAS/Tools/Backups/Docker/offen/backup-[date removed].tar.gz- NAS mount:
/srv/remotemount/NAS- New destination:
/opt/docker-apps
Problem¶
The restored environment no longer uses /DockerAppData, but the backup archive was built from the old path. The archive had to be restored into the new layout without preserving the old leading path components.
Symptoms¶
- Initial archive lookup failed because the exact filename and extension were wrong.
- Once found, the archive structure showed:
text /backup/my-app-backup/<AppName>/config/... - This meant a naive extract would recreate the wrapper path, not restore directly into
/opt/docker-apps/<AppName>.
Actions Taken¶
- Confirmed the NAS CIFS mount was active.
- Located the correct archive file with
find. - Inspected the archive with
tar -tzf. - Determined that the leading components to remove were:
backupmy-app-backup- Planned extraction into
/opt/docker-appsusing--strip-components=2. - Planned to stop compose stacks before extraction.
- Planned permission normalization after restore.
Key Findings¶
- The correct Offen archive was gzipped (
.tar.gz), not plain.tar. - The archive layout required exactly two components to be stripped to land correctly under
/opt/docker-apps. - The backup was usable without reintroducing
/DockerAppData, provided extraction was handled carefully.
Resolution¶
The restore plan was established: stop containers, extract the archive into /opt/docker-apps with --strip-components=2, reapply ownership and permission policy, then bring the stacks back up.
Validation¶
Validation was achieved by: - confirming the NAS mount - locating the archive on disk - reading the top archive entries - matching those entries to the intended destination layout
Follow-Up Tasks¶
- perform the actual extract if not already done
- reapply ownership and secret file permissions after extraction
- restart and validate application stacks
- update backup definitions so future archives are sourced from
/opt/docker-apps
Lessons Learned¶
- Always inspect backup archive structure before restoring into a live environment.
- Backup path migrations should be handled deliberately when host layout changes.
- CIFS mount validation is a necessary first step before restore operations.
Update Backup and Restore Strategy from /DockerAppData to /opt/docker-apps¶
Summary¶
The backup configuration for Offen and Restic still pointed at /DockerAppData, but the rebuilt host now uses /opt/docker-apps. Backup and restore strategy had to be updated to reflect the new canonical path.
Environment¶
- Offen config:
- archive path:
/srv/remotemount/NAS/Tools/Backups/Docker/offen - old source mapping:
/DockerAppData:/backup/my-app-backup:ro - Restic config:
- repository:
/srv/remotemount/NAS/Tools/Backups/Docker/restic - old source:
/DockerAppData - New live source:
/opt/docker-apps
Problem¶
Future backups would be inconsistent or incomplete if the old /DockerAppData path remained in the compose definitions after migration.
Symptoms¶
- Existing backup configs still referenced
/DockerAppData - The rebuilt host used
/opt/docker-appsinstead - Starting containers before updating these references could accidentally recreate stale path usage
Actions Taken¶
- Reviewed the old Offen and Restic service configuration.
- Identified all bind mounts and environment values still referencing
/DockerAppData. - Recommended changing those references to
/opt/docker-appsbefore restarting the backup containers. - Recommended updating compose files and scripts before bringing any stack back online.
Key Findings¶
- Offen backed up the live bind-mounted path mounted into
/backup/my-app-backup. - Restic backed up
/DockerAppDatadirectly. - Both backup services needed explicit path migration to match the rebuilt host.
Resolution¶
The required path migration was identified:
- Offen source volume should change from /DockerAppData to /opt/docker-apps
- Restic RESTIC_BACKUP_SOURCES and bind mount should change from /DockerAppData to /opt/docker-apps
Validation¶
Validation at this stage was design-level: - old configs were reviewed - replacement path strategy was established - risk of stale path recreation was identified before bringing services online
Follow-Up Tasks¶
- edit backup stack compose files
- validate backup jobs after restart
- document one known-good backup and restore workflow based on
/opt/docker-apps
Lessons Learned¶
- Rebuilds and path migrations must include backup jobs, not just application stacks.
- Backup containers are easy to overlook during storage layout changes.
Use Resumable Rsync for Application Data Migration and Recovery¶
Summary¶
A resumable rsync-based migration was prepared to copy Docker app data from the old host to the new Docker VM with bandwidth limiting and automatic resume behavior.
Environment¶
- Source host:
192.168.16.100 - Destination host:
192.168.16.3 - Source path:
/DockerAppData(old)/ - Destination path:
/opt/docker-apps
Problem¶
A large data copy needed to survive connection drops and resume without restarting from scratch.
Symptoms¶
- Concern about connection drops interrupting long file copies
- Earlier shell formatting issues caused an rsync loop to misbehave
- At one point the operator realized the sync had been run on the wrong host
Actions Taken¶
- Built an rsync command with:
- bandwidth limiting
- SSH keepalive settings
--partial--append-verify- Wrapped it in a retry loop.
- Added a completion banner after the loop.
- Discussed how to cancel the loop cleanly.
- Used a no-change rsync dry run to confirm final completion.
Key Findings¶
- The resumable rsync strategy was suitable for unstable or long-running transfers.
- A clean completion was indicated by:
- zero regular files transferred
- zero transferred file size
- return to shell prompt and completion banner
- The approach worked, but host context must be verified before running it.
Resolution¶
The rsync process and verification pattern were established as a reusable migration workflow.
Validation¶
Validation was provided by the successful dry-run output showing no files needed transfer and a completion timestamp.
Follow-Up Tasks¶
- clean any accidental sync artifacts from the wrong host
- keep the resumable rsync snippet in the homelab runbook
- use
hostnameor IP checks before future migrations
Lessons Learned¶
--partial --append-verifyis a strong default for resumable LAN copies.- Explicit completion output improves operator confidence.
- Always verify the current host before starting a long-running migration.
Fix Compose File Discovery, Path Assumptions, and Dockge Startup¶
Summary¶
Compose stack management was briefly blocked by incorrect assumptions about file locations. Compose files were stored under per-stack subdirectories within /opt/compose, not directly under /opt/compose.
Environment¶
- Compose root:
/opt/compose - Example compose paths:
/opt/compose/dockge/compose.yml/opt/compose/traefik/compose.yaml/opt/compose/arr_stack/compose.yaml- others under stack subdirectories
Problem¶
Service commands failed because they referenced non-existent flat paths like /opt/compose/dockge.yml.
Symptoms¶
- Docker Compose reported:
text open /opt/compose/dockge.yml: no such file or directory findwith-maxdepth 1returned no compose files, which initially obscured the real directory structure.
Actions Taken¶
- Listed compose files recursively under
/opt/compose. - Identified the real Dockge compose file path:
text /opt/compose/dockge/compose.yml - Corrected the startup command to use the real path.
- Confirmed that
docker compose -f <path>can be run without changing into the directory.
Key Findings¶
- Compose files were organized by stack subdirectory, not flat naming.
- Recursive discovery is required for bulk maintenance tasks.
- The issue was path discovery, not a Docker or Dockge runtime failure.
Resolution¶
Compose operations were updated to target the actual file paths under /opt/compose/<stack>/compose.yml|yaml.
Validation¶
Validation came from recursive file discovery and the corrected startup command format.
Follow-Up Tasks¶
- standardize compose file naming if desired
- maintain a reusable compose discovery command
- validate all compose paths before mass automation
Lessons Learned¶
- Never assume a flat compose directory in a multi-stack homelab.
- File discovery should be recursive before bulk edits or startup automation.
Bulk Remove Deprecated Compose version Lines¶
Summary¶
A reusable regex-based bulk edit was developed to remove obsolete version: "3.x" lines from compose files stored under /opt/compose.
Environment¶
- Compose root:
/opt/compose - File types:
compose.ymlcompose.yaml- other YAML compose files
Problem¶
Compose files still included deprecated version: declarations that were no longer required.
Symptoms¶
- Initial bulk edit attempt returned:
text sed: no input filesbecause the command assumed flat files at/opt/composedepth 1. - Real compose files were nested in stack subdirectories.
Actions Taken¶
- Searched recursively under
/opt/composeto locate all compose files. - Reworked the bulk edit command to run against the discovered files.
- Included
.bakbackup creation. - Added verification via
grep.
Key Findings¶
- The first failure was due to incorrect depth assumptions, not bad regex logic.
- Recursive compose file discovery solved the input problem.
- A regex removing only
version: "3.x"style lines was sufficient and safe once the right files were targeted.
Resolution¶
A recursive regex-based cleanup approach was established for all compose files under /opt/compose.
Validation¶
Validation consisted of:
- successful recursive compose file discovery
- no remaining version: lines after the cleanup
- .bak files available for rollback
Follow-Up Tasks¶
- clean up
.bakfiles when satisfied - run
docker compose configon edited stacks as a final sanity check
Lessons Learned¶
- Regex maintenance tasks are only as good as the file discovery feeding them.
- Always create backups for bulk in-place YAML edits.
Create Traefik Bridge Network and Restore Traefik ACME Permissions¶
Summary¶
Traefik-specific recovery work included creating the traefik-proxy Docker bridge network and troubleshooting write access to the Traefik ACME directory and acme.json file.
Environment¶
- Docker network:
traefik-proxy - Intended network settings:
- driver:
bridge - subnet:
172.35.0.0/16 - gateway:
172.35.0.1 - Traefik config layout:
/opt/docker-apps/Traefik/config- expected ACME storage under
.../letsencrypt/acme.json
Problem¶
Traefik required both a known bridge network and correct file permissions for its ACME storage, but directory layout assumptions and permissions caused confusion.
Symptoms¶
- Network needed to be recreated manually.
acme.jsoncreation attempts initially failed because the path assumption was wrong.- WinSCP reported permission denied when writing to the Traefik
letsencryptdirectory. - The
letsencryptdirectory was found to have been given file-like permissions (0600), which is invalid for a usable directory.
Actions Taken¶
- Created the
traefik-proxyDocker bridge network with the intended subnet/gateway. - Corrected the Traefik path assumption to use:
/opt/docker-apps/Traefik/config/letsencrypt- Diagnosed permission issues with
namei, ownership checks, and user/group reasoning. - Corrected the guidance:
- directories need execute bits
letsencryptshould be a directoryacme.jsonshould be a file with tight permissions- Established two valid permission models:
1000:1000with700directory and600file- or
debian:dockerwith group-shared permissions if operationally needed
Key Findings¶
- The directory path was wrong at first because the real per-app layout is
/opt/docker-apps/<app>/config. - A directory set to
0600cannot be traversed or written into normally because it lacks execute permission. acme.jsonshould be the file restricted to600, not the directory.- Host-side group membership for the Docker socket is a separate concern from Traefik app data ownership.
Resolution¶
The correct Traefik ACME storage layout and permission model were re-established:
- use /opt/docker-apps/Traefik/config/letsencrypt
- ensure the directory has executable permission
- ensure acme.json is tightly permissioned
Validation¶
Validation included: - checking path existence - checking directory and file ownership/perms - confirming the intended bridge network definition - reasoning through SFTP/WinSCP write behavior
Follow-Up Tasks¶
- confirm Traefik compose volume mounts align with the corrected ACME path
- recreate or restore
acme.jsonif needed - restart Traefik and confirm certificate storage works
Lessons Learned¶
- Secret files and secret directories require different permission models.
- A directory without execute permission behaves like an inaccessible path even if ownership is otherwise correct.
- Reverse proxy recovery work should validate both network and storage assumptions.
Standardize App Permissions Under /opt/docker-apps//config¶
Summary¶
The restored environment required a consistent permissions policy for application directories stored under /opt/docker-apps/<App>/config. Special handling was discussed for Traefik, Authelia, Gluetun, DB-backed apps, TubeArchivist, Plex, and related services.
Environment¶
- App root:
/opt/docker-apps - Per-app layout:
/opt/docker-apps/<App>/config - Runtime UID/GID standard:
1000:1000
Problem¶
Restored files from migration or backup could preserve old ownership or overly open/overly restrictive modes. Some apps require special handling for secrets, DBs, or media/transcode/log access.
Symptoms¶
- Permission denied errors in Traefik-related paths
- Need to decide whether
1000:1000ordebian:dockerwas appropriate - Concern about app-specific special cases
Actions Taken¶
- Established a baseline permissions policy:
- ownership
1000:1000 - directories
2775 - files
0664 - Identified categories requiring tighter permissions:
.envacme.json- keys
- VPN credentials
- Authelia config
- Identified DB file patterns that should be
0660. - Documented app-specific exceptions for Traefik, Authelia, Gluetun, TubeArchivist, Plex, Tautulli, Arr stack, Syncthing, and selected web apps.
Key Findings¶
- Most apps can use one sane baseline.
- Secret files should be
0600. - DB files should usually be
0660. - Traefik
acme.jsonis a must-tighten file. - OpenSearch data for TubeArchivist should not be world-readable.
- Plex transcode benefits from a sticky temp-style directory.
Resolution¶
A reusable permissions checklist and shell snippets were developed for the /opt/docker-apps/<App>/config layout.
Validation¶
Validation was intended through:
- ls -l
- namei -l
- secret file checks
- app startup behavior after permission normalization
Follow-Up Tasks¶
- apply the baseline and special-case permissions to restored apps
- audit secret file permissions after full restore
- verify the host Docker socket access model for Traefik separately
Lessons Learned¶
- Baseline-plus-exceptions is more maintainable than one-off manual permissions.
- Numeric UID/GID alignment with container PUID/PGID is usually the cleanest host-side model.
- Secret directories and secret files must be treated differently.
Increase VM 100 Resources and Discuss HA Across Dissimilar Nodes¶
Summary¶
VM 100 started with 2 GiB of RAM and later required an increase. CPU sizing and the implications of running HA across a more powerful tower PC and a weaker Intel NUC were also discussed.
Environment¶
- VM:
100 - Proxmox HA context:
- stronger tower PC
- weaker Intel NUC
- Resource targets discussed:
- RAM increase to 4 GiB
- possible CPU increase
Problem¶
The rebuilt Docker VM needed more memory, and there was a broader design question about how to handle CPU sizing in a cluster with mixed host performance.
Symptoms¶
- VM memory was only 2 GiB
- Desire to increase to 4 GiB and possibly raise CPU allocation
- Concern that a VM sized for a tower might not fit or perform similarly on a weaker HA target
Actions Taken¶
- Proposed changing VM memory to 4 GiB.
- Discussed increasing CPU with Proxmox
qm set. - Explained that one VM has one config, even in HA.
- Discussed using a portable CPU model instead of
hostif live migration or HA portability across dissimilar nodes matters. - Discussed using maximum topology plus boot-time online vCPU adjustment via hookscript as a more advanced option.
Key Findings¶
- RAM increase is straightforward.
- CPU sizing is more nuanced in mixed-node HA:
- one VM config applies across nodes
cpu: hostis best performance but worse portability- portable CPU models help with migration across different hardware
- The final resource decision depends on whether migration portability or performance is more important.
Resolution¶
The intended direction was to increase VM memory to 4 GiB and consider a moderate CPU increase, with awareness that HA across dissimilar nodes requires deliberate CPU model choices.
Validation¶
No final resource reconfiguration was confirmed in-guest in this session, but the correct Proxmox resource-setting approach was established.
Follow-Up Tasks¶
- set final RAM and CPU values on VM 100
- decide whether to prioritize
cpu: hostperformance or a portable CPU model - document HA behavior expectations across tower and NUC nodes
Lessons Learned¶
- Resource changes are easy; mixed-node portability is the hard part.
- HA design should account for the weakest node that may need to start the VM.
Command Reference¶
Command¶
qm clone 9000 100 --name debian-docker --storage cephpool --full 1
What it does
Creates VM 100 from template 9000 as a full clone on the Ceph RBD-backed storage pool cephpool.
Important flags
- 9000: source template VM ID
- 100: destination VM ID
- --name debian-docker: sets the VM name
- --storage cephpool: places clone storage on the Ceph pool
- --full 1: creates an independent full clone instead of a linked clone
Why it was used
To rebuild the main Docker VM cleanly.
Expected result
A new VM 100 exists on cephpool and can be configured with cloud-init and an extra data disk.
What failure indicates
Template, storage, or Proxmox clone errors.
Risk
Low to moderate. It creates a new VM but does not by itself destroy the old one.
Command¶
qm set 100 --scsihw virtio-scsi-single
What it does
Sets the guest’s SCSI controller to virtio-scsi-single.
Why it was used
To provide a modern, stable controller for multiple attached disks.
Expected result
The VM configuration shows the selected SCSI controller.
What failure indicates
VM config or Proxmox-side issue.
Proxmox relevance
Controller choice affects how additional disks are presented to the guest.
Command¶
qm set 100 --scsi1 cephpool:0,size=150G,ssd=1,discard=on,cache=writeback
What it does
Attempts to create a new Ceph-backed disk and attach it as scsi1.
Important flags
- cephpool:0: create a new disk on cephpool
- size=150G: intended disk size
- ssd=1: mark as SSD-like
- discard=on: allow discard/TRIM semantics
- cache=writeback: use writeback caching
Why it was used
To create the dedicated Docker data disk that would later mount at /var/lib/docker.
Expected result
A new Ceph RBD image appears and the VM config shows scsi1.
What failure indicates
Incorrect Proxmox/Ceph disk syntax or storage-layer handling issues.
Risk
Moderate. Mis-specified storage arguments can create an unusable disk.
Safer alternative
Verify the resulting disk immediately with qm config 100 and storage inspection before proceeding.
Command¶
qm set 100 --ide2 cephpool:cloudinit
What it does
Attaches a cloud-init disk to VM 100 on cephpool.
Why it was used
To provide NoCloud seed data built from Proxmox cloud-init settings.
Expected result
The VM has an ide2 cloud-init drive attached.
What failure indicates
If it reports File exists, the cloud-init volume is already present and should not be recreated.
Risk
Low.
Command¶
qm set 100 --cicustom "user=local:snippets/docker-userdata.yml,network=local:snippets/docker-net.yml"
What it does
Tells Proxmox to use custom cloud-init snippet files for user-data and network-config.
Important arguments
- user=local:snippets/docker-userdata.yml
- network=local:snippets/docker-net.yml
Why it was used
The default cloud-init template behavior was not sufficient for the Docker VM’s custom provisioning.
Expected result
qm config 100 reflects the custom cloud-init snippet references.
What failure indicates
Snippet path or storage content-type issues.
Proxmox relevance
This is how Proxmox consumes user-managed cloud-init YAML from snippet-capable storage.
Command¶
qm cloudinit update 100
What it does
Regenerates the cloud-init ISO for VM 100 after snippet or config changes.
Why it was used
To ensure updated user-data and network-config were baked into the next boot.
Expected result
A fresh cloud-init seed image is generated.
What failure indicates
Cloud-init disk problems or malformed VM config.
Command¶
qm terminal 100
What it does
Opens the serial console for VM 100 from the Proxmox host.
Why it was used
To inspect boot logs, cloud-init output, and early mount failures when SSH was not yet reliable.
Expected result
Live serial console output from the guest.
What failure indicates
Console misconfiguration or a VM state issue.
Proxmox relevance
Serial console access is often the fastest path to diagnose cloud-image first-boot issues.
Command¶
lsblk -f
What it does
Lists block devices, filesystems, labels, and mountpoints.
Why it was used
To confirm whether /dev/sdb existed and whether it had a filesystem.
Expected result
sdb appears with filesystem and label once formatted.
What failure indicates
If sdb lacks a filesystem, the Docker data disk has not been initialized.
Command¶
sudo mkfs.ext4 -F -L docker-data /dev/sdb
What it does
Creates an ext4 filesystem on /dev/sdb with label docker-data.
Important flags
- -F: force filesystem creation
- -L docker-data: assign a filesystem label
Why it was used
To recover from the first-boot failure where /dev/sdb existed but had no ext4 filesystem.
Expected result
lsblk -f and blkid show ext4 and label docker-data.
What failure indicates
Disk problems, permissions problems, or use of the wrong device.
Risk
High. This destroys existing contents on /dev/sdb.
Safer alternative
Double-check the target device with lsblk -f before running it.
Command¶
sudo mount -a
What it does
Attempts to mount all entries from /etc/fstab.
Why it was used
To activate newly added filesystem and bind mount definitions.
Expected result
All valid fstab entries mount without error.
What failure indicates
Bad paths, missing source directories, missing filesystems, or invalid fstab syntax.
Risk
Moderate. A bad fstab can break later boots if not corrected.
Command¶
sudo systemctl start docker
What it does
Starts the Docker daemon.
Why it was used
Docker could only start after /var/lib/docker was successfully mounted from /dev/sdb.
Expected result
Docker service becomes active.
What failure indicates
Dependency or storage-root problems.
Docker relevance
Docker will fail or behave incorrectly if its data-root is unavailable or mounted on the wrong filesystem.
Command¶
sudo systemctl status docker --no-pager
What it does
Shows Docker service status without invoking a pager.
Why it was used
To verify whether Docker was running after storage fixes.
Expected result
active (running).
What failure indicates
Mount dependency failure, daemon config issue, or container runtime issue.
Command¶
findmnt /var/lib/docker /opt/docker-apps /opt/compose
What it does
Shows live mount sources and types for the requested targets.
Why it was used
To verify the Docker data mount and bind mounts, and to detect layered or duplicate mounts.
Expected result
/var/lib/docker points to /dev/sdb or the ext4 label; /opt/docker-apps and /opt/compose point to bind sources under /var/lib/docker.
What failure indicates
Incorrect bind mounts, stale mounts, or mount source problems.
Command¶
cloud-init status --long
What it does
Reports detailed cloud-init state in the guest.
Why it was used
To determine whether provisioning had completed or failed.
Expected result
Cloud-init modules complete without fatal errors.
What failure indicates
Provisioning or datasource issues.
Cloud-init relevance
Useful for separating provisioning failure from ordinary system boot behavior.
Command¶
Likely command used:
cloud-init devel schema --config-file /var/lib/vz/snippets/docker-userdata.yml
What it does
Validates the cloud-init YAML file against the expected schema.
Why it was used
To diagnose the invalid cloud-config warning and catch malformed YAML before reuse.
Expected result
Validation passes without schema errors.
What failure indicates
The YAML contains unsupported or malformed keys.
Safer alternative
Validate before attaching the snippet to a production VM.
Command¶
find /srv/remotemount/NAS -maxdepth 5 -iname 'backup-[date removed].tar*' -printf '%p\n'
What it does
Searches the NAS-mounted backup tree for the expected Offen backup archive.
Why it was used
The original archive path assumption was wrong and needed to be corrected.
Expected result
Prints the full path to the matching archive.
What failure indicates
Wrong path assumption, missing NAS mount, or missing backup file.
Command¶
tar -tzf "/srv/remotemount/NAS/Tools/Backups/Docker/offen/backup-[date removed].tar.gz" | head -n 20
What it does
Lists the first 20 entries in the gzipped tar archive without extracting it.
Why it was used
To determine the internal path structure and the correct strip count for restoration.
Expected result
Archive entries showing the leading directories, such as /backup/my-app-backup/....
What failure indicates
Bad path, archive corruption, or wrong compression assumption.
Command¶
sudo tar --numeric-owner --same-owner --acls --xattrs -xpf "$ARCH" -C /opt/docker-apps --strip-components=2 -z
What it does
Extracts the Offen archive into /opt/docker-apps, removing the first two path components.
Important flags
- --numeric-owner: preserve numeric UID/GID
- --same-owner: restore ownership if possible
- --acls --xattrs: preserve ACLs and extended attributes
- -xpf: extract from file and preserve permissions
- -C /opt/docker-apps: destination directory
- --strip-components=2: remove backup/my-app-backup
- -z: treat input as gzip-compressed
Why it was used
The archive layout was /backup/my-app-backup/<AppName>/..., while the live destination should be /opt/docker-apps/<AppName>/....
Expected result
App directories appear directly under /opt/docker-apps.
What failure indicates
Wrong strip count, path problems, or archive issues.
Risk
Moderate to high. This can overwrite existing restored data.
Safer alternative
Extract into a staging directory first, then rsync into place.
Command¶
sudo rsync -aHAX --numeric-ids --partial --append-verify --info=progress2 --stats --bwlimit=12M -e "ssh -o ServerAliveInterval=30 -o ServerAliveCountMax=6" root@192.168.16.100:"/DockerAppData(old)/" /opt/docker-apps
What it does
Pulls app data from the old host into /opt/docker-apps with metadata preservation, throttling, and resumable transfer behavior.
Important flags
- -aHAX: archive, hardlinks, ACLs, xattrs
- --numeric-ids: preserve UID/GID numerically
- --partial: keep partial files
- --append-verify: resume and verify appended files
- --info=progress2 --stats: detailed progress and summary
- --bwlimit=12M: rate limit
- SSH keepalive options: prevent idle disconnects
Why it was used
To move a large Docker application tree safely over the LAN even if connections dropped.
Expected result
Data copies into /opt/docker-apps and can be resumed if interrupted.
What failure indicates
SSH, path, permission, or network interruption issues.
Risk
Moderate. Running on the wrong host or wrong destination can sync into the wrong place.
Safer alternative
Echo hostname and IP before starting long transfers.
Command¶
sudo rsync -aHAXn --delete --info=stats2,flist0,del0 -e "ssh -o ServerAliveInterval=30 -o ServerAliveCountMax=6" root@192.168.16.100:"/DockerAppData(old)/" /opt/docker-apps
What it does
Runs a dry-run rsync comparison without changing files.
Why it was used
To confirm that the migration had completed and source and destination matched.
Expected result
Zero files transferred when the trees are fully in sync.
What failure indicates
Remaining drift between source and destination.
Command¶
find /opt/compose -type f \( -iname 'docker-compose.yml' -o -iname 'compose.yml' -o -iname '*.yml' -o -iname '*.yaml' \)
What it does
Recursively finds compose files under /opt/compose.
Why it was used
Compose files were not stored flat under /opt/compose; they lived in stack subdirectories.
Expected result
A list of actual compose file paths.
What failure indicates
Wrong root path or missing compose files.
Command¶
docker compose -f /opt/compose/dockge/compose.yml up -d
What it does
Starts the Dockge stack using the correct nested compose file path.
Why it was used
The earlier command assumed a nonexistent path /opt/compose/dockge.yml.
Expected result
Dockge containers start in detached mode.
What failure indicates
Wrong path, compose syntax problem, or runtime issue.
Docker relevance
docker compose -f allows stack management without changing into the compose directory.
Command¶
find /opt/compose -type f \( -name '*.yml' -o -name '*.yaml' \) -exec sed -ri.bak -e 's/\r$//' -e "/^[[:space:]]*version:[[:space:]]*['\"]?3(\.[0-9]+)?['\"]?[[:space:]]*(#.*)?$/d" {} +
What it does
Recursively edits compose files in place, removes Windows CRLF if present, and deletes deprecated version: "3.x" lines.
Why it was used
To clean up legacy Compose syntax across many stack files.
Expected result
No remaining top-level version: declarations matching 3.x, with .bak rollback files preserved.
What failure indicates
Wrong file targeting or shell quoting issues.
Risk
Moderate. Bulk in-place edits affect many files.
Safer alternative
Run a dry-run grep first and keep backups until validation is complete.
Command¶
docker network create --driver bridge --subnet 172.35.0.0/16 --gateway 172.35.0.1 traefik-proxy
What it does
Creates the traefik-proxy Docker bridge network with the specified subnet and gateway.
Why it was used
Traefik and proxied services needed a known shared bridge network.
Expected result
traefik-proxy exists with the requested IPAM settings.
What failure indicates
Name conflict or subnet overlap.
Docker relevance
A shared bridge network is the standard pattern for Traefik-to-service communication on a single Docker host.
Command¶
namei -l /opt/docker-apps/Traefik/config/letsencrypt/acme.json
What it does
Displays permissions and ownership for every path component from / to the target file.
Why it was used
To diagnose WinSCP and shell permission errors in the Traefik letsencrypt path.
Expected result
Each directory in the chain is traversable and the file exists with the intended ownership and mode.
What failure indicates
Missing directories or insufficient execute/read/write permission somewhere in the path.
Command¶
sudo install -m 600 -o 1000 -g 1000 /dev/null /opt/docker-apps/Traefik/config/letsencrypt/acme.json
What it does
Creates acme.json if missing, with strict permissions and explicit ownership.
Important flags
- -m 600: set file mode to 600
- -o 1000 -g 1000: set owner/group
- /dev/null: source for creating an empty file
Why it was used
Traefik requires acme.json to exist with strict permissions.
Expected result
An empty but correctly permissioned acme.json at the target path.
What failure indicates
Missing parent directory or permission problem.
Risk
Moderate. If the file already exists, this can replace it depending on usage context.
Safer alternative
Use touch and then chmod/chown if preserving existing contents is critical.
Command¶
qm set 100 --memory 4096
What it does
Sets VM 100 memory allocation to 4096 MiB (4 GiB).
Why it was used
The rebuilt Docker VM initially had only 2 GiB and needed more RAM.
Expected result
VM config reflects 4 GiB of memory.
What failure indicates
VM config or Proxmox-side issue.
Proxmox relevance
RAM sizing directly affects guest workload stability and HA fit on target nodes.
Command¶
Likely command used:
qm set 100 --cpu host --sockets 1 --cores 2
What it does
Sets CPU model and topology for the VM.
Why it was discussed
The VM may need more CPU, but HA portability across mismatched nodes complicates the choice.
Expected result
VM config reflects the new CPU topology.
What failure indicates
Proxmox config issue or an architectural mismatch with migration requirements.
Risk
Low for the config change itself; higher operationally if cpu: host is used across mismatched HA nodes.
Safer alternative
Use a portable baseline CPU model when cross-node migration compatibility matters more than peak performance.