Proxmox HA Failover vs Live Migration with Ceph-Backed VM Storage¶
Summary¶
This work session clarified how Proxmox VM migration behaves when a VM uses Ceph-backed shared storage. The goal was to understand whether migrating a VM from pve1 to pve2 should be effectively instantaneous, whether a 16 GiB transfer indicated disk movement, and how this differs from High Availability (HA) failover when a node goes down.
The discussion also reviewed an actual Proxmox migration log for VM 110, identified that the transfer shown was VM memory state rather than disk, and distinguished between:
- planned live migration while the source node remains healthy
- HA restart behavior after node failure
Environment¶
- Platform: Proxmox VE cluster
- Nodes involved:
pve1,pve2 - VM involved:
VM 110 - Shared storage context: Ceph-backed shared storage
- Ceph RBD is the most likely VM disk backend in this scenario
- CephFS was also discussed conceptually as shared storage
- Migration type discussed:
- live migration
- HA failover / restart on another node
- Cluster networking referenced:
192.168.16.x - Example remote node address in log:
192.168.16.13 - VM memory size observed in migration log:
16.0 GiB
Problem¶
There was uncertainty about what Proxmox transfers during a VM migration when the VM uses shared Ceph storage. Specifically:
- whether Ceph or CephFS makes migration effectively instantaneous
- whether the migration process should avoid transferring 16 GiB
- whether Proxmox HA would still "move" the VM automatically if the source node fails
Symptoms¶
Observed migration log from Proxmox for VM 110:
[date removed] starting migration of VM 110 to node 'pve2' (192.168.16.13)
[date removed] starting VM 110 on remote node 'pve2'
[date removed] start remote tunnel
[date removed] ssh tunnel ver 1
[date removed] starting online/live migration on unix:/run/qemu-server/110.migrate
[date removed] set migration capabilities
[date removed] migration downtime limit: 100 ms
[date removed] migration cachesize: 2.0 GiB
[date removed] set migration parameters
[date removed] start migrate command to unix:/run/qemu-server/110.migrate
[date removed] migration active, transferred 107.8 MiB of 16.0 GiB VM-state, 114.9 MiB/s
...
[date removed] migration active, transferred 3.1 GiB of 16.0 GiB VM-state, 112.7 MiB/s
[date removed] ERROR: online migrate failure - interrupted by signal
[date removed] aborting phase 2 - cleanup resources
[date removed] migrate_cancel
Additional clarification from the operator:
- the migration was manually cancelled
- concern remained about what HA would do if pve1 actually went down
Actions Taken¶
- Reviewed the expected behavior of Proxmox migration when VM disks are stored on shared Ceph storage.
- Distinguished between:
- shared disk access through Ceph
- RAM transfer required during live migration
- Examined the migration log line containing:
text 16.0 GiB VM-state - Interpreted the transfer rate values (
~110–140 MiB/s) as memory-state transfer consistent with a likely1 Gbpsmigration path. - Identified the failure message:
text ERROR: online migrate failure - interrupted by signal - Determined that this specific migration failure did not indicate a Ceph or storage problem because it was manually cancelled.
- Clarified the operator’s actual question: not live migration behavior, but HA failover behavior when a node goes down.
- Explained how Proxmox HA behaves with Ceph-backed shared storage after node failure:
- the VM is restarted on another eligible node
- RAM state is not preserved
- the VM boots from the existing shared Ceph disk
Key Findings¶
- In Proxmox live migration, the value shown as
VM-staterefers to RAM contents, not VM disk size. - A line such as:
text transferred 3.1 GiB of 16.0 GiB VM-stateindicates transfer of memory pages from source to destination during live migration. - With Ceph-backed shared storage, the VM disk does not need to be copied from
pve1topve2during migration. - Ceph reduces migration overhead by making VM storage available to both nodes, but it does not eliminate RAM transfer during live migration.
- Live migration is therefore fast but not instantaneous.
- If the source node fails unexpectedly, live migration cannot complete because the running in-memory VM state is lost with the source node.
- In an HA scenario, Proxmox does not perform a seamless migration after node failure. Instead, it:
- detects node loss
- selects another eligible node
- starts the VM there using the same Ceph-backed disks
- HA failover with Ceph is effectively a reboot on another node, not a continuation of the exact prior in-memory execution state.
- The observed migration error in this session was not an infrastructure failure; it was caused by a manual cancellation.
Resolution¶
The issue was resolved conceptually rather than through a configuration change.
Final understanding:
- the 16 GiB shown in the migration log referred to VM RAM
- Ceph-backed shared storage avoids a full disk transfer during migration
- HA after node failure will restart the VM on another node instead of preserving the running state
- planned live migration and HA failover are separate behaviors and should not be interpreted the same way
Current status: - no storage migration issue was identified - no Ceph fault was indicated by the provided log - the operator now has a correct model for how Proxmox migration and HA behave with Ceph-backed VM disks
Validation¶
Success was confirmed by matching the migration log behavior to Proxmox live migration mechanics:
16.0 GiB VM-statematched the VM’s memory allocation, not disk- transfer rates aligned with expected network-based RAM copy behavior
- the failure reason matched a manual abort rather than storage or cluster failure
- the HA explanation was reconciled against the operator’s intended question:
- no live handoff after node death
- restart from shared Ceph storage on another eligible node
A useful validation command noted during the discussion was:
qm config 110 | grep memory
Purpose: confirm the VM’s configured memory size and compare it to the VM-state shown in migration logs.
Follow-Up Tasks¶
- Confirm that VM
110is on Ceph RBD shared storage rather than local-only storage. - Review HA group membership and failover target eligibility for critical VMs.
- Test planned HA behavior with a controlled maintenance scenario rather than waiting for an actual node failure.
- Verify corosync quorum health and node fencing behavior so HA restart timing is understood ahead of an outage.
- Document which VMs are expected to tolerate reboot-style failover versus those requiring application-level clustering.
- If faster live migration is desired, evaluate migration network bandwidth between Proxmox nodes.
Lessons Learned¶
- In Proxmox migration logs,
VM-statemeans memory state, not disk. - Ceph shared storage removes the need to copy VM disks during migration, but not RAM during live migration.
- HA failover after node loss is a restart, not a transparent continuation of execution.
- A migration log must be interpreted carefully before assuming a storage bottleneck.
- Manual cancellation can resemble migration failure in logs; operator actions should be ruled out before deeper troubleshooting.
Command Reference¶
Command¶
qm config 110 | grep memory
What it does:
Displays the Proxmox VM configuration for VM 110 and filters for the memory entry.
Important arguments:
- qm config 110 — shows the configuration of VM 110
- grep memory — limits output to lines containing memory
Why it was used at that moment:
To verify whether the 16.0 GiB VM-state shown in the migration log matched the VM’s configured RAM.
Expected result:
A line similar to:
memory: 16384
What success or failure would indicate:
- If it shows 16384, that confirms the migration log’s 16.0 GiB VM-state refers to RAM.
- If it shows a different value, either the VM memory changed later or the log refers to a different runtime state.
Platform relevance:
For Proxmox, this is a direct way to correlate migration behavior with VM resource configuration.
Command¶
journalctl -r | head -n 40
What it does:
Shows the most recent journal entries in reverse chronological order, limited to the newest 40 lines.
Important arguments:
- -r — reverse order, newest first
- head -n 40 — show the first 40 lines of that output
Why it was suggested:
To quickly inspect recent system events around a migration failure or interruption.
Expected result:
Recent logs from Proxmox services, SSH, kernel messages, or system daemons that may explain an aborted migration.
What success or failure would indicate:
- Useful recent log entries can help identify whether a migration failure was due to service restart, SSH interruption, or other host-side events.
- If nothing relevant appears, more targeted journal queries are needed.
Risk level:
Low risk. Read-only.
Safer alternative:
A time-bounded journalctl query is usually cleaner for incident review.
Command¶
journalctl -S "[date removed]" -U "[date removed]"
What it does:
Queries system logs only for the specified time window.
Important arguments:
- -S — start time
- -U — end time
Why it was suggested:
To inspect logs specifically around the migration interruption time.
Expected result:
A filtered set of entries from the exact incident window.
What success or failure would indicate:
- Relevant service or network logs during the event window can identify the cause of interruption.
- If nothing appears, the event may not have been logged by the system journal or may require service-specific filtering.
Platform relevance:
Useful during Proxmox troubleshooting to isolate cluster, SSH, or daemon activity around a specific failure.
Risk level:
Low risk. Read-only.
Command¶
journalctl -u pvedaemon -u pveproxy -u ssh -S "[date removed]" -U "[date removed]"
What it does:
Shows logs for specific services in a defined time window.
Important arguments:
- -u pvedaemon — Proxmox management daemon
- -u pveproxy — Proxmox web/API proxy
- -u ssh — SSH service
- -S / -U — time window bounds
Why it was suggested:
To check whether the migration interruption came from Proxmox service behavior or an SSH tunnel problem.
Expected result:
Service logs that may correlate with migration initiation, interruption, cancellation, or transport problems.
What success or failure would indicate:
- Relevant service messages may confirm cancellation, daemon interruption, or transport loss.
- Lack of errors may suggest a manual abort or an event outside those services.
Platform relevance:
In Proxmox, live migration depends heavily on management daemons and transport setup, so these logs are high-value for incident review.
Risk level:
Low risk. Read-only.
Command¶
pvecm status
What it does:
Displays Proxmox cluster status, including quorum state and cluster membership.
Why it was suggested:
To verify that the cluster was healthy if there were concerns about why migration or HA behavior might fail.
Expected result:
Healthy quorum, visible nodes, and no obvious cluster membership issues.
What success or failure would indicate:
- Healthy quorum supports normal HA and migration decisions.
- Loss of quorum or cluster instability can prevent HA actions or interfere with node coordination.
Platform relevance:
This is one of the core cluster health commands in Proxmox.
Risk level:
Low risk. Read-only.
Command¶
ping -c 5 pve2
What it does:
Sends five ICMP echo requests to pve2 to confirm basic network reachability.
Important arguments:
- -c 5 — send 5 packets, then stop
Why it was suggested:
To verify whether the destination node was reachable during migration troubleshooting.
Expected result:
Five successful replies with stable latency.
What success or failure would indicate:
- Successful replies indicate basic Layer 3 connectivity.
- Packet loss or failure indicates a possible network or host availability problem.
Platform relevance:
Migration and HA both depend on reliable node-to-node communication, even when storage is shared through Ceph.
Risk level:
Low risk. Read-only.
Command¶
Likely command used: qm migrate 110 pve2 --online
What it does:
Initiates a live migration of VM 110 from its current node to pve2.
Important arguments:
- qm migrate — Proxmox VM migration command
- 110 — VM ID
- pve2 — destination node
- --online — perform live migration while the VM remains running
Why it was likely used:
The log clearly shows an online/live migration of VM 110 to pve2. Even if initiated from the GUI, this is the logical CLI equivalent.
Expected result:
Proxmox starts the VM on the destination node, transfers the VM memory state, then briefly pauses and completes cutover.
What success or failure would indicate:
- Success indicates healthy cluster communications, compatible VM configuration, and accessible shared storage.
- Failure can indicate transport interruption, node issues, manual cancellation, incompatible CPU/state conditions, or other migration blockers.
Risk level:
Moderate. This affects a running VM.
Safer alternative:
Use the Proxmox GUI for clearer visibility, or perform an offline migration during maintenance if seamless runtime continuity is not required.
Platform relevance:
For Ceph-backed Proxmox VMs, this command usually transfers RAM state rather than copying disk storage.
Command¶
Likely action used: Proxmox GUI migration cancel / abort
What it does:
Cancels an in-progress migration task from the Proxmox interface.
Why it was likely used:
The operator later confirmed the migration was manually cancelled.
Expected result:
The migration task stops, cleanup occurs, and the task log may show messages such as:
ERROR: online migrate failure - interrupted by signal
migrate_cancel
What success or failure would indicate:
- If cancellation succeeds cleanly, the VM should remain in a stable state on the source node or be cleaned up appropriately depending on migration phase.
- If cancellation occurs at an unsafe point or coincides with other issues, follow-up validation of VM placement and status is required.
Risk level:
Moderate. Cancelling migration interrupts a running administrative task and should be done deliberately.
Safer alternative:
Prefer allowing a healthy migration to finish unless there is a clear reason to stop it.
Command¶
Likely command used: ha-manager status
What it does:
Displays the state of Proxmox HA resources and HA-managed services.
Why it is relevant here:
Although not explicitly run in the chat, it is directly relevant to validating HA behavior after clarifying that the real question was about failover rather than live migration.
Expected result:
A list of HA-managed resources, their states, and current node placement.
What success or failure would indicate:
- If the VM appears as HA-managed and healthy, it is eligible for automated failover according to policy.
- If it is absent from HA management, the VM will not be automatically restarted elsewhere after node failure.
Platform relevance:
This is a key Proxmox HA operational check.
Risk level:
Low risk. Read-only.