Initial Investigation of Proxmox Host and VM Instability Under Container Load¶
Summary¶
Troubleshooting began for a Proxmox node that became unresponsive while running a Debian Docker VM with many containers active. The first goal was to determine whether the issue was caused by kernel faults, guest instability, host overcommit, memory pressure, or storage/network dependencies.
Environment¶
- Proxmox VE host:
pve1 - Kernel observed in logs:
6.8.12-13-pve - VM:
debian-docker - Guest OS: Debian
- Workload: Docker / Docker Compose containers
- Host network:
- physical NIC:
eno1 - bridge:
vmbr0 - Ceph present in the environment
- Hardware platform later identified as Intel NUC8i7BEH
- CPU later identified as Intel Core i7-8559U
Problem¶
The Proxmox host and the Docker VM became unresponsive after some runtime with most or all containers active.
Symptoms¶
- Host and VM both became unresponsive.
- Previous boots ended uncleanly.
- No immediate panic, MCE, or OOM signature was obvious at the start of troubleshooting.
- The issue did not match prior bare-metal behavior, where the same or similar container workload had been stable.
Actions Taken¶
- Listed previous boots and inspected the prior boot’s kernel logs.
- Filtered logs for warnings, errors, panic signatures, watchdog events, and hardware fault indicators.
- Checked host memory, swap, CPU, and IO at a healthy moment.
- Checked live kernel logs inside the Debian Docker VM.
- Compared the VM-based deployment to the previously stable bare-metal Docker deployment.
Important commands used:
journalctl --list-boots
Purpose: identify which prior boot should be inspected.
journalctl -k -b -1
Purpose: inspect kernel messages from the previous boot.
journalctl -k -b -1 -p warning..alert
Purpose: reduce noise and focus on prior-boot kernel warnings and errors.
journalctl -k -b -1 | egrep -i 'panic|BUG:|Oops|Call Trace|hardware error|MCE|watchdog|soft lockup|hard lockup|NMI|reset|blocked for more than'
Purpose: search for classic crash and hardware-fault signatures.
free -h
swapon --show
vmstat 1 5
Purpose: check host RAM pressure, swap, and basic CPU/IO health.
sudo journalctl -kf
Purpose: watch live kernel messages inside the Docker VM.
Key Findings¶
- Prior-boot warnings initially appeared mostly non-fatal:
- SGX disabled in BIOS
- CPU vulnerability warnings
- ACPI thermal firmware warning
- ZFS taint messages
- journald corruption / unclean shutdown notice
- No early evidence of:
- kernel panic
- MCE / ECC-style hardware fault
- watchdog soft lockup
- guest OOM killer
- Live VM logs showed normal Docker bridge and veth lifecycle messages rather than guest kernel crashes.
- At this stage, the issue appeared more likely to be host-side or dependency-related than a Docker guest kernel fault.
Resolution¶
No final resolution was reached in this phase. Troubleshooting continued into memory, thermals, and hardware-specific causes.
Validation¶
- Prior boot logs were successfully collected and reviewed.
- VM live logs did not show guest kernel panic behavior.
- Early host checks did not reveal a simple OOM or swap-driven failure.
Follow-Up Tasks¶
- Capture host telemetry closer to the actual failure window.
- Investigate thermal conditions under sustained load.
- Inspect previous-boot logs near the end of failed runtime.
- Continue checking for hardware- or driver-specific faults.
Lessons Learned¶
- Unclean shutdown messages alone do not identify root cause.
- Docker bridge and veth events inside the guest are normal and should not be mistaken for guest kernel failure.
- Early triage should clearly separate:
- guest failure
- host failure
- storage/network dependency failure
Host Resource Review and Thermal Testing on pve1¶
Summary¶
The next work session focused on determining whether the node was simply overcommitted or memory-starved. Host telemetry showed healthy RAM and swap usage, which shifted attention away from basic overcommit and toward thermals or platform-specific faults.
Environment¶
- Proxmox host:
pve1 - Hardware: Intel NUC8i7BEH
- CPU: Intel Core i7-8559U
- Host RAM: approximately 32 GiB
- Swap: 8 GiB configured
- VM:
debian-docker - VM sizing discussed:
- 12 GiB RAM
- 4 vCPUs
- CPU type
x86-64-v3 - ballooning disabled
Problem¶
The node still became unresponsive after running for a while, even though the container workload was not believed to be especially resource-intensive.
Symptoms¶
- Host and VM became unresponsive after some runtime.
- Behavior was intermittent rather than a constant “load too high immediately” pattern.
- Prior theory that the node was simply RAM-starved became doubtful.
Actions Taken¶
- Checked host memory and swap usage while the node was healthy.
- Reviewed
vmstatoutput for swap activity and IO wait. - Installed
lm-sensors. - Ran
sensors-detect. - Checked baseline temperatures.
- Monitored temperatures during heavier sustained load.
Important commands used:
free -h
swapon --show
vmstat 1 5
Purpose: verify whether the host was exhausting RAM or swapping.
apt-get install -y lm-sensors
Purpose: install host sensor tooling.
sensors-detect
Purpose: identify supported sensor drivers.
sensors
Purpose: read CPU, chipset, and NVMe temperatures.
watch -n2 sensors
Purpose: monitor temperatures continuously during load.
Key Findings¶
- Host memory state was healthy when sampled:
- about 31 GiB total
- about 11 GiB used
- about 19 GiB free
- swap unused
vmstatshowed:- no swap-in / swap-out
- low IO wait
- plenty of idle CPU at the sampled moment
- This ruled out simple host RAM exhaustion as the immediate cause.
lm-sensorsfound thecoretempdriver and provided usable CPU telemetry.- Initial temperatures were normal.
- Under sustained load, temperatures later spiked dramatically:
- CPU package reached 99°C
- at least one core also reached 99°C
- This confirmed that thermal stress was a real issue at least part of the time.
Resolution¶
A real thermal problem was identified, but it was not yet proven to be the only root cause.
Validation¶
- Host RAM and swap telemetry disproved the simple “memory starvation” theory.
- Sensor telemetry captured thermal throttle territory under load.
Follow-Up Tasks¶
- Clean the NUC cooling path.
- Check fan profile and BIOS cooling settings.
- Consider re-pasting if necessary.
- Continue collecting post-crash logs because thermals did not fully explain all failures.
Lessons Learned¶
- Verify resource-pressure assumptions with actual telemetry before changing VM sizing.
- Thermal issues can be real without explaining every outage.
- Sample both healthy state and sustained-load state before narrowing the failure domain.
Thermal Instability Confirmed, Then Ruled Out as the Only Failure Mode¶
Summary¶
Thermal testing confirmed the NUC could reach near-critical CPU temperatures under load. However, a later outage occurred while temperatures were normal, which proved a second failure mechanism existed.
Environment¶
- Proxmox host:
pve1 - Hardware: Intel NUC8i7BEH
- CPU: Intel Core i7-8559U
- Sensor source:
lm-sensors
Problem¶
Even after confirming the node could overheat, the node still later went down when temperatures were only in the mid-60°C range.
Symptoms¶
- Under one sustained-load test:
- package temperature hit 99°C
- one core hit 99°C
- In a later failure:
- package temperature was around 66°C
- cores were around 62–68°C
- The node still went down.
Actions Taken¶
- Observed temperatures continuously during runtime.
- Compared a high-thermal event to a later outage with normal temperatures.
- Concluded that thermals were contributing but not the only issue.
Important command used:
watch -n2 sensors
Purpose: compare thermal behavior across different failure windows.
Key Findings¶
- The node definitely reached throttle territory in one run.
- Another outage occurred at safe temperatures, so overheating was not the only explanation.
- Troubleshooting needed to pivot back to log-based fault analysis.
Resolution¶
Thermals were kept as a real but partial issue. Post-crash log analysis became the next priority.
Validation¶
- Two separate sensor observations showed:
- one clearly thermal event
- one non-thermal outage
- That split prevented a false conclusion that “temperature alone” was the whole problem.
Follow-Up Tasks¶
- Keep thermal monitoring in place.
- Continue reviewing previous-boot logs after each crash.
- Improve cooling anyway, even if a second issue also exists.
Lessons Learned¶
- Multiple independent fault domains can coexist on the same node.
- Do not stop at the first confirmed problem if later evidence contradicts a single-cause explanation.
Previous-Boot Log Analysis Identified Intel e1000e NIC Hardware Hangs¶
Summary¶
Targeted inspection of the previous boot’s final log lines revealed repeated Intel e1000e hardware hangs on eno1, followed immediately by Ceph socket closures. This established a strong link between node “death” and host network failure.
Environment¶
- Proxmox host:
pve1 - Hardware: Intel NUC8i7BEH
- NIC:
- interface:
eno1 - driver:
e1000e - Bridge:
vmbr0 - Ceph environment present:
- monitor connectivity affected
- MDS connectivity affected
- VM:
debian-docker - VM and cluster behavior dependent on host network stability
Problem¶
The node still failed when temperatures were normal. The goal became distinguishing between: - power loss - kernel panic - reboot/reset - NIC failure - Ceph/storage dependency failure
Symptoms¶
Previous-boot logs ended with repeated messages such as:
- e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang
- libceph: mon3 ... socket closed
- libceph: mds0 ... socket closed
The node appeared offline externally and required reboot.
Actions Taken¶
- Pulled the final lines from the previous boot’s kernel log.
- Pulled previous-boot error and alert messages.
- Searched the previous boot for panic, watchdog, MCE, reset, and reboot indicators.
- Checked
last -xto confirm reboot timing and uptime windows. - Interpreted the end-of-boot sequence.
Important commands used:
journalctl -k -b -1 | tail -n 120
Purpose: inspect final kernel messages before reboot.
journalctl -b -1 -p err..alert | tail -n 120
Purpose: inspect severe prior-boot errors.
journalctl -b -1 | egrep -i 'panic|BUG:|Call Trace|watchdog|soft lockup|hard lockup|mce|hardware error|fatal|reset|reboot' | tail -n 80
Purpose: check for classic crash signatures.
last -x | head
Purpose: confirm reboot sequence and uptime windows.
Key Findings¶
- The previous boot ended with repeated
e1000eNIC hangs oneno1. - Ceph monitor and MDS socket closures followed immediately.
- No matching panic, MCE, or watchdog signature was found in the same time window.
- This strongly indicated:
- host networking failed first
- Ceph connectivity failed as a downstream effect
- the node appeared “dead” because network access and Ceph-backed or Ceph-dependent services collapsed
Resolution¶
The primary failure mode captured in logs was identified as Intel e1000e NIC hangs on the onboard NUC NIC.
Validation¶
- Previous-boot logs clearly captured repeated
Detected Hardware Unit Hangmessages. - Ceph socket closures immediately followed.
- The absence of panic/MCE signatures made the NIC/driver path the strongest explanation.
Follow-Up Tasks¶
- Apply an offload workaround on the affected node.
- Standardize the workaround on similar NUC8i7BEH nodes.
- Continue checking for recurrence of
e1000ehang messages. - Consider BIOS updates and alternate NIC options if needed.
Lessons Learned¶
- On clustered Proxmox/Ceph nodes, a NIC failure can look like total host death.
- Ceph errors can be downstream effects of NIC instability rather than the root cause.
- Previous-boot tail inspection is often more useful than broad log sweeps once the likely failure window is known.
Applied e1000e Offload Workaround on pve1¶
Summary¶
A mitigation was applied to disable problematic offload features on the Intel NUC onboard NIC and on the Linux bridge. The change was tested live and then persisted in the network interfaces configuration.
Environment¶
- Proxmox host:
pve1 - Hardware: Intel NUC8i7BEH
- Physical NIC:
eno1 - Linux bridge:
vmbr0 - Driver:
e1000e - Network config file:
/etc/network/interfaces
Problem¶
The Intel onboard NIC on pve1 was hanging under load and causing host network loss and downstream Ceph disconnects.
Symptoms¶
- Repeated:
e1000e ... Detected Hardware Unit Hang- Followed by:
libceph ... socket closed- Host and VM appeared down or frozen from the network.
Actions Taken¶
- Installed
ethtool. - Disabled TSO/GSO/GRO offloads live on
eno1. - Disabled TSO/GSO/GRO offloads live on
vmbr0. - Verified offload state using
ethtool -k. - Edited
/etc/network/interfacesto add persistentpost-upcommands. - Reloaded networking with
ifreload -a. - Re-verified offload settings after reload.
Important commands used:
apt-get install -y ethtool
Purpose: install NIC feature tuning tool.
ethtool -K eno1 tso off gso off gro off
Purpose: disable problematic offloads on the physical NIC.
ethtool -K vmbr0 tso off gso off gro off
Purpose: apply the same mitigation at the bridge layer where supported.
ethtool -k eno1 | egrep 'tso|gso|gro'
Purpose: verify the physical NIC’s offload state.
ethtool -k vmbr0 | egrep 'tso|gso|gro'
Purpose: inspect bridge feature state after the change.
nano /etc/network/interfaces
Purpose: persist the workaround across reboot.
ifreload -a
Purpose: apply the network config without a full reboot.
Persisted configuration:
auto lo
iface lo inet loopback
auto eno1
iface eno1 inet manual
post-up /sbin/ethtool -K eno1 tso off gso off gro off
auto vmbr0
iface vmbr0 inet static
address 192.168.16.12/24
gateway 192.168.16.1
bridge-ports eno1
bridge-stp off
bridge-fd 0
post-up /sbin/ethtool -K vmbr0 tso off gso off gro off
iface wlp0s20f3 inet manual
source /etc/network/interfaces.d/*
Key Findings¶
eno1successfully showed the relevant offload features disabled.vmbr0showed mixed bridge-specific behavior, which is normal; the key mitigation is on the physical NIC.- The mitigation fits a known Intel NUC /
e1000eproblem pattern under Linux and bursty network load.
Resolution¶
The e1000e offload workaround was applied successfully on pve1 and made persistent.
Validation¶
- Live
ethtoolchanges succeeded. - Verification output showed the relevant offload features disabled on
eno1. - After about a day, the node appeared stable with most containers running.
Follow-Up Tasks¶
- Apply the same workaround to other Intel NUC8i7BEH nodes using the same NIC/driver path.
- Keep checking for:
e1000ehardware hangs- Ceph socket closures
- Continue monitoring temperatures because thermal spikes were also confirmed earlier.
- Consider BIOS updates and cooling cleanup on all NUC nodes.
Lessons Learned¶
- Intel NUC onboard
e1000eNIC instability can destabilize an entire Proxmox/Ceph node. - Disabling TSO/GSO/GRO is a practical and low-risk mitigation.
- Persisting the fix in
/etc/network/interfacesis better than relying on manual reapplication after reboot.
Follow-Up Operational Notes After Stability Improved¶
Summary¶
After roughly a day of runtime, the node appeared stable with most containers running. The conversation then shifted into follow-up operational guidance for other NUC nodes, VM disk option tuning, backup option clarification, and Proxmox update workflow.
Environment¶
- Proxmox host:
pve1 - Other nodes: Intel NUC8i7BEH
- VM:
debian-docker - Proxmox storage and VM disk options discussed:
- IOThreads
- discard
- cloud-init disk
- per-disk backup checkbox
Problem¶
With immediate instability reduced, the next goal was to standardize the workaround and document safe operational behavior.
Symptoms¶
- Node appeared stable after the NIC workaround.
- A Proxmox warning occurred when enabling IOThread:
WARN: iothread is only valid with virtio disk or virtio-scsi-single controller, ignoring- Clarification was needed on:
- whether to apply the NIC fix to other NUCs
- whether IOThreads/discard were appropriate
- where backup storage is actually consumed
- how to update Proxmox cleanly
Actions Taken¶
- Determined that the NIC workaround should likely be repeated on other NUC8i7BEH nodes using the same onboard NIC and driver.
- Reviewed whether
iothreadanddiscardwere appropriate for the VM disk. - Explained that the per-disk Proxmox “Backup” checkbox only controls inclusion in backups and does not allocate separate disk storage on its own.
- Interpreted the IOThread warning as a controller/disk compatibility issue.
- Documented concise Proxmox update commands and Proxmox wrapper behavior.
Important commands used or discussed:
pveversion
Purpose: check current Proxmox version before updating.
apt update
apt full-upgrade -y
reboot
Purpose: standard Proxmox update flow.
pveupdate
pveupgrade
reboot
Purpose: wrapper-based Proxmox update flow.
Key Findings¶
- The Intel NUC8i7BEH platform likely shares the same
e1000erisk on other nodes, so the workaround should be standardized there as well. iothreadis only valid for:- virtio disks
- disks on a
virtio-scsi-singlecontroller - Enabling
iothreadon unsupported disks such as cloud-init or unsupported controller types produces a warning and is ignored. discard=onis generally reasonable for thin-provisioned or Ceph-backed virtual disks when trim/unmap is supported.- The Proxmox disk “Backup” option only controls whether a disk is included in a VM backup job.
pveupdateandpveupgradeare Proxmox convenience wrappers around the normalaptflow.
Resolution¶
Current status:
- pve1 appeared stable after the NIC offload workaround.
- Guidance was captured for:
- repeating the NIC fix on similar NUC nodes
- using IOThread only on supported disk/controller combinations
- enabling discard where appropriate
- understanding disk-backup inclusion behavior
- updating Proxmox using either standard apt or Proxmox wrappers
Validation¶
- Roughly one day of improved stability was observed with most containers running.
- No new failure evidence was presented during this follow-up checkpoint.
Follow-Up Tasks¶
- Apply the NIC workaround across other NUC8i7BEH nodes.
- Review BIOS versions and cooling health on all NUC nodes.
- Verify VM controller types before enabling IOThreads.
- Continue observing
pve1before fully closing the incident.
Lessons Learned¶
- Once a platform-specific fault pattern is confirmed, standardizing the workaround across identical nodes is usually worthwhile.
- Not all Proxmox disk options apply to all controller types.
- Concise operational notes are useful once an incident moves from active troubleshooting to maintenance.
Command Reference¶
Command¶
journalctl --list-boots
What it does¶
Lists known boot sessions from the systemd journal, each with a relative boot index such as 0, -1, or -2.
Important flags or arguments¶
- none in this invocation
Why it was used at that moment¶
To identify which earlier boot corresponded to a failure window before inspecting previous-boot logs.
Expected result¶
A list of boots with IDs and time ranges.
What success or failure would indicate¶
- Success: prior boots are available for review.
- Failure: journald may not be persistent or older logs may be unavailable.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
journalctl -k -b -1
What it does¶
Shows kernel messages from the previous boot.
Important flags or arguments¶
-k: kernel messages only-b -1: previous boot
Why it was used at that moment¶
To inspect host-side kernel and driver behavior leading up to the last crash or reboot.
Expected result¶
The previous boot’s kernel log.
What success or failure would indicate¶
- Success: prior-boot kernel history is available.
- Failure: logs from the prior boot are missing.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
journalctl -k -b -1 -p warning..alert
What it does¶
Filters the previous boot’s kernel log to warnings and more severe messages.
Important flags or arguments¶
-p warning..alert: severity range from warning through alert
Why it was used at that moment¶
To reduce noise and focus only on significant prior-boot kernel warnings and errors.
Expected result¶
A smaller, higher-signal set of kernel messages.
What success or failure would indicate¶
- Success: serious kernel messages are easier to inspect.
- No output: there may have been no warning-level kernel events captured.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
journalctl -k -b -1 | egrep -i 'panic|BUG:|Oops|Call Trace|hardware error|MCE|watchdog|soft lockup|hard lockup|NMI|reset|blocked for more than'
What it does¶
Searches the previous boot’s kernel log for common panic, fault, and hardware-error signatures.
Important flags or arguments¶
egrep -i: extended regex, case-insensitive- Search terms include common kernel crash indicators
Why it was used at that moment¶
To quickly identify whether the failure resembled a classic panic, watchdog event, or MCE-style hardware fault.
Expected result¶
Any matching fault signatures.
What success or failure would indicate¶
- Matches found: there may be direct crash or hardware clues.
- No matches: the failure may be outside classic kernel panic patterns.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
free -h
What it does¶
Displays memory usage in human-readable units.
Important flags or arguments¶
-h: human-readable output
Why it was used at that moment¶
To check whether host RAM exhaustion was contributing to instability.
Expected result¶
A summary showing total, used, free, shared, cache, and available memory.
What success or failure would indicate¶
- High available memory: RAM starvation is less likely.
- Very low available memory plus swap activity: memory pressure is more likely.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
swapon --show
What it does¶
Lists configured swap devices and current swap usage.
Important flags or arguments¶
--show: tabular display of active swap devices
Why it was used at that moment¶
To verify whether the host had started swapping.
Expected result¶
A list of swap devices and used size.
What success or failure would indicate¶
0Bused: no current swap pressure.- Nonzero usage: swap pressure exists or existed.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
vmstat 1 5
What it does¶
Prints virtual memory, process, IO, swap, and CPU statistics every second for five samples.
Important flags or arguments¶
1: sample interval in seconds5: number of samples
Why it was used at that moment¶
To check whether the host was swapping, blocked on IO, or under visible CPU pressure.
Expected result¶
Five rows of live system telemetry.
What success or failure would indicate¶
si/soabove zero: active swap activity.- High
wa: IO wait / storage bottleneck. - High run queue or low idle: CPU pressure.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
sudo journalctl -kf
What it does¶
Follows live kernel messages continuously.
Important flags or arguments¶
-k: kernel messages only-f: follow new entries as they arrive
Why it was used at that moment¶
To observe live guest or host kernel behavior during runtime and while reproducing the issue.
Expected result¶
New kernel messages appear as they are logged.
What success or failure would indicate¶
- New fault messages during a problem window can reveal the root cause.
- Quiet output may simply mean the kernel is not logging anything unusual.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
apt-get install -y lm-sensors
What it does¶
Installs Linux hardware sensor utilities and dependencies.
Important flags or arguments¶
-y: automatically answer yes to prompts
Why it was used at that moment¶
To gather thermal telemetry from the NUC host.
Expected result¶
The lm-sensors package and dependencies install successfully.
What success or failure would indicate¶
- Success: thermal readings can be collected.
- Failure: package or repository issue must be resolved first.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
sensors-detect
What it does¶
Probes the system for supported hardware monitoring chips and recommended drivers.
Important flags or arguments¶
- interactive prompts during hardware probing
Why it was used at that moment¶
To identify which driver modules were needed for temperature reporting.
Expected result¶
A detection summary showing supported sensors and recommended modules.
What success or failure would indicate¶
- Success: usable drivers are identified.
- Failure: the platform may expose only limited monitoring.
Risk¶
Low to moderate. Some bus probing is more intrusive than simply reading existing sensors.
Safer alternative¶
Run only sensors first if sensor modules are already loaded, but sensors-detect is standard when telemetry is missing.
Command¶
sensors
What it does¶
Displays current temperature and sensor readings.
Important flags or arguments¶
- none in this invocation
Why it was used at that moment¶
To inspect CPU, chipset, and NVMe temperatures while evaluating thermal behavior.
Expected result¶
Temperature readings per detected device.
What success or failure would indicate¶
- High temperatures near critical thresholds indicate thermal stress.
- Normal temperatures during failure windows suggest another fault domain exists.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
watch -n2 sensors
What it does¶
Repeats the sensors command every two seconds.
Important flags or arguments¶
-n2: update every two seconds
Why it was used at that moment¶
To catch peak temperatures under sustained load rather than relying on a single snapshot.
Expected result¶
A continuously refreshed thermal display.
What success or failure would indicate¶
- CPU temperatures near 99–100°C indicate thermal throttle territory on this NUC.
- Later normal readings during a crash showed thermal issues were not the only problem.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
journalctl -k -b -1 | tail -n 120
What it does¶
Shows the last 120 lines of the previous boot’s kernel log.
Important flags or arguments¶
tail -n 120: limit to the end of the log where the failure likely occurred
Why it was used at that moment¶
To inspect the final kernel events before the reboot.
Expected result¶
The previous boot’s last kernel messages.
What success or failure would indicate¶
- This command exposed the repeated
e1000eNIC hang messages. - If the log ends abruptly with no clue, hard power loss or deeper lockup remains possible.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
journalctl -b -1 -p err..alert | tail -n 120
What it does¶
Shows the last 120 error-level and higher messages from the previous boot.
Important flags or arguments¶
-b -1: previous boot-p err..alert: error through alert severitytail -n 120: limit output to the end of the failure window
Why it was used at that moment¶
To isolate high-severity service and kernel errors from the failing boot.
Expected result¶
A compact list of severe prior-boot messages.
What success or failure would indicate¶
- This helped confirm the NIC hang as the main severe event before reboot.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
journalctl -b -1 | egrep -i 'panic|BUG:|Call Trace|watchdog|soft lockup|hard lockup|mce|hardware error|fatal|reset|reboot' | tail -n 80
What it does¶
Searches the previous boot’s full journal for panic, watchdog, and hardware-fault signatures.
Important flags or arguments¶
egrep -i: case-insensitive regex searchtail -n 80: focus on the end of the result set
Why it was used at that moment¶
To distinguish a NIC or network failure from a classic kernel panic or hardware fault.
Expected result¶
Any matching critical fault lines.
What success or failure would indicate¶
- Few or no relevant matches support a non-panic failure path.
- Strong matches would shift attention back to kernel or hardware-fault analysis.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
last -x | head
What it does¶
Shows recent login, reboot, shutdown, and runlevel events.
Important flags or arguments¶
-x: include system events such as reboot and runlevel changeshead: limit to the most recent entries
Why it was used at that moment¶
To confirm reboot timing and how long the failed boot lasted.
Expected result¶
Recent reboot and runlevel history.
What success or failure would indicate¶
- Frequent reboot entries confirm repeated outages.
- Helps align journal timestamps with observed downtime.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
apt-get install -y ethtool
What it does¶
Installs the NIC inspection and tuning utility.
Important flags or arguments¶
-y: automatically answer yes
Why it was used at that moment¶
To disable problematic offload features on the Intel onboard NIC.
Expected result¶
ethtool installs successfully.
What success or failure would indicate¶
- Success: NIC features can be queried and changed.
- Failure: NIC mitigation steps cannot be applied yet.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
ethtool -K eno1 tso off gso off gro off
What it does¶
Disables transmit segmentation offload, generic segmentation offload, and generic receive offload on the physical NIC.
Important flags or arguments¶
-K: change offload settingstso off: disable TCP segmentation offloadgso off: disable generic segmentation offloadgro off: disable generic receive offload
Why it was used at that moment¶
To avoid the offload paths associated with the observed e1000e hardware hangs on the Intel NUC NIC.
Expected result¶
The command succeeds and the requested offloads are disabled.
What success or failure would indicate¶
- Success: the NIC is less likely to hit the buggy offload path.
- Failure: the feature is unsupported or the change was rejected.
Risk¶
Low. Performance may decrease slightly because packet processing shifts more into software.
Safer alternative¶
Disabling only one feature at a time is sometimes used for narrower testing, but disabling all three was the chosen mitigation here.
Command¶
ethtool -K vmbr0 tso off gso off gro off
What it does¶
Requests the same offload-related changes on the Linux bridge device.
Important flags or arguments¶
- same offload flags as above
Why it was used at that moment¶
To align bridge-layer behavior with the NIC workaround where supported.
Expected result¶
Some bridge features may change; others may remain fixed or partially supported.
What success or failure would indicate¶
- Mixed bridge output is normal.
- The critical mitigation remains the change on the physical NIC.
Risk¶
Low.
Safer alternative¶
The physical NIC change alone is the essential step.
Command¶
ethtool -k eno1 | egrep 'tso|gso|gro'
What it does¶
Displays the relevant offload settings for the physical NIC.
Important flags or arguments¶
-k: show NIC feature stateegrep 'tso|gso|gro': filter only relevant offload entries
Why it was used at that moment¶
To verify that the workaround was actually in effect on eno1.
Expected result¶
Relevant offloads show off or off [fixed].
What success or failure would indicate¶
- Correct output confirms the physical NIC mitigation is active.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
ethtool -k vmbr0 | egrep 'tso|gso|gro'
What it does¶
Displays the relevant offload-related settings for the bridge.
Important flags or arguments¶
-k: show feature stateegrep 'tso|gso|gro': filter relevant entries
Why it was used at that moment¶
To inspect bridge-level behavior after applying the workaround.
Expected result¶
A mix of bridge-specific feature states.
What success or failure would indicate¶
- Use this as supplemental verification only;
eno1is the important interface.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
nano /etc/network/interfaces
What it does¶
Opens the Proxmox host network configuration file for editing.
Important flags or arguments¶
- none
Why it was used at that moment¶
To persist the ethtool workaround with post-up lines.
Expected result¶
The network configuration file opens in the editor.
What success or failure would indicate¶
- Success: the workaround can survive reboot.
- Failure: another editor or permission check is needed.
Risk¶
Moderate. Incorrect edits can break management networking on the host.
Safer alternative¶
The Proxmox GUI is safer for common network changes, but direct file edits are often necessary for custom post-up directives.
Command¶
ifreload -a
What it does¶
Reloads all interface definitions using ifupdown2.
Important flags or arguments¶
-a: reload all interfaces
Why it was used at that moment¶
To apply the persistent NIC workaround without rebooting the node.
Expected result¶
Interfaces reload successfully and post-up hooks run.
What success or failure would indicate¶
- Success: the persistent change is live immediately.
- Failure: there may be a syntax issue in
/etc/network/interfaces.
Risk¶
Moderate. Reloading networking on a remote Proxmox host can interrupt management access if the config is wrong.
Safer alternative¶
Reboot during a maintenance window instead of live reloading.
Command¶
pveversion
What it does¶
Shows installed Proxmox VE version information.
Important flags or arguments¶
- none
Why it was used at that moment¶
To check the current update baseline before upgrading the node.
Expected result¶
Proxmox version output.
What success or failure would indicate¶
- Success: confirms current installed version state.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
apt update
What it does¶
Refreshes package metadata from configured Debian and Proxmox repositories.
Important flags or arguments¶
- none
Why it was used at that moment¶
As part of the standard Proxmox node update workflow.
Expected result¶
Updated package lists.
What success or failure would indicate¶
- Success: host is ready for package upgrade.
- Failure: repository, network, or configuration problem exists.
Risk¶
Low.
Safer alternative¶
pveupdate is the Proxmox wrapper for a similar step.
Command¶
apt full-upgrade -y
What it does¶
Performs a full package upgrade, allowing dependency changes and replacements.
Important flags or arguments¶
full-upgrade: upgrade with dependency changes-y: automatically confirm prompts
Why it was used at that moment¶
To fully update the Proxmox node, including kernel and core packages.
Expected result¶
The node upgrades all eligible packages.
What success or failure would indicate¶
- Success: node is updated and ready for reboot.
- Failure: package dependency or repository issues must be addressed.
Risk¶
Moderate. This can update critical Proxmox, kernel, and storage components.
Safer alternative¶
Run on one node at a time during a maintenance window.
Command¶
reboot
What it does¶
Restarts the node.
Important flags or arguments¶
- none
Why it was used at that moment¶
To load updated kernels or apply changes that require reboot.
Expected result¶
The node restarts and returns to service.
What success or failure would indicate¶
- Success: host comes back online with the updated runtime.
- Failure: console investigation may be required.
Risk¶
High in a clustered environment if workloads are not planned around the reboot.
Safer alternative¶
Migrate or stop critical workloads first.
Command¶
pveupdate
What it does¶
Runs the Proxmox convenience wrapper for refreshing package lists.
Important flags or arguments¶
- none
Why it was used at that moment¶
To describe the concise Proxmox-native update flow.
Expected result¶
Repository metadata refreshes.
What success or failure would indicate¶
- Success: package lists are current.
- Failure: same classes of issues as
apt update.
Risk¶
Low.
Safer alternative¶
apt update is the standard Debian equivalent.
Command¶
pveupgrade
What it does¶
Runs the Proxmox convenience wrapper for a full node upgrade.
Important flags or arguments¶
- none in this invocation
Why it was used at that moment¶
To describe the concise Proxmox-native update flow.
Expected result¶
Available Proxmox and Debian package upgrades are applied.
What success or failure would indicate¶
- Success: host is updated.
- Failure: dependency or repository issues need review.
Risk¶
Moderate. This can affect critical virtualization and storage components.
Safer alternative¶
apt full-upgrade is the standard Debian equivalent.
Command¶
Likely command used: top
What it does¶
Displays live process, CPU, memory, and load information.
Important flags or arguments¶
- interactive usage
- in a VM, pressing
1shows per-vCPU detail
Why it was used at that moment¶
To inspect runtime CPU pressure, IO wait, and possible VM steal time.
Expected result¶
An interactive process and CPU summary.
What success or failure would indicate¶
- High
wa: storage bottleneck. - High
stinside a VM: host scheduling contention. - High load with poor responsiveness: potential lockup or backend dependency issue.
Risk¶
Low.
Safer alternative¶
htop if installed, but top is standard and usually available.
Command¶
Likely command used: iostat -x 1
What it does¶
Displays extended disk IO statistics every second.
Important flags or arguments¶
-x: extended device statistics1: one-second interval
Why it was used at that moment¶
To determine whether storage latency or device saturation was contributing to freezes.
Expected result¶
Rolling per-device IO stats including utilization and wait times.
What success or failure would indicate¶
- High
%utilandawaitindicate a storage bottleneck. - Low utilization suggests storage is not the main limiter.
Risk¶
Low.
Safer alternative¶
None needed.
Command¶
Likely command used: stress-ng --cpu 8 --vm 2 --vm-bytes 75% --io 4 --timeout 30m
What it does¶
Generates synthetic CPU, memory, and IO load on the host.
Important flags or arguments¶
--cpu 8: stress 8 CPU workers--vm 2: memory stress workers--vm-bytes 75%: target 75% memory usage--io 4: IO stress workers--timeout 30m: run for 30 minutes
Why it was discussed at that moment¶
To separate generic hardware instability from workload-specific failures.
Expected result¶
Thirty minutes of sustained synthetic load.
What success or failure would indicate¶
- Stable run: hardware may be okay and workload pattern may matter more.
- Crash/hang: deeper hardware, firmware, or kernel issues become more likely.
Risk¶
Moderate to high. Can push the node into failure and should be run only in a maintenance window.
Safer alternative¶
Run a shorter-duration test first.
Command¶
Likely command used: memtest86+ / MemTest86 boot run
What it does¶
Tests system RAM outside the normal operating system.
Important flags or arguments¶
- boot-time utility rather than a shell command in this session
Why it was discussed at that moment¶
To rule out faulty RAM after repeated host instability.
Expected result¶
Multiple clean passes with zero errors.
What success or failure would indicate¶
- Zero errors: RAM is less likely to be the issue.
- Any errors: memory or memory path faults are strongly suspected.
Risk¶
Low runtime risk, but requires host downtime.
Safer alternative¶
None equivalent from inside the running OS.
Command¶
Likely command used: qm / Proxmox GUI disk option changes for iothread and discard
What it does¶
Applies VM disk options such as IOThreads and discard/TRIM handling.
Important flags or arguments¶
iothread: valid on supported virtio/virtio-scsi-single disk pathsdiscard: enables trim/unmap propagation when supported
Why it was discussed at that moment¶
To improve VM disk behavior and explain why Proxmox ignored IOThread on an unsupported disk/controller combination.
Expected result¶
Supported disks accept the setting; unsupported ones emit a warning and ignore it.
What success or failure would indicate¶
- Warning about unsupported controller or disk type means IOThread is ignored safely.
- Correct use requires a virtio disk or
virtio-scsi-singlecontroller.
Risk¶
Low to moderate depending on whether controller changes require downtime.
Safer alternative¶
Verify disk bus and SCSI controller type before enabling IOThread.