Proxmox Host Kernel Crash Investigation on mainframe¶
Summary¶
Investigated a Proxmox host crash on mainframe after the host logged a kernel fault and scheduler-related panic behavior. The goal was to determine whether the issue was caused by the Proxmox kernel, an out-of-tree module, or unstable host hardware.
Environment¶
- Host:
mainframe - Platform: Proxmox VE host
- Motherboard: ASUS ROG Strix Z390-F Gaming
- CPU platform: Intel 8th/9th Gen LGA1151 platform
- Kernel observed in logs:
6.8.12-16-pve - Memory:
- 32 GB installed
- 4 memory slots populated
- Non-ECC platform
- Storage context:
- Ceph in use elsewhere in the homelab
- ZFS not intentionally used on this host
- Workloads affected:
- Proxmox scheduling
- Guest VM availability and stability
Problem¶
The Proxmox host experienced a kernel crash, causing host instability and likely affecting guest VMs running on the node.
Symptoms¶
- Kernel trace showed a crash in:
unlink_anon_vmas+...- Additional fatal messages included:
BUG: scheduling while atomic: pvescheduler/...recursive fault but reboot is needed!- Kernel was marked tainted.
- Module list in the log included out-of-tree modules such as:
zfs(P0)spl(0)- Host instability on the Proxmox node would also explain guest-side symptoms such as VM hangs or dropped guest agent connectivity.
Actions Taken¶
- Reviewed the OCR-recovered host kernel crash log from [date removed].
- Identified that the crash occurred on the Proxmox host rather than inside a guest VM.
- Noted that the crash path was in Linux memory-management code (
unlink_anon_vmas). - Noted that the scheduler process
pveschedulerwas involved in the panic sequence. - Checked whether ZFS was actually part of the intended host design.
- Distinguished Ceph usage from ZFS usage to avoid conflating unrelated storage layers.
- Considered kernel taint and loaded third-party modules as potential contributors.
Key Findings¶
- The crash was a host-level kernel problem, not a guest-only issue.
unlink_anon_vmasis part of the Linux VM/MM subsystem, so the crash pattern was consistent with:- memory corruption
- unstable RAM
- a buggy out-of-tree kernel module
- The presence of
zfs(P)andsplin the module list meant the host was running with proprietary/out-of-tree kernel modules loaded, even though ZFS was not intentionally part of the design. - This crash alone did not conclusively prove defective RAM, but it strongly justified hardware and module scrutiny.
Resolution¶
No permanent resolution was completed in this part of the session. The working conclusion at this stage was: - treat the issue as a host-level kernel instability - remove suspicion from Ceph itself - continue investigating both: - unused out-of-tree modules - host RAM stability
Validation¶
Validation was not yet complete at this stage. The main validation outcome was analytical: - the issue was correctly reclassified from guest instability to host kernel instability.
Follow-Up Tasks¶
- Verify whether
zfsandsplare installed and loaded unnecessarily. - Remove unused out-of-tree modules if they are not required.
- Continue memory stability testing on the host.
- Keep copies of kernel traces for comparison across incidents.
- Consider pinning to a known-stable Proxmox kernel if crashes continue after hardware remediation.
Lessons Learned¶
- A guest appearing frozen can be the result of a host kernel fault.
- Ceph and ZFS are unrelated; loaded ZFS modules should not be assumed to be part of a Ceph design.
- OCR logs can still be useful if the stack frames and panic messages are recognizable.
Second Host Kernel Crash Linked to knem¶
Summary¶
Investigated another Proxmox host crash on mainframe. This time, the kernel trace pointed to knem_cache_alloc, which suggested a fault involving the knem kernel module rather than a generic guest VM issue.
Environment¶
- Host:
mainframe - Platform: Proxmox VE host
- Kernel family: Proxmox
6.8.xseries - Workloads impacted:
- Proxmox-hosted VMs
- QEMU guest agent visibility
- General host stability
- Relevant modules mentioned:
knemzfsspl- virtualization/network-related modules such as
vhost_net,tap
Problem¶
A later host crash occurred, and the user wanted to confirm whether the earlier kernel errors were consistent with the current instability.
Symptoms¶
- Kernel trace showed:
RIP: 0010:knem_cache_alloc+...- Fault markers included:
---[ end trace ... ]---- kernel-space fault address in
CR2 - Guest-visible effects included:
- VM instability
- dropped
qemu-guest-agent - apparent VM freezes after some uptime
Actions Taken¶
- Reviewed the recovered host kernel trace.
- Confirmed that the trace was from the Proxmox host, not from a guest VM.
- Interpreted
knem_cache_allocas evidence pointing to theknemkernel module. - Distinguished this fault from the earlier
unlink_anon_vmascrash while noting that both were host memory-path failures. - Considered removal/blacklisting of
knem. - Considered removal of unused ZFS packages to reduce host kernel complexity.
Key Findings¶
- The crash was again a host kernel fault.
knem_cache_allocstrongly implicated theknemmodule.- Since host kernel faults can stall scheduling and I/O, guest VM symptoms were consistent with a host-side root cause.
- The presence of multiple distinct MM-adjacent host crashes increased suspicion of:
- unstable host memory
- problematic out-of-tree modules
- or both
Resolution¶
No final fix was completed in this portion of the session, but the likely remediation path identified was:
- blacklist or remove knem
- remove unused ZFS-related packages if not needed
- continue hardware validation
Validation¶
Validation was still pending at this stage. The important completed validation was logical: - host kernel instability was confirmed as the correct investigative focus.
Follow-Up Tasks¶
- Confirm whether
knemis installed and loaded. - Blacklist/remove
knemif not needed. - Rebuild initramfs after module cleanup.
- Reboot the host and re-check loaded modules.
- Continue host memory testing.
Lessons Learned¶
- Repeated guest freezes can originate from a crashing Proxmox host kernel.
- A precise
RIPlocation in the stack trace can identify a likely culprit module. - Unused modules increase attack surface and troubleshooting complexity.
Host Memory Stability Validation on mainframe¶
Summary¶
Performed targeted host memory diagnostics after multiple host kernel crashes. The aim was to determine whether RAM instability was the true underlying cause of the Proxmox host failures.
Environment¶
- Host:
mainframe - Platform: Proxmox VE host
- Motherboard: ASUS ROG Strix Z390-F Gaming
- Memory installed:
- TEAMGROUP T-Force Delta RGB DDR4
- 16 GB kit branding discussed as
4x8GB 3200MHz CL16 - Effective detected platform state:
- 4/4 DIMM slots populated
- Non-ECC memory
- Logs:
- boot log from
[date removed] - later journal entries from
[date removed] - Test tools used:
memtesterstress-ng
Problem¶
Needed to determine whether host RAM was actually the source of the Proxmox host crashes and VM instability.
Symptoms¶
memtesterreported extensive failures, including:FAILURE: possible bad address line- many repeated mismatches across multiple test patterns
- repeated bit-flip style corruption
stress-ngreported direct memory corruption:vm: detected 523 bit errors while stressing memoryvm: detected 1456 bit errors while stressing memory- Kernel logs did not show Machine Check Exceptions or ECC correction events.
- Journal showed:
EDAC ie31200: No ECC support- System reported:
4/4 memory slots populated (from DMI)
Actions Taken¶
- Confirmed host board model and memory kit details.
- Discussed whether XMP was in use and clarified that non-XMP/JEDEC recommendations were more appropriate for a stability-first Proxmox host.
- Queried kernel logs for hardware error indicators:
sudo dmesg -T | egrep -i 'mce|machine check|hardware error|ecc|memory'
sudo journalctl -k | egrep -i 'mce|hardware error|ecc'
Purpose: check for machine check exceptions, ECC activity, or obvious hardware error records.
- Reviewed the kernel/journal output.
- Ran
memtesterand observed widespread failures across many test categories. - Ran
stress-ngusing large VM-backed memory allocations:
sudo stress-ng --vm 2 --vm-bytes 80% --timeout 30m --metrics-brief
Purpose: stress memory allocation and detect corruption under load.
- Interpreted the absence of ECC support together with repeated user-space memory corruption.
Key Findings¶
- The host platform has no ECC support, so memory corruption cannot be corrected or cleanly attributed by ECC reporting.
memtesteroutput showed severe and repeated memory corruption, including:- stuck address failures
- random value mismatches
- arithmetic and pattern test corruption
stress-ngindependently reproduced memory errors with large numbers of bit flips.- The combination of:
- repeated host kernel crashes
- repeated memory test failures
- lack of ECC is strong evidence that the memory path is unstable.
- At this stage, host RAM instability became the leading root-cause candidate over Proxmox software alone.
Resolution¶
Current status: - No hardware replacement had yet been completed in the chat. - The operational conclusion was that RAM instability is real and must be treated as an active hardware issue. - Recommended immediate remediation path: - run memory at JEDEC-safe settings - isolate DIMMs one at a time - identify bad stick or slot - replace unstable memory kit as needed
Validation¶
Validation was strong and multi-layered:
- memtester reproduced corruption repeatedly.
- stress-ng detected hundreds to thousands of bit errors.
- Log review confirmed the system is non-ECC and therefore unable to mask or correct these failures.
- The results were consistent with the earlier host kernel crashes.
Follow-Up Tasks¶
- Enter BIOS and set memory to conservative JEDEC settings.
- Disable XMP if enabled now or in future tests.
- Test one DIMM at a time in the same slot.
- Test a known-good DIMM across slots to rule out a motherboard slot issue.
- Run bootable MemTest86 or Memtest86+ for deeper validation.
- Replace failing DIMM(s) or memory kit.
- After hardware correction, re-validate host stability under Proxmox load.
- Review storage and service integrity after running with unstable RAM.
Lessons Learned¶
- Widespread bit flips in both
memtesterandstress-ngare strong evidence of hardware-level memory instability. - Absence of MCE logs does not clear RAM on a non-ECC platform.
- Host memory faults can masquerade as VM instability, guest agent drops, and kernel crashes.
- For a virtualization host, conservative JEDEC settings are often preferable to performance-oriented memory profiles.
Memory Tuning and Power-State Diagnostic Discussion¶
Summary¶
Discussed whether disabling S-/P-/C-states would be meaningful in this case and whether the memory test results could identify a particular DIMM.
Environment¶
- Host:
mainframe - Board: ASUS ROG Strix Z390-F Gaming
- Memory subsystem:
- 4 DIMMs installed
- non-ECC DDR4
- Operating role:
- always-on Proxmox host
Problem¶
Needed to interpret advice seen elsewhere about disabling S-/P-/C-states and determine whether existing tests could identify a specific RAM stick.
Symptoms¶
- Host crashes and proven memory bit errors already existed.
- No explicit DIMM-level failure mapping was available from the performed tests.
Actions Taken¶
- Evaluated whether disabling sleep/power states would meaningfully address the confirmed memory corruption.
- Clarified that:
- S-states are generally irrelevant for an always-on Proxmox host
- P-/C-state changes can be used as a diagnostic aid, but not as a true fix for memory corruption
- Evaluated whether
stress-ng --vmcould identify the specific failing stick. - Clarified that the performed tests only proved memory corruption, not DIMM identity.
- Recommended one-DIMM-at-a-time testing and slot isolation.
Key Findings¶
- Disabling power states may reduce transient conditions, but it does not explain away large-scale repeatable memory corruption.
- The current evidence still points to unstable RAM, slot, or IMC path rather than a pure CPU power-management issue.
- Existing Linux memory stress tools used in-session did not identify the failing stick.
Resolution¶
Current status: - No permanent BIOS power-state changes were adopted as the fix. - The recommended path remained: - JEDEC-safe memory settings - isolate sticks individually - replace bad hardware if identified
Validation¶
No new validation was completed in this section. This was a decision/interpretation step that refined the troubleshooting path.
Follow-Up Tasks¶
- Test each DIMM independently in slot A2.
- Test a known-good DIMM in other slots.
- Only revisit power-state tuning if instability remains after memory hardware is proven good.
Lessons Learned¶
- Power-state changes can be useful for narrowing edge-case instability, but they are not a substitute for fixing bad RAM.
- Tools like
stress-ngcan prove corruption without localizing the bad DIMM. - DIMM isolation remains the most reliable low-cost method on a non-ECC desktop platform.
Replacement RAM Selection for a Stability-First Proxmox Host¶
Summary¶
Reviewed replacement RAM selection for the ASUS ROG Strix Z390-F and corrected earlier advice that assumed XMP usage. The goal shifted to selecting memory appropriate for a stable no-XMP Proxmox host.
Environment¶
- Host:
mainframe - Motherboard: ASUS ROG Strix Z390-F Gaming
- CPU family: Intel 8th/9th Gen
- Requirement:
- no-XMP / JEDEC-oriented stability
- homelab/Proxmox host usage
- Current RAM:
- TEAMGROUP T-Force Delta RGB DDR4
- unstable under testing
Problem¶
Needed replacement RAM recommendations that prioritize reliability over XMP-driven speed.
Symptoms¶
- Existing RAM showed clear corruption in memory stress tests.
- Earlier generic suggestions mentioning XMP were not aligned with the stated stability-first requirement.
Actions Taken¶
- Reviewed the board model and platform class.
- Corrected the recommendation path to no-XMP / JEDEC memory.
- Identified that the safe operating target for a stability-focused Z390 host is generally JEDEC DDR4 speeds.
- Recommended conservative, non-XMP-oriented kit choices such as:
- Crucial DDR4-2666 JEDEC
- Kingston ValueRAM DDR4-2666 JEDEC
- Discussed 2x16 GB and 4x16 GB stable capacity options rather than performance-tuned RGB kits.
Key Findings¶
- For this host role, JEDEC memory is more appropriate than relying on XMP profiles.
- Full population of 4 DIMM slots puts more stress on the memory controller than a 2-DIMM layout.
- Replacing RGB/performance-oriented memory with plain JEDEC-oriented DIMMs is operationally sensible for a Proxmox host.
Resolution¶
Current status: - No new kit was purchased in the chat. - The direction of travel was to replace the current unstable RAM with conservative JEDEC DDR4 suitable for the Z390 platform.
Validation¶
Not yet applicable. Validation will come after: - installation - MemTest/Memtest86+ - Linux-side stress testing - stable Proxmox uptime
Follow-Up Tasks¶
- Decide target capacity: 32 GB or 64 GB.
- Prefer matched kits rather than mixed modules.
- Validate replacement RAM with both bootable and Linux-based tests.
- Re-check host kernel stability after replacement.
Lessons Learned¶
- Homelab hosts benefit more from conservative memory configuration than peak memory frequency.
- Advice appropriate for gaming builds is not always appropriate for always-on virtualization hosts.
- Correcting assumptions about XMP matters when making hardware recommendations.
Command Reference¶
Command¶
sudo dmesg -T | egrep -i 'mce|machine check|hardware error|ecc|memory'
What it does¶
Searches the kernel ring buffer for machine check, ECC, hardware error, or memory-related messages.
Important flags and arguments¶
dmesg -Tshows kernel messages with human-readable timestamps.egrep -iperforms a case-insensitive extended regex search.
Why it was used¶
To look for kernel-reported evidence of hardware memory failure, machine check exceptions, or ECC activity on the Proxmox host.
Expected result¶
- Matches such as
MCE,hardware error, or ECC correction/failure would support hardware suspicion. - No such messages would not fully clear RAM, especially on a non-ECC system.
Success or failure meaning¶
- Success: command runs and returns matching kernel lines if present.
- No output: no matching strings were found in the current ring buffer.
Risk¶
Low risk. Read-only diagnostic command.
Safer alternative¶
journalctl -k can provide a broader boot history if the ring buffer has rotated.
Command¶
sudo journalctl -k | egrep -i 'mce|hardware error|ecc'
What it does¶
Searches the systemd journal for kernel log entries related to machine checks, hardware errors, and ECC.
Important flags and arguments¶
journalctl -krestricts output to kernel messages.egrep -iperforms case-insensitive matching.
Why it was used¶
To search a longer-lived kernel log history than dmesg alone and check whether past boot sessions recorded hardware-level faults.
Expected result¶
- MCE/ECC/hardware error logs would strengthen the case for host hardware instability.
- In this case, the relevant finding was:
EDAC ie31200: No ECC support
Success or failure meaning¶
- Success: journal access and matching lines returned.
- No output: no matching kernel entries were found.
Risk¶
Low risk. Read-only diagnostic command.
Safer alternative¶
None needed; this is already a safe log query.
Command¶
sudo stress-ng --vm 2 --vm-bytes 80% --timeout 30m --metrics-brief
What it does¶
Runs two memory (vm) stress workers, each allocating and exercising a large amount of memory, while reporting basic metrics.
Important flags and arguments¶
--vm 2launches 2 VM memory stress workers.--vm-bytes 80%tells each worker to use a large portion of available memory.--timeout 30mruns the stress for 30 minutes.--metrics-briefprints summary performance numbers.
Why it was used¶
To reproduce memory corruption under sustained load on the Proxmox host.
Expected result¶
- On a healthy system: no bit-error reports and a clean completion.
- On an unstable memory subsystem: stress-ng may report detected bit errors or terminate unsuccessfully.
Success or failure meaning¶
- Success: run completes with no bit-error failures.
- Failure: reported bit errors strongly indicate unstable memory hardware or memory settings.
Risk¶
Moderate. - Heavy memory pressure can affect host responsiveness. - Should not be run casually on a production virtualization host carrying critical workloads.
Safer alternative¶
Run a bootable offline memory test such as MemTest86/Memtest86+ during a maintenance window.
Command¶
memtester 24576M 2
What it does¶
Exercises a large block of memory with multiple test patterns for two loops.
Important flags and arguments¶
24576Mrequests testing of roughly 24 GiB.2runs two test loops.
Why it was used¶
To test most of the host’s available RAM from Linux and look for corruption patterns.
Expected result¶
- Healthy memory should complete pattern tests with no failures.
- Repeated mismatches, stuck address failures, or bit-flip patterns indicate unstable RAM, slot, or memory controller path.
Success or failure meaning¶
- Success: zero reported failures.
- Failure: strong evidence of memory corruption.
Risk¶
Moderate. - High memory allocation on a live host can pressure other services. - Best used during maintenance windows.
Safer alternative¶
Bootable MemTest86/Memtest86+ performs testing outside the running OS and avoids interference from live workloads.
Command¶
lsmod | egrep -i 'knem|zfs|spl'
What it does¶
Lists currently loaded kernel modules and filters for modules relevant to the investigation.
Important flags and arguments¶
lsmodshows active modules.egrep -ifilters case-insensitively.
Why it was used¶
To confirm whether suspect out-of-tree modules such as knem, zfs, or spl were loaded on the host.
Expected result¶
- Presence of these modules would support module cleanup and simplification.
- Absence would reduce suspicion for those specific components.
Success or failure meaning¶
- Success: matching modules, if any, are shown.
- No output: none of the searched modules are currently loaded.
Risk¶
Low risk. Read-only diagnostic command.
Safer alternative¶
None needed.
Command¶
modprobe -r knem
What it does¶
Attempts to unload the knem kernel module from the running kernel.
Important flags and arguments¶
-rremoves the specified module if it is not in active use.
Why it was discussed¶
Because a host kernel trace pointed at knem_cache_alloc, suggesting knem may have contributed to the crash.
Expected result¶
- Successful unload if the module is present and not busy.
- Failure if the module is in use or not loaded.
Success or failure meaning¶
- Success: the module is removed from the running kernel.
- Failure: either it is not loaded or something still depends on it.
Risk¶
Moderate to high. - Removing a kernel module on a live Proxmox host can destabilize dependent workloads if the module is actually in use.
Safer alternative¶
Blacklist the module and remove it during a maintenance reboot window.
Command¶
echo 'blacklist knem' > /etc/modprobe.d/blacklist-knem.conf
What it does¶
Creates a modprobe blacklist entry to prevent the knem module from auto-loading.
Important flags and arguments¶
- Writes a blacklist directive into a persistent configuration file.
Why it was discussed¶
To keep knem from loading again after reboot if it was not required.
Expected result¶
- Future boots should not automatically load
knem.
Success or failure meaning¶
- Success: file is created and used by modprobe/initramfs logic after rebuild/reboot.
- Failure: module may still load if initramfs or another config path still includes it.
Risk¶
Moderate. - Blacklisting a needed module can break dependent software.
Safer alternative¶
Confirm the module is unused before blacklisting; test during maintenance.
Command¶
apt-get purge -y zfs-dkms zfsutils-linux spl-dkms
What it does¶
Removes ZFS-related DKMS packages and utilities from the host.
Important flags and arguments¶
purgeremoves packages and their configuration files.-yauto-confirms prompts.
Why it was discussed¶
Because ZFS was not part of the intended host design, yet ZFS/SPL modules appeared in crash logs.
Expected result¶
- Removes unused ZFS package set and reduces out-of-tree kernel surface area.
Success or failure meaning¶
- Success: packages are removed.
- Failure: package names may differ, or dependencies may block removal.
Risk¶
High if ZFS is actually in use. - Removing ZFS packages from a host using ZFS storage can break storage access and boot behavior.
Safer alternative¶
Confirm with storage and module checks before removal.
Command¶
update-initramfs -u -k all
What it does¶
Rebuilds initramfs images for all installed kernels.
Important flags and arguments¶
-uupdates existing initramfs images.-k allapplies the update to all installed kernels.
Why it was discussed¶
Needed after module blacklisting or package removal so boot images reflect the new module state.
Expected result¶
- Rebuilt initramfs images without the unwanted modules or with updated module configuration.
Success or failure meaning¶
- Success: initramfs images are regenerated cleanly.
- Failure: packaging or initramfs hooks may need further correction.
Risk¶
Moderate. - A bad initramfs rebuild can affect bootability if the system depends on modules that are removed or misconfigured.
Safer alternative¶
Keep console/KVM access ready before rebooting after initramfs changes.
Command¶
sudo dmidecode -t memory
What it does¶
Reads SMBIOS/DMI memory inventory data from firmware.
Important flags and arguments¶
-t memoryrestricts output to memory device structures.
Why it was implied¶
To identify slot population, module size, and part details for DIMM isolation and replacement planning.
Expected result¶
- Shows slot locators, sizes, configured speed, and often part numbers.
Success or failure meaning¶
- Success: hardware inventory is displayed.
- Failure: uncommon unless firmware tables are inaccessible.
Risk¶
Low risk. Read-only hardware inventory command.
Safer alternative¶
None needed.
Command¶
Likely command used: MemTest86 or Memtest86+ from bootable media
What it does¶
Runs memory diagnostics outside the installed operating system.
Important flags and arguments¶
- Tool-specific; not a shell command from the running OS.
Why it was recommended¶
Because offline memory testing avoids interference from the running kernel and is one of the best ways to validate DIMM stability.
Expected result¶
- Zero errors on healthy memory.
- Any error strongly indicates RAM/slot/IMC instability.
Success or failure meaning¶
- Success: multiple clean passes.
- Failure: errors indicate hardware instability.
Risk¶
Low operational risk, but requires downtime.
Safer alternative¶
Linux-based tests like memtester are easier to run live, but are less isolated than bootable tests.
Command¶
Likely command used: one-DIMM-at-a-time retest in slot A2, followed by known-good DIMM testing across slots
What it does¶
This is a test procedure rather than a single command: - install one DIMM only - boot - run memory test - repeat per DIMM and per slot
Important flags and arguments¶
Not applicable.
Why it was recommended¶
To isolate whether the instability follows: - a specific DIMM - a specific motherboard slot - or only a fully populated configuration
Expected result¶
- Errors following one DIMM point to a bad stick.
- Errors following one slot point to board/slot/channel issues.
- Errors only with full population suggest margin issues with the IMC or timings.
Success or failure meaning¶
- Success: stable one-stick and slot mapping identifies the failing component.
- Failure: inconsistent results may require deeper platform-level investigation.
Risk¶
Low, aside from maintenance downtime and handling hardware.
Safer alternative¶
None better for a non-ECC desktop platform.