Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace snap-01 failing memory #1109

Closed
Firefishy opened this issue Jul 9, 2024 · 11 comments
Closed

Replace snap-01 failing memory #1109

Firefishy opened this issue Jul 9, 2024 · 11 comments

Comments

@Firefishy
Copy link
Member

Firefishy commented Jul 9, 2024

Snap-01 has a failing DIMM throwing ECC Correction errors.

CPU_SrcID#1_MC#1_Chan#0_DIMM#0

Which is I think the one marked from this hardware linkage table:

memory stick 'P1-DIMMA1' is located at 'P0_Node0_Channel0_Dimm0'
memory stick 'P1-DIMMA2' is located at 'P0_Node0_Channel0_Dimm1'
memory stick 'P1-DIMMB1' is located at 'P0_Node0_Channel1_Dimm0'
memory stick 'P1-DIMMB2' is located at 'P0_Node0_Channel1_Dimm1'
memory stick 'P1-DIMMC1' is located at 'P0_Node0_Channel2_Dimm0'
memory stick 'P1-DIMMC2' is located at 'P0_Node0_Channel2_Dimm1'

memory stick 'P1-DIMMD1' is located at 'P0_Node1_Channel0_Dimm0'
memory stick 'P1-DIMMD2' is located at 'P0_Node1_Channel0_Dimm1'
memory stick 'P1-DIMME1' is located at 'P0_Node1_Channel1_Dimm0'
memory stick 'P1-DIMME2' is located at 'P0_Node1_Channel1_Dimm1'
memory stick 'P1-DIMMF1' is located at 'P0_Node1_Channel2_Dimm0'
memory stick 'P1-DIMMF2' is located at 'P0_Node1_Channel2_Dimm1'

memory stick 'P2-DIMMA1' is located at 'P1_Node0_Channel0_Dimm0'
memory stick 'P2-DIMMA2' is located at 'P1_Node0_Channel0_Dimm1'
memory stick 'P2-DIMMB1' is located at 'P1_Node0_Channel1_Dimm0'
memory stick 'P2-DIMMB2' is located at 'P1_Node0_Channel1_Dimm1'
memory stick 'P2-DIMMC1' is located at 'P1_Node0_Channel2_Dimm0'
memory stick 'P2-DIMMC2' is located at 'P1_Node0_Channel2_Dimm1'

memory stick 'P2-DIMMD1' is located at 'P1_Node1_Channel0_Dimm0' ****
memory stick 'P2-DIMMD2' is located at 'P1_Node1_Channel0_Dimm1'
memory stick 'P2-DIMME1' is located at 'P1_Node1_Channel1_Dimm0'
memory stick 'P2-DIMME2' is located at 'P1_Node1_Channel1_Dimm1'
memory stick 'P2-DIMMF1' is located at 'P1_Node1_Channel2_Dimm0'
memory stick 'P2-DIMMF2' is located at 'P1_Node1_Channel2_Dimm1'

DMI lists the memory as:

Handle 0x0035, DMI type 17, 84 bytes
Memory Device
        Array Handle: 0x0033
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 32 GB
        Form Factor: DIMM
        Set: None
        Locator: P2-DIMMD1
        Bank Locator: P1_Node1_Channel0_Dimm0
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MT/s
        Manufacturer: Micron Technology
        Serial Number: F0E34EF7
        Asset Tag: P2-DIMMD1_AssetTag (date:20/01)
        Part Number: 36ASF4G72PZ-2G6E1
        Rank: 2
        Configured Memory Speed: 2400 MT/s
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V
        Memory Technology: DRAM
        Memory Operating Mode Capability: Volatile memory
        Firmware Version: 0000
        Module Manufacturer ID: Bank 1, Hex 0x2C
        Module Product ID: Unknown
        Memory Subsystem Controller Manufacturer ID: Unknown
        Memory Subsystem Controller Product ID: Unknown
        Non-Volatile Size: None
        Volatile Size: 32 GB
        Cache Size: None
        Logical Size: None
@Firefishy
Copy link
Member Author

I have ordered 2x replacement DIMMs. They should arrive in Catford shortly.

@Firefishy
Copy link
Member Author

As soon as plausible I would like to reboot the system to ensure that ADDDC is enabled in the BIOS.

On a successful boot with ADDDC enabled I would then like to upgrade the BIOS to the latest revision 3.2 -> 4.2. snap-02 has already been upgraded.

@Firefishy
Copy link
Member Author

Firefishy commented Jul 10, 2024

I don't want to jinx it, but it looks like the memory errors have stopped for now.
Note to reader: Corrected ECC Errors, not Uncorrected ECC errors.

We scheduled a 1 hour maintenance today where I performed the following:

  • Rebooted into BIOS and Enabled: "Enhanced PPR" (PPR aka "Post Package Repair". Enables an extended memory test on Boot / POST which allows internal DDR4 re-mapping to spares). "Enhanced PPR" appears to be a Supermicro proprietary option. Extended POST by 6 minutes while running against 512GB of RAM.
  • Updated BIOS to latest release. IPMI/BMC/OOB done previously.
  • Enabled RAS option "ADC Sparing", cannot find documentation for this. Maybe ADDDC mislabelled? Regardless, Xeon Scalable Silver appear not to support ADDDC. Found the documentation: "The Silver/Bronze SKUs offer Adaptive Data Correction (ADC [SR]), at Bank granularity, and the Platinum/Gold SKUs offer Adaptive Double DRAM Device Correction (ADDDC [MR]), at Bank and Rank granularity, with additional hardware facilities for device map-out."
  • Ran another "Enhanced PPR" for good measure.

All options above were first tested on the twin snap-02.

@Firefishy Firefishy changed the title snap-01 has failing memory Replace snap-01 failing memory Jul 10, 2024
@Firefishy
Copy link
Member Author

We discussed the RAM replacement at the 11 July 2024 Ops call. We will aim to replace the memory in the server in the next 3 months. The server is no longer throwing errors and is not urgent priority.

@Firefishy
Copy link
Member Author

In the event the RAM starts throwing errors we will treat it as urgent.

@Firefishy
Copy link
Member Author

2x DIMMs are in-stock @ Catford.

Unfortunately not possible to tell what revision is insallled in snap-01. Stock is 2 different revisions.

@Firefishy
Copy link
Member Author

I've been able to identify the FULL RAM model + revision: 36ASF4G72PZ-2G6E1QG from photos.
Unfortunately neither of those I've ordered are an exact match.

Exact match: https://www.ebay.nl/itm/155164317853

@Firefishy
Copy link
Member Author

I have ordered the exact memory module. It will arrive in Catford in a few days.

@Firefishy
Copy link
Member Author

Matching memory module has arrived in Catford stock.

@Firefishy Firefishy added this to the 2024 AM6 Visit milestone Sep 15, 2024
@Firefishy
Copy link
Member Author

Memory ready and maintenance window scheduled for today: https://community.openstreetmap.org/t/openstreetmap-maintenance-26-september-2024/118989

@Firefishy
Copy link
Member Author

Memory replaced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant