This article is intended for administrators wishing to understand and test failure scenario tolerances of StorMagic SvHCI in a 2-node configuration with the sharable witness.

Note: All images are clickable for enlarging, or can be opened in a new tab

Resolution/Information

Introduction

This Evaluator’s Guide describes how to perform SvHCI failure testing and benchmarking using an example hardware, and SvHCI configuration.

The first section describes configuring and deploying the infrastructure, including:

- SvHCI hosts, clusters
- Configuring the hardware

- Deploying SvHCI
- Configuring the virtual networking

- Clustering the systems with a witness

- Creating a test VM
- Configuring SvHCI event notifications

The next sections describe various failure scenarios. Each scenario checks that systems are configured correctly to provide the highest levels of redundancy where possible and to understand the expected outcomes of each instance. These are organized into two categories: lights on and lights off:

'lights on' – in single failure scenarios, VMs remain running due to resiliency and redundancy designed into the system.
‘lights off’ – in dual failure scenarios, such as both hosts losing power, VMs go offline ('lights off').
There is a section outlining how to set up SvSAN data encryption with StorMagic SvKMS key management, followed by various data encryption failure scenarios.

There are performance benchmarking sections: one for a system that does not utilize SSD or memory caching, and one for testing performance with caching.

Finally, there is a section on troubleshooting, and one for how to roll back the system if required.

StorMagic Support may be contacted at support@stormagic.com or at our support suite at http://support.stormagic.com.

Infrastructure configuration and deployment

The example below uses HPE DL325 Gen11 systems:

Model: ProLiant DL325 Gen11

- 1x AMD EPYC 9224 2.5GHz 24-core 200W Processor for HPE

--Logical Processors: 48

- 8x32GB RAM = 256GB RAM (DDR4)

- Hypervisor + VSA

-- HPE NS204i-u Gen11 NVMe Hot Plug Boot Optimized Storage Device

- SAN Storage

-- HPE MR216i-o Gen11 x16 Lanes without Cache OCP SPDM Storage Controller

--- 5x HPE 1.92TB SATA 6G Read Intensive SFF BC Multi Vendor SSD

Install SvHCI - https://support.stormagic.com/hc/en-gb/articles/18858860181917-SvHCI-Installation

Complete the first boot wizard - https://support.stormagic.com/hc/en-gb/articles/18860442155549-SvHCI-First-Boot-Wizard

Configure the virtual networking - https://support.stormagic.com/hc/en-gb/articles/18860678011293-SvHCI-Configure-networking

Deploy a Witness - https://support.stormagic.com/hc/en-gb/articles/4418754965777-StorMagic-SvHCI-SvSAN-Witness-Deploy-Install-Upgrade-and-Migrate

Create the Cluster - https://support.stormagic.com/hc/en-gb/articles/18860768070557-SvHCI-Cluster-Creation

Create the Storage Pool - https://support.stormagic.com/hc/en-gb/articles/18860841823389-SvHCI-SvSAN-Configure-a-Storage-Pool

Create a 5x Benchmark Ubuntu VMs - https://support.stormagic.com/hc/en-gb/articles/18860989386781-SvHCI-Linux-Guest-VM-Creation

- Each with a boot disk a around 100GB of benchmark disk space

Video Resources

The below Webinar overviews some of the below scenarios live:

Hardware failure (HA Failover)

WAN Drop (witness connectivity failure)

Storage Failure

Cluster down

‘Lights on’ failure scenarios

This section details different failure scenarios that may occur in a production environment. Each scenario checks that systems are configured correctly to provide the highest levels of redundancy where possible and to understand the expected outcomes of each instance. In the case of single failures, resiliency and redundancy is designed into the system such that the guest VMs remain running or complete a HA/FT failover (‘lights on’).

Scenario #	Scenario	Description	Procedure	Expected Outcome
1	SvHCI node offline	Due to maintenance (firmware updates, configuration changes, etc.), or failure of the underlying storage that the node runs on.	Power off one node.	Production continues to be served by the surviving SvHCI node. SvHCI initiates the restart of Protected guest VMs on remaining SvHCI node. Guest VMs that were previously running on the surviving node continue un-disrupted.
2	SvHCI node experiences storage failure	The underlying storage in the host has failed: multiple disks have been lost in a RAID set (hardware or software RAID), or the RAID controller has failed etc.	Fail or pull multiple disks in RAID set	This depends on the hardware storage configuration on the SvHCI node. If SvHCI is deployed to separate storage, i.e. a boot RAID 1 with a pool RAID 5, SvHCI marks affected VM disks/targets as Storage Failed with storage continuing to be served from the node via storage traffic proxying. If SvHCI is on the same RAID set that has failed, the SvHCI node goes offline, and the cluster restarts the Protected guest VMs on the surviving SvHCI node. When the failed storage is operational again the storage can be recovered through the SvHCI web GUI and a full synchronization is initiated.
3	Mirror link failure	Failure on mirror link interface.	Disconnect physical cable from host or disconnect node virtual network adapter from mirror link interface.	Mirror traffic continues to work over other available link but performance may be affected.
4	Witness connectivity lost	Loss of communication with the witness service, e.g. service stopped.	Stop witness service, power down witness host server or disconnect network link to witness.	Mirror remains up with no disruption; however, the cluster is then vulnerable to a further failure.
5	SvHCI node failure followed by witness loss	The SvHCI node is powered off. The surviving SvHCI node then experiences connectivity issues to the witness.	Shut down one SvHCI node, then stop the witness service or disconnect the link to the witness.	Guest VM service remains online throughout, on the surviving SvHCI node even though it is isolated. The witness failure does not affect production due to the order of the failure, node then witness.
6	SSD cache failure – single SSD or RAID 0	In use is either a single SSD drive with no RAID protection or multiple SSDs in a RAID 0. Pulling the drive results in the storage cache pool failing/going offline.	Remove the physical cache drive from an SvHCI node.	The Cache on the failed side is marked ‘storage failed’. Event: Error – SSD cache storage has failed.
7	SSD cache failure – RAID 1 and above	A disk in the cache RAID fails, (hardware or software RAID)	Fail or pull a single disk in RAID-protected disk set.	The hardware or software RAID ensures the volume is available throughout the failure and SvHCI is unaffected.
8	Custer Split	An SvHCI node fails and is offline - e.g. a server system board failure A new server is shipped to site. Split the cluster, non-disruptively, and re-enable HA/VM Protection	Power off SvHCI2 Any Protected Guest VMs on this node failover. From the Discovery page split the cluster.	Power off SvHCI2 Any Protected Guest VMs on this node failover. From the Discovery page split the cluster.

‘Lights off’ failure scenarios

This section details different failure scenarios that may occur in a production environment. Each scenario checks that systems are configured correctly to provide the highest levels of redundancy where possible and to understand the expected outcomes of each instance. In the case of dual failures, such as both hosts losing power, guest VMs are offline ('lights off').

Scenario #

Scenario

Description

Procedure

Expected Outcome

Witness lost followed by SvHCI node failure

The witness service is stopped or communication to the witness disrupted for both Nodes. The environment continues to run uninterrupted.

A node is then taken offline.

Stop the witness service then power down one node.

The surviving node is isolated and unable to determine mirror state so consequently takes the storage offline to protect the data and failsafe. A ‘loss of quorum’ event is posted:

Event: Error – Mirrored target 'guestvm01disk01' was taken offline due to loss of quorum’.

Dual hypervisor host failure

Cluster down

The two hosts fail simultaneously or in sequence. e.g. an environment power failure.

Force reset or pull power cables of both SvHCI hosts.

With SvHCI hosts back online the targets automatically establish leadership and start a resynchronization. This may be a quick resynchronization of just the changed I/O if possible or if write I/O was occurring at the time of failure, a full resynchronization. This is to failsafe and ensure data is consistent.

The storage is available and online during either a quick or a full resynchronization with a full resynchronization resuming in the event of a further failure.

Full network failure with redundancy lost

All network communications are lost simultaneously. Network resiliency is best practice. A real-world scenario might be running all networking through a single switch, in which case neither the hosts nor the VMs will be accessible on the network.

Remove all networking from hosts at the same time.

SvHCI Nodes are unable to establish quorum as they cannot communicate with each other or the witness. Storage is taken offline until network connectivity is returned.

Data Encryption and StorMagic SvKMS

If you have SvHCI/SvSAN Data Encryption (separately licensed), you may also wish to consider evaluating the feature.

SvHCI/SvSAN Data Encryption can be used with SvKMS, the KMIP-compliant key management solution from StorMagic. However, various third party key management solutions are also supported, which may be used instead of SvKMS, and integration guides for these are provided.

See the data encryption topic in the manual Data encryption for how to configure and use the feature.

Once configured, test that you can:

- encrypt your VM disks/storage volumes presented by SvHCI/SvSAN
- rekey your VM disks/storage volumes presented by SvHCI/SvSAN.

SvHCI/SvSAN VSAs request new encryption keys from the key server with which to encrypt the storage. You can do this using either the VSA web GUI or StorMagic PowerShell Toolkit.

The SvHCI/SvSAN VSAs will continue to present the storage, without disruption, while rekeying in the background.
Continue to the next two sections to test failure scenarios when using SvHCI/SvSAN data encryption with StorMagic SvKMS.

Scenario #	Scenario	Description	Procedure	Expected Outcome
1	SvKMS* server disconnected	Communication between SvKMS and the SvHCI cluster is disrupted.	Power down the SvKMS VM.	SvHCI continues to run guest VMs, presenting storage, without issue. Event: Error – Failed to connect to key server. SvHCI status: Warning – Key server is not connected.
2	SvKMS* server re-connected	Communication between SvKMS and the SvHCI cluster is restored.	Power on the SvKMS VM.	SvHCI continues to run guest VMs, presenting storage, without issue. Event: Informational – Connected to key server. SvHCI status: Normal.
3	SvKMS* server disconnected, then an SvHCI node reboots	Communication between SvKMS and SvHCI cluster is disrupted, then an SvHCI node reboots.	Power down the SvKMS VM, reboot a VSA.	SvHCI continues to run, guest VMs, presenting storage, without issue. Event: Error – Failed to connect to key server. SVHCI status: Warning – Key server is not connected. Reboot one of the SvHCI nodes. Event: Warning – Connection to node 'hpe-dl-325-gen11-02' was lost. Event: Warning – Mirrored target 'guestvm01disk01': plex 'hpe-dl-325-gen11-svhci02' is unsynchronized. The rebooted SvHCI node shows the storage (target) state as offline, as it was unable to re-get the keys: Disk/Target state: Offline (Locked). The surviving SvHCI node shows the VM storage (target) state as degraded: Disk/Target state: Degraded (Remote Locked). High availability has been lost but the environment is still online. Note that although the storage is locked it is still synchronized in this instance.
4	SvKMS* server re-connected	Communication between SvKMS and SvHCI nodes is restored.	Power on the SvKMS VM.	SvHCI nodes automatically re-establish connectivity to the SvKMS server and request/get keys. Volumes become unlocked. Event: Informational – Encrypted target 'guestvm01disk01' is online. Event: Informational – Connected to key server. High availability is restored.

* or other key management provider

'Lights off' failure scenarios with encryption

This section details different failure scenarios that may occur in a production environment. when using SvSAN data encryption with SvKMS key server (any other supported key server may be used). In the case of dual failures, such as both hosts losing power, guest VMs are offline ('lights off').

Scenario #

Scenario

Description

Procedure

Expected Outcome

SvKMS* server disconnected, both SvHCI nodes restarted due to environment power failure

Communication between SvKMS and SVHCI cluster is disrupted.

An environment power failure occurs causing both SvHCI nodes to reboot.

Power down the SvKMS VM.

Reboot both SvHCI nodes in the cluster.

Both SvHCI nodes lose access to the SvKMS server.

Once power is restored, the hosts boot as per BIOS settings, SvHCI boots.

The Guest VMs/storage is held offline due to the SvHCI nodes being unable to get the keys from the SvKMS server.

Disk/Target state: Offline (Locked).

SvKMS* server re-connected

Communication between SvKMS and SvHCI nodes is restored.

Power on the SvKMS VM.

SvHCI nodes re-establish connectivity to the SvKMS server, and request/get keys. Volumes become unlocked.

Event: Informational – Encrypted target 'guestvm01disk01' is online.

Event: Informational – Connected to key server.

Storage is online, high availability is restored.

SvHCI status: Normal.

* or other key management provider

Performance Benchmarking

This task describes how to measure the performance of the evaluation environment. A test tool called FIO will measure IOps, MBps and response times of the system using simulated workloads. FIO is open-source software available to download from https://fio.readthedocs.io/en/latest/fio_doc.html

1. Create Ubuntu guest VMs with an operating system (OS) disk and a benchmark disk

2. Install FIO

3. Run desired benchmarks

The example DL325 Gen11 system demo demonstrates 5x Ubuntu guest VMs performing 4k random reads at 32 queue depths, each running 4 jobs, concurrently

job name	READ_bandwidth_kbs	READ_IOPS	READ_Clat_mean_us
vm1-4vcpus-reads-32qd-4jobs	10796916	89,970	47292
vm1-4vcpus-writes-32qd-4jobs	6756420	56,298	18761
vm2-4vcpus-reads-32qd-4jobs	11442576	95,354	14155
vm2-4vcpus-writes-32qd-4jobs	6373564	53,109	14671
vm3-4vcpus-reads-32qd-4jobs	11512968	95,940	13008
vm3-4vcpus-writes-32qd-4jobs	6832208	56,932	13997
vm4-4vcpus-reads-32qd-4jobs	11507000	95,890	15186
vm4-4vcpus-writes-32qd-4jobs	7169252	59,743	14190
vm5-4vcpus-reads-32qd-4jobs	12496424	104,135	10437
vm5-4vcpus-writes-32qd-4jobs	8008460	66,736	13201

2023-07-31 10_05_33-vSphere - edgeline-esxi2.ts.stormagic.com - Datastores.png

See Also

https://stormagic.com/manual/

https://stormagic.com/svsan/features/data-encryption/

https://support.stormagic.com/hc/en-gb/articles/5978263848861-SvHCI-SvSAN-Encryption

StorMagic SvHCI Evaluators Guide

Introduction

Video Resources

‘Lights on’ failure scenarios

‘Lights off’ failure scenarios

Data Encryption and StorMagic SvKMS

'Lights off' failure scenarios with encryption

Performance Benchmarking

Comments

Introduction

Video Resources

‘Lights on’ failure scenarios

‘Lights off’ failure scenarios

Data Encryption and StorMagic SvKMS

'Lights off' failure scenarios with encryption

Performance Benchmarking

Related articles