This article is intended for administrators wishing to understand and test failure scenario tolerances of StorMagic SvHCI in a 2-node configuration with the sharable witness.
Note: All images are clickable for enlarging, or can be opened in a new tab
Resolution/Information
Introduction
This Evaluator’s Guide describes how to perform SvHCI failure testing and benchmarking using an example hardware, and SvHCI configuration.
The first section describes configuring and deploying the infrastructure, including:
- SvHCI hosts, clusters
- Configuring the hardware
- Deploying SvHCI
- Configuring the virtual networking
- Clustering the systems with a witness
- Creating a test VM
- Configuring SvHCI event notifications
The next sections describe various failure scenarios. Each scenario checks that systems are configured correctly to provide the highest levels of redundancy where possible and to understand the expected outcomes of each instance. These are organized into two categories: lights on and lights off:
'lights on' – in single failure scenarios, VMs remain running due to resiliency and redundancy designed into the system.
‘lights off’ – in dual failure scenarios, such as both hosts losing power, VMs go offline ('lights off').
There is a section outlining how to set up SvSAN data encryption with StorMagic SvKMS key management, followed by various data encryption failure scenarios.
There are performance benchmarking sections: one for a system that does not utilize SSD or memory caching, and one for testing performance with caching.
Finally, there is a section on troubleshooting, and one for how to roll back the system if required.
StorMagic Support may be contacted at support@stormagic.com or at our support suite at http://support.stormagic.com.
Infrastructure configuration and deployment
The example below uses HPE DL325 Gen11 systems:
Model: ProLiant DL325 Gen11
- 1x AMD EPYC 9224 2.5GHz 24-core 200W Processor for HPE
--Logical Processors: 48
- 8x32GB RAM = 256GB RAM (DDR4)
- Hypervisor + VSA
-- HPE NS204i-u Gen11 NVMe Hot Plug Boot Optimized Storage Device
- SAN Storage
-- HPE MR216i-o Gen11 x16 Lanes without Cache OCP SPDM Storage Controller
--- 5x HPE 1.92TB SATA 6G Read Intensive SFF BC Multi Vendor SSD
Install SvHCI - https://support.stormagic.com/hc/en-gb/articles/18858860181917-SvHCI-Installation
Complete the first boot wizard - https://support.stormagic.com/hc/en-gb/articles/18860442155549-SvHCI-First-Boot-Wizard
Configure the virtual networking - https://support.stormagic.com/hc/en-gb/articles/18860678011293-SvHCI-Configure-networking
Deploy a Witness - https://support.stormagic.com/hc/en-gb/articles/4418754965777-StorMagic-SvHCI-SvSAN-Witness-Deploy-Install-Upgrade-and-Migrate
Create the Cluster - https://support.stormagic.com/hc/en-gb/articles/18860768070557-SvHCI-Cluster-Creation
Create the Storage Pool - https://support.stormagic.com/hc/en-gb/articles/18860841823389-SvHCI-SvSAN-Configure-a-Storage-Pool
Create a 5x Benchmark Ubuntu VMs - https://support.stormagic.com/hc/en-gb/articles/18860989386781-SvHCI-Linux-Guest-VM-Creation
- Each with a boot disk a around 100GB of benchmark disk space
Video Resources
The below Webinar overviews some of the below scenarios live:
Hardware failure (HA Failover)
WAN Drop (witness connectivity failure)
Storage Failure
Cluster down
‘Lights on’ failure scenarios
This section details different failure scenarios that may occur in a production environment. Each scenario checks that systems are configured correctly to provide the highest levels of redundancy where possible and to understand the expected outcomes of each instance. In the case of single failures, resiliency and redundancy is designed into the system such that the guest VMs remain running or complete a HA/FT failover (‘lights on’).
td class="TableStyle-Style1-BodyE-Column1-Body1" style="box-sizing: border-box; border-right: 1px solid #000000; border-left: none; border-bottom: 1px solid #000000; vertical-align: top; padding: 5px; width: 28.9446%; height: 322px;"After unprotecting the Guest VMs, and breaking any other mirrored storage e.g. ISOs, the cluster can be split.
Power back on SvHCI2.Clean down any old config if reusing the same host.This is due to the split being performed while this node was offline.
Rejoin to a cluster and Protect the Guest VMs.
Scenario # | Scenario | Description | Procedure | Expected Outcome |
---|---|---|---|---|
1 | SvHCI node offline | Due to maintenance (firmware updates, configuration changes, etc.), or failure of the underlying storage that the node runs on. | Power off one node. |
Production continues to be served by the surviving SvHCI node. SvHCI initiates the restart of Protected guest VMs on remaining SvHCI node. Guest VMs that were previously running on the surviving node continue un-disrupted. |
2 | SvHCI node experiences storage failure | The underlying storage in the host has failed: multiple disks have been lost in a RAID set (hardware or software RAID), or the RAID controller has failed etc. | Fail or pull multiple disks in RAID set |
This depends on the hardware storage configuration on the SvHCI node. If SvHCI is deployed to separate storage, i.e. a boot RAID 1 with a pool RAID 5, SvHCI marks affected VM disks/targets as Storage Failed with storage continuing to be served from the node via storage traffic proxying. If SvHCI is on the same RAID set that has failed, the SvHCI node goes offline, and the cluster restarts the Protected guest VMs on the surviving SvHCI node. When the failed storage is operational again the storage can be recovered through the SvHCI web GUI and a full synchronization is initiated. |
3 | Mirror link failure | Failure on mirror link interface. | Disconnect physical cable from host or disconnect node virtual network adapter from mirror link interface. | Mirror traffic continues to work over other available link but performance may be affected. |
4 | Witness connectivity lost | Loss of communication with the witness service, e.g. service stopped. | Stop witness service, power down witness host server or disconnect network link to witness. | Mirror remains up with no disruption; however, the cluster is then vulnerable to a further failure. |
5 | SvHCI node failure followed by witness loss |
The SvHCI node is powered off. The surviving SvHCI node then experiences connectivity issues to the witness. |
Shut down one SvHCI node, then stop the witness service or disconnect the link to the witness. |
Guest VM service remains online throughout, on the surviving SvHCI node even though it is isolated. The witness failure does not affect production due to the order of the failure, node then witness. |
6 | SSD cache failure – single SSD or RAID 0 | In use is either a single SSD drive with no RAID protection or multiple SSDs in a RAID 0. Pulling the drive results in the storage cache pool failing/going offline. | Remove the physical cache drive from an SvHCI node. |
The Cache on the failed side is marked ‘storage failed’. Event: Error – SSD cache storage has failed. |
7 |
SSD cache failure – RAID 1 and above |
A disk in the cache RAID fails, (hardware or software RAID) |
Fail or pull a single disk in RAID-protected disk set. |
The hardware or software RAID ensures the volume is available throughout the failure and SvHCI is unaffected. |
8 |
Custer Split |
An SvHCI node fails and is offline - e.g. a server system board failure
A new server is shipped to site.
Split the cluster, non-disruptively, and re-enable HA/VM Protection |
Power off SvHCI2
Any Protected Guest VMs on this node failover.
From the Discovery page split the cluster. |
Power off SvHCI2
Any Protected Guest VMs on this node failover.
From the Discovery page split the cluster. |
‘Lights off’ failure scenarios
This section details different failure scenarios that may occur in a production environment. Each scenario checks that systems are configured correctly to provide the highest levels of redundancy where possible and to understand the expected outcomes of each instance. In the case of dual failures, such as both hosts losing power, guest VMs are offline ('lights off').
Scenario # | Scenario | Description | Procedure | Expected Outcome |
---|---|---|---|---|
1 |
Witness lost followed by SvHCI node failure |
The witness service is stopped or communication to the witness disrupted for both Nodes. The environment continues to run uninterrupted.
A node is then taken offline. |
Stop the witness service then power down one node. |
The surviving node is isolated and unable to determine mirror state so consequently takes the storage offline to protect the data and failsafe. A ‘loss of quorum’ event is posted: Event: Error – Mirrored target 'guestvm01disk01' was taken offline due to loss of quorum’. |
2 |
Dual hypervisor host failure
Cluster down |
The two hosts fail simultaneously or in sequence. e.g. an environment power failure. | Force reset or pull power cables of both SvHCI hosts. |
With SvHCI hosts back online the targets automatically establish leadership and start a resynchronization. This may be a quick resynchronization of just the changed I/O if possible or if write I/O was occurring at the time of failure, a full resynchronization. This is to failsafe and ensure data is consistent.
The storage is available and online during either a quick or a full resynchronization with a full resynchronization resuming in the event of a further failure. |
3 |
Full network failure with redundancy lost |
All network communications are lost simultaneously. Network resiliency is best practice. A real-world scenario might be running all networking through a single switch, in which case neither the hosts nor the VMs will be accessible on the network. |
Remove all networking from hosts at the same time. |
SvHCI Nodes are unable to establish quorum as they cannot communicate with each other or the witness. Storage is taken offline until network connectivity is returned. |
Data Encryption and StorMagic SvKMS
If you have SvHCI/SvSAN Data Encryption (separately licensed), you may also wish to consider evaluating the feature.
SvHCI/SvSAN Data Encryption can be used with SvKMS, the KMIP-compliant key management solution from StorMagic. However, various third party key management solutions are also supported, which may be used instead of SvKMS, and integration guides for these are provided.
See the data encryption topic in the manual Data encryption for how to configure and use the feature.
Once configured, test that you can:
- encrypt your VM disks/storage volumes presented by SvHCI/SvSAN
- rekey your VM disks/storage volumes presented by SvHCI/SvSAN.
SvHCI/SvSAN VSAs request new encryption keys from the key server with which to encrypt the storage. You can do this using either the VSA web GUI or StorMagic PowerShell Toolkit.
The SvHCI/SvSAN VSAs will continue to present the storage, without disruption, while rekeying in the background.
Continue to the next two sections to test failure scenarios when using SvHCI/SvSAN data encryption with StorMagic SvKMS.
Scenario # | Scenario | Description | Procedure | Expected Outcome |
---|---|---|---|---|
1 |
SvKMS* server disconnected | Communication between SvKMS and the SvHCI cluster is disrupted. | Power down the SvKMS VM. |
SvHCI continues to run guest VMs, presenting storage, without issue. Event: Error – Failed to connect to key server. SvHCI status: Warning – Key server is not connected. |
2 |
SvKMS* server re-connected | Communication between SvKMS and the SvHCI cluster is restored. | Power on the SvKMS VM. |
SvHCI continues to run guest VMs, presenting storage, without issue. Event: Informational – Connected to key server. SvHCI status: Normal. |
3 |
SvKMS* server disconnected, then an SvHCI node reboots | Communication between SvKMS and SvHCI cluster is disrupted, then an SvHCI node reboots. | Power down the SvKMS VM, reboot a VSA. |
SvHCI continues to run, guest VMs, presenting storage, without issue. Event: Error – Failed to connect to key server. SVHCI status: Warning – Key server is not connected. Reboot one of the SvHCI nodes. Event: Warning – Connection to node 'hpe-dl-325-gen11-02' was lost. Event: Warning – Mirrored target 'guestvm01disk01': plex 'hpe-dl-325-gen11-svhci02' is unsynchronized. The rebooted SvHCI node shows the storage (target) state as offline, as it was unable to re-get the keys: Disk/Target state: Offline (Locked). The surviving SvHCI node shows the VM storage (target) state as degraded: Disk/Target state: Degraded (Remote Locked). High availability has been lost but the environment is still online. Note that although the storage is locked it is still synchronized in this instance. |
4 |
SvKMS* server re-connected | Communication between SvKMS and SvHCI nodes is restored. | Power on the SvKMS VM. |
SvHCI nodes automatically re-establish connectivity to the SvKMS server and request/get keys. Volumes become unlocked. Event: Informational – Encrypted target 'guestvm01disk01' is online. Event: Informational – Connected to key server. High availability is restored. |
* or other key management provider
'Lights off' failure scenarios with encryption
This section details different failure scenarios that may occur in a production environment. when using SvSAN data encryption with SvKMS key server (any other supported key server may be used). In the case of dual failures, such as both hosts losing power, guest VMs are offline ('lights off').
Scenario # | Scenario | Description | Procedure | Expected Outcome |
---|---|---|---|---|
1 |
SvKMS* server disconnected, both SvHCI nodes restarted due to environment power failure |
Communication between SvKMS and SVHCI cluster is disrupted. An environment power failure occurs causing both SvHCI nodes to reboot. |
Power down the SvKMS VM.
Reboot both SvHCI nodes in the cluster. |
Both SvHCI nodes lose access to the SvKMS server. Once power is restored, the hosts boot as per BIOS settings, SvHCI boots.
The Guest VMs/storage is held offline due to the SvHCI nodes being unable to get the keys from the SvKMS server. Disk/Target state: Offline (Locked). |
2 |
SvKMS* server re-connected | Communication between SvKMS and SvHCI nodes is restored. | Power on the SvKMS VM. |
SvHCI nodes re-establish connectivity to the SvKMS server, and request/get keys. Volumes become unlocked. Event: Informational – Encrypted target 'guestvm01disk01' is online. Event: Informational – Connected to key server. Storage is online, high availability is restored. SvHCI status: Normal. |
* or other key management provider
Performance Benchmarking
This task describes how to measure the performance of the evaluation environment. A test tool called FIO will measure IOps, MBps and response times of the system using simulated workloads. FIO is open-source software available to download from https://fio.readthedocs.io/en/latest/fio_doc.html
1. Create Ubuntu guest VMs with an operating system (OS) disk and a benchmark disk
2. Install FIO
3. Run desired benchmarks
The example DL325 Gen11 system demo demonstrates 5x Ubuntu guest VMs performing 4k random reads at 32 queue depths, each running 4 jobs, concurrently
job name | READ_bandwidth_kbs | READ_IOPS | READ_Clat_mean_us |
---|---|---|---|
vm1-4vcpus-reads-32qd-4jobs |
10796916 |
89,970 |
47292 |
vm1-4vcpus-writes-32qd-4jobs | 6756420 | 56,298 | 18761 |
vm2-4vcpus-reads-32qd-4jobs |
11442576 |
95,354 |
14155 |
vm2-4vcpus-writes-32qd-4jobs | 6373564 | 53,109 | 14671 |
vm3-4vcpus-reads-32qd-4jobs |
11512968 |
95,940 |
13008 |
vm3-4vcpus-writes-32qd-4jobs | 6832208 | 56,932 | 13997 |
vm4-4vcpus-reads-32qd-4jobs |
11507000 |
95,890 |
15186 |
vm4-4vcpus-writes-32qd-4jobs | 7169252 | 59,743 | 14190 |
vm5-4vcpus-reads-32qd-4jobs |
12496424 |
104,135 |
10437 |
vm5-4vcpus-writes-32qd-4jobs | 8008460 | 66,736 | 13201 |
See Also
https://stormagic.com/svsan/features/data-encryption/
https://support.stormagic.com/hc/en-gb/articles/5978263848861-SvHCI-SvSAN-Encryption
Comments
0 comments
Article is closed for comments.