To SERT or not to SERT? Interrogating SPEC benchmarking in the lab

Rich Kenny | Sept. 16, 2022

A key to the success of Interact is its ability to integrate disparate sources of data. We are the only company that has access to a hardware database that covers thousands of makes, models, types, generation and configuration of server. We marry this with a combination of benchmark data from SERT (which measures performance over energy draw), and our own component-level analysis in the lab that refines this, adding more detail on systems that are memory heavy, for example.

Of these three things, only the SERT data comes from a third party. We wanted to interrogate our choice on this with lab work measuring the draw directly from the PDU rather than the power analyser and comparing this to tests run using other benchmarks and benchmark suites. To put it in mathematical terms, we wanted to prove the concept from first principles rather than just rely on the equation. We are particularly interested in the CPU power draw (65%) because this is where the majority of the energy draw of the server.

What is SERT and how does it work?

SERT by SPEC Power is the industry de-facto when it comes to benchmarking servers. It is used by all OEMs (Dell, Cisco, HPE, IBM, Fujitsu) to provide the energy star ratings for servers and is the metric we use for the data behind Interact’s Machine learning model. It was created by the Standard Performance Evaluation Corporation (SPEC) and works by measuring transactional workloads (single computational operations) per watt and giving a total score. It makes no distinction between types of operation, just the number on these.

However, it does make a distinction on which components of the server are used, and introduces the idea of sample workloads, which it uses software to generate. Examples of these are the Compress workload (mimicking GIFs generation and stressing the CPU), and Sequential workload (mimicking file storage and retrieval and stressing the storage components). After it runs all these workloads, it aggregates the scores. CPU workloads are a given a 65% weighting on the energy draw, memory 30%, and storage 5%. The aggregated scores are used to compare one server to another.

Text</p>

<p>Description automatically generated

Alternatives benchmarks available

As mentioned above, SERT is the industry standard. However, there are other pieces of software that stress test different components in different ways. The Phoronix Test Suite and Open Benchmarking Server Suite is a Linux testing and benchmarking software platform that caters to a much wider array of IT hardware. It is a collaborative platform that allows users to share their hardware and software benchmarks through an organized online interface. Specific tests relating to servers are:

- Ebizzy (designed to replicate web server workloads)
- Perl (designed to test CPU workloads and performance)
- OpenSSL (Testing performance of cryptographic algorithms)
- PHP benchmark (Exercising code paths such as XML parsing, JSON generation and common real-world operations)

In addition to the Phoronix test suite, the Intel Processor Diagnostic Tool enables researchers to run test workloads on a server whilst measuring energy draw. There are also several software options available to test CPU performance to the maximum point it can cope with. The Intel Burn Test, is a stress testing software developed by the chip manufacturer that overclocks the CPU. Prime 95 runs the CPU to breaking point by running a transactional workload that searches for all possible prime numbers.

Testing benchmarks in the lab

Running the SERT benchmark tests in the lab requires the use of a prescribed power analyzer that has been approved by SPEC. Whilst this is the approach we usually take with in-house benchmarking, we chose to monitor the PDU instead for this set of experiments as it provided a level playing field for all pieces of software.

We used a PowerEdge R630 with a Xeon E5-2697 v3 processor, 4 x 16GB DDR4 2133MHz memory. The configuration was chosen because it was a popular one in the market. The consistent temperature of 25⁰C and a pressure of 5Pa was achieved and maintained with the use of a wind tunnel and is well within ASHRAE guidance. The server was in efficiency mode and the tests were run up to maximum utilisation.

Results

The graph shows each of the SERT worklets and how they performed in this environment to generate the maximum energy draw with the operations. The two ‘torture tests’ - Prime95 and Intel Burn test – exceed the power consumption by 8% and 14% respectively. However, this is to be expected with benchmarks designed to take the CPU past its maximum capacity. The Capacity3 also exceeded the SERT tests, but only by 1-2%. Compress, CryptoAES, and LU were within a % of SSJ power consumption also. The greatest power draw was pulled by Intel Processor Diagnostic Tool at 462W or 14.5% above SSJ. In short, the alternative tests for SERT had either slightly higher or equivalent energy draw at the PDU.

Having conducted the tests, this tells us that the SERT benchmarks are directly comparable to a large sample of other benchmarks on the market. We also see that running tests with a power Analyser and also taking readings at PDU level are comparable, validating the SERT methodology. It was a useful piece of work because looking at different benchmarks provided an increased range of scenarios on server performance and energy draw. However, extending the benchmarks demonstrated similar power draw on the operations. The testing has reassured us that the SERT benchmarks are comparable and realistic indicators of both energy use and ability. The lower outliers can be explained by the fact they are testing less CPU intensive processes.

As a “first principles” kind of company, this was a great piece of work to complete on testing the tools we have chosen to use.