Unleashing performance benefits of the 4th generation Intel Xeon Scalable Processor family (Intel 4th Gen Xeon Scalable Processors)

Syed Zaidi
The SADA Engineering Blog
11 min readOct 18, 2023

--

Introduction of C3 VMs powered by Intel Intel 4th Gen Xeon Scalable Processors

  • C3 VMs bring the enterprise-grade performance and reliability of the 4th Gen Intel Xeon Scalable Processor (code named Sapphire Rapids) to Google Compute Engine (GCE) and Google Kubernetes Engine (GKE) customers — all with industry-leading price-performance.
  • C3 VMs are GA with multiple flavors including c3-standard, c3-highcpu, c3-highmem, and c3-standard-lssd. We used different instance types based on the workload characteristics in our testing.
  • In our benchmarking, we discovered that C3 VMs deliver the following:

MongoDB:

  • Up to 84% more throughput vs C2 VMs
  • Up to 44% more throughput vs N2-Ice Lake VMs

Kafka:

  • Up to 45% more throughput vs C2 VMs
  • Up to 20% more throughput vs N2-Ice Lake VMs

Spark:

  • Up to 10% speed-up executing TPCDS query when comparing N2-Ice Lake VMs vs smaller C3 VMs using 8% less vCPUs

Casandra:

  • Up to 13% more performance vs N2-Ice Lake VMs

Tool used for benchmarking

The Workload Services Framework (WSF) is a benchmarking framework developed by Intel to help define, deploy, and manage application workloads on Kubernetes.

Note: We tested 3 generations of CPUs: Ice Lake, which is used for n2 machine types; Cascade Lake, which is used for c2 machine types; and Intel 4th Gen Xeon Scalable Processors, which is used for c3 machine types.

This tool is used for proxy workload setup to test various technologies. More details can be found here.

MongoDB

MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas.

Description for testing benchmark

For our exploration, the YCSB standard benchmarking test has been executed. This test is performed using 1 mongo worker (1 server container instance) and 1 client container instance. The worker and client instances are hosted on two different VMs, and they are of the same VM type. Several test cases ran for 10 mins and used the zstd compression with MongoDB Journal enabled.

Test cases

  • 100% Read
  • 100% Write
  • 70% Read + 30% Write
  • 90% Read + 10% Update

Mongo benchmarks

Data visualization and trends

Throughput comparison

The speed of a database system is measured by the transaction throughput, expressed as a number of transactions per second. A transaction in a database refers to a sequence of one or more operations that are treated as a single, indivisible unit of work.

With our benchmarks tool, throughput was measured in operations per second (ops/sec). Throughput for each run was recorded by taking the median throughput for all executions of a given instance type.

Visualization trends

In our graphs, throughput is measured in operations per second (ops/sec).

Note: All measurements are relative to the baseline benchmark which is noted by the number 1.

Use case 1: Throughput comparison Gen to Gen for 4vCPUs

c2/c3 comparison for 4vCPUs

Intel 4th Gen Xeon Scalable Processors’ (c3) throughput outperformed Cascade Lake (c2) for equivalent number of vCPUs and workloads run on MongoDB.

For read operations, Intel 4th Gen Xeon Scalable Processors had a throughput 1.84x better than the equivalent Cascade Lake instance. Intel 4th Gen Xeon Scalable Processors performed 1.48x better for write operations.

For our benchmark test write-to-read ratio 3:7, Intel 4th Gen Xeon Scalable Processors had a throughput 1.47x better than Cascade Lake, and for our benchmark test read-to-update ratio 90:10, it was 1.78x better than Cascade Lake.

Use case 2: Gen to Gen comparison of throughput for 8 vCPUs

Instance type comparison throughput

Intel 4th Gen Xeon Scalable Processors’ performance can also be seen when scaling from 4 vCPUs to 8 vCPUs. As observed in the graph below, Intel 4th Gen Xeon Scalable Processors (c3) outperformed both Ice Lake (n2) and Cascade Lake (c2) for each of our benchmarks while Ice Lake had a higher throughput than Cascade Lake.

Use case 3: Scaling up vCPUs comparison for throughput

4vCPU to 8vCPU throughput comparison

Scaling from 4vCPUs to 8vCPUs, we saw throughput increase approximately 2x across all benchmarks for our c3 instances. Our benchmark results indicate a strong linear scalability in throughput. Given the equivalent workload, throughput rose in proportion to increases in vCPUs in our configurations.

Kafka

Apache Kafka is a powerful framework that serves as a software bus for stream-processing. Developed as an open-source project by the Apache Software Foundation, Kafka is primarily written in Scala and Java. Its mission is clear: to offer a unified platform that excels in both high throughput and low-latency data feed management.

One of the most compelling use cases for Kafka is its ability to tackle operational tasks efficiently. Whether it’s collecting application logs or facilitating event streaming across diverse systems, frameworks, or platforms, Kafka proves its mettle.

For our exploration, we’ve set up a three-node Kafka cluster with replication factor three, throughput as -1 , Kafka version is 3.2.0, and we’re using Kafka-producer-perf tests to get the throughput (refers to the rate at which messages or data records can be processed and transferred by the Kafka cluster) and latency (refers to the time delay between an event being produced by a producer and that event being consumed by a consumer).

Kafka benchmarks

Data visualization and trends

Throughput and Latency for each run were recorded by taking the median for all executions of a given instance type.

Visualizing trends

Use case 1: Gen to Gen comparison of throughput and latency

In the following graph, we’ve evaluated the performance of three distinct machines: c2 (Cascade Lake), n2 (Ice Lake), and c3 (Intel 4th Gen Xeon Scalable Processors), all of which featured 4 vCPUs. A noteworthy trend is the consistent improvement in throughput and reduction in latency across successive generations. It’s worth noting that these tests were conducted using JDK 17.

Intel 4th Gen Xeon Scalable Processors (c3) emerges as the clear winner, outperforming both Ice Lake (n2) and Cascade Lake (c2). When running workloads on 4 vCPUs with the c3-standard-4 configuration, we observed a median throughput 1.45 times greater than that of c2-standard-4.

Notably, Intel 4th Gen Xeon Scalable Processors c3 exhibited an average latency 22% lower than Cascade Lake, while Ice Lake boasted a latency 5% lower than our base Cascade Lake.

In our graphs, throughput is measured in megabytes per second (MB/s), while latency is measured in microseconds.

Note: All measurements are relative to the baseline benchmark which is noted by the number 1.

Instance type comparison (throughput & latency) — Throughput higher is better and latency lower is better

Latency and throughput comparison between Intel 4th Gen Xeon Scalable Processors c3, Cascade Lake c2, and Ice Lake n2 for 4vCPUs

Use case 2: Scaling up vCPUs comparison (throughput & latency) — tests are conducted using JDK 17

Similar to the benchmark results observation for MongoDB, we also saw linear scalability increase in throughput as per vCPUs for Kafka as well.

4vCPU to 8vCPU throughput and latency comparisonthroughput higher is better and latency lower is better

Intel 4th Gen Xeon Scalable Processors c3 comparison between 4vCPUs and 8vCPUs

The outcomes reveal a robust linear scalability pattern in both throughput and latency. As we increased the number of vCPUs within our configurations while maintaining the same workload, we observed a corresponding rise in throughput and a decline in latency. For instance, when scaling from 4vCPUs to 8vCPUs in our c3 instances, the median throughput improved by a factor of 2.05x. Similarly, scaling up to 16vCPUs yielded enhanced throughput across all instance types involved. Additionally, the latency benefits were notable, with our c3 instances experiencing a latency reduction of approximately 47% when transitioning from 4vCPUs to 8vCPUs.

Use case 3: JDK 11 vs JDK 17 throughput and latency comparison

Intel’s contributions to OpenJDK, including crypto accelerations, have significantly boosted the throughput of Kafka workloads. Consequently, we’ve observed substantial performance enhancements from JDK 11 to JDK 17.

JDK 11 vs JDK17 latency comparison

JDK 17 demonstrated lower latency performance as compared to JDK 11. Benchmarks run against c3-standard-4 saw an average latency .90x less when run with JDK 17.

Spark

Description of the benchmark

TPC-DS is a decision support benchmark that models several applicable aspects of a decision support system. This benchmark study using TPC-DS includes 103 queries and is widely used for big data analysis, answering real-world business questions via ad-hoc reporting, online analytical processing, data mining, and database maintenance functions. Key performance metric is latency / query execution time in the Power test. TPC-DS queries measure the ability of the system to process the most queries in the least amount of time.

Test case

The test was performed with Data scale Factor/Storage (500 GB) across GCP Intel Cascade Lake (2nd gen), Ice Lake (3rd gen) and Intel 4th Gen Xeon Scalable Processors (4th Gen) instance types.

A workload is launched to provision a cluster with 1 Driver and 3 Worker Nodes. These nodes run Spark processes inside containers. The TPC-DS data generator is used to create Parquet files, which are then loaded into external tables on HDFS. Subsequently, queries are executed from these external tables. This experiment was done with one batch of queries and NOT stream (concurrent TPCDS).

The test case uses pd-balanced disk type for OS disk and external disk with minimum disk IOPS to be 10K. SPARK_EXECUTOR_CORES used is 4 across all use cases below with minimum SPARK_AVAILABLE_MEMORY = 70% of the total cluster memory.

Spark benchmarks

Data visualization and trends

Test execution time

In Spark, we measure the average test execution time by running 2 iterations of each test. Execution time is measured in minutes. The lower the execution time, the better the result.

Visualizing trends

In our graphs, the average execution time is measured in mins.

Note: All measurements are relative to the baseline benchmark which is noted by the number 1.

Use case 1: Gen to Gen (GCP Standard Intel 2nd/3rd/4th gen instances) comparison

Instance type comparison (execution time)

Intel 4th Gen Xeon Scalable Processors’ (c3) throughput outperformed Cascade Lake (c2) and Ice Lake (n2) instances.

As shown in the diagram, Intel 4th Gen Xeon Scalable Processors (4th gen) has 25% speed-up executing TPCDS query comparing CLX (2nd gen) and 12% better execution time comparing ICX (3rd gen).

Use case 2: Gen to Gen (GCP Standard Intel Ice Lake-3rd gen vs Intel 4th Gen Xeon Scalable Processors-4th gen) comparison

n2-standard-48 to c3-standard-44 comparison

As shown in the diagram, Intel 4th Gen Xeon Scalable Processors (4th gen) has 10% speed-up executing TPCDS query comparing ICX (3rd gen), and Intel 4th Gen Xeon Scalable Processors uses 10% less vCPUs.

Use case 3: Gen to Gen (GCP Highmem Intel Ice Lake-3rd gen vs Intel 4th Gen Xeon Scalable Processors-4th gen) comparison

n2-highmem-48 to c3-highmem-44 comparison

As shown in the diagram, Intel 4th Gen Xeon Scalable Processors (4th gen) has ~10% speed-up executing TPCDS query comparing ICX (3rd gen) and Intel 4th Gen Xeon Scalable Processors uses ~10% less vCPUs.

Cassandra

Cassandra is a distributed and highly scalable NoSQL database that excels in handling large volumes of data with high availability and fault tolerance. It is widely used in applications where real-time data management and seamless scalability are critical, such as IoT (Internet of Things) platforms, time-series data analysis, and mission-critical systems.

Cassandra benchmarks

In this performance comparison, we’re putting Intel Ice Lake (ICX) against Intel 4th Gen Xeon Scalable Processors (Intel 4th Gen Xeon Scalable Processors), utilizing a 4-node Cassandra cluster. Our benchmarking leverages the Cassandra-stress tool, all under the umbrella of JDK 11.

During our evaluation, we’ve rigorously tested two key usage scenarios: one focusing on read-intensive operations at 100%, and the other simulating a workload of 80% reads and 20% updates. These tests provide valuable insights into how ICX and Intel 4th Gen Xeon Scalable Processors handle Cassandra workloads differently.

Visualization trends

Use case:

As depicted in the chart, we’ve noted a noticeable performance uptick with each new generation. For the Read100% use case, the c3 performance exhibited a significant 13% increase, while for the Read80% Write20% use case, the c3 showed an 8% improvement in performance. These results highlight the incremental enhancements brought by each successive generation, contributing to improved overall performance.

Summary and key takeaways

As observed, Intel 4th Gen Xeon Scalable Processors C3 instances offered significant performance improvements versus both N2 Ice Lake instances, and C2 Cascade Lake instances, for all the applications tested. In addition to the significant performance improvements mentioned at the beginning of this blog post across the 4 tested workloads, here are some other noteworthy highlights:

  • For Kafka, we observed an increase in throughput and a decrease in latency when using the same Intel 4th Gen Xeon Scalable Processors C3 instance configurations with different JDK versions (11 to 17). Performance almost doubled for throughput, and latency dropped when using Intel 4th Gen Xeon Scalable Processors C3 instances instead of Cascade Lake C2 machines.
  • For Kafka, the highest performance we observed was when using Intel 4th Gen Xeon Scalable Processors C3 instances for 4vCPUs. We noticed a median throughput of 1.45x greater than c2-standard-4 using Cascade Lake.
  • For MongoDB, we saw the performance of Intel 4th Gen Xeon Scalable Processors C3 instances outperform both Ice Lake N2 and Cascade Lake C2 instances for 8 vCPUs.
  • For MongoDB, Intel 4th Gen Xeon Scalable Processors C3 instances were significantly more performant than C2 instances for 4vCPUs.
  • For MongoDB, when increasing vCPU from 4 to 8, we observed throughput significantly increase (almost doubled) for Intel 4th Gen Xeon Scalable Processors C3 instances.
  • For Spark, we observed C3 Intel 4th Gen Xeon Scalable Processors outperformed both n2 (Ice Lake) and C2 (Cascade Lake) instances.
  • For Cassandra, we observed Read100% use case C3 Intel 4th Gen Xeon Scalable Processors instances to have 13% better performance then N2 instances, like Read80% use case, C3 instances have 8% better performance.

In conclusion, Intel 4th Gen Xeon Scalable Processors C3 instances offer a clear performance improvement against N2 Ice Lake and C2 Cascade Lake instances across the board and, in some cases, quite a significant performance boost. Also, the C3 instances are now a generally available product and highly available across GCP in 7 regions and 20 zones that will continue to grow for even broader availability. With these instances being broadly available across GCP, SADA customers can take advantage of the improved performance benefits today.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

--

--