HPC processor design extends emulation


Tools to automate system-level content generation and verify compliance with system-level objectives will have broad applicability in many markets.

As semiconductor design teams strive to capitalize on “More than Moore”, new architecture options and new challenges are flourishing. Take hyperscaler hardware, where an array of workloads—database analytics, AI, microservices, video coding, and high-complexity computing algorithms—demand an array of processor solutions. Performance, power and cost are still critical, but now architects are delivering. There is no “best” architecture; processors should be designed to best serve specific classes of workloads and price/performance profiles.

Multicore Architecture Challenges

the AWS Graviton2 features 64 Arm Neoverse N1 cores tiled in a cohesive mesh network on a single die. Other designs have already extended to multiple dies with cache-consistent connections between dies. Multi-die implementations pave the way for additional growth and cost reduction potential on less advanced processes. While these new architecture options expand the possibilities, they also create new design challenges. Among many choices, which architectures are really going to deliver higher throughput for the right workloads at the right price?

A question here is how distributed system cache in a coherent mesh network should be partitioned relative to physical memory for a target class of applications. Optimizing these choices, and even which processor cores best suit the needs, requires running real-world workloads with cycle-level precision. High level models are simply not accurate enough for this purpose.

Figure: Different I/O latencies in a multi-die implementation. (Source: Cadence)

Communication latencies across a set of processors in a coherent mesh will be relatively uniform within a single die, but latencies can vary significantly from die to die in a multi-die implementation (see figure). As such, the designs develop a variety of architectures that could be used in the future – fully connected meshes, hub and spoke memory systems or others in 2D and 3D structures in which a chiplet provides a large system cache and access to main memory. Additionally, other chiplets in the stack communicate with each other and with main memory through the hub.

Exploring all of these options largely depends on accurately modeling performance against realistic workloads. Modeling and analysis can only be explored in the RTL domain using emulation and prototyping.

SystemReady Compliance

Another type of issue for a server architect is operating system compatibility. You can boot any Linux, hypervisor, or Windows distribution on most laptops right out of the box. To get started on Arm-based servers, that responsibility is split between the server builders and Arm.

Arm developed a compliance suite called SystemReady to standardize a set of minimum requirements to overcome this and other compliance issues. PCIe compliance is a particularly important component because it directly provides or underpins core I/O for many server interface protocols for fast storage, fast networking, and consistent off-chip interfaces. Booting the remote server via PCIe is of particular importance here. Arm provides this compliance suite as software running on the UEFI (BIOS) layer. Cadence has been working with Arm for a few years to reduce testing to a minimal bare metal test suite with a PCIe traffic generation library that will emulate faster than the UEFI test suite, for fast hardware debugging.

Another challenge for server developers is that PCIe uses a highly ordered memory model. Arm supports a loosely ordered memory model allowed by the standard. But only a strong order guarantees the absence of blockages. As part of a loose order, the hardware/firmware developer must provide this warranty. Unfortunately, this cannot be verified by compliance. The integrator must prove that the design is block-free, through extensive use case testing, always on an emulator or prototyping system.

An approach using Cadence System Verification IP allows engineers to have a working system-level test suite in half a day that can validate PCIe integration against SystemReady requirements. This methodology can also be used to demonstrate booting SUSE Linux and Windows from a PCIe-attached flash memory device model, which is generating a lot of interest in the advanced server community.

Not just for servers

Arm Neoverse platforms aren’t just designed for high-end servers. The family is already moving to other cloud applications and communications infrastructure, all the way to the edge. In some of these applications, multi-core architectures are already important. In most of these applications (automotive for example), out-of-the-box support for a range of open and commercial operating systems is essential.

I believe that tools to automate system-level content generation and verify compliance with system-level goals will have wide applicability in many markets. The EDA industry must think beyond the traditional scope of single-interface, single-protocol Verification IP (VIP) to a new era of system-level, multi-interface, multi-protocol VIP.

Paul Cunningham is senior vice president and general manager of the system verification group at Cadence Design Systems. His product responsibilities include logic simulation, emulation, prototyping, formal, VIP and debugging. Previously, he was responsible for Cadence’s front-end digital design tools, including logic synthesis and design for test. Paul joined Cadence in 2011 following the acquisition of Azuro, a startup developing technologies for simultaneous physical optimization and useful clock tree synthesis, of which he was co-founder and CEO. Paul holds a master’s and a doctorate. in Computer Science from the University of Cambridge, UK.

Related content:

For more embedded, subscribe to Embedded’s weekly newsletter.


Comments are closed.