Spirent circle logo
AI

AI/ML was Ready for its Closeup at OFC 2024

By:

EA-booth2-1240x600

Attendees focused on the impact of AI/ML on high-speed networking as new feats in scale, bandwidth, and low latency and jitter drove excitement. Now, all eyes are on real-world traffic emulation, interoperability and performance testing to drive success.

The four busiest letters at this year’s Optical Fiber Communications Conference (OFC) were by far AI/ML. Booths touted new marketing and fresh technical capabilities at every turn, with chipset, transceiver, cable, equipment and testing vendors all getting in on the action.

While it’s always healthy to be wary of hype, the general sentiment was that AI/ML can truly generate growth and demand for bandwidth. And no one wants to miss out on the action.

It’s evident that the ChatGPTs of the world are completely reliant on the building blocks being championed by the OFC ecosystem to meet massive scale targets. Existing network technologies simply don’t have what it takes to make the envisioned AI/ML future possible.

Let’s drill down a bit on why.

AI models are growing in complexity by 1,000 times every three years, causing network traffic to increase at the same rate. There is also an unprecedented need for low latency and high bandwidth connectivity between servers, storage, and GPUs for AI training and inferencing. The projected AI demand for bandwidth and processing is requiring Terabit networking thresholds.

To meet the explosive growth in workloads and thousands of GPUs, data centers need to be rearchitected. Just adding more server racks or fiber runs won’t do the job. To handle large AI app workloads, a scalable, routable, high-speed back-end network infrastructure is needed for GPU-to-GPU connectivity.

Progress was on display at OFC, though challenges with delays were lurking too. Read on for our take on key themes emerging from this year’s event and what comes next.

Road to 1.6T Ethernet

IEEE 802.3dj will usher in 1.6T Ethernet era using the 200G per lane technology. This new electrical lane technology is meant to double bitrates (and the core in the process). However, the initial IEEE 802.3dj 1.6T and 200G per lane technology adoption timelines may be delayed. Reasons include IEEE 802.3dj schedule delays, a focus on deploying 800G for AI/ML, and some business and technical risk management by chip suppliers.

A hot topic among OFC attendees was alternative paths to 1.6T. Instead of doubling the number of bits, the number of electrical lanes can be doubled from 8 to 16 using OSFP-XD technology. OSFP-XD offers 1.6T density with 16 lanes of 100G and the potential for 3.2T density with 16 lanes of 200G. Using 16 lanes of 100G would be an option for hyperscalers. Or perhaps delays in 1.6T will stimulate 800G adoption. The market has not spoken in favor of OSFP-XD yet.

Another factor is power consumption. A major advantage of 1.6T with 200G lanes is reduced power and cost per bit. That is a critical factor for hyperscalers as they invest billions of dollars in data center supercomputers to handle thousands of AI/ML GPUs. One hyperscaler is even locating its new data center next to a nuclear power plant to minimize the power costs.

The path forward on 200G lanes to achieve 1.6T is not clear but will be driven by hyperscaler decisions. If they adopt 200G lanes, will the ecosystem deliver them fast enough?

Impact of AI/ML on testing

Testing speeds and feeds is no longer sufficient. To validate and quantify performance of GPU workloads, test vendors need to emulate real-world traffic patterns created by AI/ML networks, not just blast high volumes of generic traffic as in the past.

As AI/ML transitions from 800G to 1.6T, testing needs to follow the same path. If OSFP-XD is adopted, test companies need to test it quickly. Whatever the path to 1.6T, they need to work closely with multiple vendors to be ready to test devices as soon as they are available.

Of course, testing needs to continue for 800G and earlier architectures as adoption grows and factors such as energy management are incorporated.

Linear-drive pluggable optics

Lower power and lower latency are important for AI/ML data centers, so we expected more buzz around linear-drive pluggable optics (LPO).

LPOs is a new Ethernet interconnect, a pure analog solution that takes electrical signals directly and modulates them using lasers. Because it’s analog, it's a relatively low power solution with signal integrity and fast propagation. As line rates go up, there’s an opportunity for a diversity of interconnect solutions such as LPO that enable low latency at lower power usage.

Ultra Ethernet

AI workloads are driving an unprecedented need for next-generation back-end (GPU-to-GPU) networks for AI training and inferencing. Low latency and high bandwidth connectivity between servers, storage, and the GPUs are essential. Thousands of $40K GPUs need to be used very efficiently to avoid large power and cost consequences.

While having the advantage of being a standard with wide deployment, Ethernet is lossy by design and therefore sensitive to delay and variable latency (jitter) that can negatively impact AI/ML training algorithms.

To make Ethernet useful for the hyperscaler GPU use case and AI/ML, the industry created the Ultra Ethernet Consortium. Ultra Ethernet builds on Ethernet to provide less downtime, reduce latency and jitter, and speed connectivity. It will be an open solution, available in one to two years.

The alternative to Ultra Ethernet, InfiniBand, is available now and provides deterministic flow control, but is a proprietary solution provided by one vendor.

800G standardized for 100G lanes

Last month, the IEEE ratified 802.3df, the 800G Ethernet rate, 100G electrical lane architecture standards. Just as with 400G, the industry is rallying and converging quickly on the technology now that standards are completed. Vendors are rushing to demonstrate they are compliant with the 100G lane architecture.

AI/ML testing solutions

AI/ML is disruptive technology that is having major impacts on hyperscaler data centers. The high-speed technology innovations required must be validated and tested to ensure they meet the needs of growing AI workloads.

Spirent provides leading-edge test solutions that enable the high-speed networking technologies essential for validating next-gen AI/ML networks. For example, Spirent is:

Whether 800G or 1.6T, Spirent is there to support the industry. Spirent test solutions and services ensure the conformance, performance, interoperability, flexibility, and scalability of 800G deployments and AI workloads. Automated testing reduces network text complexity and simplifies and accelerates the path to 1.6T.

Our new white paper, Turning Ideas into Action for 1.6T Ethernet, provides insights on 1.6T.

Learn how Spirent is enabling the future of high-speed networking with high-speed Ethernet testing solutions.

Like our content?

Subscribe to our blogs here.

Blog Newsletter Subscription

Asim Rasheed

Senior Product Manager, HSE

Asim Rasheed is the Senior Product Manager for Spirent’s High-Speed Ethernet products. In his current role, he is responsible for managing the next-generation network and infrastructure testing product lines and building partnerships within the Ethernet ecosystem to support its continued expansion by providing vendor-neutral test solutions. Prior to Spirent, Asim worked at multiple network equipment manufacturing and test & measurement companies, managing software and hardware product lines across Routing/Switching, Security, Broadband Access, and hardware products. To connect with Asim, please go to LinkedIn at https://www.linkedin.com/in/masimrasheed/.