Understanding Lustre Performance: Throughput in High-Demand Scenarios

When it comes to AI and machine learning applications, data throughput can become the deciding factor between success and failure. Lustre’s architecture is designed to offer unmatched performance across large-scale clusters, and the achievable throughput can be optimized based on the hardware configuration—specifically, the type of storage drives used and the network infrastructure in place.

Throughput on Different Storage Drives

1. HDDs (Hard Disk Drives)

For installations that use traditional HDDs, throughput generally peaks between 100 to 200 MB/s per drive. While HDDs are cost-effective and offer sufficient storage capacity, their speed can limit Lustre’s overall performance, especially when used in data-intensive applications. Clusters with multiple HDDs configured in RAID can increase throughput, but the gains remain modest compared to solid-state options. In high-throughput applications, using only HDDs might cause bottlenecks, especially in AI tasks requiring rapid sequential read/write capabilities.

2. SSDs (Solid-State Drives)

SSDs significantly increase Lustre’s throughput, delivering between 400 to 500 MB/s per drive. With SSDs, Lustre’s I/O operations become much faster, enhancing performance for AI and ML tasks that need high-speed data access, such as model training and large-scale simulations. SSDs’ low latency also reduces wait times between data requests, providing a smoother data pipeline. When configured with NVMe (Non-Volatile Memory Express), SSDs can reach 3 GB/s or more per drive, making them ideal for demanding applications in deep learning or real-time analytics.

3. NVMe Drives

NVMe drives further elevate Lustre’s throughput, achieving speeds up to 6 GB/s per drive in optimal configurations. For AI workloads involving huge datasets, NVMe drives enable the highest level of performance, minimizing delays and accelerating data transfer. In addition, NVMe’s superior parallelism complements Lustre’s design, which supports high concurrency across distributed nodes. For example, in a deep learning environment with large image datasets, NVMe-backed Lustre can manage multiple simultaneous data accesses without lag, ensuring continuous model training without bottlenecks.

Network Infrastructure and Lustre Throughput

Beyond storage drives, network infrastructure significantly impacts Lustre’s performance. Lustre relies on a fast and stable network to maintain its high throughput across nodes:

1. 1 Gigabit Ethernet (1GbE)

Although 1GbE is economical, it quickly becomes a limiting factor in Lustre deployments. With a maximum theoretical throughput of 125 MB/s, a 1GbE network can bottleneck performance, even with high-speed SSDs or NVMe drives. This setup is typically insufficient for AI workloads, where the data demands outpace what 1GbE can offer.

2. 10 Gigabit Ethernet (10GbE)

With 10GbE, Lustre can achieve up to 1.25 GB/s in throughput. This bandwidth allows for more effective use of SSDs in Lustre nodes, providing sufficient throughput for mid-sized AI workloads and complex simulations. AI teams working on medium-sized datasets can benefit from 10GbE, as it supports rapid data ingestion and processing at a much higher rate than 1GbE.

3. RDMA

The importance of RDMA in Lustre lies in its ability to significantly improve throughput and reduce latency across the storage and compute nodes. Lustre is designed to handle high-bandwidth I/O workloads, often involving terabytes or petabytes of data. Without RDMA, traditional networking can become a bottleneck, as each data transfer would require multiple steps involving CPU intervention, adding delays. With RDMA, however, data is sent directly over the network at much higher speeds, often in the range of tens or even hundreds of Gbps, which is essential for keeping up with the demands of distributed file systems like Lustre.

RDMA (Remote Direct Memory Access) is crucial for achieving high performance and low latency in data transfers between nodes. RDMA allows data to move directly between the memory of one server and another without involving the CPU, bypassing traditional network stacks and reducing the number of memory copies needed. This feature is particularly valuable in HPC (High-Performance Computing) and large-scale storage environments, where large amounts of data need to be transferred quickly and efficiently.

Lustre’s Aggregate Throughput: Cluster-Level Performance for HPC Workloads

One of the standout benefits of Lustre is its ability to scale throughput across an entire cluster, allowing performance gains far beyond what single-server storage solutions like NFS can provide. In high-performance computing (HPC) environments, Lustre’s architecture enables the aggregation of throughput from multiple Object Storage Targets (OSTs), forming a single, high-speed parallel file system. This approach not only enhances data access but also effectively minimizes the bottlenecks that often arise when relying on single-endpoint storage like NFS.

Aggregate Throughput in Lustre Clusters

In a Lustre deployment, each node contributes to the cluster’s overall throughput, allowing cumulative data access speeds that can reach hundreds of gigabytes per second. For example:

– HDD-Based Clusters

A Lustre cluster using traditional HDDs across multiple OSTs might achieve an aggregate throughput of 5-10 GB/s, depending on the number of drives and network configuration. While slower than SSD or NVMe-based clusters, this setup still provides substantially more throughput than a single-node NFS configuration, as it combines the read/write speeds of all nodes.

– SSD-Based Clusters

With SSDs, Lustre’s aggregate throughput can reach dozen of GB/s across multiple OSTs. For AI workflows, this increased throughput translates into faster data loading and processing, crucial for training models that require continuous data access.

– NVMe-Based Clusters

In top-tier HPC configurations, NVMe-backed Lustre clusters can achieve aggregate throughput of 100 GB/s or more, with some setups even reaching hundreds of GB/s in cutting-edge research facilities. This ultra-high throughput enables rapid processing of massive datasets, ensuring that even the most data-intensive AI applications run without delay. For instance, organizations running simulations on genomic data or climate models can process massive amounts of data in real-time, making NVMe and Lustre a formidable combination for HPC environments.

Single-Node Throughput vs. Cluster-Level Aggregation

In a traditional storage setup using NFS, throughput is limited to the capabilities of a single server. This creates a bottleneck, as all clients accessing data are constrained by the maximum speed of that endpoint. In contrast, each Lustre node contributes to the overall data pipeline, allowing clients to access data in parallel from multiple OSTs. As a result:

– Single Node Throughput

While individual nodes in a Lustre setup (e.g., an OSS with NVME) might deliver 6GB/s per drive, multiple OSS nodes can be combined to exponentially increase the throughput. In an AI workload scenario, each OSS provides independent access paths, meaning clients don’t have to compete for data access, as they would in an NFS setup.

– Cluster-Level Throughput

By leveraging the combined throughput of numerous OSS nodes, a Lustre cluster can handle tens to hundreds of gigabytes per second. For example, if each OSS node in a Lustre setup can independently provide 5 GB/s, a 20-node cluster could theoretically deliver an aggregate throughput of 100 GB/s. This kind of scalability is essential for HPC and AI tasks that need to read and write large files quickly.

How Lustre Avoids Bottlenecks Compared to NFS

1. Parallel Data Access

Unlike NFS, where data flows through a single endpoint, Lustre distributes files across multiple OSTs. Clients can access data from these OSTs in parallel, avoiding the central bottleneck inherent in NFS architectures. This parallelism is critical in AI and machine learning environments, where multiple nodes may need to access large datasets simultaneously.

2. High Concurrency and Scalability

Lustre’s scalability allows it to support thousands of nodes, each capable of contributing to the data flow. This high level of concurrency is especially beneficial in HPC, where simulations or deep learning models may need to access large amounts of data at the same time. An NFS server, limited by single-point throughput, would struggle to keep up with these demands, leading to latency and throttled performance.

3. Resiliency and Redundancy

Lustre offers data striping across multiple storage nodes, which facilitates high-speed access and redundancy. However, while Lustre ensures data availability in case of node overload or temporary unavailability, the system’s resilience depends heavily on proper configuration, such as failover mechanisms and redundancy in metadata and object storage targets (MDTs and OSTs). When correctly configured, Lustre can provide continuous, high-speed data access with minimal performance impact, even during node failures or peak loads.

Benefits of Using Lustre for AI and HPC Workloads

In AI environments, a high-throughput Lustre setup ensures that data is readily available, reducing time spent waiting for data loads and increasing time spent processing and analyzing data. With NVMe and InfiniBand, Lustre can provide the speed and efficiency that drive faster model training, streamline data processing, and enhance the performance of even the most complex AI workflows.

For AI training, data analytics, and HPC simulations, Lustre’s performance advantages over NFS are considerable. By removing single-server limitations, Lustre ensures a consistent data throughput that scales with the demands of the workload. AI models that require vast datasets for training, or HPC simulations running complex computations, benefit from Lustre’s ability to streamline data access, reduce latency, and deliver aggregate throughput that far surpasses traditional storage solutions.

In essence, Lustre provides the high throughput and scalability needed to support cutting-edge AI and HPC environments, minimizing bottlenecks and ensuring that data access can keep pace with even the most demanding computational tasks.

Continue reading the series of posts on Lustre: Lustre Design Strategies on Oracle OCI

Resources:

At the time of publishing this post, I am an Oracle employee, but he views expressed on this blog are my own and do not necessarily reflect the views of Oracle.