In today’s high-performance computing (HPC) landscape, storage solutions that offer speed, scalability, and reliability are essential. Lustre, an open-source parallel file system, is a popular choice for HPC environments due to its capacity to handle large datasets and intensive workloads. When deploying Lustre on Oracle Cloud Infrastructure (OCI), achieving an optimized design that balances performance with cost-efficiency requires careful planning.
Let’s explore key Lustre infrastructure design patterns tailored for OCI, focusing on strategies to meet diverse storage needs and performance benchmarks. From selecting the right storage configurations to optimizing network setups and data access protocols, we’ll walk through best practices and design approaches that can help you build a robust Lustre environment on Oracle’s cloud platform.
In a Lustre file system, the key components are:
- Metadata Server (MDS), Metadata Target (MDT).
- The MDS handles metadata operations, such as file creation and directory structures, and stores this metadata on the MDT
- Object Storage Server (OSS), Object Storage Target (OST)
- The OSS manages actual file data, which is stored on one or more OSTs to enable efficient parallel access.
- Management Server (MGS), Management Target (MDT)
- The MGS provides centralized management, coordinating the configuration and status of the entire Lustre system.
Components ending in “T”—like MDT, MGT and OST—represent the physical disks or volumes where metadata and file data are written, making up the storage backbone of the system.
In the following images from the Lustre manual, we see diagrams of typical Lustre infrastructures.
Physical resources can be managed in Active/Active or Active/Passive clusters using various types of technologies that are not included in the Lustre application but can be implemented in the underlying Linux operating systems.
In the following image, we begin to see a more structured and complex installation of Lustre. Indeed, we can see the introduction and use of failover clusters across the various components of the infrastructure, namely the MGS, MDS, and OSS servers.
Of course, the network plays a fundamental role in this type of infrastructure. The machines must be able to communicate with each other over a dedicated network, separate from the one used by the clients accessing the file system. While this segmentation is not mandatory for operation, it is highly recommended to avoid bottlenecks and network data saturation.
Therefore, we can translate the infrastructure suggested by the Lustre manual, optimizing it for Oracle OCI infrastructure as outlined below. The failover clusters shown in the diagram are not mandatory for the operation of the infrastructure, but they are recommended because, if a volume within the Lustre file system becomes unreachable for any reason, the entire file system will be compromised, resulting in potential data loss. Thus, it is important to evaluate whether the data used for processing and served to the HPC cluster needs to be persistent or if it does not serve a critical long-term function.
It is important to remember that block volumes with VPU120 (those with the highest performance) are available only when using compute shapes with at least 16 OCPUs, and to achieve maximum performance, they must have a size of at least 1.5TB each. Therefore, it is necessary to calculate the network allocated and provided by the compute shapes to determine how many block volumes to connect to each machine in order to avoid potential bottlenecks and to take full advantage of the available throughput for each node, considering both the network bandwidth and the throughput of the connected volumes.
In another image taken from the Oracle OCI Marketplace Oracle OCI, we can see the subnet-level distribution of the Lustre service. In fact, it is possible to deploy the infrastructure directly using a preconfigured Terraform script from the marketplace itself.
As can be seen, Lustre has a dedicated subnet to better manage access security from the other subnets where the clients accessing the file system are located.
Another type of approach for cost savings on GPU HPC cluster.
These infrastructure diagrams are not the only possibilities with Lustre. In fact, when trying to optimize costs for deploying and using HPC clusters along with storage, one can consider using the hosts used for HPC to also provide storage services with Lustre, using the same locally connected NVMe volumes provided by the Bare Metal shapes. This can lead to significant cost savings on the infrastructure. However, it is important to note that Lustre can utilize more than half of the CPU under stress to ensure throughput, so it is advisable to use this type of deployment only if the HPC cluster predominantly processes data using GPUs rather than CPUs. Additionally, if the data being used must be persistent and is of fundamental importance, it is essential to implement data replication between hosts to avoid potential system downtime, as the disks are connected locally and cannot be attached at the same time between hosts.
Conclusion
Designing an effective Lustre infrastructure on Oracle OCI involves careful consideration of various components, alongside the critical role of networking. The use of dedicated subnets enhances access security, while failover clusters provide additional resilience to the system, protecting against potential data loss. By optimizing the deployment to leverage existing HPC hosts for storage services, significant cost savings can be achieved. However, it is essential to assess the workload characteristics and ensure that data persistence is maintained through effective replication strategies. Ultimately, the right balance between performance, cost, and reliability will enable organizations to harness the full potential of Lustre in high-performance computing environments.
Pingback: Understanding Lustre Performance: Throughput in High-Demand Scenarios – Marco Santucci