MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (2024)

Yihao Zhao¹¹1Yihao Zhao and Xin Liu contributed equally.Peking University,Xin Liu¹¹1Yihao Zhao and Xin Liu contributed equally.ByteDance Inc.,Shufan LiuByteDance Inc.,Xiang LiByteDance Inc.,Yibo ZhuByteDance Inc.,Gang HuangPeking University,Xuanzhe LiuPeking UniversityandXin JinPeking University

Abstract.

Large-scale GPU clusters are widely-used to speed up both latency-critical (online) and best-effort (offline) deep learning (DL) workloads.However, most DL clusters either dedicate each GPU to one workload or share workloads in time, leading to very low GPU resource utilization.

We present MuxFlow, the first production cluster system that supports efficient and safe space-sharing for DL workloads.NVIDIA MPS provides an opportunity to share multiple workloads in space on widely-deployed NVIDIA GPUs, but it cannot guarantee the performance and safety of online workloads.MuxFlow introduces a two-level protection mechanism for memory and computation to guarantee the performance of online workloads.Based on our practical error analysis, we design a mixed error-handling mechanism to guarantee the safety of online workloads.MuxFlow further proposes dynamic streaming multiprocessor (SM) allocation and matching-based scheduling to improve the efficiency of offline workloads.MuxFlow has been deployed at CompanyX’s clusters with more than 20,000 GPUs.The deployment results indicate that MuxFlow substantially improves the GPU utilization from $26\%$ to $76\%$ , SM activity from $16\%$ to $33\%$ , and GPU memory from $42\%$ to $48\%$ .

^†^†copyright: none^†^†doi: ^†^†isbn: ^†^†price:

1. Introduction

Deep learning (DL) has been widely integrated into intelligent applications and services, such as intelligent recommendation(Covington etal., 2016; Gao etal., 2021), autonomous driving(Alcon etal., 2020; Jang etal., 2020), image recognition(Simonyan and Zisserman, 2015; He etal., 2016), and machine translation(Vaswani etal., 2017; Gehring etal., 2017).Some of them provide real-time inference and have critical latency demand (called online workloads).Meanwhile, other workloads do not have hard latency demand (called offline workloads).Enterprises usually build large-scale GPU clusters for DL workloads and reserve specific GPUs for online workloads.

Existing efforts in online workload management have significantly improved the serving efficiency(Crankshaw etal., 2017; Gujarati etal., 2020; Shen etal., 2019).However, a major limitation is that most methods dedicate the whole GPU to a single workload.It is reported that an online workload usually cannot fully utilize the expensive GPU resource(Ma etal., 2020; Han etal., 2022) mainly due to two reasons.First, the frequency of online requests fluctuates from time to time.When the request frequency is low, more GPU computing units are idle, leading to a great waste of GPU.Second, even if the request frequency is high, the batch size of online workloads is usually limited to a small value for latency demand and most kernels need few computing resources.Thus, the computing units in GPU are still underutilized.

A common idea is to share GPUs among multiple workloads with different latency demands(Xiang and Kim, 2019; Xiao etal., 2020; Han etal., 2022), i.e., sharing GPUs between online and offline workloads.Time-sharing and space-sharing are two paradigms for GPU sharing.Time-sharing(Xiao etal., 2020; cgp, 2022) assigns time slices to different workloads, but it may degrade the performance of online workloads and cannot improve GPU resource utilization in space.Space-sharing(Han etal., 2022; mps, 2022) is a better way to improve GPU resource utilization.For widely-deployed NVIDIA GPUs, multi-process service (MPS)(mps, 2022) is a feasible choice due to its efficacy, flexibility, and compatibility with NVIDIA GPUs.

However, MPS brings new challenges to production clusters.First, the primary goal for production clusters is to guarantee the performance of online workloads, such as real-time recommendation and machine translation.These workloads have hard latency demands because longer latency may influence the user’s experience.However, MPS cannot guarantee the performance of online workloads.Second, MPS has a serious error propagation problem, i.e., when one workload encounters an error, the shared workload may also be influenced.It is critical to guarantee the safety of shared workloads, especially the online workloads in production clusters.

This paper presents MuxFlow, a system that supports efficient and safe space-sharing of large-scale GPU clusters for DL workloads in the production environment.MuxFlow addresses the above challenges to guarantee the performance and safety of online workloads.MuxFlow exploits a two-level protection mechanism to guarantee the performance of online workloads from both the workload level and GPU level.At the workload level, we propose xCUDA to constrain the GPU memory and computing power used by offline workloads.xCUDA monitors the GPU memory allocation to limit the memory usage of offline workloads, and controls kernel launches to limit the computing power used by offline workloads.Besides, xCUDA provides adjustable parameters to control how much the online workloads are influenced.At the GPU level, MuxFlow employs the SysMonitor to monitor the GPU device status.The SysMonitor maintains a state machine according to multi-dimensional GPU metrics, and will evict offline workloads if the GPU status can potentially compromise the performance of online workloads.

To guarantee the safety of online workloads, we investigate all propagated errors in our production clusters and propose a mixed error-handling mechanism.We find that $99\%$ propagated errors are caused by SIGINT and SIGTERM signals, which are usually used to stop containers in Kubernetes(k8s, 2022).MuxFlow employs a graceful exit mechanism that intercepts related signals and releases CUDA context actively to avoid the propagated error.For other corner cases, MuxFlow resets the CUDA context and restarts the workloads.

Furthermore, MuxFlow improves the efficiency of offline workloads.MuxFlow dynamically allocates the computing unit of NVIDIA GPU, i.e., streaming multiprocessor (SM), used by offline workloads.Our key intuition is that we can set the SM percentage of offline workloads complementary to the SM percentage used by online workloads, with an acceptable slowdown of online workloads.Besides, we observe that different sharing pairs vary dramatically in the efficiency of offline workloads.To maximize the efficiency, we formulate the problem as a maximum weighted bipartite matching problem.MuxFlow exploits a DL approach to build the bipartite graph and the KM algorithm(Kuhn, 1955; Munkres, 1957) to solve this problem.

In summary, we make the following contributions.

•
We investigate the characteristics of production inference clusters and identify the opportunity in space sharing to better utilize GPU.
•
We provide efficient and safe space-sharing with three mechanisms: a two-level protection mechanism to guarantee the performance of online workloads, a mixed error-handling mechanism to ensure the safety of online workloads, and a dynamic SM percentage mechanism to improve the efficiency of offline workloads.
•
We design a matching-based scheduling algorithm to improve the sharing efficiency at the cluster level.The scheduling algorithm can improve the overall normalized throughput of offline workloads while maintaining the performance of online workloads.
•
We introduce MuxFlow, the first production cluster system that enables efficient and safe space sharing.We have deployed MuxFlow in a production cluster with more than $20,000$ GPUs at CompanyX to serve tens of thousands of daily workloads.Deployment results show that MuxFlow improves the GPU utilization from $26\%$ to $76\%$ , SM activity from $16\%$ to $33\%$ , and GPU memory from $42\%$ to $48\%$ .

2. Motivation

In this section, we begin with introducing DL workloads and critical terminologies.Then we describe the observations from the production cluster for online workloads to motivate the design of MuxFlow.We end by discussing opportunities to share the GPUs between different DL workloads.

2.1. DL workloads

DL workloads use the deep neural network (DNN) to perform inference or training.DL workloads are usually classified into two categories, i.e., online workload and offline workload, according to the latency demand.Online workload refers to latency-critical inference, such as real-time recommendation(Covington etal., 2016; Gao etal., 2021) and machine translation(Vaswani etal., 2017; Gehring etal., 2017).Online workloads have strict latency demands because longer end-to-end latency may hurt users’ experience.Additionally, the requests for online workloads are usually submitted periodically at different frequencies.Offline workload does not have strong latency demand, such as DL training, batch inference, scientific computing(Senior etal., 2020), and automatic neural architecture search(Liu etal., 2018; Tan etal., 2019).These workloads usually take hours or even days to finish.The offline workloads do not have hard time requirements and can usually highly utilize the computing units of GPU, making them suitable to fill the idle GPU resource.

2.2. Production cluster for online workloads

Production clusters exploit GPUs to accelerate DL workloads(Crankshaw etal., 2017; Gujarati etal., 2020; Han etal., 2022).GPUs are usually assigned to online workloads exclusively to guarantee the latency demand.We study GPU resource utilization in production clusters from two aspects: memory and computing power.

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (1)

Low GPU resource utilization.We collect one week’s statistics of GPU computation utilization and memory usage in the inference cluster of CompanyX, as shown in Figure1.The inference workloads include various popular DL models, such as CNN, GNN, LLM, and recommendation models.As for the GPU computing utilization, we use two metrics: GPU utilization and SM activity(dcg, 2022).GPU utilization and SM utilization represent how busy the GPU is in time and in space, respectively.GPU memory usage is the ratio of used memory to memory capacity.Figure1 illustrates that both GPU utilization and SM utilization are lower than $60\%$ for more than $99\%$ GPUs.In addition, GPU memory usage is less than $60\%$ for about $90\%$ GPUs.These numbers show that GPUs are underutilized in both memory and computing power, indicating a great waste of valuable GPUs.

Fluctuating and predictable GPU utilization.

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (2)

We take one typical online workload in the production cluster of CompanyX as an example and show its GPU computing utilization and memory usage in Figure2.Both the GPU utilization and SM activity fluctuate greatly in one day, because the number of online requests varies from time to time.For example, more users use entertainment applications in the evening and send more online requests to related services, while during the day, fewer requests are sent.The GPU memory usage is stable because the DL framework, e.g., PyTorch(Paszke etal., 2019), caches the intermediate GPU memory for efficiency.Besides, we observe that the curves of the GPU usage metrics are smooth in minutes and periodical in days.Thus, we can predict the GPU usage metrics by the past values.

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (3)

2.3. Opportunities in GPU sharing

Some recent work(cgp, 2022; mig, 2022; mul, [n. d.]; mps, 2022) has exploited GPU sharing approaches to improve GPU resource utilization.There are two aspects of GPU sharing, i.e., time-sharing and space-sharing.We compare GPU sharing approaches for widely-deployed NVIDIA GPUs with an example as shown in Figure3.

Time-sharing is not efficient to improve GPU resource utilization.In time-sharing, shared workloads use different time slices.To protect the performance of online workloads, priority-based time-sharing(Xiao etal., 2020; cgp, 2022) (Figure3(a)) assigns more time slices to high-priority workloads.However, a single online workload usually cannot fill all SMs of one GPU completely(Han etal., 2022; Ma etal., 2020), leading to a waste of GPU computing power.

Opportunity: space-sharing of GPU.When a workload cannot fully utilize the GPU computing units, i.e., SMs, it can share the idle SMs with other workloads.The SMs of one GPU can be divided into multiple parts, and used by different workloads simultaneously, i.e., space-sharing.We summarize three space-sharing approaches to share widely-deployed NVIDIA GPUs in Figure3.NVIDIA proposes multi-instance GPU (MIG)(mig, 2022) which can partition one GPU into multiple instances, as shown in Figure3(b).However, the partition cannot be dynamically adjusted during workload execution, and thus, we have to allocate maximum resources for online workloads which leads to a waste of GPU.Additionally, MIG only works for specific new-generation GPU types, e.g., A100 and H100, which are not widely used in production clusters.CUDA provides multiple streams(mul, [n. d.]) (Figure3(c)) to execute kernels from multiple workloads, whereas the concurrent workloads can significantly degrade the performance of online workloads.Besides, NVIDIA stream can only share with other streams in one process, which needs to merge multiple workloads and is hard to manage in production clusters.We find that NVIDIA MPS(mps, 2022) (Figure3(d)) is the best trade-off between GPU resource utilization and online performance.MPS is supported by Kepler and newer NVIDIA GPUs which are the majority of the GPUs used in production clusters.MPS enables NVIDIA GPU to execute multiple workloads at the same time by assigning different sets of SMs to the shared workloads.Besides, MPS provides environment variables to roughly control the SM percentage used by each workload, which enables performance protection of online workloads.

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (4)

To show the effect of MPS, we choose two DL models, VGG16(Simonyan and Zisserman, 2015) and DenseNet201(Huang etal., 2017) as workloads.We use the inference of these DL models as online workloads and the training as offline workloads.These workloads are tested on NVIDIA T4 GPU.Figure4 reports the normalized performance when we share one online workload with one offline workload.The normalized performance is the average iteration duration when running alone divided by the average iteration duration when shared with other workloads.To protect the performance of online workloads, we adjust the SM percentage of offline workloads.Figure4 demonstrates that one GPU can provide up to $62\%$ more computing power while slowing online workloads less than $20\%$ .The results indicate the potential of sharing GPU in space with MPS.

Challenges for space-sharing.There are some technical challenges to deploy space-sharing in large-scale DL clusters.First, the primary goal of production clusters is to guarantee the performance of online workloads.MPS enables us to roughly change the SM percentage used by each workload.However, it cannot guarantee the performance of online workloads.For example, the online requests may suddenly burst due to a special activity, but the SM percentage for offline workloads cannot be reduced timely.Thus, we need to control the execution process of the shared workloads to protect online workloads.Second, MPS is notorious for its serious error propagation problems.Specifically, when one workload encounters an error, the shared workload will also be influenced.These safety problems are intractable and critical in production clusters.Third, different SM percentages assigned to offline workloads can greatly influence the efficiency of both shared workloads.We change the SM percentage assigned to offline workloads from $10\%$ to $100\%$ as shown in Figure4.The normalized performance of both online workload and offline workload varies more than $5\times$ .Thus, choosing a proper SM percentage is important to provide efficient GPU sharing.Fourth, different shared pairs of online and offline workloads show different impacts on the shared workloads in Figure4.The normalized performance of offline workloads varies up to $50\%$ in Figure4.Additionally, the number of possible sharing plans is factorial to the number of workloads, which is enormous for a production cluster.We need to efficiently decide how to share workloads to maximize offline efficiency, while maintaining the performance of online workloads.

3. MuxFlow architecture

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (5)

MuxFlow is a cluster system that enables efficient and safe space-sharing in production clusters with tens of thousands of heterogeneous GPUs at CompanyX.The architecture of MuxFlow is shown in Figure5.MuxFlow consists of a service manager for online workloads, a global manager for offline workloads, and a set of local executors for each GPU.

Service manager.In this paper, we only describe the functionality of the service manager briefly as the details of the service manager are beyond the scope of this paper.The service manager is responsible for online workloads deployment, online requests discovery, and horizontal pod autoscaling.

Global manager.When an offline workload is submitted to MuxFlow, the global manager buffers the offline workload in a pending queue and makes scheduling decisions periodically.The global manager includes three components: workload profiler, speed predictor, and scheduler.

Workload profiler.The workload profiler gets the GPU resource utilization and execution time for each offline workload.When an offline workload is first submitted, the workload profiler runs the job for a few iterations and measures the execution information.The measured information is stored in a database and can be used by the speed predictor when making scheduling decisions.

Speed predictor.When the speed predictor gets an online workload and an offline workload, it can predict the sharing speed of the given workloads.The speed predictor employs a DL model to perform prediction.The DL model leverages the execution information when the workloads are executed separately.The execution information is reported by the GPU monitor for online workloads and is profiled by the workload profiler in advance for offline workloads.The predicted speed is passed to the scheduler to make scheduling decisions.

Scheduler.The scheduler schedules the offline workloads from the pending queue.By utilizing the predicted speed from the speed predictor, the scheduler exploits a matching-based scheduling algorithm to decide which offline workload and online workload should share the same GPU.The scheduling algorithm can find the optimal sharing strategy.The scheduler performs global rescheduling periodically at a fixed interval.

Local executor.Each local executor manages the workloads on one GPU.The local executor executes workloads according to the scheduling decision of the scheduler and monitors the running workloads.Besides, it can evict the offline workload if the online workload is under threat.The workloads share the same GPU in space with MPS.There are four components in the local executor: online container, offline container with xCUDA, GPU monitor, and SysMonitor.

Online container and offline container.The online container runs the server for online workload and serves the online requests from upstream callers registered in the server manager.The offline container runs the offline workload.With xCUDA built in the offline container, the execution of the offline workload is controlled to guarantee the performance of the online workload.To do so, xCUDA limits the GPU memory and computing power used by the offline workloads.

GPU monitor.The GPU monitor periodically collects GPU metrics, such as GPU utilization, memory usage, and SM clock.These data reflect the workload pressure of each GPU and they are leveraged by the SysMonitor and xCUDA to manage offline workloads.

SysMonitor.The SysMonitor maintains a state machine reflecting the device status and ensures that the device is not in unhealthy status.The state machine transits according to the GPU metrics collected by the GPU monitor.When the state machine indicates one online workload is highly influenced, the SysMonitor will evict the shared offline workload.

4. Efficient and safe space-sharing

The goal of MuxFlow is to guarantee the performance and safety of online workloads, and improve the efficiency of offline workloads.In this section, we introduce how to provide efficient and safe space-sharing in each local executor.We first describe how we protect the performance of online workloads with a two-level protection mechanism.Then we introduce how we protect the safety of online workloads with a mixed error-handling mechanism.We end with the dynamic SM allocation mechanism for offline efficiency improvement.

4.1. Two-level performance protection

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (6)

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (7)

MPS provides an environment variable to control the SM percentage used by a workload.^*^** $CUDA\_MPS\_ACTIVE\_THREAD\_PERCENTAGE$ configures the active thread percentage at the client process level and limits the SM percentage used by the client process.We can roughly limit the offline workload with this environment variable and reduce the slowdown of online workloads indirectly.However, in production clusters, we need rigorous protection mechanisms for online workloads.MuxFlow employs a two-level protection mechanism as shown in Figure6.Specifically, MuxFlow controls offline workloads to protect online workloads at the workload level by xCUDA and the GPU level by SysMonitor.

First of all, we need GPU metrics to make decisions on how to control the offline workloads.In the local executor, the GPU monitor collects real-time GPU metrics periodically.The collection interval is in the millisecond level for timely control.The metrics include GPU resource utilization (e.g., GPU utilization, SM activity, and GPU memory usage), and GPU device status (e.g., SM clocks, power, and temperature).The GPU monitor stores the metrics for only several minutes because old data not only consume storage but also are useless for timely workload management.

Workload level.xCUDA is built in the offline container to control the GPU memory and computing power used by offline workloads, as shown in Figure6(a).In the aspect of memory, xCUDA can keep track of the GPU memory usage and make sure that the memory used by the offline workload does not exceed the GPU memory quota.Specifically, whenever the DL framework, e.g., TensorFlow(Abadi etal., 2016) and PyTorch(Paszke etal., 2019), applies for GPU memory by calling the related CUDA API, the call will be checked by xCUDA first.

In terms of computing power, we aim to guarantee the performance of online workloads and improve GPU computing utilization.For NVIDIA GPUs, the SM clock represents how fast the SMs execute instructions.The performance of online workloads is greatly affected by the SM clock, and the SM clock will decrease when the GPU load is high.The decrease in SM clock is especially noteworthy for NVIDIA GPU for inference, e.g., T4.Thus, our goal is equivalent to attaining both high SM clock and high GPU utilization.When the SM clock is low, we can delay the kernel launches of the offline workloads to reduce the GPU load and improve the SM clock.When the GPU utilization is low, we can launch more kernels to improve it.Formally, we define $U_{SM}$ as the SM activity and $a_{C}$ as a clock factor that is negatively correlated with the SM clock.We use the GPU load $U_{GPU}$ to quantify our goal, which can be calculated by,

(1)

U_{GPU}=U_{SM}\times a_{C}.

However, the SM clock and GPU utilization are conflicting in practice because the SM clock is negatively correlated to the GPU load, while the GPU utilization is positively correlated to the GPU load.Note that when sharing online and offline workloads, it is enough to get an SM clock which is similar to the SM clock when the online workload runs separately.Consequently, xCUDA sets an SM clock threshold for these two goals.When the SM clock is below the threshold, xCUDA tends to improve the SM clock.When the SM clock is over the threshold, xCUDA tends to improve the GPU utilization.The factor $a_{C}$ can be calculated by,

(2)

a_{C}=\begin{cases}1+a_{L}*\frac{T_{SM}-C_{SM}}{T_{SM}}&C_{SM}<T_{SM}\\1-a_{H}*\frac{C_{SM}-T_{SM}}{C_{H}-T_{SM}}&C_{SM}\geq T_{SM},\end{cases}

where $a_{L}$ is a parameter for low SM clock, $a_{H}$ is a parameter for high SM clock, $C_{SM}$ is the SM clock, $T_{SM}$ is a SM clock threshold, and $C_{H}$ is the highest SM clock. $a_{L}$ is much larger than $a_{H}$ to show the preference of increasing the SM clock when it is below the threshold.With the GPU load $U_{GPU}$ , xCUDA will delay the kernel when $U_{GPU}$ is high and launch the kernel when $U_{GPU}$ is low.Additionally, as the GPU load $U_{GPU}$ may change rapidly, xCUDA leverages the PID algorithm(Johnson and Moradi, 2005) to provide more stable and robust controlling.

GPU level.xCUDA constrains offline workloads and provides indirect performance protection for online workloads.However, it cannot reply to changes caused by online workloads in time.For example, when the GPU memory usage of the online workload bursts, xCUDA cannot dynamically adjust the GPU memory quota for offline workloads.Thus, at the GPU level, MuxFlow uses SysMonitor to monitor the GPU device status with a state machine.Figure6(b) shows the state machine of SysMonitor.The state machine has five states and each state has a set of metric thresholds for GPU utilization, SM activity, SM clock, and GPU memory usage.The threshold values are empirically selected.Note that offline workloads can only be scheduled to Healthy GPUs.

The five states of SysMonitor are as follows:(1) Init state represents that the GPU is being initialized. When the initialization finishes, the Init state will transit to the Healthy state.(2)Healthy state represents that the GPU is healthy and is able to execute offline workloads.The metric thresholds of this state guarantee that the online workload is not influenced by the offline workloads.Once one metric reaches the Unhealthy threshold, the state will transit to the Unhealthy state.Furthermore, once one metric exceeds the Overlimit threshold, the state will directly transit to the Overlimit state.(3)Unhealthy state represents that one GPU metric is in the Unhealthy state and none is in the Overlimit state.Intuitively, this state means that the online workloads may be influenced.The offline workloads are forbidden to be scheduled to the GPU in this state.Once one metric exceeds the Overlimit threshold, the state will transit to the Overlimit state.Oppositely, if all metrics are below the Healthy threshold, the state will transit to the Healthy state.(4) Overlimit state represents that the GPU device is overloaded.In this state, the offline workloads are evicted.When all metrics are below the Overlimit threshold after a period, the state will transit to the Unhealthy state.To avoid frequent eviction, the period is set to be the exponent of the times going into the Overlimit state during the last two hours.(5)Disabled state represents that the GPU device is unavailable and no workload runs on it.

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (8)

4.2. Safety protection

MPS has a serious error propagation problem, i.e., one workload’s error may impact other workloads sharing the same GPU.The error propagation problem is dangerous in large-scale clusters, especially for online workloads.For example, if one offline workload is canceled by SIGINT signal, the MPS context may hang and the shared online workload cannot serve requests.We summarize propagated errors in one production cluster with MPS enabled as shown in Figure7, and propose a mixed error-handling mechanism.

We find that $99\%$ of such propagated errors are caused by SIGINT/SIGTERM.To handle the dominant error type, we use xCUDA to intercept SIGINT and SIGTERM signals and exit gracefully.Specifically, when xCUDA gets these signals, it will freeze all kernel launches and release CUDA context actively.Other errors only count to $1\%$ , e.g., MPS server crash (caused by program bugs), XID31 event (GPU memory page fault), and MPS hangs caused by other reasons.For these errors, we summarize their error patterns.An automated detector monitors GPUs and alerts when the error patterns are satisfied.Once xCUDA gets the alert, it will reset the CUDA context and MPS server.

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (9)

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (10)

4.3. Dynamic SM allocation

In Figure4, we illustrate that the SM percentage assigned to offline workloads can influence the speed of shared workloads dramatically.In other words, we can balance the speed of the online workload and that of the offline workload by changing the SM allocation.Our goal is to maximize the efficiency of offline workloads with an acceptable slowdown of the online workloads.As SM is the computing unit of GPU, maximizing the efficiency of offline workloads is approximately equal to maximizing the percentage of SMs assigned to offline workloads.Apparently, fixed SM allocation is not a panacea for all sharing cases.We use the example in Figure8 to illustrate the drawbacks of fixed SM allocation.A and B are online workloads, and C and D are offline workloads.Assume that the SM percentage is set to $40\%$ for offline workloads.The online workload A only uses $20\%$ SMs and leaves more than $40\%$ SMs.If we fix the SM percentage for offline workloads to $40\%$ , there will be $40\%$ idle SM and the computing power is wasted.The online workload B uses $80\%$ SMs when running alone and the left SMs are less $40\%$ .With the fixed SM allocation, the offline workload D will occupy B’s SM and slow the online workload B down.

To provide efficient space-sharing, we propose the dynamic SM allocation mechanism by selecting the proper SM percentage for offline workloads.A natural idea is to assign the SM percentage according to the SM activity of online workloads, as shown in Figure8(c).For example, we can set the SM percentage for the offline workload C to $80\%$ and D to $20\%$ .In this way, the computing units, i.e., SMs, are used up and the shared workloads do not contend for SMs.

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (12)

5. Matching-based scheduling

Note that one offline workload has diverse throughput when sharing with different online workloads.We next consider scheduling submitted workloads to achieve high overall throughput for offline workloads.The method of scheduling and deploying online workloads is orthogonal to the scheduling algorithm of offline workloads.For online workloads, we reuse the scheduling and deployment strategy of the exclusive inference cluster at CompanyX.The details of the strategy are beyond the scope of this paper.For offline workloads, the scheduling algorithm needs to solve two problems.

The first problem is to capture the overall throughput for all offline workloads.It is unfair and meaningless to simply sum up the throughput of every offline workload because different kinds of workloads vary in throughput when running separately.Thus, we use the normalized throughput, which is defined as the throughput when sharing divided by the throughput when running separately.We can get the throughput of separate execution by profiling the offline workloads when it is submitted.However, the shared throughput is difficult to get because a production cluster usually has thousands of online workloads and it is impossible to profile all sharing pairs.Fortunately, getting the shared throughput can be seen as a regression problem, which can be solved by DL.We can use the profiled separate execution features of online and offline workloads as input, and employ a DL model to get shared throughput.Specifically, we choose highly related execution features, e.g., GPU utilization, SM activity, SM occupancy, separate execution time, and assigned SM percentage, as input.We employ the multi-layer perceptron (MLP) as the speed predictor to get the shared throughput.Because different GPU types perform diversely, we train multiple MLPs for each GPU type.

The second problem is to select the best sharing plan from all potential sharing plans.For a production cluster with thousands of online and offline workloads, there is an enormous number of combinations to share workloads.Formally, given $n$ online workloads and $m$ offline workloads, we pair the workloads to maximize the normalized throughput of the entire sharing plan.This problem can be transformed to a maximum weighted bipartite matching problem.Figure9 shows how we model this problem.We build a bipartite graph $G(V,E)$ , where each node $v\in V$ represents a workload.Every node belongs to one of the two node sets, i.e., the online workload set and the offline workload set.The edge $(u,v)\in E$ represents sharing online workload $u$ and offline workload $v$ with the normalized throughput of $v$ as the edge weight.A matching of $G$ is a set of disjoint edges and it corresponds to a feasible sharing plan.Finding the sharing plan with maximum overall throughput is converted to finding the maximum weighted matching of the corresponding bipartite graph.We leverage the Kuhn-Munkres (KM) algorithm(Kuhn, 1955; Munkres, 1957) for this well-studied problem.The KM algorithm can find the optimal maximum weighted bipartite matching in $O(|V|^{3})$ .

We use an example in Figure9 to illustrate the matching problem.There are two online workloads (A and B) and three offline workloads (C, D, and E).The number on the edge represents the normalized throughput of the offline workload when sharing with the online workload.For example, when sharing C with A, the normalized throughput of C is $0.3$ .We compare two matching plans.Plan 1 shares A with D and B with C.The overall throughput of offline workloads is $0.8+0.8=1.6$ .Plan 2 shares A with C and B with E.The overall throughput of offline workloads is $0.3+0.4=0.7$ .Apparently, plan 1 has higher overall throughput and is more efficient in the efficiency of offline workloads.

Algorithm1 shows the pseudocode of the matching-based scheduling.Given online workloads $W_{on}$ and offline workloads $W_{off}$ , we initialize the bipartite graph $G$ where online workloads $W_{on}$ and offline workloads $W_{off}$ are two disjoint sets of $G$ (Line 1-4).For each pair of nodes, we get the SM percentage for the offline workload by dynamic SM allocation mechanism (Line 5-6).The edge weights are calculated by the speed predictor $P$ (Line 7-8).Then we get the maximum weighted bipartite matching $M$ by the KM algorithm (Line 9-10).Edge $(u,v)$ in matching $M$ represents that the offline workload $v$ should be shared with the online workload $u$ .

1:Input: Online workloads $W_{on}$ ; offline workloads $W_{off}$ ; speed predictor $P$ .

2:Main routine:

1:// Initialize

2: $G.Init()$

3: $G.AddNodes(W_{on})$

4: $G.AddNodes(W_{off})$

5:foreach pair $(u,v)$ , where $u\in W_{on},v\in W_{off}$ do

6: $sm\leftarrow DynamicSM(u,v)$

7: $weight\leftarrow P.CalcNormTput(u,v,sm)$

8: $G.AddEdge(u,v,weight)$

9:// Find optimal matching with the KM algorithm

10: $M\leftarrow G.GetMatching()$

11:Subroutines:

12:• $G.Init()$ : Initialize an empty bipartite graph.

13:• $G.AddNodes(W)$ : Add every workload $w\in W$ as a node to graph $G$ .

14:• $G.AddEdge(u,v,c)$ : Add an edge $(u,v)$ with edge weight $c$ to graph $G$ .

15:• $G.GetMatching()$ : Calculate the maximum weighted bipartite matching of graph $G$ by the KM algorithm.

16:• $DynamicSM(u,v)$ : Use dynamic SM allocation mechanism to get the proper SM percentage for offline workload $v$ .

17:• $P.CalcNormTput(u,v,sm)$ : Calculate the normalized throughput of $v$ by the speed predictor $P$ .

6. Implementation

At CompanyX, we have deployed MuxFlow in our internal clusters to serve daily DL workloads.The internal clusters consist of heterogeneous GPUs, including NVIDIA T4 GPU and NVIDIA A10 GPU.Integrated with Kubernetes(k8s, 2022), MuxFlow manages thousands of GPUs in each cluster and more than 20,000 GPUs in all.

Service manager.For online workloads, we use the existing service manager at CompanyX which deploys containers, discovers service, and autoscales horizontal pods.

Global manager.We modify the Kubernetes scheduler to schedule offline workloads.The workload profiler takes several dedicated GPUs, whose number is negligible to the total number of GPUs.When a new offline workload comes, the workload profiler performs a few dry runs of the workload and utilizes the NVIDIA Data Center GPU Manager (DCGM) tools(dcg, 2022) and NVIDIA Management Library (NVML)(nvm, 2022) libraries to collect GPU metrics.We collect about 2,000 data for each GPU type to train the speed predictor.The MLPs of the speed predictor have four layers with hidden size $64\times 64$ .The MLPs are trained with momentum SGD optimizer(Ruder, 2016) in PyTorch v1.8.0(Paszke etal., 2019) until they converge.MuxFlow invokes the scheduler periodically to schedule all offline workloads.When moving workloads, we record checkpoints of offline workloads and restart the workloads after transmitting the models and checkpoints.As the datasets are usually colossal, we store the datasets in a remote file system and fetch data during the execution.We implement the scheduler as a third-party plugin to the Kubernetes scheduler.

Local executor.Each local executor executes online workloads according to the service manager and offline workloads according to the global manager.DL workloads are executed in Docker containers with our customized components.We add Best-Effort GPU DevicePlugin in Kubernetes and relevant control paths with Kubelet and SysMonitor for offline workloads.To control SM percentage, we leverage the environment variable $CUDA\_MPS\_ACTIVE\_THREAD\_PERCENTAGE$ provided by MPS.The GPU monitor collects resource metrics through DCGM(dcg, 2022) and NVML(nvm, 2022) for NVIDIA GPU.The SysMonitor updates the state machine with the collected resource metrics and empirically-set thresholds.When the state is unhealthy, the SysMonitor will ask the NodeManager in Kubernetes to evict offline workloads.xCUDA intercepts nearly 800 CUDA driver APIs for GPU memory allocation and kernel launch.The GPU memory quota of offline workloads is fixed to $40\%$ as Figure1 reports that most online workloads use less than $60\%$ GPU memory.We adopt the cpuset of Cgroup for CPU isolation.For memory, MuxFlow will evict offline workloads if memory usage is higher than a threshold or the kernel swap daemon is busy for a long time.The parameters to calculate GPU load in Equation1 $\&$ 2 are empirically selected through trial-and-error.

7. Evaluation

Our evaluation consists of testbed experiments, trace-driven simulations, and results from the production deployment.We mainly focus on the efficiency and safety of MuxFlow.The results show that MuxFlow oversells up to $90.0\%$ GPU resources to offline workloads, and improves the GPU utilization by $4.0\times$ , SM activity by $4.7\times$ , and GPU memory usage by $1.5\times$ .The error rate of MuxFlow is similar to the dedicated inference clusters in production deployment at CompanyX.

7.1. Experimental setup

Testbed.We conduct the testbed experiments with $125$ machines and $1,000$ GPUs.Each machine is equipped with $8$ NVIDIA Tesla T4 GPUs, 2 Intel Xeon Platinum CPUs, 128G memory, and 100+25G NIC.We PyTorch v1.8.0 with CUDA 11.1 for offline workloads.We use real online workloads in our production clusters and these workloads use different DL frameworks and CUDA drivers.

Simulator.Inspired by (Gu etal., 2019; Zhao etal., 2022), we build a simulator to evaluate a broader set of configurations, traces, and baselines.We profile the iteration duration, GPU resource utilization, and device metrics of the selected DL workloads when they are executed separately and shared with others.The profiling results include more than 200 executions.The difference between the simulation and testbed experiment is under $5\%$ , showing the high fidelity of the simulator.

Workloads.We use the actual online workloads and real-time requests at CompanyX for the testbed experiment.The online workloads use a wide spectrum of DL models, including CNN, GNN, LLM, and recommendation models.For trace-driven simulation, we select three online workloads deployed over hundreds of GPUs at CompanyX, and generate requests according to their actual query per second (QPS) varying from 20 to 190.For offline workloads, we leverage the public trace from Microsoft(Jeon etal., 2019) and split the trace according to the virtual cluster ID.We use submission time and duration from the traces and randomly choose DL models from four popular DL models including ResNet50(He etal., 2016), VGG16(Simonyan and Zisserman, 2015), DensNet201(Huang etal., 2017), and Inception-V3(Szegedy etal., 2016), in accordance with the common practice(Gu etal., 2019; Zhao etal., 2022; Han etal., 2022).We repeat the workloads to fit in 1,000 GPUs and guarantee that the generated traces can be finished in 12 hours for the testbed experiment and 24 hours for simulations.The numbers of offline workloads in these traces vary from 1,410 to 7,287.We set the batch size according to the memory quota limited by xCUDA.

Baselines.We compare MuxFlow with three related systems, Online-only, Time-sharing, and Priority-based time-sharing (PB-time-sharing).Online-only executes only the online workloads and shows the optimal latency for online workloads.Time-sharing shares workloads in time and assigns the time slices of GPUs to the shared workloads by the GPU driver strategy, which is adopted by Gandiva(Xiao etal., 2018).PB-time-sharing sets online workloads with high priority and assigns more time slices of GPUs to high-priority workloads to protect the high-priority workloads’ performance, which is adopted by AntMan(Xiao etal., 2020) and PAI(Weng etal., 2022).

Metrics.Average latency and $99\%-th$ latency are two common metrics to evaluate the performance of online workloads(Gujarati etal., 2020; Han etal., 2022).Average job completion time (JCT) and makespan are used to reflect the workload efficiency of schedulers(Xiao etal., 2020; Zhao etal., 2022).Offline normalized throughput shows the sharing efficiency.We define how much GPU the offline workloads get in the aspect of computation speed as the oversold GPU.This metric is in the range of $[0,1]$ , where $0$ represents that the offline workloads get no GPU computation resource, and $1$ represents that the offline workloads get equivalent GPU computation resources as they are executed exclusively.This metric can be calculated by Equation3.

(3)

Oversold\ GPU=\frac{\sum_{w\in W_{off}}T^{real}_{w}}{\sum_{w\in W_{off}}T^{sep}_{w}},

where $W_{off}$ represents offline workloads, $T^{real}_{w}$ represents the real execution time of $w$ , and $T^{sep}_{w}$ represents the execution time of $w$ when running exclusively.We report GPU resource utilization with three metrics: GPU utilization, SM activity, and GPU memory utilization.

7.2. Testbed experiments

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (13)

We first evaluate MuxFlow on a testbed with 1,000 GPUs.Figure10 shows the detailed metrics for online workloads, offline workloads, and GPU resource utilization.To get the metrics of Online-only, we stop the offline workloads for one minute in every scheduling interval and collect the metrics.The scheduling interval is set to 15 minutes considering the overhead of pulling images and initialization.Though the trace of offline workloads lasts for 12 hours, most workloads finish before 8 hours.Thus, there is an obvious shift for MuxFlow’s curves of offline normalized throughput and GPU resource utilizations between 7 to 8 hours.

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (14)

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (15)

Performance of online workloads.Figure10 shows that MuxFlow increases the average latency by $16.0\%$ , the $99\%-th$ latency by $15.3\%$ .These results indicate that MuxFlow slows down online workloads less than $20\%$ , i.e., 10ms.It is worth mentioning that such a slowdown is almost imperceptible for most online workloads, e.g. recommendation services and machine translation.In the wide spectrum of industrial online workloads deployed in CompanyX, the latency demand of most online workloads is more than 100ms, hence the 10ms slowdown of online workloads is acceptable in practice.Additionally, we can adjust xCUDA and the dynamic SM allocation mechanism to reduce the slowdown threshold or improve the oversold resource for offline workloads.We observe that $1.5\%$ executions of offline workloads are evicted, indicating the functionality of performance protection mechanisms.

Efficiency of offline workloads.We find that MuxFlow provides up to $86.42\%$ GPU resource to offline workloads, which is a substantial number considering the large number of GPUs in production.

GPU resource utilization.Figure10 compares the GPU computing utilization and memory usage between Online-only and MuxFlow.The utilization numbers are the average of all GPUs.MuxFlow improves the GPU utilization by $4.0\times$ , SM activity by $4.7\times$ , and GPU memory usage by $1.5\times$ .

Safety.We observe that during the 12-hour testbed experiments, no error propagation happens.That is, no online workload is influenced by offline workload errors, verifying the safety of MuxFlow.

7.3. Comparison with related work

We compare MuxFlow with three related systems, Online-only, Time-sharing, and PB-time-sharing.The related systems are implemented in our simulator, and evaluated with production online workloads and popular offline workloads.Figure11 demonstrates the average latency of online workloads, average JCT of offline workloads, and oversold GPU to offline workloads.We normalize the latency by that of Online-only and other metrics by that of MuxFlow.MuxFlow improves the average JCT by $1.10-2.24\times$ , and the oversold GPU by $1.08-1.97\times$ , while slowing down the online workloads by less than $20\%$ .Time-sharing slows down online workloads by up to $50\%$ , indicating a great impact on online workloads.PB-time-sharing utilizes priority to protect the performance of online workloads, but its metrics for offline workloads are worse than MuxFlow due to two reasons.First, MuxFlow can utilize the GPU resource wasted by online workloads in space.Second, MuxFlow employs the scheduling algorithm to improve the efficiency of offline workloads.

7.4. Analysis of MuxFlow

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (17)

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (18)

Accuracy of the speed predictor.To better understand the impact of MLP architecture on prediction accuracy, we evaluate the speed predictor with various hidden sizes and numbers of the network layers in MLP.We vary the hidden size from $64\times 64$ to $1024\times 1024$ and fix the number of layers to 4.The test error curves in Figure12(a) show that the MLPs with different hidden sizes have similar accuracy and convergence speed.For the number of layers, we evaluate the MLP with 2 to 8 layers as shown in Figure12(b), with the hidden size fixed to $64\times 64$ .We find that the error is lowest for 4 layers due to its proper relationship between dataset size and parameters.Thus, we select $64\times 64$ as the hidden size and $4$ as the layer number for better accuracy and faster inference time.

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (19)

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (20)

Impact of the mechanisms for offline efficiency.We also study the impact of the dynamic SM allocation mechanism and the matching-based scheduling.We leverage the simulator to compare MuxFlow with three variants, MuxFlow without dynamic SM allocation mechanism (MuxFlow-S), MuxFlow without matching-based scheduling (MuxFlow-M),and MuxFlow without both dynamic SM allocation mechanism and matching-based scheduling but only the protection mechanisms for online workloads (MuxFlow-S-M).Figure13 reports the metrics for online and offline workloads over traces A to D.We normalize the latency by that of Online-only and average JCT by that of MuxFlow.Note that compared with MuxFlow-S-M, both MuxFlow-S and MuxFlow-M improve the average JCT and oversold GPU.These improvements confirm that only the online protection mechanisms are not efficient, and dynamic SM allocation mechanism and matching-based scheduling are important for offline efficiency.Additionally, combining the two mechanisms, i.e., MuxFlow, brings more benefits.

System overhead.MuxFlow mainly introduces two kinds of overhead, i.e., the profiling overhead and the scheduling overhead.The profiling takes less than 10 minutes for each offline workload.The profiling overhead is marginal, as the offline workloads usually take hours or even days,The overhead of the scheduling algorithm consists of two periods.The first period is to predict the sharing performance and build the bipartite graph.Each prediction only takes less than one millisecond, and each internal cluster at CompanyX consists of thousands of GPUs and thousands of workloads.Thus, the first period only takes several seconds for each cluster using the batched prediction.The second period is to execute the KM algorithm, which takes several minutes for thousands of workloads.Note that the scheduling algorithm can be executed in parallel with the workload execution.Thus, we can hide the scheduling overhead within each scheduling interval.

7.5. Production deployment

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (22)

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (23)

MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters (24)

We have deployed MuxFlow on the production clusters with more than 20,000 GPUs at CompanyX.As the results of the whole system are not ready when the paper is written, we concentrate on the results of MuxFlow without the dynamic SM allocation mechanism and the matching-based scheduling.To verify the performance protection for online workloads provided by MuxFlow, we collect the latency and throughput of online workloads that are deployed with both MuxFlow and dedicated inference clusters (Online-only), as shown in Figure14.The average latency and $99\%-th$ latency of online workloads increase by less than $10ms$ .The slowdown of online workloads is acceptable compared with the latency demand of our online workloads.Besides, we collect the average resource utilization of all GPUs used by MuxFlow and Online-only for four weeks in Figure15.We observe that MuxFlow improves the GPU utilization from $26\%$ to $76\%$ , SM activity from $16\%$ to $33\%$ , and GPU memory from $42\%$ to $48\%$ , indicating the efficiency of MuxFlow.The GPU memory utilization increases less than other utilizations because of our conservative memory limitation for offline workloads.

The percentage of daily error devices of MuxFlow is below $0.9\%$ , which is slightly higher than the error rate of Online-only at CompanyX, $0.7\%$ .However, the increase of the error rate is much less than the increase of the executed containers, i.e., $2\times$ .Compared with Online-only, the extra device errors of MuxFlow come from MPS server crashes and other MPS hangs, which cannot be handled with existing mechanisms.

8. Lessons and future direction

Safety protection.How to guarantee the safety of shared workloads is one of the most important problems of deploying GPU sharing.Thus, most GPU sharing solutions(cgp, 2022; Gu etal., 2018) in production do not consider MPS due to its poor isolation ability.In contrast, to our best knowledge, we are the first to thoroughly analyze all unsafe cases encountered in our production cluster and propose corresponding solutions for these cases.Almost all unsafe cases in our deployment can be handled by the graceful exit mechanism of MuxFlow.Besides, we have worked with NVIDIA to improve MPS.Some bugs and features we reported have been verified and fixed by NVIDIA.For example, we observed that sharing workloads compiled by different GCC versions with MPS can cause the MPS server hangs, and this problem has been verified by NVIDIA and fixed in NVIDIA GPU Driver 470.

However, things become complicated when considering malicious behaviors, e.g., intriguing sticky CUDA error by dividing zero to influence the shared workload.To avoid malicious behaviors, MuxFlow only accepts trustworthy offline workloads and we employ a fault detector with manually-summarized rules to monitor collected device metrics and alert when abnormal situations are found.For now, MuxFlow is only used in our internal clusters and for internal users.These protection approaches seem safe enough according to our deployment experience.Yet if considering external users in a cloud setting,we need more general and automated approaches for safety protection and malicious behaviour detection.One opportunity is to leverage DL to discover malicious behaviours automatically(Saufi etal., 2019).Besides, enabling the scheduler to identify fail-prone workloads and avoid sharing them with other workloads is another possible approach.

Slowdown of online workloads.In this paper, we get the latency of online workloads increases less than $20\%$ , i.e., $10ms$ .The degradation is affordable and acceptable in our internal cluster, because most latency demands are more than 100ms for production online workloads.Note that the degradation threshold is a tradeoff between the online service quality and resource utilization, and it can be changed in MuxFlow by two mechanisms.First, the parameters of GPU load1 $\&$ 2 in xCUDA affect how the offline workload is executed and then how the online workload is influenced.Second, we can adjust the SM percentage assigned to offline workloads to change the slowdown of online workloads.How to select a proper degradation threshold for each cluster or even each online workload is left as an open problem.

The number of shared workloads.In MuxFlow, we share at most one offline workload with each online workload because one offline workload is usually enough to fill SMs up.Sharing multiple offline workloads with multiple online workloads may bring more benefits, especially for light-weighted offline workloads.However, there are four challenges to sharing multiple workloads.First, we need to guarantee the performance of all online workloads.Second, we need to limit the total SM percentage used by multiple offline workloads which cannot be simply limited by MPS parameters.Third, xCUDA needs to monitor kernel launches of all offline workloads and decide which kernel to delay or launch according to their priority.Fourth, the scheduling algorithm to choose sharing pairs with more than three workloads becomes an NP-hard problem(Zhao etal., 2022).How to solve these challenges to utilize GPU better is an interesting future direction.

9. Related work

DL workload scheduling.Existing DL schedulers are mainly designed for online or offline workloads, but not both, to ensure the performance of online workloads and avoid interference.The primary goals of online workload schedulers(Crankshaw etal., 2017; Shen etal., 2019; Gujarati etal., 2020; Romero etal., 2021) are meeting the latency demand and improving overall throughput.However, these schedulers let one workload monopolize GPUs and thus, cannot fully exploit the GPU resource.Most prior offline workload schedulers(Gu etal., 2019; Hwang etal., 2021; Qiao etal., 2021; Mohan etal., 2022) also allocate GPUs exclusively.However, existing offline workload schedulers cannot be directly applied to GPU-sharing clusters because they cannot ensure the performance of high-priority online workloads.Differently, MuxFlow leverages space-sharing to improve GPU resource utilization and applies a two-level protection mechanism to guarantee the performance of online workloads.

Resource sharing for big data workloads.Prior work has studied resource sharing for big data workloads and CPU clusters.DRF(Ghodsi etal., 2011) extends max-min fairness to achieve resource sharing fairness.Tetris(Grandl etal., 2014), Graphene(Grandl etal., 2016b), and Carbyne(Grandl etal., 2016a) propose heuristic algorithms to solve multi-resource scheduling problems and improve resource utilization.MonoSpark(Ousterhout etal., 2017) improves performance clarity by splitting data analytics tasks into monotasks.Many large enterprises deploy resource-sharing clusters.Apollo system(Boutin etal., 2014) improves resource utilization by opportunistic tasks in Microsoft.Google’s Borg(Verma etal., 2015; Tirmazi etal., 2020) adopts machine sharing with performance isolation to achieve high utilization.In comparison, DL workloads use GPUs to speed up, and GPUs lack efficient and safe sharing mechanisms.Thus, it is more challenging to deploy GPU sharing in production clusters.

GPU sharing for DL workloads.Recently, GPU sharing has been studied for DL workloads.The techniques to share GPUs mainly fall into two categories.Prior time-sharing approaches(Xiao etal., 2018; Wang etal., 2021; Lim etal., 2021; Zhao etal., 2022) may impact the efficiency of shared workloads and cannot fully utilize GPU computing power in every time slice.Salus(Yu and Chowdhury, 2019) and PipeSwitch(Bai etal., 2020) propose fast job switching and memory management techniques to speed up time-sharing.But they cannot avoid the intrinsic drawbacks of time-sharing.Some work(Gu etal., 2018; Xiao etal., 2020; Weng etal., 2022; cgp, 2022) assigns time slices according to priority to guarantee the performance of online workloads.However, these approaches still cannot improve resource utilization for each time slice.MuxFlow employs space-sharing and performance protection mechanisms for efficient and safe GPU sharing.

Space-sharing is the other direction to share GPU.NVIDIA’s MPS(mps, 2022) is a general method to multiplex jobs on NVIDIA GPUs.Gavel(Narayanan etal., 2020) leverages MPS directly but it cannot guarantee the performance of online workloads.GSLICE(Dhakal etal., 2020) advances MPS to support dynamic and fair resource allocation but it does not consider cluster-level scheduling.Retiarii(Zhang etal., 2020) merges multiple similar models to improve GPU utilization which is infeasible for production clusters running diverse workloads.DeepPool(Park etal., 2022) and Reef(Han etal., 2022) leverage priority-based multi-stream approaches.However, multi-stream approaches are unfit for production deployment mainly due to two reasons.First, it needs to manually merge different workloads into one process to leverage NVIDIA GPU streams, which is not friendly to existing infrastructure and users.Second, multiple streams may introduce the overhead of locks and CPU kernel launches.In contrast, MuxFlow is a practical space-sharing cluster system that has been deployed at CompanyX.

Some work predicts performance interference among shared workloads for time-sharing(Chen etal., 2016, 2017) and space-sharing(Zhao etal., 2019).These approaches play the similar role as the DL-based speed predictor in MuxFlow, and are orthogonal to other system designs.

10. Conclusion

In this paper, we have presented MuxFlow, the first production DL cluster system for efficient and safe space-sharing.MuxFlow ensures the performance of online workloads from both workload level and GPU level.To guarantee the safety of online workloads, MuxFlow leverages a mixed error-handling mechanism based on the analysis of production errors.Furthermore, MuxFlow exploits dynamical SM allocation and matching-based scheduling to improve the efficiency of offline workloads.The evaluation results demonstrate the efficiency and efficacy of MuxFlow.Particularly, MuxFlow has been already deployed in the production DL clusters at CompanyX with more than 20,000 GPUs.

References

(1)
mul ([n. d.])[n. d.].NVIDIA Multiple Streams.https://developer.download.nvidia.cn/CUDA/training/StreamsAndConcurrencyWebinar.pdf.
cgp (2022)2022.cGPU.https://www.alibabacloud.com/help/en/elastic-gpu-service/latest/cgpu.
dcg (2022)2022.DCGM.https://developer.nvidia.com/dcgm.
k8s (2022)2022.Kubernetes.http://kubernetes.io.
mps (2022)2022.NVDIA MPS.https://docs.nvidia.com/deploy/mps/index.html.
mig (2022)2022.NVIDIA MIG.https://www.nvidia.com/en-us/technologies/multi-instance-gpu/.
nvm (2022)2022.NVML.https://developer.nvidia.com/nvidia-management-library-nvml.
Abadi etal. (2016)Martín Abadi, PaulBarham, Jianmin Chen, Zhifeng Chen,Andy Davis, Jeffrey Dean,Matthieu Devin, Sanjay Ghemawat,Geoffrey Irving, Michael Isard,etal. 2016.TensorFlow: a system for Large-Scale machinelearning. In USENIX OSDI.
Alcon etal. (2020)Miguel Alcon, HamidTabani, Leonidas Kosmidis, EnricoMezzetti, Jaume Abella, and FranciscoJCazorla. 2020.Timing of autonomous driving software: Problemanalysis and prospects for future solutions. InIEEE RTAS.
Bai etal. (2020)Zhihao Bai, Zhen Zhang,Yibo Zhu, and Xin Jin.2020.PipeSwitch: Fast Pipelined Context Switching forDeep Learning Applications. In USENIX OSDI.
Boutin etal. (2014)Eric Boutin, JaliyaEkanayake, Wei Lin, Bing Shi,Jingren Zhou, Zhengping Qian,Ming Wu, and Lidong Zhou.2014.Apollo: Scalable and Coordinated Scheduling for $\{$ Cloud-Scale $\}$ Computing. In USENIX OSDI.
Chen etal. (2017)Quan Chen, Hailong Yang,Minyi Guo, RamSrivatsa Kannan,Jason Mars, and Lingjia Tang.2017.Prophet: Precise qos prediction on non-preemptiveaccelerators to improve utilization in warehouse-scale computers. InACM ASPLOS.
Chen etal. (2016)Quan Chen, Hailong Yang,Jason Mars, and Lingjia Tang.2016.Baymax: Qos Awareness and Increased Utilization forNon-preemptive Accelerators in Warehouse Scale Computers.ACM ASPLOS (2016).
Covington etal. (2016)Paul Covington, JayAdams, and Emre Sargin.2016.Deep neural networks for youtube recommendations.In ACM RecSys.
Crankshaw etal. (2017)Daniel Crankshaw, XinWang, Guilio Zhou, MichaelJ Franklin,JosephE Gonzalez, and Ion Stoica.2017.Clipper: A Low-Latency Online Prediction ServingSystem. In USENIX NSDI.
Dhakal etal. (2020)Aditya Dhakal, SameerGKulkarni, and KK Ramakrishnan.2020.Gslice: Controlled Spatial Sharing of GPUs for aScalable Inference Platform. In ACM SoCC.
Gao etal. (2021)Weihao Gao, Xiangjun Fan,Chong Wang, Jiankai Sun,Kai Jia, Wenzi Xiao,Ruofan Ding, Xingyan Bin,Hui Yang, and Xiaobing Liu.2021.Learning An End-to-End Structure for Retrieval inLarge-Scale Recommendations. In ACM CIKM.
Gehring etal. (2017)Jonas Gehring, MichaelAuli, David Grangier, Denis Yarats,and YannN Dauphin. 2017.Convolutional sequence to sequence learning. InPMLR.
Ghodsi etal. (2011)Ali Ghodsi, MateiZaharia, Benjamin Hindman, AndyKonwinski, Scott Shenker, and IonStoica. 2011.Dominant resource fairness: Fair allocation ofmultiple resource types. In USENIX NSDI.
Grandl etal. (2014)Robert Grandl, GaneshAnanthanarayanan, Srikanth Kandula,Sriram Rao, and Aditya Akella.2014.Multi-resource packing for cluster schedulers. InACM SIGCOMM.
Grandl etal. (2016a)Robert Grandl, MosharafChowdhury, Aditya Akella, and GaneshAnanthanarayanan. 2016a.Altruistic scheduling in multi-resource clusters.In USENIX OSDI.
Grandl etal. (2016b)Robert Grandl, SrikanthKandula, Sriram Rao, Aditya Akella,and Janardhan Kulkarni.2016b.Graphene: Packing and dependency-aware schedulingfor data-parallel clusters. In USENIX OSDI.
Gu etal. (2019)Juncheng Gu, MosharafChowdhury, KangG Shin, Yibo Zhu,Myeongjae Jeon, Junjie Qian,Hongqiang Liu, and Chuanxiong Guo.2019.Tiresias: A GPU cluster manager for distributeddeep learning. In USENIX NSDI.
Gu etal. (2018)Jing Gu, Shengbo Song,Ying Li, and Hanmei Luo.2018.GaiaGPU: sharing GPUs in container clouds. InIEEE ISPA/IUCC/BDCloud/SocialCom/SustainCom.
Gujarati etal. (2020)Arpan Gujarati, RezaKarimi, Safya Alzayat, Wei Hao,Antoine Kaufmann, Ymir Vigfusson, andJonathan Mace. 2020.Serving DNNs like Clockwork: PerformancePredictability from the Bottom Up. In USENIXOSDI.
Han etal. (2022)Mingcong Han, HanzeZhang, Rong Chen, and Haibo Chen.2022.Microsecond-scale Preemption for ConcurrentGPU-accelerated DNN Inferences. In USENIX OSDI.
He etal. (2016)Kaiming He, XiangyuZhang, Shaoqing Ren, and Jian Sun.2016.Deep residual learning for image recognition. InIEEE CVPR.
Huang etal. (2017)Gao Huang, Zhuang Liu,Laurens Van DerMaaten, and KilianQWeinberger. 2017.Densely connected convolutional networks. InIEEE CVPR.
Hwang etal. (2021)Changho Hwang, TaehyunKim, Sunghyun Kim, Jinwoo Shin, andKyoungSoo Park. 2021.Elastic resource sharing for distributed deeplearning. In USENIX NSDI.
Jang etal. (2020)Wonseok Jang, HansaemJeong, Kyungtae Kang, Nikil Dutt, andJong-Chan Kim. 2020.R-TOD: Real-time object detector with minimizedend-to-end delay for autonomous driving. In IEEERTSS.
Jeon etal. (2019)Myeongjae Jeon, ShivaramVenkataraman, Amar Phanishayee, JunjieQian, Wencong Xiao, and Fan Yang.2019.Analysis of large-scale multi-tenant GPU clustersfor DNN training workloads. In USENIX ATC.
Johnson and Moradi (2005)MichaelA Johnson andMohammadH Moradi. 2005.PID control.Springer.
Kuhn (1955)HaroldW Kuhn.1955.The Hungarian method for the assignment problem.Naval research logistics quarterly(1955).
Lim etal. (2021)Gangmuk Lim, JeongseobAhn, Wencong Xiao, Youngjin Kwon, andMyeongjae Jeon. 2021.Zico: Efficient GPU memory sharing for concurrentDNN training. In USENIX ATC.
Liu etal. (2018)Chenxi Liu, Barret Zoph,Maxim Neumann, Jonathon Shlens,Wei Hua, Li-Jia Li, LiFei-Fei, Alan Yuille, Jonathan Huang,and Kevin Murphy. 2018.Progressive neural architecture search. InECCV.
Ma etal. (2020)Lingxiao Ma, ZhiqiangXie, Zhi Yang, Jilong Xue,Youshan Miao, Wei Cui,Wenxiang Hu, Fan Yang,Lintao Zhang, and Lidong Zhou.2020.Rammer: Enabling Holistic Deep Learning CompilerOptimizations with rTasks. In USENIX OSDI.
Mohan etal. (2022)Jayashree Mohan, AmarPhanishayee, Janardhan Kulkarni, andVijay Chidambaram. 2022.Looking beyond GPUs for DNN scheduling onmulti-tenant clusters. In USENIX OSDI.
Munkres (1957)James Munkres.1957.Algorithms for the assignment and transportationproblems.Journal of the society for industrial andapplied mathematics (1957).
Narayanan etal. (2020)Deepak Narayanan, KeshavSanthanam, Fiodar Kazhamiaka, AmarPhanishayee, and Matei Zaharia.2020.Heterogeneity-aware cluster scheduling policies fordeep learning workloads. In USENIX OSDI.
Ousterhout etal. (2017)Kay Ousterhout,Christopher Canel, Sylvia Ratnasamy,and Scott Shenker. 2017.Monotasks: Architecting for performance clarity indata analytics frameworks. In ACM SOSP.
Park etal. (2022)SeoJin Park, JoshuaFried, Sunghyun Kim, Mohammad Alizadeh,and Adam Belay. 2022.Efficient Strong Scaling Through Burst ParallelTraining.
Paszke etal. (2019)Adam Paszke, Sam Gross,Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan,Trevor Killeen, Zeming Lin,Natalia Gimelshein, Luca Antiga,etal. 2019.Pytorch: An imperative style, high-performance deeplearning library. In NIPS.
Qiao etal. (2021)Aurick Qiao, WillieNeiswanger, Qirong Ho, Hao Zhang,GregoryR Ganger, and EricP Xing.2021.Pollux: Co-adaptive cluster scheduling forgoodput-optimized deep learning. In USENIX OSDI.
Romero etal. (2021)Francisco Romero, QianLi, NeerajaJ Yadwadkar, and ChristosKozyrakis. 2021.INFaaS: Automated Model-less Inference Serving. InUSENIX ATC.
Ruder (2016)Sebastian Ruder.2016.An overview of gradient descent optimizationalgorithms.arXiv preprint arXiv:1609.04747(2016).
Saufi etal. (2019)SyahrilRamadhan Saufi,Zair AsrarBin Ahmad, MohdSalman Leong,and MengHee Lim. 2019.Challenges and opportunities of deep learningmodels for machinery fault detection and diagnosis: A review.IEEE Access (2019).
Senior etal. (2020)AndrewW Senior, RichardEvans, John Jumper, James Kirkpatrick,Laurent Sifre, Tim Green,Chongli Qin, Augustin Žídek,AlexanderWR Nelson, Alex Bridgland,etal. 2020.Improved protein structure prediction usingpotentials from deep learning.Nature (2020).
Shen etal. (2019)Haichen Shen, Lequn Chen,Yuchen Jin, Liangyu Zhao,Bingyu Kong, Matthai Philipose,Arvind Krishnamurthy, and RaviSundaram. 2019.Nexus: A GPU cluster engine for acceleratingDNN-based video analysis. In ACM SOSP.
Simonyan and Zisserman (2015)Karen Simonyan andAndrew Zisserman. 2015.Very deep convolutional networks for large-scaleimage recognition. In ICLR.
Szegedy etal. (2016)Christian Szegedy, VincentVanhoucke, Sergey Ioffe, Jon Shlens,and Zbigniew Wojna. 2016.Rethinking the inception architecture for computervision. In IEEE CVPR.
Tan etal. (2019)Mingxing Tan, Bo Chen,Ruoming Pang, Vijay Vasudevan,Mark Sandler, Andrew Howard, andQuocV Le. 2019.Mnasnet: Platform-aware neural architecture searchfor mobile. In CVPR.
Tirmazi etal. (2020)Muhammad Tirmazi, AdamBarker, Nan Deng, MdE Haque,ZhijingGene Qin, Steven Hand,Mor Harchol-Balter, and John Wilkes.2020.Borg: the next generation. InEuroSys.
Vaswani etal. (2017)Ashish Vaswani, NoamShazeer, Niki Parmar, Jakob Uszkoreit,Llion Jones, AidanN Gomez,Łukasz Kaiser, and IlliaPolosukhin. 2017.Attention is all you need. InNIPS.
Verma etal. (2015)Abhishek Verma, LuisPedrosa, Madhukar Korupolu, DavidOppenheimer, Eric Tune, and JohnWilkes. 2015.Large-scale cluster management at Google withBorg. In EuroSys.
Wang etal. (2021)Guanhua Wang, Kehan Wang,Kenan Jiang, Xiangjun Li, andIon Stoica. 2021.Wavelet: Efficient DNN training with tick-tockscheduling. In MLSys.
Weng etal. (2022)Qizhen Weng, WencongXiao, Yinghao Yu, Wei Wang,Cheng Wang, Jian He,Yong Li, Liping Zhang,Wei Lin, and Yu Ding.2022.MLaaS in the wild: Workload analysis and schedulingin large-scale heterogeneous GPU clusters. InUSENIX NSDI.
Xiang and Kim (2019)Yecheng Xiang andHyoseung Kim. 2019.Pipelined data-parallel CPU/GPU scheduling formulti-DNN real-time inference. In IEEE RTSS.
Xiao etal. (2018)Wencong Xiao, RomilBhardwaj, Ramachandran Ramjee, MuthianSivathanu, Nipun Kwatra, Zhenhua Han,Pratyush Patel, Xuan Peng,Hanyu Zhao, Quanlu Zhang,Fan Yang, and Lidong Zhou.2018.Gandiva: Introspective cluster scheduling for deeplearning. In USENIX OSDI.
Xiao etal. (2020)Wencong Xiao, Shiru Ren,Yong Li, Yang Zhang,Pengyang Hou, Zhi Li,Yihui Feng, Wei Lin, andYangqing Jia. 2020.AntMan: Dynamic scaling on GPU clusters for deeplearning. In USENIX OSDI.
Yu and Chowdhury (2019)Peifeng Yu and MosharafChowdhury. 2019.Salus: Fine-grained GPU sharing primitives for deeplearning applications.arXiv:arXiv:1902.04610.
Zhang etal. (2020)Quanlu Zhang, ZhenhuaHan, Fan Yang, Yuge Zhang,Zhe Liu, Mao Yang, andLidong Zhou. 2020.Retiarii: A deep learning exploratory-trainingframework. In USENIX OSDI.
Zhao etal. (2019)Wenyi Zhao, Quan Chen,Hao Lin, Jianfeng Zhang,Jingwen Leng, Chao Li,Wenli Zheng, Li Li, andMinyi Guo. 2019.Themis: Predicting and Reining in Application-levelSlowdown on Spatial Multitasking GPUs. In IEEEIPDPS.
Zhao etal. (2022)Yihao Zhao, YuanqiangLiu, Yanghua Peng, Yibo Zhu,Xuanzhe Liu, and Xin Jin.2022.Multi-resource interleaving for deep learningtraining. In ACM SIGCOMM.