高并发场景下推荐模型高效推理算子调度方法研究
首发时间:2025-02-28
摘要:推荐系统在电子商务、数字媒体等领域中具有重要作用,但随着用户规模的不断扩大,系统在高并发场景下面临着严格的延迟约束和复杂的请求处理挑战。现有的服务框架调度策略主要依赖GPU计算能力来加速单个请求的处理,但未能充分优化算子的并发执行,且难以有效区分和处理具有不同商业价值的请求,导致算子间干扰、资源利用率低下以及推理延迟增加等问题。针对上述问题,本文提出了一种创新的算子调度系统OpStream,旨在提升高并发场景下推荐模型的推理性能。OpStream通过实时GPU负载监控和算子优先级评估,动态地将算子分配到多个CUDA流中,从而优化GPU资源利用率并减少算子间干扰。实验结果表明,OpStream在复杂请求场景下显著降低了推理延迟,最高可实现62.9%的加速,同时将推理成本降低多达49.4%。
For information in English, please click here
Enhancing Inference Efficiency in High-Concurrency Recommendation Models through Operator Scheduling
Abstract:Recommendation systems are crucial in industries such as e-commerce and digital media. However, as the number of users grows, these systems face significant challenges in processing high-concurrency requests under strict latency constraints. Current scheduling strategies in service frameworks primarily utilize GPU computing power to accelerate the processing of individual requests. Yet, these strategies lack optimization for the concurrent execution of operators and struggle to manage complex requests with varying commercial values. This results in operator interference, inefficient resource utilization and increased inference latency. We propose OpStream, an innovative operator scheduling system that optimizes the inference performance of recommendation models in high-concurrency scenarios. OpStream integrates a holistic scheduling strategy that dynamically allocates operators to multiple CUDA streams based on real-time GPU load monitoring and operator prioritization. By incorporating operator analysis and stream load monitoring, OpStream enhances GPU resource utilization and reduces operator interference, leading to substantial performance improvements. Our experimental results show that OpStream achieves up to 62.9% speedup in inference latency and reduces inference costs by 49.4% in complex request scenarios.?
Keywords: Recommendation System Operator Scheduling Parallel Computing
基金:
引用

No.****
同行评议
勘误表
高并发场景下推荐模型高效推理算子调度方法研究
评论
全部评论