AI部署之路 | 模型选型、本地部署、服务器部署、模型转换全栈打通!( 三 )


个人建议,如果自己训练的模型想要被别人更好地使用,想开源模型贡献的话,也可以参考写个类似的[model card](),介绍下你的模型哈 。
回到正题,拿到模型的第一点,当然是观察模型的结构 。模型结构主要是为了分析这个模型是基于哪个、哪个框架训练的,可以知道模型的大概,心里有数:
一般来说要看:
一般经验丰富的算法工程师,大概看一下模型结构、模型op类型就知道这个模型坑不坑了 。
分两种,一种是评测模型的精度指标,一些常见的指标:
不列了,不同任务模型评价的指标也不一样,明白这个意思就行 。
另一种是模型的性能,在确认模型精度符合要求之后,接下来需要看模型的几个指标:
相关概念:
to theof time it takes for the model to make aafterinput. In other words, it's the timethe input and the.is anfor real-timewhere fastare , such as in self- cars or.

AI部署之路 | 模型选型、本地部署、服务器部署、模型转换全栈打通!

文章插图
to theofthat a model can make in a givenof time. It'sinper(PPS).is anfor large-scalewhere the model needs toa largeof data , such as in imageor.
举个的例子:
[01/08/2023-10:47:32] [I] Average on 10 runs - GPU latency: 4.45354 ms - Host latency: 5.05334 ms (enqueue 1.61294 ms)[01/08/2023-10:47:32] [I] Average on 10 runs - GPU latency: 4.46018 ms - Host latency: 5.06121 ms (enqueue 1.61682 ms)[01/08/2023-10:47:32] [I] Average on 10 runs - GPU latency: 4.47092 ms - Host latency: 5.07136 ms (enqueue 1.61714 ms)[01/08/2023-10:47:32] [I] Average on 10 runs - GPU latency: 4.48318 ms - Host latency: 5.08337 ms (enqueue 1.61753 ms)[01/08/2023-10:47:32] [I] Average on 10 runs - GPU latency: 4.49258 ms - Host latency: 5.09268 ms (enqueue 1.61719 ms)[01/08/2023-10:47:32] [I] Average on 10 runs - GPU latency: 4.50391 ms - Host latency: 5.10193 ms (enqueue 1.43665 ms)[01/08/2023-10:47:32] [I] [01/08/2023-10:47:32] [I] === Performance summary ===[01/08/2023-10:47:32] [I] Throughput: 223.62 qps[01/08/2023-10:47:32] [I] Latency: min = 5.00714 ms, max = 5.33484 ms, mean = 5.06326 ms, median = 5.05469 ms, percentile(90%) = 5.10596 ms, percentile(95%) = 5.12622 ms, percentile(99%) = 5.32755 ms[01/08/2023-10:47:32] [I] Enqueue Time: min = 0.324463 ms, max = 1.77274 ms, mean = 1.61379 ms, median = 1.61826 ms, percentile(90%) = 1.64294 ms, percentile(95%) = 1.65076 ms, percentile(99%) = 1.66541 ms[01/08/2023-10:47:32] [I] H2D Latency: min = 0.569824 ms, max = 0.60498 ms, mean = 0.58749 ms, median = 0.587158 ms, percentile(90%) = 0.591064 ms, percentile(95%) = 0.592346 ms, percentile(99%) = 0.599182 ms[01/08/2023-10:47:32] [I] GPU Compute Time: min = 4.40759 ms, max = 4.73703 ms, mean = 4.46331 ms, median = 4.45447 ms, percentile(90%) = 4.50464 ms, percentile(95%) = 4.5282 ms, percentile(99%) = 4.72678 ms[01/08/2023-10:47:32] [I] D2H Latency: min = 0.00585938 ms, max = 0.0175781 ms, mean = 0.0124573 ms, median = 0.0124512 ms, percentile(90%) = 0.013855 ms, percentile(95%) = 0.0141602 ms, percentile(99%) = 0.0152588 ms[01/08/2023-10:47:32] [I] Total Host Walltime: 3.01404 s[01/08/2023-10:47:32] [I] Total GPU Compute Time: 3.00827 s[01/08/2023-10:47:32] [W] * GPU compute time is unstable, with coefficient of variance = 1.11717%.
的场景和方式也有很多 。除了本地离线测试,云端的也是需要的,举个压测--的例子:
*** Measurement Settings ***Batch size: 1Service Kind: TritonUsing "time_windows" mode for stabilizationMeasurement window: 45000 msecUsing synchronous calls for inferenceStabilizing using average latencyRequest concurrency: 1Client: Request count: 32518Throughput: 200.723 infer/secAvg latency: 4981 usec (standard deviation 204 usec)p50 latency: 4952 usecp90 latency: 5044 usecp95 latency: 5236 usecp99 latency: 5441 usecAvg HTTP time: 4978 usec (send/recv 193 usec + response wait 4785 usec)Server: Inference count: 32518Execution count: 32518Successful request count: 32518Avg request latency: 3951 usec (overhead 46 usec + queue 31 usec + compute 3874 usec)Composing models: centernet, version: Inference count: 32518Execution count: 32518Successful request count: 32518Avg request latency: 3512 usec (overhead 14 usec + queue 11 usec + compute input 93 usec + compute infer 3370 usec + compute output 23 usec)centernet-postprocess, version: Inference count: 32518Execution count: 32518Successful request count: 32518Avg request latency: 96 usec (overhead 15 usec + queue 8 usec + compute input 7 usec + compute infer 63 usec + compute output 2 usec)image-preprocess, version: Inference count: 32518Execution count: 32518Successful request count: 32518Avg request latency: 340 usec (overhead 14 usec + queue 12 usec + compute input 234 usec + compute infer 79 usec + compute output 0 usec)Server Prometheus Metrics: Avg GPU Utilization:GPU-6d31bfa8-5c82-a4ec-9598-ce41ea72b7d2 : 70.2407%Avg GPU Power Usage:GPU-6d31bfa8-5c82-a4ec-9598-ce41ea72b7d2 : 264.458 wattsMax GPU Memory Usage:GPU-6d31bfa8-5c82-a4ec-9598-ce41ea72b7d2 : 1945108480 bytesTotal GPU Memory:GPU-6d31bfa8-5c82-a4ec-9598-ce41ea72b7d2 : 10504634368 bytesInferences/Second vs. Client Average Batch LatencyConcurrency: 1, throughput: 200.723 infer/sec, latency 4981 usec