```tex
%\documentclass{article}
\documentclass{dianzixuebao}
\newcommand{\MDIyear}{xxxx}%年
\newcommand{\MDImonth}{xx}%月
\newcommand{\MDIissuevolume}{xx}%卷
\newcommand{\MDIissuenumber}{xx}%期
\newcommand{\MDIshorttitle}{Fers}
\usepackage{newfloat,caption}
\usepackage{subcaption}
\usepackage{graphicx}
\usepackage[svgnames]{xcolor}
\usepackage{multicol}
\usepackage{multirow}
\usepackage{booktabs}
\usepackage{tabularx}
\usepackage{array}
\usepackage{booktabs} % For formal tables
\usepackage{xcolor}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{tabularx}
\usepackage[misc,geometry]{ifsym} %use a small envelope with superscript right after the person's name to denote corresponding authorship
\usepackage{array}
\begin{document}
\begin{multicols}{2}
To compare the image throughput performance of data parallel training and inference of DNN models on the experimental cluster with the corresponding image throughput performance obtained with the CUDA-enabled GPU workstation, CUDA-accelerated and cuDNN-accelerated DNNs are also implemented.
For comparison, Figure \ref{fig:clusterthr} shows the image throughput of data parallel training and inference of YOLOv3, ResNet-152 and DenseNet-201 on the experimental ARMv8 CPU cluster and the GPU workstation. \texttt{Train\_FTCL} and \texttt{Inference \_FTCL} denote the image throughput realized by using the proposed FTCL-Darknet framework on the experimental many-core CPU cluster for the training and inference of DNN models independently. \texttt{Train\_CUDA\_1080Ti} and \texttt{Inference \_CUDA\_1080Ti} denote the image throughput obtained with the CUDA-accelerated Darknet on the GPU workstation without using cuDNN, while \texttt{Train\_CUDNN\_1080Ti} and \texttt{Inference\_CUDNN\_1080Ti} denote the image throughput achieved with the cuDNN-accelerated Darknet on the GPU workstation.
The data parallel training performance of YOLOv3, ResNet-152 and DenseNet-201 on the experimental ARMv8 many-core CPU cluster reach 1.3, 2.5 and 2.8 image/s respectively. They nearly achieve 16.1\% of the training performance obtained on the CUDA-enabled GPU workstation on average, and approximately achieve 3.8\%, 7.9\% and 7.4\% of the training performance accelerated by using the cuDNN-enabled GPU workstation respectively.
On the other hand, the parallel inference performance achieves 7.1, 6.2 and 5.9 images/s respectively. They achieve 17.6\% of the inference performance obtained on the CUDA-enabled GPU workstation on average, and approximately achieve 14.3\%, 16.1\% and 15.3\% of the inference performance accelerated by using the cuDNN-enabled GPU workstation.
\end{multicols}
\astable{
\astabletitle{\bf Table 1.\ DNN models and datasets}
\astableobj{
\setlength{\tabcolsep}{3mm}{
\begin{tabular}{l l c l c c}
%\begin{tabular}{lp{7em} p{3em} p{3em} p{em} p{9em}}
\toprule
DNN &Input size &Batchsize &Convolution layers &Dataset \\
\midrule
YOLOv3 \cite{Redmon2018p} &$416\times416$ &64 &75/107 &MS COCO2014\\
\midrule
ResNet-152 \cite{He20162ICoCVaPRC770} &$256\times256$ &256 &152/206 &ImageNet2012\\
\midrule
DenseNet-201 \cite{Huang2017ICoCVaPRC2261} &$256\times256$ &256 &201/305 &ImageNet2012\\
\bottomrule
\end{tabular}%
}
}
}
\begin{multicols}{2}
To compare the image throughput performance of data parallel training and inference of DNN models on the experimental cluster with the corresponding image throughput performance obtained with the CUDA-enabled GPU workstation, CUDA-accelerated and cuDNN-accelerated DNNs are also implemented.
For comparison, Figure \ref{fig:clusterthr} shows the image throughput of data parallel training and inference of YOLOv3, ResNet-152 and DenseNet-201 on the experimental ARMv8 CPU cluster and the GPU workstation. \texttt{Train\_FTCL} and \texttt{Inference \_FTCL} denote the image throughput realized by using the proposed FTCL-Darknet framework on the experimental many-core CPU cluster for the training and inference of DNN models independently. \texttt{Train\_CUDA\_1080Ti} and \texttt{Inference \_CUDA\_1080Ti} denote the image throughput obtained with the CUDA-accelerated Darknet on the GPU workstation without using cuDNN, while \texttt{Train\_CUDNN\_1080Ti} and \texttt{Inference\_CUDNN\_1080Ti} denote the image throughput achieved with the cuDNN-accelerated Darknet on the GPU workstation.
The data parallel training performance of YOLOv3, ResNet-152 and DenseNet-201 on the experimental ARMv8 many-core CPU cluster reach 1.3, 2.5 and 2.8 image/s respectively. They nearly achieve 16.1\% of the training performance obtained on the CUDA-enabled GPU workstation on average, and approximately achieve 3.8\%, 7.9\% and 7.4\% of the training performance accelerated by using the cuDNN-enabled GPU workstation respectively.
On the other hand, the parallel inference performance achieves 7.1, 6.2 and 5.9 images/s respectively. They achieve 17.6\% of the inference performance obtained on the CUDA-enabled GPU workstation on average, and approximately achieve 14.3\%, 16.1\% and 15.3\% of the inference performance accelerated by using the cuDNN-enabled GPU workstation.
\end{multicols}
\end{document}
```
回答: 2019-11-03 16:34
一周热门 更多>