Data set and data acquisition
The dataset consisted of 5480 samples in two classes, including 2740 CT chest images of patients with confirmed COVID-19 and 2740 images of suspected cases. In the experimental analysis, 4400 images of the dataset were used as training data, and 1080 images as test data. It is necessary to mention that slices of each person were not divided between both training and test sets. The current study was carried out between 28 April 2020, and 3 September 2020. To manage COVID-19, all patients with a rapid respiratory rate over 30 per minute, fever over 37.8 °C, hypoxemia, dyspnea, cardiovascular disease, hypertension, diabetes mellitus, underlying pulmonary diseases, and immunodeficiency underwent non-contrast chest CT examinations. In our center, all patients must perform the PCR test and CT imaging to clarify COVID-19. A physician for screening and diagnosing COVID-19 reviewed medical records and imaging. All patients, both clinical findings and chest CT findings compatible with COVID-19 pneumonia, were located in the confirmed COVID-19 group. CT scans and laboratory tests confirmed that some patients had other lung infections. These patients had some common symptoms with confirmed COVID-19 patients. In these patients, CT imaging’s initial diagnosis was difficult, so additional laboratory tests were performed. That is why we named them suspected COVID-19. Non-contrast CT chest examinations were performed with a 16-slice CT scanner (Somatom Emotion; Siemens Medical Solutions, Forchheim, Germany) with the protocol as follows: kVp = 110, mAs = 90, slice thickness = 2 mm, matrix size = 512 × 512, voxel size = 0.714 mm, 0.714 mm, 2 mm. In Fig. 1, chest CT images of patients with suspected COVID-19 and confirmed are represented. The graphical abstract of the study is displayed in Fig. 2.
CNNs and proposed deep transfer learning models
CNN is a class of deep learning models of data processing and analysis, which is an inspired design by the structure of the human visual cortex [33]. CNN is designed to learn spatial hierarchies of features through a backpropagation algorithm, from low- to high-level patterns. The CNN typical architecture includes repetitions of a stack of multiple convolution layers and pooling layers followed by one or more fully connected layers [34]. The convolution layer is an essential layer of the CNN model composed of several convolution kernels based on moving the input image with the selected filter to extract different feature maps. The size and number of kernels are two key hyperparameters that define the convolution operation. The size is typically 3 × 3, but sometimes are 5 × 5 or 7 × 7. The number of kernels is arbitrary and specifies the depth of output feature maps. In general, in the convolution layer, each of the output feature maps can be combined with more than one input feature map as follows:
$$ {x}_j^l=f\left(\sum \limits_{i\text{\EUR} Mj}{x}_j^{l-1}\ast {k}_{ij}^l+{b}_j^l\right) $$
(1)
Where the output of the current layer is \( {x}_j^l \), \( {x}_j^{l-1} \) is the previous layer output, \( {k}_{ij}^l \) is the kernel for the present layer, and \( {b}_j^l \) are the biases for the current layer. Mj represents a selection of input maps. The outputs of convolution are then passed per a nonlinear activation function. Rectified linear unit (ReLU) is the most common nonlinear activation utilized as an activation function [35]. It can be defined as:
$$ \mathrm{F}\ \left(\mathrm{x}\right)=\max\ \left(0,\mathrm{x}\right) $$
(2)
ReLU does by thresholding values at 0. When x < 0, it outputs 0, and conversely, when x ≥ 0, it outputs a linear function.
A pooling layer enables a specific down-sampling action, which reduces the feature maps dimension, the number of subsequent learnable parameters, and costs. It is necessary to mention that in any of the pooling layers, there is no learnable parameter. Therefore, hyperparameters in pooling operations are similar to convolution operations. The most common type of pooling operation is max pooling, which extracts the maximum value in the input maps, and discards all the other values. The global average pooling is another pooling operation. In this pooling, a powerful method of downsampling is performed with retaining the depth of feature maps. A feature map is downsampled into a 1 × 1 array using the average of all the elements. The global average pooling is applied before the fully connected layers [36]. Pooling operation can be formulated as:
$$ {x}_j^l= down\left({x}_j^{l-1}\right) $$
(3)
Where down (.) represents a sub-sampling function.
The output feature maps of the final convolution layer are typically transformed into a single vector, and the neurons are connected to all the activation functions from the previous layer. Each convolutional layer has a filter (m1). The output \( {\mathrm{Y}}_{\mathrm{i}}^{\mathrm{l}} \)of layer l consists of \( {m}_1^l \) feature map of with size \( {m}_2^l \) ×\( {m}_3^l \). The ith feature map, \( {\mathrm{Y}}_{\mathrm{i}}^{\mathrm{l}}, \) is calculated on the bases of Eq. 4:
$$ {Y}_i^{(1)}=f\left({B}_i^{(l)}+\sum \limits_{j=1}^{m_i^{\left(l-1\right)}}{k}_{i,j}^{(1)}\times {Y}_j^{\left(l-1\right)}\right) $$
(4)
Where \( {B}_i^l \) demonstrates the bias matrix and \( {K}_{i,j}^l \) the filter size.
The processing phases of the fully connected layer are shown in Eq. 5, if (l −1) is a fully connected layer;
$$ {Y}_i^{(1)}=f\left({Z}_i^{(l)}\right)\ with\ {Z}_i^{(l)}=\sum \limits_{j=1}^{m_i^{\left(l-1\right)}}{w}_{i,j}^{(1)}\times {Y}_j^{\left(l-1\right)}\Big) $$
(5)
Based on each task, an appropriate activation function needs to be selected. A softmax function is an activation function applied to the multiclass classification and the values in two classes of “0” and “1” interpreted [37].
The DenseNet201, ResNet50, VGG16, and Xception models are considered and described briefly in this section [30]. DenseNet201 includes densely connected CNN layers. In a dense block, the outputs of each layer are associated with all successor layers. Put merely, DenseNet201 organized with dense connectivity between the layers. The features extracted from the DenseNet201model is a 1920-dimensional space. ResNet50 is a usual feedforward network with a residual connection containing 50 layers, 49 convolution layers, and one fully connected layer. The features extracted from the ResNet50 model is a 2048-dimensional space. The image’s input size is usually set to 224 × 224 pixels, and the size of the filter can be selected to 3 ×3 or 5 ×5 pixels. The VGG16 architecture includes two convolutional layers such that both use the ReLU activation function. Followed, a single max-pooling layer and several fully connected layers also use a ReLU activation function. In this model, the convolution filter size is 3 × 3 filters with a stride of 2. The features extracted from the VGG16 model is a 512-dimensional space. Xception or Extreme Inception is a linear stack of depth wise detachable convolution layers with residual connections. In this model, except for the first and last modules, the 36 convolutional layers are structured into 14 modules. This architecture does not evaluate spatial and depth-wise correlations simultaneously and deals with them independently. The features extracted from the Xception model are a 2048-dimensional space.
Machine learning methods
RF is a meta-learner that works by building many numbers of decision trees during the training process. The RF method only needs to determine two parameters for creating a prediction model, including the number of classification trees desired and prediction variables. Simply put, to classify a dataset, a fixed number of random predictive variables is used, and each of the samples of the dataset is classified by several trees defined [38]. SVM is a method to make a decision border between two classes that predicts labels using one or more feature vectors. The mentioned decision boundary is known as the hyperplane, with a maximum margin separating negative and positive data [39]. The output of an SVM classifier is given in Eq. 6, wherein w and x are the normal vectors to the hyperplane and the input vector, respectively.
$$ \mathrm{u}=\overrightarrow{w}.\mathrm{x}-\mathrm{b} $$
(6)
Maximizing margins can be determined as an optimization subject: minimize Eq. 7 concerning Eq. 8, where xi is ith training sample, and yi is the correct output of the SVM model for ith training.
$$ \frac{1}{2}{\left\Vert \overrightarrow{w}\right\Vert}^2 $$
(7)
$$ {y}_i\left(\overrightarrow{w}.\overrightarrow{xi}-b\right)\ge 1,\forall i $$
(8)
DT algorithm is a data mining induction method that recursively divisions a data set of records using the greedy method until all the data items belong to a specific class. The structure of this model is created of a root, internal, and leaf nodes. To classify new data records, the tree structure is used. At any internal node of the tree, making decisions about the best split is made by using impurity measures [40]. KNN classifier is a nonparametric classifier that provides good performance for optimal values of k. In the KNN rule, a test sample belongs to the class mostly represented among the k-nearest training samples, and classification is performed by calculating the distance between the selected features and the k-nearest neighbors [29]. The Euclidian distance to determine the spaces among the features can be calculated as follows: If two vectors xi and xj are given, the difference between xi and xj is:
$$ D\ \left({x}_{\mathrm{i}},{x}_j\right)=\sqrt{\sum_k^{n=1}{\left(\left( xik- xjk\right)\right)}^2} $$
(9)
LGR model is used when the value of the target variable is categorical, or is either a 0 or 1. A threshold is usually determined that demonstrated what value they will be put into one class vs. the other class [28]. The logistic regression model as follows:
Experimental setup
The inductive transfer learning for the pre-trained CNN models, which are DenseNet201, ResNet50, VGG16, and Xception, was used to differentiate COVID-19 patients from suspected. In the inductive transfer learning method, the target duty is different from the source duty, no matter when the target and source domains are the same or not. Therefore, for inducing an objective predictive model fT (.) for use in the target domain, some labeled data in the target domain are needed. Based on “What to transfer,” there are different approaches to transfer learning that we used parameter transfer. Parameter transfer assumes that the model’s hyperparameters, the source, and target tasks share some parameters or prior distributions. Therefore, by finding the shared parameters or priors, knowledge can be transferred through tasks. This study was conducted in two sections. In the first section, the output of pre-trained models was used to differentiate patients with confirmed COVID-19 from suspected cases. Before training, we resized all the images into 224-pixel width and 224-pixel height in 3 channels for faster processing. The used structure for the four models was the same: the last convolutional block + model. Output + GlobalAveragePooling2D + Dropout (0.1) + Dense (256, activation= “ReLU”) + Dense (2, activation= “softmax”). It should be noted that only the last four layers were trained, and the rest of the pre-trained model layers were frozen. Finally, the performance of these models was obtained using four criteria as follows:
$$ \mathrm{Accuracy}=\left(\mathrm{TN}+\mathrm{TP}\right)/\left(\mathrm{TN}+\mathrm{TP}+\mathrm{FN}+\mathrm{FP}\right) $$
(11)
$$ \mathrm{Recall}=\mathrm{TP}/\left(\mathrm{TP}+\mathrm{FN}\right) $$
(12)
$$ \mathrm{Precision}=\mathrm{TP}/\left(\mathrm{TP}+\mathrm{FP}\right) $$
(13)
$$ \mathrm{F}1-\mathrm{Score}=2\times \left(\mathrm{Precision}\times \mathrm{Recall}\right)/\left(\mathrm{Precision}+\mathrm{Recall}\right) $$
(14)
TP, FP, TN, and FN represent the number of true positive, false positive, true negative, and false negative, respectively. We used the dimensionality reduction method “t-distributed stochastic neighbor embedding (t-SNE)” to visualize high-dimensional data by giving each data point in a two-dimensional map [41]. Therefore, t-SNE aims to preserve the significant structure of the high-dimensional data so that, put merely can be displayed in a scatterplot. t-SNE using a gradient descent method minimizes a Kullback-Leibler divergence between a joint probability distribution in the high-dimensional space and a joint probability distribution in the low dimensional. The pairwise similarities in the high-dimensional original data map as follows:
With conditional probabilities:
T-SNE has a tunable parameter, “perplexity,” which declares how to balance regard between local and global aspects of data. The perplexity is a guess of the number of close neighbors at each point. The perplexity value has a complex effect on the resulting image, and its value tuned to 200 for presented t-SNE in our study. We drew t-SNE plots for six different situations which including original CT images, Conv2-layer10, Conv15-layer56, GlobalAveragePooling layer, FC layer-layer 1, and FC layer-layer 2.
In the second section, we used ML methods, which include RF, SVM, DT, KNN, and LGR, to classify patients. In this manner, we entered the output of pre-trained methods into ML algorithms and performed the ML algorithms classification. The structure used to do this is as follows: the last convolutional block + model. Output + GlobalAveragePooling2D + predict datasets+ ML algorithms. The performance metrics of ML models were obtained similarly to pre-trained models. The performance metrics of ML models was obtained as the same as the pre-trained models.
All experiments, including data preprocessing and analysis, were performed on the Google Cloud computing service “Google Colab” (colab.research.google.com) using programming language Python and framework Tensor Flow. We used the following parameters to compile pre-trained models: optimizer= “Adam,” loss= “Categorical Crossentropy.” For all experiments, the batch size, learning rate, and the number of epochs were experimentally set to 64, 0.001, and 100, respectively.