Classification of chest radiographs using general purpose cloud-based automated machine learning: pilot study

Widespread implementation of machine learning models in diagnostic imaging is restricted by dearth of expertise and resources. General purpose automated machine learning offers a possible solution. This study aims to provide a proof of concept that a general purpose automated machine learning platform can be utilized to train a CNN to classify chest radiographs. In a retrospective study, more than 2000 postero-anterior chest radiographs were assessed for quality, contrast, position, and pathology. A selected dataset of 637 radiographs were used to train a CNN using reinforcement learning based automated machine learning platform. Accuracy metrics of each label was calculated and model performance was compared to previous studies. The auPRC (area under precision-recall curve) was 0.616. The model achieved precision of 70.8% and recall of 60.7% (P > 0.05) for detection of “Normal” radiographs. Detection of “Pathology” by the model had a precision of 75.6% and recall of 75.6% (P > 0.05). The F1 scores were 0.65 and 0.75 respectively. Automated machine learning platforms may provide viable alternatives to developing custom CNN models for classification of chest radiographs. However, the accuracy achieved is lower than a comparable traditionally developed neural network model.


Current scenario
Until recently, the approach to develop a CAD system to extract meaningful features and infer a diagnosis was based heavily on Rule Based algorithms [1]. These were hand crafted sets of definitions, which were used by the computing system to detect abnormalities. The levels of accuracy achieved by these systems were poor and remained an enhanced visualization function rather than an independent diagnostic tool [1].
Advances in Deep Learning algorithms have since surpassed the traditional Rule Based algorithms in accuracy [2]. Multiple Deep Learning algorithms have even been able to surpass human performance in sorting natural images [3,4]. This has also led to interest in applying Deep Learning algorithms to diagnostic imaging and multiple studies have been done on application of Deep Learning on the interpretation of Chest radiographs. Litjens et al. produced an extensive survey of deep learning studies done on medical image datasets. It recorded 12 studies which had been undertaken till then, on application of deep learning techniques to chest radiographs to aid diagnosis [5].

Related works
The first published attempt at applying machine learning to this problem was by Lo et al. They designed a two-layered convolutional neural network (CNN) to identify true pulmonary nodules on chest radiographs. The model was trained to differentiate between true nodules and end on vessels or rib overlap artifacts mimicking pulmonary nodules [6].
Anavi et al. created an image retrieval system that would rank the dataset images according to similarity with the query image. The model was created as a combination of pre-trained CNN with a support vector machine (SVM). The classification-based model was able to achieve a recall (recall of only top 30 images were reported) of 0.310 for left side pleural effusion, 0.182 for left side consolidation, and 0.103 for identification of a healthy chest [7,8].
Multiple studies also focused on using CNN models which were previously trained with natural images, applying them to classification of chest x-rays in an attempt to reduce the complexity, cost, and time required. Bar et al. in a unique experiment used a pre-trained CNN and low level features to successfully detect lung pathologies. The Sensitivity of the model ranged from 0.80 to 0.89 and the specificity ranged from 0.79 to 0.87 [9,10]. Cicero et al. trained and validated a GoogLe Net CNN model on a large data set of over 30,000 frontal chest radiographs to detect common lung pathologies. They were able to achieve a ROC AUC of 0.964 with a sensitivity and specificity of 91% for identification of Normal chest [11]. They also achieved a remarkable level of accuracy with AUC ranging from 0.850 to 0.962 for detection of common lung pathologies, with a sensitivity of 74 to 91% and specificity of 75 to 91% [11]. Hwang et al. designed a pre-trained fine-tuned 6 layered CNN which was able to process complete chest radiographs and was trained to detect pulmonary tuberculosis. The average ROC AUC of the model without transfer learning was 0.816 and with transfer learning it improved to 0.964 [12]. Wang et al. used a ImageNet pre-trained CNN along with hand crafted feature selection to identify pulmonary nodules on chest radiographs. They were able to achieve a sensitivity of 69.27% and specificity of 97.02% [13].
Some studies have attempted to overcome the exponential rise in algorithm complexity that accompanies the quest for greater accuracy by deploying multiple synergistically working trained machine learning models. Shin et al. produced a highly sophisticated model with a CNN predicting the lung pathology and a Recurrent Neural Tensor Network providing short captions for annotation. The model was trained on a large dataset of 7000 images [14]. Islam et al. trained and tested multiple deep learning models to detect lung pathologies on frontal chest radiographs. Their experiments proved that different models excelled at single specific pathology detection and using ensembles of models improved the overall performance [1]. Wang et al. created the first standardized public dataset of chest radiographs to develop a benchmark against which all deep network performances could be assessed [15]. It also tested several different standard CNN models against the dataset. The average ROC AUC for the best performing model was 0.738. Yao [18].
There has been simultaneous development in Deep Neural Network models that aid in other challenging aspects of chest radiograph interpretation. Kim and Hwang developed an ML framework to detect tuberculosis by projecting heat maps on the suspicious areas of chest radiographs [19]. Rajkomar et al. created a pretrained DCNN model to sort chest radiographs into anteroposterior and lateral views [20]. Yang et al. used a cascading set of CNNs to detect and suppress bone from standard chest radiographs to render clear view of the pulmonary and cardiac soft tissue shadow [21].

Problem statement
While the potential of neural networks to change the diagnostic imaging is clear, the real world application of such research remains greatly hindered due to the prohibitive cost of running multiple graphics processing units (GPUs) to train such neural networks. While the cost of such technology is reducing, it still remains in the range of $100,000 [2]. An additional obstacle for accessing and applying machine learning techniques in radiology is the lack of expertise and knowledge required for hyperparameter tuning, data augmentation, etc. It also remains a time-consuming process as the complexity of the machine learning model increases combinatorially and requires considerable experimentation even by those with machine learning expertise. Recent advances have attempted to automate the designing process of machine learning models by evolutionary algorithms and reinforcement learning algorithms [22,23]. Multiple proprietary Application Programming Interfaces (APIs) for automated machine learning based on reinforcement learning are now available. These offer the ability to access and train a neural network for a fractional cost and time to that of traditional machine learning models. This is critical for democratizing the access to machine learning and universalizing the use of this technology. It is especially important considering the potential beneficial outcomes of such research-reducing cost of diagnostic imaging, streamlining work flows, and enabling greater penetration of diagnostic imaging to the community level-are most required in resource limited developing communities.
This pilot study was aimed to provide a proof of concept that general purpose automated machine learning platforms such as Google AutoML Vision can be utilized to train a neural network to diagnose and categorize chest radiographs in a real-world setting.

Dataset creation
A pool of over 2000 postero-anterior view chest radiographs from the out-patient and in-patient department acquired on different computed radiography and digital radiography systems were assessed for quality, level of penetration, positioning, and contrast. Those with very poor quality, low contrast, and unsatisfactory positioning were rejected. However, chest radiographs with minor imperfections in breath-holding, positioning or contrast, deemed reportable by the radiologist were included. Chest radiographs with clothing, jewelry, and implantable medical devices artifacts were included to mirror real-world variations.  Image processing The resultant dataset of 637 images were then converted from proprietary file types into Joint Photographic Expert Group file type with a 1024 × 1024 matrix size with 96 dpi vertical and horizontal resolution encoded using baseline DCT Huffman coding. The bit depth (bits per sampling) was set at 8bits with a Chroma subsampling Y'CbCr=4:2:0. The dataset was de-identified and was compliant with the Health Insurance Portability and Accountability Act. No data augmentation procedures were performed.

Model implementation
The dataset was uploaded onto Google Cloud Platform (Google LLC, Menlo Park, CA, USA) and processed using Cloud AutoML Vision Beta (release date: July 24, 2018). Multiple labels were created for classifying different pathologies and image characteristics (Fig. 1

Dataset characteristics
The dataset contained 637 postero-anterior view chest radiographs of which 332 had some pathology (52.1%). The dataset had a mild male predominance (57.8%) with an average age of 26.5 years. Each image assessed subjectively for quality and marked either satisfactory or poor. 82.1% of the images were of satisfactory quality but the dataset also contained 17.9% radiographs which were poor in quality but still deemed reportable by the radiologist. 47.6% of the dataset contained some form of artifact from clothing, jewelry, or implantable devices like pacemakers. The images were also assessed for positioning of the subject and revealed 25.9% to have some degree of rotation-which could lead to certain artifactual findings such as apparent cardiomegaly and prominence of the hila. Forty-three of the 637 radiographs were found to have been acquired in midinspiration. These imperfect images were introduced  into the dataset to reduce overfitting of the model to the training set and improve its real world applicability. The images with pathology were sub-classified and labeled into 9 different categories (Fig. 2). The pathologies were also assessed for subjective conspicuity. Each lung field was divided into three lung zone: upper, middle, and lower. A pathology occupying more than or equal to half of a zone was deemed "Apparent." If the pathology occupied less than half but more than 25% of the lung zone, it was marked as "Conspicuous." Lesions occupying less than 25% of a lung zone were termed "Subtle." The distributions of the lesions are shown in Fig. 2.

Accuracy metrics
The precision (positive predictive value) for all labels was 65.7% with a recall (sensitivity) of 40.1%. The auPRC (area under precision-recall curve or average precision) of the model was 0.616 (Fig. 3). The precision and recall for each category is summarized in Table 1. The F1 Score for classification was 0.65 for "Normal" category and 0.75 for "Pathology" category. Further evaluation statistics for both categories are summarized in Tables 2  and 3, respectively.

Unmet needs
While there has been considerable interest in the application of convolutional neural networks and other forms of machine learning for classification of chest radiographs into various pathologies, the underlying technology utilized in all these studies remain exclusionary [5,[24][25][26]. These studies either constructed and trained machine learning models de novo or worked with pretrained CNNs like AlexNet and GoogLeNet [3,27]. Though these methods yielded high accuracy models which could classify chest pathologies, they were built on systems which required high level of expertise as well as prohibitively costly infrastructure. This has led to a data-algorithm divide. The predictive accuracy of an algorithm is strictly contingent on the dataset that it is trained on (Fig. 4). But a large number of institutions in resource-limited settings may not have access to

Proposed solution
In this study, we tried to explore the possibility of repurposing general purpose automated machine learning models to classify diagnostic images, in particular chest radiographs. The platform used was Cloud AutoML Vision, which circumvents the challenges of requiring a large amount of time and expertise in crafting a neural network by using reinforcement learning [23]. The "controller" recurrent network creates variable length strings. These strings act as templates for development of "child" convolutional neural networks. These "child" networks are trained on the dataset and subsequently evaluated for accuracy. The accuracy metric is used as a positive reinforcement for the "controller" network. Thus in the subsequent iteration, the "child" networks with higher accuracy are favored. This is repeated until a single best "child" network is achieved with the highest accuracy.

Model accuracy
The accuracy metric of our trained model was expectedly lower than dedicated CNNs. The model had very poor sensitivity for sub-classification of pathology. However, the overall accuracy achieved for detection of pathology in chest radiographs was 74.57%. The accuracy parameters of the model are compared with two studies conducted with comparable machine learning models in Table 4. Our model, DeepDx, was able to achieve comparable accuracy to the model used by Bar et al., even surpassing their precision rate by almost 25%. This is substantial progress, especially when viewed in the context of the highly specialized fusion model (two separate deep learning baseline descriptors used along with GIST descriptor) created by Bar et al. [10]. The comparison table also reveals that Cicero et al. in their study achieved a much higher overall accuracy, but the success could be attributed at least in part to the large dataset on which their model was trained [11].

Justification
In our model, the three categories with examples above the minimum recommended number did provide good accuracy and with targeted increase in the dataset in subsequent iterations the overall model accuracy is likely to improve further. As per documentation released with Cloud AutoML Vision (Google LLC, Menlo Park, CA, USA), which we utilized in the study, the minimum recommended examples per label is 100 and approximately 1000 examples are advised for accurate prediction. This may not always be feasible for medical imaging, as rarity is often a feature of diseases with serious implications; and the time required to accrue enough examples may impede progress. This problem is usually circumvented by data augmentation procedures. Application of techniques such as horizontal flipping, cropping, rotation, and padding on chest radiographs and their effect on  Similarly, training machine learning models on rotated radiographs may lead to the algorithms assigning undue importance to irrelevant components of the image. Also many disease processes are defined by their orientation like cephalization of vessels in CCF, which may be lost during the augmentation process. Accuracy of the model is also likely to gain from changing the labeling structure of the dataset. In our study, we trained the algorithm to diagnose "Normal" and "Pathology" not as a binary alternative but as distinct classification categories. This was done keeping in mind real world application, as many radiographs do not distinctly fit either into an apparently normal or disease category. Many radiographs have suspicious features which should not be classified as disease and may require consensus reporting by radiologists. Another advantage of detecting the two categories separately was that it gave us comparable statistics with a larger number of studies, as most have trained to classify either one of the categories. The downside of this labeling structure was that it added to the complexity and thus probably reduced accuracy of the model. The model can be trained, in further studies; to detect only "Pathology" and the "Normal" can be processed as a default class. The sensitivity of the "Pathology" label should be increased to commit false positives and catch the indeterminate cases rather than being labeled "Normal" (Fig. 5). This will again entail human intervention to sort through and weed out the false positive, but will improve accuracy.

Reflections
The study has highlighted certain definite advantages of using automated machine learning in developing diagnostic classification models. The method reduces infrastructure requirements and cost to a fractional amount. The ease of use, with GUIs, also enables implementation and fine tuning without cumbersome coding languages. The reinforcement-based learning model greatly reduces the time requirement for developing complex CNN architecture. And importantly, such platforms provide scalability to improve upon a model and add further complexity to the classifier.

Future implications
Further work needs to be done with larger datasets of diagnostic images, to ascertain the maximal overall accuracy achievable. Multiple platforms now exist providing similar tools and they should be evaluated in a controlled trial for unbiased comparison. Data augmentation procedures should also be validated for use with medical imaging, particularly radiological images. Lastly, most studies attempting to classify chest radiographs have dealt with post-processed compressed images converted to non-native file types such as JPEG and PNG [11,12,17,18]. This conversion may lead to loss of important image characteristics and attempts should be made to use DICOM file types for future training of algorithms.

Conclusion
Computer vision is revolutionizing the field of diagnostic imaging. But its resource-intensive nature may preclude its wider implementation and acceptance. This study presented an alternative to the traditional machine learning infrastructure and aimed to investigate the use of commercially available general purpose cloud-based automated machine learning for detection of pathologies on standard postero-anterior chest radiographs. The study found automated machine learning to be a viable alternative to human designed diagnostic convolutional neural networks. The accuracy of the model developed was conservative in comparison to standard deep learning models. However, restructuring of the classifiers and increasing the training dataset hold promise of achieving greater accuracy. Further multi-platform studies are required with larger datasets to fully explore its potential.
While machine learning promises of vast improvements in speed and accuracy in detection of pathologies across imaging modalities, greater research focus needs to be directed towards ensuring that this novel technology is used to bridge the health-wealth gap and not widen it.