Circulating microRNAs (miRNAs) have potential as early diagnostic biomarkers for invasive epithelial ovarian cancer. By combining two independent patient cohorts using a randomized stratification method, we were able to create training and testing sets with sufficient power to construct discriminatory models for the diagnosis of invasive epithelial ovarian cancer. Using this strategy, we found that machine learning using a neural network appears to be the best approach for analyzing this type of data because it is not biased by batch effects, data overfitting, or patient heterogeneity. In addition, this method elucidated relationships between analytes that would be missed by pairwise or linear correlation methods. The extent to which it offers a lead time advantage over other putative biomarkers remains to be proven, however, the strategy appears reliable, cost-effective, and hopefully scalable to the general population.
This suplementary software allows you to use our neural networks to predict diagnosis of invasive epithelial ovarian cancer (vs. controls and borderline) depending on the levels of selected miRNAs. This model scoring system has 4 modules (accoring to the type of data you want to provide as an input). This software is a part of paper entitled 'Diagnostic potential for a serum miRNA neural network for detection of ovarian cancer' and published in eLife 2017;6:e28932.
Module: miRNA sequencing data
Using next generation sequencing technology to profile all serum miRNA transcripts in 179 patient samples, we identified 192 miRNA species that could be reliably detected in human sera. We used these data to evaluate 11 independent machine learning algorithms and to select the modeling approach that could best discriminate among women with and without invasive epithelial ovarian cancer. A neural network machine learning approach presented here based on a subset of 14 miRNAs had the best performance characteristics with an area under the curve (AUC) of 0.90 (95% CI: 0.81-0.99). This model outperformed CA-125, and unlike other modeling approaches, was insensitive to clinical heterogeneity or batch effects.
This network requires the values of miRNA log-transformed expression levels in transcripts per million. Please note that log-transformed data should be provided as input (input = log10(tpm), where tpm is transcripts per million). Please refer to the full text of article for the interpretation of results.
Module: microarray data (GEO GSE94533)
After demonstating that the selected set of 14 microRNAs is sufficient to produce effective classifiers we tested both the dataset and the neural network framework on an independent dataset GSE31568 (Keller A et al.). The best neural network in terms of highest performance and lowest complexity had 4 neurons in the hidden layer. This neural network perfectly classified patients in the training set (AUC 1.00, 95%CI 1.00-1.00) and provided very good discriminatory power on the testing set (AUC 0.93, 95%CI 0.81-1.00). The neural network incorrectly classified one cancer sample as a control in the testing set which resulted in a sensitivity of 75% and specificity of 100%. This step assured us that the selected 14 miRNAs do in fact contain all the information needed for efficient diagnosis of ovarian cancer cases and that a properly trained neural network may use this information for near-perfect performance regardless of input data range or expression quantification algorithms.
Module: quantitative PCR
Having established that the 14 miRNA signature was sufficient to discriminate ovarian cancers, we attempted to calibrate a qPCR-based classifier using a neural network tailored to this quantification method. This produced a ROC AUC of 1.00 (95%CI 1.00-1.00) on the training set and 0.85 (95%CI 0.71-0.99) on the testing set, respectively.
Module: quantitative PCR (reduced set)
As qPCR has a lower sensitivity than sequencing, miRNAs values were undetectable for some miRNAs in some samples using this technique. We considered whether this might account for the lower AUC seen on the testing set using qPCR inputs rather than sequencing data. To minimize missing datapoints, we performed a global sensitivity analysis on the best neural network for qPCR data and iteratively removed the variables which did the least in terms of improving the classifier’s performance. Ultimately, we were able to reduce the neural network number of miRNAs down to 7 miRNAs (miR-92a, miR-450b, miR-335, miR-29a, miR-1307, miR-320c and miR-200c) and 4 normalizers.