Can we determine clinical applicability?
This year’s competition was intended to catalyze a change in cardiac diagnostics, so connecting the competition participants and the medical community is an essential part of the DSB. I have done some preliminary analysis of the Data Science Bowl’s (DSB) top 4 team submissions. The goal is to present the results in terms that are meaningful to the medical research community. In doing so I hope to spark a dialog between the communities.
DSB competitors produced diastolic and systolic volume cumulative probability distribution function (CDF) for each heart. Researchers, clinicians, and technicians will prefer an algorithm to output volumes, ejection fraction and a precision estimate for those predictions. Obtaining those quantities from the submission files requires some level of approximation. I will describe what was done for this post, noting where I made simplifying assumptions. The plots and visualizations presented here are valuable for understanding the significant of the results. Many of the plots were used because they appear in prior medical research on reliability of cardiac MRI.
Please feel free to comment on, add to, and critique the approach outlined in this post. You can find the solutions used to score the Private Leaderboard here. Dr. Andrew Arai of NIH wrote a forum post, giving a medical perspective on these results. For the script that generated these plots, you can check out the Kaggle forums here.
Calculating a Prediction
The DSB challenge was to use the images from a heart study to generate two cumulative probability distribution functions (CDF)—one for the end-systolic volume (SV) and one for the end-diastolic volume (DV). Those volumes are the left ventricle sizes at the beginning and middle of each heartbeat. There are 600 values in each row, giving the the predicted probability that the true volume of the chamber in mL is less than the value index (i.e., if the 34th value in a row is 0.55, then the model is predicting that there is a 55% chance that the volume is less than 34 mL). So, subtracting adjacent values will give the predicted cumulative probability that the volume falls between the value indices (ie, if the 35th value is 60% and the 34th value is 55%, then the predicted probability that the volume is between 34 mL and 35 mL is 5%). Doing this subtraction for each consecutive value in a row generates a discretized probability density function (PDF):
where the value for P(v) can be interpreted as the predicted approximate probability per mL that the true volume is . For me, the PDF is a more intuitive way to look at the model submissions. A volume that the model is predicting as likely for a heart will show up as a peak in the PDF. I am using the expectation value and the standard deviation of the PDF as the single-value volume prediction and the uncertainty estimate for that prediction respectively. Summing over the bins in the PDF, the prediction and uncertainty estimate are:
The entries are already normalized, and the width of the bins are 1 mL, so I am not explicitly writing out the bin width or the normalization for these averages. Note, the PDF necessarily has one fewer bins than the CDF. This may look confusing depending on your background, but all I’m doing is finding the average and standard deviation of the PDF.
In order to give you an idea of what these CDFs look like after I turn them into PDFs, I have included a set of 100 random PDFs along with the true values, for the top four teams’ submissions. Two submissions have nearly normal distributions and the other two have some distinct non-normal features. I invite feedback and discussion on better ways to translate the CDFs into single number predictions, or any comments on why there is a better way to look at the problem of digesting model output for consumption by the medical research community.
For each heart, I simulated EF distributions by collecting EFs calculated from 10,000 SV-DV pairs sampled from the SV and DV PDFs. I then took the expectation value of the simulated EF distribution as the model’s EF prediction. I verified over a subset of heart studies that the sample size I chose produces a stable EF expectation value.
Limits of My Approach
This overall approach requires some caveats. Using the expectation value as the “prediction” is only reasonable for PDFs that are well behaved and unimodal. In reality a model could output a prediction that gives high probability to disparate systolic (or diastolic) volumes, in which case, the expectation value of SV (or DV) may be inappropriate. Estimating the EF distribution from the SV and DV distributions may also be problematic because the SV and DV are clearly correlated. This simplest solution is to seek a model that directly produces the EF value.
My overall approach will provide, however, an approximation of results that can be used to start a conversation with the medical community. We will eventually need to work with the medical community to develop a more rigorous analysis that takes into account some knowledge of the models. The medical community and the data science community must combine their effort to develop a method to rigorously translate model output into metrics that can be used for clinical evaluation and ultimately clinical use.
The table below shows root-mean-square (RMS) errors for the diastolic and systolic volumes and the ejection fraction (EF) from each of the top 4 teams. Remember, this is the RMS error value for the difference between the test value and the single-value prediction calculation. For some clinical perspective on these errors, the precision on human clinical determination of diastolic volume, systolic volume, and ejection fraction is about 13 mL, 14mL, and 6% respectively. Consider those benchmarks hearsay for the moment. I will edit the post with numbers from the literature and some references ASAP. With those benchmarks in mind, the RMS errors look promising.
We can glean more about the potential clinical applicability of the solutions than just a RMS error. Rather than approaching this only from a model evaluation standpoint, let’s return to the original motivation of the challenge—diagnosing heart disease.
As noted about, the quantity of interest for many diagnoses is the ejection fraction (EF). Roughly speaking, an EF under 35% is a dire emergency, around 60% is normal, and above 73% is considered hyperactive. I calculated the confusion (a.k.a. contingency) matrices for the top 4 competitors using diagnostic EF bins provided by Dr. Andrew Arai. If you take a close look at the matrices, you can get a picture of what might happen in a medical setting if these algorithms were put to use.
The models keep the diagnosis categories pretty tightly grouped together. While the models are not right 100% the time, there is a very low probability of a severely abnormal EF being incorrectly categorized in the mild to hyperdynamic range. The normal to mild diagnoses are very likely to stay within their domain of the matrix. This is a pretty good sign pointing towards suitability for clinical applications.
Other metrics that are used to evaluate clinical measurement techniques are correlation plots and Bland-Altman (BA) plots. The correlation plots are straight forward linear fits to the scatter pot of the true values versus the predicted data. For the BA plots,Vpred– Vtrue is plotted versusAVE(Vpred, Vtrue) for each prediction. A non-zero mean on the BA plot would indicate a relative bias between the two measurement techniques being compared (or in our case, a bias in the model, since we are assuming the test values are “true”). The dashed lines in the plot give the 95% confidence interval for the difference between the measurement types. What we would like to compare these BA plots to is other BA plots comparing two human measurements of the same data set. Is the variance in the plots comparable to human to human comparisons? Scroll to the bottom of the post for the correlation and BA plots.
Does it know when it’s failing?
We are especially interested in producing reasonable confidence estimates around volume and ejection fraction predictions. Michael Hansen, Ph.D, co-PI for this DSB, noted that the best models can fail, but they should “fail loudly.” We’d like a good measure of confidence for each individual prediction. In other words, “How well does the model estimate its own accuracy?” Ideally, if an individual prediction has a high probability of being wrong, a model will flag that prediction as uncertain. As mentioned before, I’m using the PDF standard deviation as an estimate of the model’s prediction certainty. A sharp peak indicates higher confidence whereas a wide peak indicates greater uncertainty.
The plot below is an illustration of a way to visualize a model’s ability to estimate its own accuracy. For each prediction, the standard deviation of the PDF is plotted versus the error of that prediction. The structure of a plot can be conceptualized as shown in the below graphic. Predictions that fall in the bottom left quadrant would be very confident (small width PDF) and accurate (small absolute error). Points in the top right would be low confidence (wide PDF) and inaccurate (large absolute error). Neither classes of points are necessarily bad. If all the predictions for a model fall within the two aforementioned quadrants, one would know when to trust the algorithm’s output and when to consider retaking a measurement. However, points in the other quadrants (large error but purported accuracy) could confuse one’s ability to rely on the algorithm’s uncertainty estimate.
In the plots that follow, the majority of the data points are shown in grey. The 5% best and worst errors are highlighted as well as the 5% widest and thinnest PDFs. I have looked at this a couple of ways, and the model error and “confidence” distributions are not obviously correlated. For that reason, I suspect that there is a better way to evaluate the models’ estimation of its own accuracy.
Hopefully, there will be an ongoing conversation between the medical and data science communities regarding the evaluation of these models for clinical use. Are there other metrics that would be useful? What else do you want to know about these models? What information from the model developers would help inform decisions that would be made in a clinical setting? From the data science community, is there more useful information that can be derived from the submission data? Are there better ways to estimate model prediction confidence? Are there better ways to calculate the prediction value?
—Written by Jonathan Mulholland