We just made a contest submission scoring 0.022886, placing us 14th on the leaderboard as of February 2nd.  In this post we’ll outline our current approach and some of the challenges we think we’ll face as we try to further improve our performance.

Our approach for this submission was to use a Convolutional Neural Network (CNN) [1] to predict the systolic and diastolic CDFs directly from the SAX images.  It is not obvious what the best way to structure the inputs to a CNN is for this problem.  The patient studies have varying numbers of slices which span a spatial dimension and within each slice there is a time-series of images.  So we can naturally think of each patient study as being a 4-dimensional tensor of shape (# slices, # timesteps, image width, image height).

If all patient studies had the same number of slices and all slices had the same number of timesteps then we could easily ingest each study in its entirety into the CNN.  Sadly this is not the case.  As such, we are currently choosing to standardize the dimensions of our input by randomly sampling a fixed number of slices and a fixed number of timesteps from each patient study before ingesting it into the CNN.  This sampling is randomized for each iteration and that has actually proven to be a useful form of training regularization.  We talked last week about the methods we are using to standardize the individual image input sizes.

The convolutional layers of our network architecture have a fairly standard structure inspired by the well-known VGG network [2], i.e. it has lots of layers with 3×3 convolutional filters.
Our network directly predicts 600-dimensional systolic and diastolic CDFs for each timestep and each slice in the input tensor independently.  So far we have found that this yields lower CPRS scores than predicting the volume directly and then generating a CDF.  In order to yield a single predicted systolic and a single predicted diastolic CDF we pool across the spatial and temporal dimensions.  At this stage we’ll leave it as an exercise for the reader to figure out the best way to aggregate CDFs…

Partly driven by the need to reshape pool across all dimensions of our 4D tensor we have chosen to build and train our model using the Theano based neural network framework Lasagne [3].
This approach has yielded a network that is fast to train (approximately 1 minute per epoch on a NVIDIA Titan X GPU with cuDNN 4 [4]) and quickly converges to a low training error.  It does, unfortunately, tend to significantly overfit.  In order to combat this we are using a number of techniques including dropout, L2 regularization, leaky rectified units and aggressive data augmentation.  We’ll talk more about these in future blog posts.

This purely CNN based approach has some limitations we hope to improve upon soon.  We are discarding both the spatial ordering of slices and the temporal ordering of images within a slice.  Both of these are potential sources of information.  We are exploring the use of 3D convolutions for better aggregation across slices.

Whilst we don’t want to give away all our secrets at this stage, we’d be happy to answer questions about our methods through the competition forum.  Please reach out to us!

[1] https://cs231n.stanford.edu/
[2] https://www.robots.ox.ac.uk/~vgg/research/very_deep/
[3] https://github.com/Lasagne/Lasagne
[4] https://developer.nvidia.com/cudnn

Study 9, Slice 10. From Left to Right is the raw Imagery, scaled, and finally segmented.


Study 10, Slice 10. From Left to Right is the raw Imagery, scaled, and finally segmented.


Study 125, Slice 17. From Left to Right is the raw Imagery, scaled, and finally segmented.


—Written by Jon Barker