We had lofty ambitions when we created the Data Science Bowl. We wanted to create an event that would give the Data Science community the opportunity to affect change on a global scale. We wanted to solve the previously impossible. We wanted to give a voice to those who had none. If I’m being honest, I wasn’t sure we would make it. There were moments of discouragement and frustration along the way, but we never lost sight of our ambitions. Now we are entering our fourth year and I have a very different perspective. We have accomplished more than I ever thought possible. We are truly changing the world with Data Science.
The technical results are amazing. In three short years we’ve tackled ocean health, heart disease, and lung cancer. Participants enabled the measurement of ocean health at a speed and scale never before possibly by reducing analysis time by 50% while increasing accuracy by 10%. Algorithms creating in our second year outperformed the accuracy achieved by human cardiologists. This past year the results increased accuracy by 10% while simultaneously reducing false-positives by 10%. Accomplishing any one of these would have been amazing. All three is astonishing.
It goes beyond the technical results though. The competition is and always has been about the data science community. It’s about the people who give countless hours of their own time to make the world a better place. The human curiosity and passion for social change that we see are an undeniable force of nature.
I know what you’re thinking – they really do it for the money. But they don’t. We’ve spoken with many of them. It’s never the money that pulls them in – in fact they rarely even mention the money. It’s the chance to affect real change that they are after. This past year we had 10,000 people take part. 10,000. You don’t get that kind of participation just for the money.
The event is bigger than any one of us. We are fueling the kind of change we want to see in the world. People like me, and people very different from me, are giving back in a truly meaningful way. There will always be other competitions, but there will only be one Data Science Bowl.
I love what we do. I love what happens when people come together and focus everything they have on finding a new way forward. The Data Science Bowl provides a venue to unleash that passion.
How far can we go? Who can we help? What impossible challenges can we solve? As you’ve already shown, only your passion and curiosity can answer these questions.
Want to see more of what I’m talking about? Check out the presentations from the 2017 GPU Technology Conference, hosted by our Data Science Bowl partner NVIDIA, where we discussed the Data Science Bowl and heard from some of our top placing teams. Also, you can check out a summary of the solutions, hardware, language and OS libraries used by our top placing teams from the 2017 challenge.
~Steve Mills, Booz Allen Hamilton
|Place||Team||Size||Training Method||Key Features||Solution||Hardware||Language||OS Libraries|
|1||grt123||3||3D convolution neural networks||Preprocessing, nodule detection, cancel classification||8 NVIDIA Titan X GPUs||Python||Pytorch|
|2||Hammack & Wit||2||3D convolution neural networks||Normalize CT scan, Find regions likely to have nodules, Predict nodule attributes, Aggregate nodule attributes predictions into a global patient level diagnosis forefast||NVIDIA?||Python||Keras, theano, numpy, scipy, scikit-learn, pandas|
|3||Aidence||3||fully convolutional Resnet & linear augmentations||Nodule radius, height of nodule in lung, malignancy, texture, calcification and spiculation||Normalize, full convolutional resnet for nodule detection, predict nodule attributes, aggregate nodule attribute predictions||8 NVIDIA K80 GPUs||Python||Tensorflow 1.1, opencv 3.1, scipy, numpy, yaml, scikit-learn, pydicom, simpleitk, pandas, pycuda|
|4||gfpxfd||8||Convolutional Neural Network||Preprocess data followed by 2 stage nodule detection 2D Faster R-CNN detect nodule candidates with high recall followed by 3D CNN to reduce false positives||?||C++/C#|
|5||Pierre Fillard (Therapixel)||1||Deep Neural Networks||Lung segmentation, nodule extraction & characterization, mass extraction & characterization, emphysema estimation, degree of calcification estimation||Â· 128GB or RAM
Â· 2TB of free disk space
Â· 2xIntel i7 CPUs
Â· 4xNVIDIA Titan X (Pascal) or NVIDIA P6000
|Python||cuda 8, cudnn 5, tensorflow 1.0, xgboost, numpy, scipy, skimage, SimpleITK
|6||MDai||2||Deep Neural Networks||Preprocess DICOM studies, determine patient sex, create ROI probability maps, create cancer predictions and generate other features, final cancer predictions with stacked meta-classifier ensemble||NVIDIA
AWS p2.16xlarge (16
NVIDIA K80 GPUs, 64 vCPUs, 732 GiB RAM
|Python||CUDA 8, cuDNN 5.1, numpy, scipy, pandas, scikit-image, scikit-learn, joblib, pillow, xgboost, keras, tensorflow
Dydicom, h5py, redis-py
|7||DL Munich||4||Convolution Neural Networks||Preprocess, nodule segmentation, candidate proposal, cancer classification,||GPU: Nvidia GTX 1080
CPU: Intel(R) Core(TM) i7-4930K CPU
RAM: 32GB of RAM
Around 200GB of free Memory
|Python||Opencv-python, dicom, joblib, tensorflow 1.0.1, simpleITK, numpy, pandas, scipy, scikit-image, scikit-learn|
|8||Alex | Andre | Gilberto | Shize||4||Convolution Neural Networks and tree based classification (XGBoost and extraTree)||The most important feature is the existence of nodule(s), followed by their size, location and their other characteristics||Train nodule identifier; resample, convert & segment images; identify nodules, extract 3 feature sets for prediction||i7 based Linux systems with 8GB GPUs and
Amazon Web Service P2 systems (with 12GB GPUs)
|Python||Keras 1.2.2, Theano, Conda, pydicom, cv2, scipy, simpleitk, numpy, and pandas, numpy, xgboost, sklearn|
|9||Deep Breath||10||Convolution Neural Networks and XGBoost||Nodule segmentation, ROI extraction, false positive reduction, final cancer prediction||7 machines (GTX 1080, GTX980, Titan X and Tesla K40) w/32-64GB mem per machine||Python||Theano, Lasagne, Scikit-Image, SciPy, NumPy, Scikit-Learn, Cuda 8.0, CuDNN 5.1|
|10||Owkin Team||2||3D Convolution Neural Networks and boosted trees (blended prediction of two separate models)||45 features (location, area, volume, HU statistics, etc)||Pre-process the images to get a fixed 1mm x 1mm x 1mm resolution and segment the lungs using thresholding, morphological operations and connected components selection then feed to separate models (3D CNN and Boosted tree)||e-mailed Simon to inquire||Python||Tensorflow-gpu 1.0, xgboost, Keras 2.0, openCV, pyradiomics|