Informatics: The End of Demographics with Deep, Wide, Fast Data

Informatics: The End of Demographics with Deep, Wide, Fast Data

Informatics: The End of Demographics with Deep, Wide, Fast Data

By January 26, 2016 Booz Allen, Data Science No Comments

Informatics: The End of Demographics with Deep, Wide, Fast Data

Years ago, when I was working as a manager in NASA’s Astrophysics Data Facility, we curated data sets from thousands of NASA space science experiments. Each of those data sets was relatively small (by today’s “big data” standards), and each was usually focused on some limited science problem, with a limited number of observed features, for a limited sample size, within a limited domain of study. The data were useful to address specific questions and specific problems.

As the sizes of the data collections began to grow in many ways, we saw a shift in the way that science was conducted. The data not only grew in bytes (data volume), but also grew in the variety of features being observed, the size of the samples being studied, and in the breadth of science questions that could be answered with the data collections. Those “new and improved” data sets could be used by a variety of different scientists (not only by the original principal investigator team), and could be used to address an impressive variety of new questions (not previously considered by the original team). These transformations corresponded to a data-driven revolution in science (and now in all other domains and industries).

Some scientists refer to this data revolution as the emergence of the 4th Paradigm of science. The first and second paradigms have been with us for centuries (even millennia): hypothetical-deductive reasoning (theory) and observational-inductive science (experimental). The third paradigm of modeling and simulation (computational science) arrived a few decades ago with the dawn of digital computing and programming languages. The fourth paradigm of data-oriented inference and discovery (data science) emerged at the start of the 21st century and is here to stay.

The data-oriented approach to discovery makes use of the tools, talents, and techniques of data science. The application of those data science components to any specific discipline X can be labeled as X-informatics. Less than a decade ago, in response to the growing data collections in space science, I was strongly motivated (with a few others) to create the field of astroinformatics (which is data science for data-oriented astronomy research and education). In creating the field, we appropriated existing concepts from the fields of bioinformatics and geoinformatics.

Since then, we have seen many other domains follow this path (creating new disciplines focused on data-oriented discovery): health informatics, cybersecurity informatics, urban informatics, climate informatics, social informatics, ecological informatics, business informatics, customer informatics, and many more.

What are some common characteristics that these informatics disciplines share? First, they all apply advanced analytics and data science methods to the most fundamental problems in their respective domains. Second, their goals are similar: more efficient and effective discovery, decision support, and innovation. Third, they collect deep data (large data collections across full populations of objects, not just demographic subsamples). Fourth, they collect wide (high-variety) data (with large numbers of features, context, and attributes for the observed objects). Fifth, they collect fast data (dynamic, time-tagged data on the objects).

Deep, wide, fast data represent the “end of demographics” in informatics disciplines. We can generate and pose a wide variety of new questions against the data collections, and thereby we can infer descriptive, predictive, and prescriptive analytics models that describe the characteristics and behaviors of nearly all categories of objects in that domain. We are no longer limited by small sample sizes or demographically limited samples. The large data collections become “the model” of that domain, explaining and explicating the domain through the exploration and exploitation of all of its data.

The job of the data scientist then is to decode the data in order to understand the characteristics and behaviors of objects in a domain (e.g., smart cities, smart health, precision medicine, precision supply chain, personalized learning, personalized marketing, etc.). One could say that the knowledge of the domain is already encoded in the data bytes – the data scientist applies the tools and techniques of data science to decode the data in order to transform that embedded knowledge from a byte encoding into human understanding. That is the essence of informatics and the fourth paradigm of discovery: the data show the way to deeper insights, better decisions, and innovative outcomes.

—Written by Kirk Borne