Is big data a big deal? Not without correct software!

05 July 2015

flickr photo shared by danmachold under a Creative Commons ( BY-NC-SA ) license

This statement was written to support my participation in a panel at the 27th International Conference on Software Engineering and Knowledge Engineering. To view the accompanying slides for this presentation, please refer to (Kapfhammer, 2015). If you want to learn more about new work that my colleagues and students and I are conducting in the area of efficiently testing data-centric applications, please read (Kinneer, Kapfhammer, Wright, & McMinn, 2015) and (Kinneer, Kapfhammer, Wright, & McMinn, 2015), two papers that were also presented at the same conference.

Big data analytics software allows researchers and practitioners to create descriptive models and make predictions. Often characterized by the "three Vs" of volume, velocity, and variety, big data systems must respectively handle large amounts of data that arrive rapidly and take many different forms. In fields such as evidence-based medicine and the detection of financial fraud, big data software is poised to and, indeed already is, making important contributions.

However, there is an additional "V" that is often overlooked by both researchers and practitioners: veracity. That is, if there is a lack of correctness in the software and data that makes up a big data analytics system, then the data models and the resulting predictions may be compromised — with serious consequences. For instance, the Data Warehouse Institute reports that North American organizations experience a $611 billion annual loss due to poor data quality. Scott W. Ambler argues that the "virtual absence" of software and data testing is the primary cause of this loss. Although this example is not specifically tied to big data systems, it clearly illustrates the risks associated with a lack of veracity in any data-rich field.

The challenge for software testing researchers is to develop and empirically evaluate new methods that can accommodate the volume, velocity, and variety that is characteristic of big data systems. While some preliminary work (e.g., the testing of both data mining systems and database applications) has recently been published, few software engineering researchers have focused on big data testing. Since veracity is not always considered by big data researchers, the challenge for these individuals is to create and assess new techniques that, whenever possible, holistically consider all of the "four Vs". If not already doing so, practitioners in both of these fields should start to establish a confidence in the correctness of both their software and data through the disciplined use of testing.

The title of my position statement poses the question "is big data a big deal?" Of course, the answer to this question is "yes". With that said, the increase of big data's importance and impact will be accelerated and even sustained if researchers and practitioners in fields such as software engineering, software testing, and big data collaborate with each other to develop efficient and effective data analytics systems that construct high-quality models and make accurate predictions. Let's collaborate across the fields of software engineering and big data to ensure that we have a positive influence on society — thus proving to be a "bigger deal" together than we would have been on our own.