Editor’s note: This story pairs with “5 data mining methods”
The emerging field of data science has exploded over the past few years, with data scientist rating number one on Glassdoor’s Best Jobs in America list for the past three years.
Within the large field of data science, data mining is used to gather information from databases in order to make relevant decisions, according to Data Science Central.
What is data mining?
Generally, data mining is about extracting knowledge from data. BYU Data Mining Lab Director Christophe Giraud-Carrier researches a number of areas in the data mining realm.
“For a long time people have been collecting data. In fact, for years since the 1960s right when we invented databases the whole idea was to collect data,” Giraud-Carrier said. “So companies have storage of data galore — they store just about anything and everything.”
He said people started to realize in the 80s or 90s having all that data was a liability because it costs money to run computers and to keep people maintaining the data. Databases can only become an asset when the data is leveraged, according to Giraud-Carrier.
“Having all that information and doing nothing about it is not very helpful, so people decided to start doing things with it,” he said.
Data comes in all shapes and sizes and is collected in a variety of ways. Data can be collected from transaction systems like ATMs, from computer searches, smartphone activity, social media, formal surveys and school databases — data is basically collected all day, everyday.
Giraud-Carrier contends there are many benefits of data mining. In some cases, it’s cost-savings or revenue generation. Other times, data mining can be used to help improve particular processes, such as health care processes.
Former U.S. Chief Data Scientist Dhanurjay “DJ” Patil gave a BYU forum address on the power of data science. He shared the U.S. chief data scientist’s mission statement, which is “to responsibly unleash the power of data to benefit all Americans.”
“There’s two words in this mission statement that are very carefully chosen,” Patil said. “The word ‘responsibly’ and ‘all Americans.’ The first word, ‘responsibly,’ is there because, just because we can with data doesn’t mean we always should.”
Giraud-Carrier said privacy is a big concern in data mining. Tools for extracting information about individuals are made available through data mining.
“It turns out we have tools where we can predict, based on your post on Facebook, your gender preferences, your political affiliations, your religious leanings and things like this just by reading what you say and who you tend to hang out with,” Giraud-Carrier said. “The question is should we — we can — but should we, and what are the implications?”
There are laws in place to protect discrimination based on information extracted through data mining. Giraud-Carrier said, for example, there is a law in place stating gender and ethnicity cannot be used to make decisions on loans.
In the medical domain there are HIPPA laws, and in education there’s FERPA. These laws dictate what information is allowed to be extracted and considered and what is not. Additionally, most universities and organizations have an institutional review board. The role of these boards are to review studies involving human subjects.
“From the perspective of data mining, this means data about a human subject,” Giraud-Carrier said. “There are mechanisms in place to make sure things are done ethically.”
He said there is a push in the data science community on teaching ethics, and many data science programs across the country address ethical issues. There is also a field of “privacy preserving data mining” in which algorithms are designed to protect privacy of individuals through anonymization and randomization mechanisms.
“One of the things that a lot of people don’t realize with data mining is that most of what we do in data mining is try to figure out things that are true at the aggregate level,” Giraud-Carrier said. “We don’t really care about you as an individual, we just care about trends across all of us.”
BYU Data Mining Lab
The BYU Data Mining Lab is engaged in a number of research projects, both theoretical and practical.
BYU student Brandon Schoenfeld is pursuing a doctoral degree in computer science and has worked on projects in the BYU Data Mining Lab. Schoenfeld’s most recent project is on metalearning.
“Essentially it’s the use of machine learning or data mining to itself — using machine learning on machine learning,” Schoenfeld said. “We’re trying to see what insights we can get from that to improve data mining.”
BYU computer science major Christie Partington started working in the lab last semester.
“I didn’t really know anything going in,” Partington said. “I really had a passion for data and computer science but I’m still learning a lot.”
Partington has been involved with the lab’s research on computational health science. She previously worked on machine learning for research on eating disorders and is currently working on a dataset from BYU CAPS to analyze how effective therapist meetings are.
The BYU Data Mining Lab collaborates with other campus entities in health science on these projects to get perspectives from both sides.
“We are really interested in using those computational techniques and data mining to help answer complex public health issues,” Giraud-Carrier said.
The team has done a great deal of work on suicide prevention. The data they’ve mined has been primarily from social media.
“The reason why we’ve latched onto this is because there’s tons of it and it turns out, in spite of what people think, a lot of people tend to spill their guts on social media and are very, very open and explicit about their feelings, their emotions and their behavior,” Giraud-Carrier said.
The lab focuses on issues relevant to Utah. Giraud-Carrier said one of the things that’s really clear about data mining is it’s interdisciplinary.
“We can design algorithms to our heart’s content, but they don’t do us any good. We need ways to test them,” he said. “Having a real, true synergy and working together helps.”
The team meets once a week with members from each field. They look for projects that allow them to push their limits and improve their algorithms.
“I like that there’s a lot of opportunity to learn (in this field),” Schoenfield. “There’s always a challenge and it’s never simple, but we’re hoping for good results in the end.”
AUDIO: Christophe Giraud-Carrier, director of the BYU Data Mining Lab, describes how they use data mining in studying suicide prevention.