Editor’s note: This story pairs with “Data mining: how it works, why it’s important”
There are many methods of data collection and data mining. Here are some of the most common forms of data mining and how they work:
1. Anomaly detection
Anomaly detection can be used to determine when something is noticeably different from the regular pattern.
BYU professor Christophe Giraud-Carrier, director of the BYU Data Mining Lab, gave the example of monitoring gas turbines and how anomaly detection is used to make sure the turbines function properly.
“These turbines have physics associated with them that predicts how they are going to function and their speed and all that jazz,” Giraud-Carrier said.
Sensors monitoring temperature and pressure, among other things, are set up to see if anything anomalous is observed over time.
“In other words, is this thing about to blow up?” Giraud-Carrier said. “Because if it’s about to blow up, you want to turn it off. But if it’s not about to blow up, you don’t want to turn it off.”
2. Association learning
Association learning, or market-basket analysis, is used to analyze which things tend to occur together either in pairs or larger groups.
Giraud-Carrier said a basic example is how Walmart will review purchases people make at the cashiers and see that people who buy milk also buy bread, or people who buy diapers also buy baby formula.
“It’s all about trying to find this association among products and the idea being that, again, you can leverage this information,” Giraud-Carrier said. “Either you can bundle these things together … or you can do crazy stuff like you put apples on one end of the store and oranges at the other end of the store, so as people travel through the store they buy all kinds of other stuff that they didn’t plan on buying.”
Association learning goes beyond simple correlation, according to Giraud-Carrier, because it extends beyond pairs and can account for larger groupings of items.
3. Cluster detection
Recognizing distinct groups or sub-categories within data is called cluster detection. Machine learning algorithms detect significantly differing subgroups within a dataset.
Giraud-Carrier said something humans naturally do is separate things into groups.
“We put people in buckets: people we like, people we don’t like; people we find attractive, people we don’t find attractive,” he said. “In your head you have these notions of what makes people attractive or not, and it’s based on who knows what.”
When people match that criteria, a person will sort them into corresponding buckets. Giraud-Carrier said the reason it’s called clustering is because people don’t have labels on their heads identifying which bucket they belong in. Cluster detection uses data to sort things into buckets based on similarities.
Another example of cluster detection would be to analyze purchasing behavior of hobbyists — fishermen and gardeners would have naturally diverse purchasing habits based on their hobbies.
Unlike cluster detection, classification deals with things that already have labels. This is referred to as training data — when there is information existing that can be trained on, or rather easily classified with an algorithm.
For example, spam filters identify differences between content found in legitimate and spam messages. This is made possible through identifying large sets of emails as spam.
The regression method is used to make predictions based on relationships within the data set. As mentioned above, future engagement on Facebook can be predicted based on everything in the user’s history — likes, photo tags, comments and interactions with other users, friend requests and all other activity on the site.
Another example would be using the relationship between income and education level to predict choice of neighborhood. Regression allows all relationships within the data to be analyzed and then used to predict future behavior.