reinforcement learning supervised learning
X
Definition

unsupervised learning

What is unsupervised learning?

Unsupervised learning is a type of machine learning (ML) technique that uses artificial intelligence (AI) algorithms to identify patterns in data sets that are neither classified nor labeled.

Unsupervised learning models don't need supervision while training data sets, making it an ideal ML technique for discovering patterns, groupings and differences in unstructured data. It's well-suited for processes such as customer segmentation, exploratory data analysis or image recognition.

Unsupervised learning algorithms can classify, label and group the data points contained within data sets without requiring any external guidance in performing that task. In other words, unsupervised learning allows a system to identify patterns within data sets on its own.

In unsupervised learning, an AI system groups unsorted information according to similarities and differences even though no categories are provided.

AI systems capable of unsupervised learning are often associated with generative learning models, although they might also use a retrieval-based approach, which is most often associated with supervised learning. Chatbots, self-driving cars, facial recognition programs, expert systems and robots are among the systems that use supervised or unsupervised learning approaches. Unsupervised learning is also known as unsupervised machine learning.

How unsupervised learning works

Unsupervised learning starts when machine learning engineers or data scientists pass data sets through algorithms to train them. There are no labels or categories contained within the data sets being used to train such systems; each piece of data that's being passed through the algorithms during training is an unlabeled input object or sample.

The objective of unsupervised learning is to have the algorithms identify patterns within the training data sets and categorize the input objects based on the patterns the system itself identifies. The algorithms analyze the underlying structure of the data sets by extracting useful information or features from them. Thus, these algorithms are expected to develop specific outputs by looking for relationships between each sample or input object.

For example, unsupervised learning algorithms might be given data sets containing images of animals. The algorithms can classify the animals into categories such as those with fur, those with scales and those with feathers. The algorithms then group the images into increasingly more specific subgroups as they learn to identify distinctions within each category.

The algorithms do this by uncovering and identifying patterns. In unsupervised learning, pattern recognition happens without the system having been fed data that teaches it to distinguish specific categories.

Unsupervised vs. supervised learning vs. semi-supervised learning

Supervised learning is an ML technique like unsupervised learning, but in supervised learning, data scientists feed algorithms with labeled training data and define the variables they want the algorithm to assess.

Unlike in unsupervised learning, both the input data and output variables of the algorithm are specified in the training data. Using the animal example, data scientists would feed the algorithm photos of each animal and create a label for each photo used in the training data to indicate if an image contains an animal and what category it belongs to.

Supervised learning models are trained until they can detect patterns and relationships between the input data and the output labels. Classification, decision trees, regression and predictive modeling are common types of supervised algorithms.

Comparing supervised versus unsupervised learning, supervised learning uses labeled data sets to train algorithms to identify and sort based on provided labels. Unsupervised learning is more unpredictable than a supervised learning model. While an unsupervised learning AI system might, for example, figure out on its own how to sort cats from dogs, it could also add unforeseen and undesired categories to deal with unusual breeds, creating clutter instead of order.

ML engineers or data scientists can opt to use a combination of labeled and unlabeled data to train their algorithms. This in-between option is appropriately called semi-supervised learning.

In semi-supervised machine learning, an algorithm is taught through a hybrid of labeled and unlabeled data. This process begins from a set of human suggestions and categories and then uses unsupervised learning to help inform the supervised learning process. Semi-supervised learning provides the freedom of defining labels for data while still being directed by a human perspective.

Algorithms used in supervised, unsupervised, semi-supervised and reinforcement learning.
A machine learning model can be trained with supervised, unsupervised, semi-supervised and reinforcement learning methods.

Another machine learning technique is reinforcement learning, which is based on rewarding desired behaviors and punishing undesired ones. In this process, developers create a method of assigning positive values to the desired actions and negative values to undesired behaviors.

Clustering and other types of unsupervised learning

Unsupervised learning is often focused on clustering. Clustering is the grouping of similar objects or data points while placing dissimilar objects in other clusters.

Machine learning engineers and data scientists can use different algorithms for clustering, with the algorithms themselves falling into different categories based on how they work. Clustering algorithms can be placed in the following categories:

  • Exclusive clustering. This form of grouping data specifies a data point can exist only in one cluster.
  • Overlapping clustering. This form of grouping data enables data points to belong to multiple clusters with different levels of membership.
  • Hierarchical clustering. This form of grouping data is classified as either agglomerative or divisive. Agglomerative clustering is where data points are initially set as separate groupings and are later merged, and divisive clustering takes a single data cluster and divides it based on data points.
  • Probabilistic clustering. This form of grouping data points is based on the potential for them to belong to a specific distribution. The Gaussian Mixture Model is commonly used here to represent subpopulations within an overall population.

Some of the more widely used algorithms include the k-means clustering algorithm, the fuzzy k-means algorithm, the hierarchical clustering and the density-based clustering algorithms.

Benefits of unsupervised learning

Benefits of unsupervised learning include the following:

  • Handles complex tasks. Unsupervised learning is more useful than supervised learning where the initial input data is more complex and unstructured.
  • No need to interpret labels. ML engineers and data scientists are in charge of passing data sets through algorithms to train them, but they aren't needed to interpret labels for each data point.
  • Derives meaning from raw data sets. AI tools can more quickly evaluate raw data when compared to a person.
  • Identifies underlying patterns in unstructured data sets. Unsupervised learning can be used to identify common factors between large amounts of different data points.
  • Works in real time. Unsupervised learning can work with real-time data to identify patterns.
  • Less costly than supervised learning. Unsupervised learning doesn't require the manual work associated with labeling data that supervised learning requires.

Challenges of unsupervised learning

Although organizations value the beneficial features of unsupervised learning, there are some disadvantages, including the following:

  • Results can be unpredictable. It can be difficult to check the accuracy of the unsupervised learning outputs, as there are no labeled data sets to verify the results.
  • Longer overall training times. Unsupervised learning models need a large training set to produce outcomes, and learning from raw data can be time-consuming.
  • Lack of insight. Identifying the hidden patterns in large unclassified data sets can make the training process more difficult.

There's an additional disadvantage with clustering as well, in that cluster analysis could overestimate the similarities in the input objects. This can obscure individual data points important for some use cases, such as customer segmentation, where the objective is to understand individual customers and their unique buying habits.

Examples and use cases

Exploratory analysis and dimensionality reduction are two of the most common uses for unsupervised learning.

Exploratory analysis, which uses algorithms to detect patterns that were previously unknown, has a range of enterprise applications. For example, businesses can use exploratory analysis as a starting point for their customer segmentation efforts.

In dimensionality reduction, algorithms reduce the number of variables or features -- dimensions -- within the data sets so that the focus can be given to the relevant features for various objectives. Some experts explain this by saying that dimensionality reduction removes noisy data. Machine learning engineers often use latent variable model-based algorithms to do this work. For example, an organization can use dimensionality reduction to read images that are blurry by reducing the background.

Additionally, organizations can use unsupervised learning for the following applications:

  • Clustering anomaly detection. This technique uses unsupervised learning to detect the performance of outliers in a data set grouping without labeling the data.
  • Association rule mining. Unsupervised learning identifies occurrence patterns in large data sets and how they affect each other. This application is often used to detect suspicious activity, disease symptoms and customer shopping habits.
  • Cybersecurity. Cybersecurity software trained in unsupervised learning can help detect when a cyber attack might occur as well as where and how.
  • Customer segmentation. Marketing groups personalize their advertising strategies based on which categories its customers fit into.
  • Medical imaging. Healthcare organizations use the unsupervised machine learning features in radiology and pathology devices to help detect and diagnose patients.
  • Prognostic validity. Often used in healthcare, this application groups patients with similar health issues and predicts how these patients will do over time.
  • Recommendation engines. Organizations gather data about people's browsing, shopping and viewing habits to provide them with personalized content.

Learn more about unsupervised learning and clustering techniques to help categorize data.

This was last updated in May 2023

Continue Reading About unsupervised learning

Dig Deeper on Machine learning platforms

Business Analytics
CIO
Data Management
ERP
Close