Browse Definitions :
Definition

over sampling and under sampling

Over sampling and under sampling are techniques used in data mining and data analytics to modify unequal data classes to create balanced data sets. Over sampling and under sampling are also known as resampling.

These data analysis techniques are often used to be more representative of real world data. For example, data adjustments can be made in order to provide balanced training materials for AI and machine learning algorithms.

One area where over sampling and under sampling techniques are used is for survey research. A survey sample population may be unbalanced in terms of types of participants, which can deter the larger population that the survey is meant to study. By using over or under sampling, the ratios of surveyed characteristics, such as gender, age group and ethnicity, can used to make the weight of the data better representative of the group’s ratios within the greater populations.

Over sampling vs. under sampling

When one class of data is the underrepresented minority class in the data sample, over sampling techniques maybe used to duplicate these results for a more balanced amount of positive results in training. Over sampling is used when the amount of data collected is insufficient. A popular over sampling technique is SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class.

Conversely, if a class of data is the overrepresented majority class, under sampling may be used to balance it with the minority class. Under sampling is used when the amount of collected data is sufficient. Common methods of under sampling include cluster centroids and Tomek links, both of which target potential overlapping characteristics within the collected data sets to reduce the amount of majority data.

In both over sampling and under sampling, simple data duplication is rarely suggested. Generally, over sampling is preferable as under sampling can result in the loss of important data. Under sampling is suggested when the amount of data collected is larger than ideal and can help data mining tools to stay within the limits of what they can effectively process.

This was last updated in November 2018

Continue Reading About over sampling and under sampling

SearchCompliance
  • compliance risk

    Compliance risk is an organization's potential exposure to legal penalties, financial forfeiture and material loss, resulting ...

  • information governance

    Information governance is a holistic approach to managing corporate information by implementing processes, roles, controls and ...

  • enterprise document management (EDM)

    Enterprise document management (EDM) is a strategy for overseeing an organization's paper and electronic documents so they can be...

SearchSecurity
  • IPsec (Internet Protocol Security)

    IPsec (Internet Protocol Security) is a suite of protocols and algorithms for securing data transmitted over the internet or any ...

  • principle of least privilege (POLP)

    The principle of least privilege (POLP) is a concept in computer security that limits users' access rights to only what are ...

  • biometric authentication

    Biometric authentication is a security process that relies on the unique biological characteristics of individuals to verify they...

SearchHealthIT
SearchDisasterRecovery
  • risk mitigation

    Risk mitigation is a strategy to prepare for and lessen the effects of threats faced by a business.

  • call tree

    A call tree is a layered hierarchical communication model that is used to notify specific individuals of an event and coordinate ...

  • Disaster Recovery as a Service (DRaaS)

    Disaster recovery as a service (DRaaS) is the replication and hosting of physical or virtual servers by a third party to provide ...

SearchStorage
  • cloud storage

    Cloud storage is a service model in which data is transmitted and stored on remote storage systems, where it is maintained, ...

  • cloud testing

    Cloud testing is the process of using the cloud computing resources of a third-party service provider to test software ...

  • storage virtualization

    Storage virtualization is the pooling of physical storage from multiple storage devices into what appears to be a single storage ...

Close