Browse Definitions :
Definition

over sampling and under sampling

Contributor(s): Matthew Haughn

Over sampling and under sampling are techniques used in data mining and data analytics to modify unequal data classes to create balanced data sets. Over sampling and under sampling are also known as resampling.

These data analysis techniques are often used to be more representative of real world data. For example, data adjustments can be made in order to provide balanced training materials for AI and machine learning algorithms.

One area where over sampling and under sampling techniques are used is for survey research. A survey sample population may be unbalanced in terms of types of participants, which can deter the larger population that the survey is meant to study. By using over or under sampling, the ratios of surveyed characteristics, such as gender, age group and ethnicity, can used to make the weight of the data better representative of the group’s ratios within the greater populations.

Over sampling vs. under sampling

When one class of data is the underrepresented minority class in the data sample, over sampling techniques maybe used to duplicate these results for a more balanced amount of positive results in training. Over sampling is used when the amount of data collected is insufficient. A popular over sampling technique is SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class.

Conversely, if a class of data is the overrepresented majority class, under sampling may be used to balance it with the minority class. Under sampling is used when the amount of collected data is sufficient. Common methods of under sampling include cluster centroids and Tomek links, both of which target potential overlapping characteristics within the collected data sets to reduce the amount of majority data.

In both over sampling and under sampling, simple data duplication is rarely suggested. Generally, over sampling is preferable as under sampling can result in the loss of important data. Under sampling is suggested when the amount of data collected is larger than ideal and can help data mining tools to stay within the limits of what they can effectively process.

This was last updated in November 2018

Continue Reading About over sampling and under sampling

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.

-ADS BY GOOGLE

File Extensions and File Formats

SearchCompliance

  • compliance audit

    A compliance audit is a comprehensive review of an organization's adherence to regulatory guidelines.

  • regulatory compliance

    Regulatory compliance is an organization's adherence to laws, regulations, guidelines and specifications relevant to its business...

  • Whistleblower Protection Act

    The Whistleblower Protection Act of 1989 is a law that protects federal government employees in the United States from ...

SearchSecurity

  • orphan account

    An orphan account, also referred to as an orphaned account, is a user account that can provide access to corporate systems, ...

  • voice squatting (skill squatting)

    Voice squatting is an attack vector for voice user interfaces (VUIs) that exploits homonyms (words that sound the same but are ...

  • WPA3

    WPA3, also known as Wi-Fi Protected Access 3, is the third version of the security certification program developed by the Wi-Fi ...

SearchHealthIT

SearchDisasterRecovery

  • business continuity policy

    Business continuity policy is the set of standards and guidelines an organization enforces to ensure resilience and proper risk ...

  • business continuity and disaster recovery (BCDR)

    Business continuity and disaster recovery (BCDR) are closely related practices that describe an organization's preparation for ...

  • warm site

    A warm site is a type of facility an organization uses to recover its technology infrastructure when its primary data center goes...

SearchStorage

  • cache memory

    Cache memory, also called CPU memory, is high-speed static random access memory (SRAM) that a computer microprocessor can access ...

  • disk array

    A disk array, also called a storage array, is a data storage system used for block-based storage, file-based storage or object ...

  • enterprise storage

    Enterprise storage is a centralized repository for business information that provides common data management, protection and data...

Close