Browse Definitions :
Definition

over sampling and under sampling

Over sampling and under sampling are techniques used in data mining and data analytics to modify unequal data classes to create balanced data sets. Over sampling and under sampling are also known as resampling.

These data analysis techniques are often used to be more representative of real world data. For example, data adjustments can be made in order to provide balanced training materials for AI and machine learning algorithms.

One area where over sampling and under sampling techniques are used is for survey research. A survey sample population may be unbalanced in terms of types of participants, which can deter the larger population that the survey is meant to study. By using over or under sampling, the ratios of surveyed characteristics, such as gender, age group and ethnicity, can used to make the weight of the data better representative of the group’s ratios within the greater populations.

Over sampling vs. under sampling

When one class of data is the underrepresented minority class in the data sample, over sampling techniques maybe used to duplicate these results for a more balanced amount of positive results in training. Over sampling is used when the amount of data collected is insufficient. A popular over sampling technique is SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class.

Conversely, if a class of data is the overrepresented majority class, under sampling may be used to balance it with the minority class. Under sampling is used when the amount of collected data is sufficient. Common methods of under sampling include cluster centroids and Tomek links, both of which target potential overlapping characteristics within the collected data sets to reduce the amount of majority data.

In both over sampling and under sampling, simple data duplication is rarely suggested. Generally, over sampling is preferable as under sampling can result in the loss of important data. Under sampling is suggested when the amount of data collected is larger than ideal and can help data mining tools to stay within the limits of what they can effectively process.

This was last updated in November 2018

Continue Reading About over sampling and under sampling

Networking
  • firewall as a service (FWaaS)

    Firewall as a service (FWaaS), also known as a cloud firewall, is a service that provides cloud-based network traffic analysis ...

  • private 5G

    Private 5G is a wireless network technology that delivers 5G cellular connectivity for private network use cases.

  • NFVi (network functions virtualization infrastructure)

    NFVi (network functions virtualization infrastructure) encompasses all of the networking hardware and software needed to support ...

Security
  • virus (computer virus)

    A computer virus is a type of malware that attaches itself to a program or file. A virus can replicate and spread across an ...

  • Certified Information Security Manager (CISM)

    Certified Information Security Manager (CISM) is an advanced certification that indicates that an individual possesses the ...

  • cryptography

    Cryptography is a method of protecting information and communications using codes, so that only those for whom the information is...

CIO
  • IT project management

    IT project management is the process of planning, organizing and delineating responsibility for the completion of an ...

  • chief financial officer (CFO)

    A chief financial officer (CFO) is the corporate title for the person responsible for managing a company's financial operations ...

  • chief strategy officer (CSO)

    A chief strategy officer (CSO) is a C-level executive charged with helping formulate, facilitate and communicate an ...

HRSoftware
  • HR automation

    Human resources automation (HR automation) is a method of using software to automate and streamline repetitive and laborious HR ...

  • compensation management

    Compensation management is the discipline and process for determining employees' appropriate pay and benefits.

  • HR technology (human resources tech)

    HR technology (human resources technology) is an umbrella term for hardware and software used to automate the human resource ...

Customer Experience
  • martech (marketing technology)

    Martech (marketing technology) refers to the integration of software tools, platforms, and applications designed to streamline ...

  • transactional marketing

    Transactional marketing is a business strategy that focuses on single, point-of-sale transactions.

  • customer profiling

    Customer profiling is the detailed and systematic process of constructing a clear portrait of a company's ideal customer by ...

Close