Browse Definitions :
Definition

data janitor (data wrangler)

Contributor(s): Matthew Haughn

A data janitor is an IT employee that cleans up big data sources to prepare them for data analysts and data scientists. The job was created to allow those with high-level skills to be employed most effectively rather than on work that could be done by others.  

It's estimated that data preparation time can make up more than 80 percent of the time involved in data analysis. Data janitors, also known as data wranglers, perform the necessary prep work that must be completed before more sophisticated processing and analysis are possible. A data janitor acquires, inspects, consolidates, cleans up and organizes disparate, disorganized data, making the work of data analysts and data scientists possible in much less time by offloading work that more skilled IT staff would normally have to do before actually working with the data.

Before data janitors do their work, big data is not ready for complex analysis. Their preparation also readies data for use with tools such as HadoopPigHiveSpark and MapReduce, and programming languages that include structured query language (SQL), PythonScala and Perl, as well as statistical computing languages such as R.

As IT firms acquire and process more and more data, division of the workload is increasingly important to deliver quality analysis on time. Often, it is junior employees in the field of data analysis that perform this painstaking preparation work. Almost a third of business intelligence workers can be considered data janitors, at least as part of their jobs. The term data janitor is typically not a job title but more of a description of the task. An employee whose primary role is data preparation may be referred to as a data engineer.

This was last updated in December 2017

Continue Reading About data janitor (data wrangler)

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.

SearchCompliance

  • risk assessment

    Risk assessment is the identification of hazards that could negatively impact an organization's ability to conduct business.

  • PCI DSS (Payment Card Industry Data Security Standard)

    The Payment Card Industry Data Security Standard (PCI DSS) is a widely accepted set of policies and procedures intended to ...

  • risk management

    Risk management is the process of identifying, assessing and controlling threats to an organization's capital and earnings.

SearchSecurity

SearchHealthIT

SearchDisasterRecovery

  • call tree

    A call tree is a layered hierarchical communication model that is used to notify specific individuals of an event and coordinate ...

  • Disaster Recovery as a Service (DRaaS)

    Disaster recovery as a service (DRaaS) is the replication and hosting of physical or virtual servers by a third party to provide ...

  • cloud disaster recovery (cloud DR)

    Cloud disaster recovery (cloud DR) is a combination of strategies and services intended to back up data, applications and other ...

SearchStorage

Close