Browse Definitions :
Definition

data janitor (data wrangler)

Contributor(s): Matthew Haughn

A data janitor is an IT employee that cleans up big data sources to prepare them for data analysts and data scientists. The job was created to allow those with high-level skills to be employed most effectively rather than on work that could be done by others.  

It's estimated that data preparation time can make up more than 80 percent of the time involved in data analysis. Data janitors, also known as data wranglers, perform the necessary prep work that must be completed before more sophisticated processing and analysis are possible. A data janitor acquires, inspects, consolidates, cleans up and organizes disparate, disorganized data, making the work of data analysts and data scientists possible in much less time by offloading work that more skilled IT staff would normally have to do before actually working with the data.

Before data janitors do their work, big data is not ready for complex analysis. Their preparation also readies data for use with tools such as HadoopPigHiveSpark and MapReduce, and programming languages that include structured query language (SQL), PythonScala and Perl, as well as statistical computing languages such as R.

As IT firms acquire and process more and more data, division of the workload is increasingly important to deliver quality analysis on time. Often, it is junior employees in the field of data analysis that perform this painstaking preparation work. Almost a third of business intelligence workers can be considered data janitors, at least as part of their jobs. The term data janitor is typically not a job title but more of a description of the task. An employee whose primary role is data preparation may be referred to as a data engineer.

This was last updated in December 2017

Continue Reading About data janitor (data wrangler)

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.

-ADS BY GOOGLE

File Extensions and File Formats

Powered by:

SearchCompliance

  • California Consumer Privacy Act (CCPA)

    The California Consumer Privacy Act (CCPA) is legislation in the state of California that supports an individual's right to ...

  • compliance audit

    A compliance audit is a comprehensive review of an organization's adherence to regulatory guidelines.

  • regulatory compliance

    Regulatory compliance is an organization's adherence to laws, regulations, guidelines and specifications relevant to its business...

SearchSecurity

  • BlueKeep (CVE-2019-0708)

    BlueKeep (CVE-2019-0708) is a vulnerability in the Remote Desktop (RDP) protocol that affects Windows 7, Windows XP, Server 2003 ...

  • endpoint detection and response (EDR)

    Endpoint detection and response (EDR) is a category of tools and technology used for protecting computer hardware devices–called ...

  • ransomware

    Ransomware is a subset of malware in which the data on a victim's computer is locked, typically by encryption, and payment is ...

SearchHealthIT

SearchDisasterRecovery

  • disaster recovery team

    A disaster recovery team is a group of individuals focused on planning, implementing, maintaining, auditing and testing an ...

  • cloud insurance

    Cloud insurance is any type of financial or data protection obtained by a cloud service provider. 

  • business continuity software

    Business continuity software is an application or suite designed to make business continuity planning/business continuity ...

SearchStorage

  • blockchain storage

    Blockchain storage is a way of saving data in a decentralized network which utilizes the unused hard disk space of users across ...

  • disk mirroring (RAID 1)

    RAID 1 is one of the most common RAID levels and the most reliable. Data is written to two places simultaneously, so if one disk ...

  • RAID controller

    A RAID controller is a hardware device or software program used to manage hard disk drives (HDDs) or solid-state drives (SSDs) in...

Close