Browse Definitions :
Definition

Apache Parquet

Contributor(s): Matthew Haughn

Apache Parquet is a column-oriented storage format for Hadoop. Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Parquet is optimized to work with complex data in bulk and includes methods for efficient data compression and encoding types.

Typically, data is stored in a row-oriented fashion. Even in databases, data is conventionally stored in this way and is optimized for working with one record at a time. Parquet uses a record-shredding and assembly algorithm to break down data and reassemble it so values in each column are physically stored in contiguous memory locations. Data stored by column in this serialized method allows for efficient searches across massive data sets. Since Hadoop is made for big data, columnar storage is a complementary technology.  

Storing data in a columnar format provides benefits such as:

  • More efficient compression due to space saved by the columnar format.
  • Likeness of the data of columns enables data compression for the specific type of data.
  • Queries searching specific column values need not read the entire row's data, making searches faster.
  • Different encoding can be used per column, allowing for better compression to be used as it is developed.

Parquet’s Apache Thrift framework increases flexibility, to allow working with C++, Java and Python.
Parquet is compatible with the majority of data processing frameworks in Hadoop. Other columnar storage file formats include ORC, RCFile and optimized RCFile.

Parquet is a top-level project sponsored by the Apache Software Foundation (ASF). The project originated as a joint effort of Twitter and Cloudera.

This was last updated in December 2017

Continue Reading About Apache Parquet

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.

-ADS BY GOOGLE

File Extensions and File Formats

Powered by:

SearchCompliance

  • PCI DSS (Payment Card Industry Data Security Standard)

    The Payment Card Industry Data Security Standard (PCI DSS) is a widely accepted set of policies and procedures intended to ...

  • risk management

    Risk management is the process of identifying, assessing and controlling threats to an organization's capital and earnings.

  • compliance framework

    A compliance framework is a structured set of guidelines that details an organization's processes for maintaining accordance with...

SearchSecurity

  • DNS over HTTPS (DoH)

    DNS over HTTPS (DoH) is a relatively new protocol that encrypts domain name system traffic by passing DNS queries through a ...

  • integrated risk management (IRM)

    Integrated risk management (IRM) is an approach to risk management that uses a set of practices and processes to improve an ...

  • MITRE ATT&CK framework

    The MITRE ATT&CK (pronounced 'miter attack') framework is a free, globally accessible service that provides comprehensive and ...

SearchHealthIT

  • telemedicine (telehealth)

    Telemedicine is the remote delivery of healthcare services, such as health assessments or consultations, over the ...

  • Project Nightingale

    Project Nightingale is a controversial partnership between Google and Ascension, the second largest health system in the United ...

  • medical practice management (MPM) software

    Medical practice management (MPM) software is a collection of computerized services used by healthcare professionals and ...

SearchDisasterRecovery

SearchStorage

  • M.2 SSD

    An M.2 SSD is a solid-state drive (SSD) that conforms to a computer industry specification and is used in internally mounted ...

  • kilobyte (KB or Kbyte)

    A kilobyte (KB or Kbyte) is a unit of measurement for computer memory or data storage used by mathematics and computer science ...

  • virtual memory

    Virtual memory is a memory management capability of an operating system (OS) that uses hardware and software to allow a computer ...

Close