Browse Definitions:
Definition

Apache Parquet

Contributor(s): Matthew Haughn

Apache Parquet is a column-oriented storage format for Hadoop. Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Parquet is optimized to work with complex data in bulk and includes methods for efficient data compression and encoding types.

Typically, data is stored in a row-oriented fashion. Even in databases, data is conventionally stored in this way and is optimized for working with one record at a time. Parquet uses a record-shredding and assembly algorithm to break down data and reassemble it so values in each column are physically stored in contiguous memory locations. Data stored by column in this serialized method allows for efficient searches across massive data sets. Since Hadoop is made for big data, columnar storage is a complementary technology.  

Storing data in a columnar format provides benefits such as:

  • More efficient compression due to space saved by the columnar format.
  • Likeness of the data of columns enables data compression for the specific type of data.
  • Queries searching specific column values need not read the entire row's data, making searches faster.
  • Different encoding can be used per column, allowing for better compression to be used as it is developed.

Parquet’s Apache Thrift framework increases flexibility, to allow working with C++, Java and Python.
Parquet is compatible with the majority of data processing frameworks in Hadoop. Other columnar storage file formats include ORC, RCFile and optimized RCFile.

Parquet is a top-level project sponsored by the Apache Software Foundation (ASF). The project originated as a joint effort of Twitter and Cloudera.

This was last updated in December 2017

Continue Reading About Apache Parquet

Start the conversation

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

-ADS BY GOOGLE

File Extensions and File Formats

SearchCompliance

  • internal audit (IA)

    An internal audit (IA) is an organizational initiative to monitor and analyze its own business operations in order to determine ...

  • pure risk (absolute risk)

    Pure risk, also called absolute risk, is a category of threat that is beyond human control and has only one possible outcome if ...

  • risk assessment

    Risk assessment is the identification of hazards that could negatively impact an organization's ability to conduct business.

SearchSecurity

  • security information and event management (SIEM)

    Security information and event management (SIEM) is an approach to security management that seeks to provide a holistic view of ...

  • polymorphic virus

    A polymorphic virus is a harmful, destructive or intrusive type of malware that can change or 'morph,' making it difficult to ...

  • cyberterrorism

    According to the U.S. Federal Bureau of Investigation, cyberterrorism is any 'premeditated, politically motivated attack against ...

SearchHealthIT

  • accountable care organization (ACO)

    An accountable care organization (ACO) is an association of hospitals, healthcare providers and insurers in which all parties ...

  • patient engagement

    Patient engagement is an ideal healthcare situation in which people are well-informed about -- and motivated to be involved -- in...

  • personal health record (PHR)

    A personal health record (PHR) is a collection of health-related information that is documented and maintained by the individual ...

SearchDisasterRecovery

  • business continuity and disaster recovery (BCDR)

    Business continuity and disaster recovery (BCDR) are closely related practices that describe an organization's preparation for ...

  • business continuity plan (BCP)

    A business continuity plan (BCP) is a document that consists of the critical information an organization needs to continue ...

  • call tree

    A call tree -- sometimes referred to as a phone tree -- is a telecommunications chain for notifying specific individuals of an ...

SearchStorage

SearchSolidStateStorage

  • hybrid hard disk drive (HDD)

    A hybrid hard disk drive is an electromechanical spinning hard disk that contains some amount of NAND Flash memory.

Close