Browse Definitions :
Definition

Apache Parquet

Apache Parquet is a column-oriented storage format for Hadoop. Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Parquet is optimized to work with complex data in bulk and includes methods for efficient data compression and encoding types.

Typically, data is stored in a row-oriented fashion. Even in databases, data is conventionally stored in this way and is optimized for working with one record at a time. Parquet uses a record-shredding and assembly algorithm to break down data and reassemble it so values in each column are physically stored in contiguous memory locations. Data stored by column in this serialized method allows for efficient searches across massive data sets. Since Hadoop is made for big data, columnar storage is a complementary technology.  

Storing data in a columnar format provides benefits such as:

  • More efficient compression due to space saved by the columnar format.
  • Likeness of the data of columns enables data compression for the specific type of data.
  • Queries searching specific column values need not read the entire row's data, making searches faster.
  • Different encoding can be used per column, allowing for better compression to be used as it is developed.

Parquet’s Apache Thrift framework increases flexibility, to allow working with C++, Java and Python.
Parquet is compatible with the majority of data processing frameworks in Hadoop. Other columnar storage file formats include ORC, RCFile and optimized RCFile.

Parquet is a top-level project sponsored by the Apache Software Foundation (ASF). The project originated as a joint effort of Twitter and Cloudera.

This was last updated in December 2017

Continue Reading About Apache Parquet

SearchCompliance

  • compliance risk

    Compliance risk is an organization's potential exposure to legal penalties, financial forfeiture and material loss, resulting ...

  • information governance

    Information governance is a holistic approach to managing corporate information by implementing processes, roles, controls and ...

  • enterprise document management (EDM)

    Enterprise document management (EDM) is a strategy for overseeing an organization's paper and electronic documents so they can be...

SearchSecurity

  • information security (infosec)

    Information security, often shortened to infosec, is the practice, policies and principles to protect data and other kinds of ...

  • denial-of-service attack

    A denial-of-service (DoS) attack is a security event that occurs when an attacker makes it impossible for legitimate users to ...

  • user authentication

    User authentication verifies the identity of a user attempting to gain access to a network or computing resource by authorizing a...

SearchHealthIT

SearchDisasterRecovery

  • risk mitigation

    Risk mitigation is a strategy to prepare for and lessen the effects of threats faced by a business.

  • call tree

    A call tree is a layered hierarchical communication model that is used to notify specific individuals of an event and coordinate ...

  • Disaster Recovery as a Service (DRaaS)

    Disaster recovery as a service (DRaaS) is the replication and hosting of physical or virtual servers by a third party to provide ...

SearchStorage

  • cloud storage

    Cloud storage is a service model in which data is transmitted and stored on remote storage systems, where it is maintained, ...

  • cloud testing

    Cloud testing is the process of using the cloud computing resources of a third-party service provider to test software ...

  • storage virtualization

    Storage virtualization is the pooling of physical storage from multiple storage devices into what appears to be a single storage ...

Close