Pivotal HAWQ – MPP database on HDFS

In this post I will go through the architecture of Pivotal HAWQ and how it works.

I strongly suggest to go through Introduction to Massively Parallel Processing (MPP) database before reading this as you will need some concepts of MPP database.

Pivotal HAWQ is a Massively Parallel Processing (MPP)  database using several Postgres database instances and HDFS storage. Think of your regular MPP databases like Teradata/Greenplum/Netezza but instead of using local storage it uses HDFS to store datafiles. Each of the processing nodes still has its own CPU/memory and storage.

As it uses Postgres the SQL is similar as PostgresSQL. Users connect to HAWQ Master and can issue SQL statements and interact with it just like Postgres Database.

You can also download a single node VM here. This VM has Pivotal Hadoop, HAWQ installed.

Lets look at some components of HAWQ and how it works. This is an example of sample physical architecture for Pivotal HAWQ.

HAWQ Architecture1  HAWQ Master Server – Separate physical server with its own OS/CPU/storage/memory. Hosts HAWQ master database. There is no user data in HAWQ master database but stores metadata about database objects and segments – think in terms of system tables. This is also an entry point to HAWQ and manages user requests and responsible for generating query plans and submitting the requests to HAWQ segments

2 HDFS Namenode – Seperate physical server with its own OS/CPU/storage/memory is the centerpiece of HDFS system. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant Datanode servers where the data lives.

3,4,5 Segment Server – Individual physical servers with their own OS/CPU/storage/memory. Hosts HAWQ segment process and HDFS datanode. HAWQ segment servers do not store any data on local file system.

6 Interconnect switch – Segment servers and datanode communicate through an interconnect switch

Lets start with a sample data set that I will be using during this article

product_id     Product_name
1                        Apple MacBook Pro
2                       Apple Iphone 4s
3                       Apple Ipad 3
4                       Samsung Galaxy Tab 10.1
5                       Blackberry Bold
6                       Amazon Kindle Fire
7                       Google Android Tablet 9″
8                       Nook Color
9                       Lenovo IdeaPad

Since HAWQ is MPP database data is distributed across each HAWQ Segements to achieve data and processing parallelism. This is achieved by creating a database table with DISTRIBUTED BY clause. By using this clause data is automatically distributed across segment servers. In HAWQ you can either use hash or round-robin distribution.

In this example we will use hash distribution to distribute the data across HAWQ segments. To get even distribution of data across segments it makes sense to use primary key as a distribution key.

CREATE TABLE product_dim
(product_id     integer,
product_name     varchar(200))
DISTRIBUTED BY (product_id) ;

Once the table is created, lets insert 9 records and see how the data is distributed and stored based on hashing of the product_id.

insert into product_dim values (1,’Apple MacBook Pro’);
insert into product_dim values (2,’Apple Iphone 4s’);
insert into product_dim values (3,’Apple Ipad 3′);
insert into product_dim values (4,’Samsung Galaxy Tab 10.1′);
insert into product_dim values (5,’Blackberry Bold’);
insert into product_dim values (6,’Amazon Kindle Fire’);
insert into product_dim values (7,’Google Android Tablet 9′);
insert into product_dim values (8,’Nook Color’);
insert into product_dim values (9,’Lenovo IdeaPad’);

Data DistributionAs the distribution key was product_id, data is hashed using the distribution key and spread out evenly on each of the HAWQ segment servers. Now lets looks at how the data is stored since the underlying storage is HDFS.

Case I: Insert data into HAWQ

When a record needs to be inserted into a table, HAWQ master will decide  – based on the distribution key which HAWQ segment server it will send the record to. It will also communicate with HDFS Namenode to get a HDFS block address to store data. HDFS Namenode will assign a HDFS data block – most preferably on the same Datanode as HAWQ Segment and give the HDFS url to HAWQ master.

HAWQ master will pass this information to HAWQ segment that in turn will store the data in the HDFS block.

Case II: Select data from HAWQ

select * from product_dim where product_id = 5 ;

When this query is issued, HAWQ master will parse and build a query plan. Since the query is being filtered on a distribution key column the plan will be sent to HAWQ segment 2 for execution – remember master stores metadata about data distribution for each table. It also sends the HDFS url for required HDFS datablock. After the query is sent to HAWQ segment  2, HAWQ segment will use the HDFS url to read data and send the results set back to HAWQ master.

Key learning points

  • HAWQ is a MPP database with HDFS as its storage
  • SQL used is same as PostgresSQL

Frequently Asked Questions

These are some questions I frequently get asked

•    Does HAWQ run MapReduce jobs like Hive when you run a sql query ?
HAWQ does not run mapreduce jobs when users run a sql query against it. It uses regular database calls and low-level api’s to access HDFS data.

•    What kind of languague does HAWQ support?
You can run any SQL queries that you can run on Postgres database – except perform update/delete on table. This limitation is because of HDFS storage.

•    Can I create stored procedures on HAWQ ?
Yes you can create stored procedures on HAWQ.

•    Are there any similar products out there in the market ?
Cloudera’s Impala. Check here for more details.

Advertisements

About Diwakar Kasibhotla

Oracle Database Tuning, VLDB designing, ETL/BI architect, Data Modelling, Exadata Architect.
This entry was posted in Big Data, Hadoop, HDFS, MPP, Pivotal HAWQ, Postgress. Bookmark the permalink.

7 Responses to Pivotal HAWQ – MPP database on HDFS

  1. ravi says:

    can i use hawq to query data in greenplum database

  2. ravi says:

    suppose if i have table named sample in gpdb and i want to access that table in hawq so how to
    do that

  3. ravi says:

    how to load data into hawq from a sql source

  4. Anonymous says:

    where hawk better than simple greenplum??

    • Hawq and Greemplum should be used for different purposes. Even though Hawq is Greenplum on HDFS it has its limitations like no ability to perform deletes and updates.
      Scalability of HAWQ could be challenging as segments now how to communicate with Namenode and Hawq master.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s