In this post I will go through the architecture of Pivotal HAWQ and how it works.
I strongly suggest to go through Introduction to Massively Parallel Processing (MPP) database before reading this as you will need some concepts of MPP database.
Pivotal HAWQ is a Massively Parallel Processing (MPP) database using several Postgres database instances and HDFS storage. Think of your regular MPP databases like Teradata/Greenplum/Netezza but instead of using local storage it uses HDFS to store datafiles. Each of the processing nodes still has its own CPU/memory and storage.
As it uses Postgres the SQL is similar as PostgresSQL. Users connect to HAWQ Master and can issue SQL statements and interact with it just like Postgres Database.
You can also download a single node VM here. This VM has Pivotal Hadoop, HAWQ installed.
Lets look at some components of HAWQ and how it works. This is an example of sample physical architecture for Pivotal HAWQ.
1 HAWQ Master Server – Separate physical server with its own OS/CPU/storage/memory. Hosts HAWQ master database. There is no user data in HAWQ master database but stores metadata about database objects and segments – think in terms of system tables. This is also an entry point to HAWQ and manages user requests and responsible for generating query plans and submitting the requests to HAWQ segments
2 HDFS Namenode – Seperate physical server with its own OS/CPU/storage/memory is the centerpiece of HDFS system. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant Datanode servers where the data lives.
3,4,5 Segment Server – Individual physical servers with their own OS/CPU/storage/memory. Hosts HAWQ segment process and HDFS datanode. HAWQ segment servers do not store any data on local file system.
6 Interconnect switch – Segment servers and datanode communicate through an interconnect switch
Lets start with a sample data set that I will be using during this article
1 Apple MacBook Pro
2 Apple Iphone 4s
3 Apple Ipad 3
4 Samsung Galaxy Tab 10.1
5 Blackberry Bold
6 Amazon Kindle Fire
7 Google Android Tablet 9″
8 Nook Color
9 Lenovo IdeaPad
Since HAWQ is MPP database data is distributed across each HAWQ Segements to achieve data and processing parallelism. This is achieved by creating a database table with DISTRIBUTED BY clause. By using this clause data is automatically distributed across segment servers. In HAWQ you can either use hash or round-robin distribution.
In this example we will use hash distribution to distribute the data across HAWQ segments. To get even distribution of data across segments it makes sense to use primary key as a distribution key.
CREATE TABLE product_dim
DISTRIBUTED BY (product_id) ;
Once the table is created, lets insert 9 records and see how the data is distributed and stored based on hashing of the product_id.
insert into product_dim values (1,’Apple MacBook Pro’);
insert into product_dim values (2,’Apple Iphone 4s’);
insert into product_dim values (3,’Apple Ipad 3′);
insert into product_dim values (4,’Samsung Galaxy Tab 10.1′);
insert into product_dim values (5,’Blackberry Bold’);
insert into product_dim values (6,’Amazon Kindle Fire’);
insert into product_dim values (7,’Google Android Tablet 9′);
insert into product_dim values (8,’Nook Color’);
insert into product_dim values (9,’Lenovo IdeaPad’);
As the distribution key was product_id, data is hashed using the distribution key and spread out evenly on each of the HAWQ segment servers. Now lets looks at how the data is stored since the underlying storage is HDFS.
Case I: Insert data into HAWQ
When a record needs to be inserted into a table, HAWQ master will decide – based on the distribution key which HAWQ segment server it will send the record to. It will also communicate with HDFS Namenode to get a HDFS block address to store data. HDFS Namenode will assign a HDFS data block – most preferably on the same Datanode as HAWQ Segment and give the HDFS url to HAWQ master.
HAWQ master will pass this information to HAWQ segment that in turn will store the data in the HDFS block.
Case II: Select data from HAWQ
select * from product_dim where product_id = 5 ;
When this query is issued, HAWQ master will parse and build a query plan. Since the query is being filtered on a distribution key column the plan will be sent to HAWQ segment 2 for execution – remember master stores metadata about data distribution for each table. It also sends the HDFS url for required HDFS datablock. After the query is sent to HAWQ segment 2, HAWQ segment will use the HDFS url to read data and send the results set back to HAWQ master.
Key learning points
- HAWQ is a MPP database with HDFS as its storage
- SQL used is same as PostgresSQL
Frequently Asked Questions
These are some questions I frequently get asked
• Does HAWQ run MapReduce jobs like Hive when you run a sql query ?
HAWQ does not run mapreduce jobs when users run a sql query against it. It uses regular database calls and low-level api’s to access HDFS data.
• What kind of languague does HAWQ support?
You can run any SQL queries that you can run on Postgres database – except perform update/delete on table. This limitation is because of HDFS storage.
• Can I create stored procedures on HAWQ ?
Yes you can create stored procedures on HAWQ.
• Are there any similar products out there in the market ?
Cloudera’s Impala. Check here for more details.