Data Warehouse

Introduction to Massively Parallel Processing (MPP)database

Advertisements

In Massively Parallel Processing (MPP) databases data is partitioned across multiple servers or nodes with each server/node having memory/processors to process data locally. All communication is via a network interconnect — there is no disk-level sharing or
contention to be concerned with (i.e. it is a ‘shared-nothing’ architecture).

I will try to explain how MPP database work by using Greenplum database as an example.

Note: The terminology might be different from vendor to vendor but the concept is almost the same.
Lets start with a sample data set that I will use to demonstrate data partitioning. product_id     Product_name
1                        Apple MacBook Pro
2                       Apple Iphone 4s
3                       Apple Ipad 3
4                       Samsung Galaxy Tab 10.1
5                       Blackberry Bold
6                       Amazon Kindle Fire
7                       Google Android Tablet 9″
8                       Nook Color
9                       Lenovo IdeaPad

This is an example of sample physical architecture for Greenplum database Note: This configuration is only used to demonstrate the working of Greenplum database.

1. Master Host – Separate physical server with its own OS/CPU/storage/memory. Hosts master database. There is no user data in master database but stores metadata about segments – think in terms of system tables.
2, 3, 4 Segment hosts – Individual physical servers with their own OS/CPU/storage/memory. Hosts segment database. Each database stores portion of user data.
5 Interconnect switch – Segment server databases communicate through an interconnect switch

As mentioned earlier the main characteristic of MPP database is data
distribution. Data is distributed across each segment database to achieve data and processing parallelism. This is achieved by creating a database table with DISTRIBUTED
BY
clause. By using this clause data is automatically distributed across segment databases. In Greenplum you can either use hash or round-robin distribution.

In this example we will use hash distribution to distribute the data across segment databases. To get even distribution of data across segments it makes sense to use primary key as a distribution key.

CREATE TABLE product_dim
(product_id     integer,
product_name     varchar(200))
DISTRIBUTED BY (product_id) ;

Once the table is created, lets insert 9 records and see how the data is distributed based on hashing of the product_id.
insert into product_dim values (1,’Apple MacBook Pro’);
insert into product_dim values (2,’Apple Iphone 4s’);
insert into product_dim values (3,’Apple Ipad 3′);
insert into product_dim values (4,’Samsung Galaxy Tab 10.1′);
insert into product_dim values (5,’Blackberry Bold’);
insert into product_dim values (6,’Amazon Kindle Fire’);
insert into product_dim values (7,’Google Android Tablet 9′);
insert into product_dim values (8,’Nook Color’);
insert into product_dim values (9,’Lenovo IdeaPad’); As the distribution key was product_id data is hashed using the distribution key and spread out evenly on each of the segment database. Now lets query the database and see how distributed data helps in query performance.
Case I: Select data for one product

select * from product_dim where product_id = 5 ;

When this query is issued, master will parse and build a query plan. Since the query
is being filtered on a distribution key column the plan will be sent to segment 2 database for execution – remember master stores metadata about data distribution for each table. After the query is issued to segment database 2, it has to scan only 3 records to get to the result compared to 9 records on a share everything database making the query to run faster.

Case II: Select data for 2 products

select * from product_dim where product_id in (1,6) ;

When this query is issued, master will parse and build a query plan. Since the query is being filtered on a distribution key column the plan will be sent to segment 1 and segment 3 database for execution – remember master stores metadata about data distribution for each table. After the query is issued to segment database 1 and 3, each parallel process has to scan only 3 records to get to the result. Data sets from both the scans is returned to the master. Master stitches the data together to present to the user.

Key learning points

Hope this article helps you understand the high level working concepts of MPP database. As always comments are welcome.

Advertisements