In Massively Parallel Processing (MPP) databases data is partitioned across multiple servers or nodes with each server/node having memory/processors to process data locally. All communication is via a network interconnect — there is no disk-level sharing or
contention to be concerned with (i.e. it is a ‘shared-nothing’ architecture).
I will try to explain how MPP database work by using Greenplum database as an example.
Note: The terminology might be different from vendor to vendor but the concept is almost the same.
Lets start with a sample data set that I will use to demonstrate data partitioning. product_id Product_name
1 Apple MacBook Pro
2 Apple Iphone 4s
3 Apple Ipad 3
4 Samsung Galaxy Tab 10.1
5 Blackberry Bold
6 Amazon Kindle Fire
7 Google Android Tablet 9″
8 Nook Color
9 Lenovo IdeaPad
1. Master Host – Separate physical server with its own OS/CPU/storage/memory. Hosts master database. There is no user data in master database but stores metadata about segments – think in terms of system tables.
2, 3, 4 Segment hosts – Individual physical servers with their own OS/CPU/storage/memory. Hosts segment database. Each database stores portion of user data.
5 Interconnect switch – Segment server databases communicate through an interconnect switch
As mentioned earlier the main characteristic of MPP database is data
distribution. Data is distributed across each segment database to achieve data and processing parallelism. This is achieved by creating a database table with DISTRIBUTED
BY clause. By using this clause data is automatically distributed across segment databases. In Greenplum you can either use hash or round-robin distribution.
In this example we will use hash distribution to distribute the data across segment databases. To get even distribution of data across segments it makes sense to use primary key as a distribution key.
CREATE TABLE product_dim
DISTRIBUTED BY (product_id) ;
Once the table is created, lets insert 9 records and see how the data is distributed based on hashing of the product_id.
insert into product_dim values (1,’Apple MacBook Pro’);
insert into product_dim values (2,’Apple Iphone 4s’);
insert into product_dim values (3,’Apple Ipad 3′);
insert into product_dim values (4,’Samsung Galaxy Tab 10.1′);
insert into product_dim values (5,’Blackberry Bold’);
insert into product_dim values (6,’Amazon Kindle Fire’);
insert into product_dim values (7,’Google Android Tablet 9′);
insert into product_dim values (8,’Nook Color’);
insert into product_dim values (9,’Lenovo IdeaPad’);
Case I: Select data for one product
select * from product_dim where product_id = 5 ;
When this query is issued, master will parse and build a query plan. Since the query
is being filtered on a distribution key column the plan will be sent to segment 2 database for execution – remember master stores metadata about data distribution for each table. After the query is issued to segment database 2, it has to scan only 3 records to get to the result compared to 9 records on a share everything database making the query to run faster.
Case II: Select data for 2 products
select * from product_dim where product_id in (1,6) ;
When this query is issued, master will parse and build a query plan. Since the query is being filtered on a distribution key column the plan will be sent to segment 1 and segment 3 database for execution – remember master stores metadata about data distribution for each table. After the query is issued to segment database 1 and 3, each parallel process has to scan only 3 records to get to the result. Data sets from both the scans is returned to the master. Master stitches the data together to present to the user.
Key learning points
- Separate self-sufficient physical servers to store and process data
- Data is distributed physically on separate individual databases
- Shared Nothing architecture
Hope this article helps you understand the high level working concepts of MPP database. As always comments are welcome.