Introduction to Massively Parallel Processing (MPP)database

In Massively Parallel Processing (MPP) databases data is partitioned across multiple servers or nodes with each server/node having memory/processors to process data locally. All communication is via a network interconnect — there is no disk-level sharing or
contention to be concerned with (i.e. it is a ‘shared-nothing’ architecture).

I will try to explain how MPP database work by using Greenplum database as an example.

Note: The terminology might be different from vendor to vendor but the concept is almost the same.
Lets start with a sample data set that I will use to demonstrate data partitioning. product_id     Product_name
1                        Apple MacBook Pro
2                       Apple Iphone 4s
3                       Apple Ipad 3
4                       Samsung Galaxy Tab 10.1
5                       Blackberry Bold
6                       Amazon Kindle Fire
7                       Google Android Tablet 9″
8                       Nook Color
9                       Lenovo IdeaPad

This is an example of sample physical architecture for Greenplum database Note: This configuration is only used to demonstrate the working of Greenplum database.

1. Master Host – Separate physical server with its own OS/CPU/storage/memory. Hosts master database. There is no user data in master database but stores metadata about segments – think in terms of system tables.
2, 3, 4 Segment hosts – Individual physical servers with their own OS/CPU/storage/memory. Hosts segment database. Each database stores portion of user data.
5 Interconnect switch – Segment server databases communicate through an interconnect switch

As mentioned earlier the main characteristic of MPP database is data
distribution. Data is distributed across each segment database to achieve data and processing parallelism. This is achieved by creating a database table with DISTRIBUTED
BY
clause. By using this clause data is automatically distributed across segment databases. In Greenplum you can either use hash or round-robin distribution.

In this example we will use hash distribution to distribute the data across segment databases. To get even distribution of data across segments it makes sense to use primary key as a distribution key.

CREATE TABLE product_dim
(product_id     integer,
product_name     varchar(200))
DISTRIBUTED BY (product_id) ;

Once the table is created, lets insert 9 records and see how the data is distributed based on hashing of the product_id.
insert into product_dim values (1,’Apple MacBook Pro’);
insert into product_dim values (2,’Apple Iphone 4s’);
insert into product_dim values (3,’Apple Ipad 3′);
insert into product_dim values (4,’Samsung Galaxy Tab 10.1′);
insert into product_dim values (5,’Blackberry Bold’);
insert into product_dim values (6,’Amazon Kindle Fire’);
insert into product_dim values (7,’Google Android Tablet 9′);
insert into product_dim values (8,’Nook Color’);
insert into product_dim values (9,’Lenovo IdeaPad’); As the distribution key was product_id data is hashed using the distribution key and spread out evenly on each of the segment database. Now lets query the database and see how distributed data helps in query performance.
Case I: Select data for one product

select * from product_dim where product_id = 5 ;

When this query is issued, master will parse and build a query plan. Since the query
is being filtered on a distribution key column the plan will be sent to segment 2 database for execution – remember master stores metadata about data distribution for each table. After the query is issued to segment database 2, it has to scan only 3 records to get to the result compared to 9 records on a share everything database making the query to run faster.

Case II: Select data for 2 products

select * from product_dim where product_id in (1,6) ;

When this query is issued, master will parse and build a query plan. Since the query is being filtered on a distribution key column the plan will be sent to segment 1 and segment 3 database for execution – remember master stores metadata about data distribution for each table. After the query is issued to segment database 1 and 3, each parallel process has to scan only 3 records to get to the result. Data sets from both the scans is returned to the master. Master stitches the data together to present to the user.

Key learning points

  • Separate self-sufficient physical servers to store and process data
  • Data is distributed physically on separate individual databases
  • Shared Nothing architecture

Hope this article helps you understand the high level working concepts of MPP database. As always comments are welcome.

Advertisements

About Diwakar Kasibhotla

Oracle Database Tuning, VLDB designing, ETL/BI architect, Data Modelling, Exadata Architect.
This entry was posted in Greenplum Database, MPP, Pivotal Database, Postgress. Bookmark the permalink.

27 Responses to Introduction to Massively Parallel Processing (MPP)database

  1. farshidce says:

    Thanks for the article. Was helpful to me to understand the MPP methodology used by greenplum but had a comment.
    Distribution of data by keys looks like a good start for sharing the data in a distributed system , however what happens if one segment becomes significantly big that it needs to be split into two segments. E.g one segment for 2011 macbook pro and one for 2012 15 inch macbook pro and another one for 15 inch retina display macbook pro

    If database automatically split a segment into multiple chunks then user doesn’t have to manage ot maintain them or worry about the downtime but if it is up to the administrator to do so this might require downtime for the application during the manual redistribution of data

  2. A distributed database will give optimal performance when all the data is distributed evenly i.e one single segment is not doing majority of your data processing.
    In your example we dont want to split segment 1 into 3. This will make Segment 1-1, 1-2, 1-3 with one record each and rest of the segments with 3 records each.
    When we want to add more processing power into the system we add more segment hosts. This process is called system expansion. Adding more segment hosts doesnt need downtime.
    But after new segment hosts are added the data needs to be re-distributed on all the segment database including the the new segment database which is through a scripted process and needs DBA intervention.

  3. MVS says:

    Nice example. What are the various vendors providing MPP database architectures? I know Teradata is one.

  4. Sudhakar says:

    Good and lucid Article!

  5. karim says:

    Excelent article!!!!!!!! It helps a lot !! but i have a couple of questions. Do I have to buy the db licence for each node? and the other is Can be the environment set up to balance loading ?

    • Karim: Glad this helped you.
      I dont have much information on how they sell the license but typically the price starts from Quarter rack – 4 segment servers with 6-8 segment database each. It goes up to Half Rack (8 segment servers) and full rack. It could be your hardware or EMC Appliance called DCA.
      For load balancing apart from picking a distribution key to spread the data on the servers there is no other mechanism to do it.

  6. Brad says:

    Excellent explanation…thank you.

  7. ben says:

    Can I delete/update data from MPP. I know that HAWQ supports only insert/truncate…

  8. Pingback: Pivotal HAWQ – MPP database on HDFS | Data Warehouse

  9. RAMESH BALAJI says:

    Excellent Explanation. Thank You. Curious How does this works in HDFS aspect as HDFS being a file system and today I am seeing Impala does MPP level queries.

  10. Thanks. Please read this Pivotal HAWQ – MPP database on HDFS to see how MPP works on HDFS.

  11. Rohit Singh says:

    Hi,
    I have one query,
    knowing that the segment nodes have distributed data. What will happen when one of the nodes goes down due to any reason. And then the query for accessing the data is fired. As the Metadata will still have the address of the down segment host which now becomes inaccessible.

    • If the nodes goes down at the same instance the query is fired then the query will fail and they issuing system has to resubmit the query. When the query is resubmitted the query goes to a mirror segment server run the query.

  12. John says:

    Thanks for the explanation. We have been using Bityota cloud service as MPP DB, we happy so far.

  13. Amlan says:

    Nice read.

  14. Anonymous says:

    Very Good Explanation

  15. Sarav says:

    Very Good Explanation!!

  16. Aryan says:

    Ni article indeed, well explanatory.
    But how will it works if i have to join 2 tables and those 2 tables resides on different segments hosts or different part of both table stores on different segments hosts. In this case data may become dependent. Kindly explain this kind of scenerio, how this can be handeled in this type of model.

  17. Sam says:

    Thank you, this is good tutorial for a newbie.

  18. Any idea of what kind of hash function / algorithm they use for data distribution?

    • Anonymous says:

      I’m no expert but looking at the diagram specifically the segment nodes, it looks to me like its round robin. I wonder if this is coincidence or not and would be nice to know if the algorithm is configurable.

  19. Thanks man, thats really nice and friendly explanation 🙂

  20. Jitendra Lovekar says:

    Very Nice article about data distribution. Well explained.

  21. Hari Redroutu says:

    Good one. Thank you.

Leave a Reply to Sam Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s