I was getting lot of requests on this so I decided to write a separate post. The question was how do I query a Greenplum Database (GPDB) from Pivotal HAWQ.
In this post I will go through the architecture of Pivotal HAWQ and how it works.
I strongly suggest to go through Introduction to Massively Parallel Processing (MPP) database before reading this as you will need some concepts of MPP database.
Pivotal HAWQ is a Massively Parallel Processing (MPP) database using several Postgres database instances and HDFS storage. Think of your regular MPP databases like Teradata/Greenplum/Netezza but instead of using local storage it uses HDFS to store datafiles. Each of the processing nodes still has its own CPU/memory and storage.
In Massively Parallel Processing (MPP) databases data is partitioned across multiple servers or nodes with each server/node having memory/processors to process data locally. All communication is via a network interconnect — there is no disk-level sharing or
contention to be concerned with (i.e. it is a ‘shared-nothing’ architecture).
I will try to explain how MPP database work by using Greenplum database as an example.
One of the features of Greenplum 4.2 version is the use of Hadoop HDFS file system to create external tables.
This is extremely useful when you want to avoid file movement from HDFS to local folder for data loading.
In this post I will go through the configuration of single node (Cent OS) Greenplum database to access and create external tables using hdfs.
Some of the key features of Greenplum Database are:
- Massively Parallel Processing (MPP) Architecture for Loading and Query Processing
- Polymorphic Data Storage-MultiStorage/SSD Support
- Multi-level Partitioning with Dynamic Partitioning Elimination
If you want to test this database on your Mac you can get a community edition that works
on single node.
Here are some installation steps that worked for me . The installation gives you an idea
of all the components of a MPP system.
You can download the software here
Last week I came across an interesting problem.
Problem: I want to centralize my average assets calculation in one place and different downstream systems should be able to consume it. For example: Cognos reports should be able to use this, Informatica mapping can use this as a source. Very similar to an enterprise service.
Here is a presentation by Robert Dawson that he did at OOW 2011. Interesting on how he correlated Exadata adoption with Grief Cycle. These are roadblocks that nobody wants to talk about but every organization implementing Exadata will face them.