Data Warehouse

Greenplum and Hadoop HDFS integration

Advertisements

One of the features of Greenplum 4.2 version is the use of Hadoop HDFS file system to create external tables.
This is extremely useful when you want to avoid file movement from HDFS to local folder for data loading.

In this post I will go through the configuration of single node (Cent OS) Greenplum database to access and create external tables using hdfs.

Architecture Overview:


Pre-requisites:
1. Greenplum database is up and running under gpadmin user
2. Hadoop (Cloudera CDH3) is up and running under hadoop user

Assumption:
1. Greenplum and Hadoop are running on the same host

Note: If anyone is interested (send a comment or email) I can go through multi-node setup
too where Greenplum and hadoop are on different hosts.

Install directories:
Greenplum: /home/gpadmin/greenplum-db
Hadoop:    /home/hadoop/
Hostname:  gphost

Step 1: Install Java on greenplum
Download jdk from here and install.
I have it installed under /usr/java/jdk1.6.0_26

Step 2: Set JAVA and HADOOP home for gpadmin
Open .bashrc and add the following lines

export JAVA_HOME=/usr/java/jdk1.6.0_26
export HADOOP_HOME=/home/hadoop

Save and exit the file and source the .bashrc

gpadmin$ source ./.bashrc

Step 3: Modify postgres.conf file
As gpadmin user

gpadmin$ cd $MASTER_DATA_DIRECTORY
gpadmin$ vi postgres.conf

Add the following lines in the end

# Hadoop related setttings
gp_external_enable_exec=on
gp_hadoop_target_version = cdh3u2
gp_hadoop_home = ‘/home/hadoop’

Save and exit the file.

Note: Possible values for gp_hadoop_target_version are

gphd-1.1
gpmr-1.0
gpmr-1.2
cdh3u2

Step 4: Restart Greenplum database

gpadmin$ gpstop -a
gpadmin$ gpstart -a

Step 5: Check access to hadoop binaries

As gpadmin user

gpadmin$ cd $HADOOP_HOME/bin
gpadmin$ ./hadoop fs -ls /

You should be able to view the hdfs folders/files. If you run into permission issues then as
hadoop user grant read privileges to other users to hdfs files.

Step 6: Copy test file to hdfs

This earthquake file (copy the contents to earthquake.txt) is generated by a mapreduce function that analyzed raw log files from different earthquake monitoring centers. The output columns are

Source – Location of monitoring center
period
latitude
longitude
magnitude
depth
NST
region

As hadoop user create a hdfs folder to hold this file

hadoop$ hadoop fs -mkdir /data

Copy earthquake.txt to /home/hadoop on hadoop server and copy it to hdfs

hadoop$ hadoop fs -copyFromLocal /home/hadoop/earthquake.txt /data

You can check the file by

hadoop$ hadoop fs -ls /data/earthquake.txt

Found 1 items
-rw-r–r–   1 hadoop hadoop       4417 2012-10-10 13:42 /data/earthquake.txt

Step 7: Create external table on greenplum

Use this definition to create external table on greenplum database

CREATE EXTERNAL TABLE earthquake_raw_ext
(
source text,
period   text,
latitude double precision,
longitude double precision,
magnitude double precision,
depth     double precision,
NST       double precision,
region    text
)
LOCATION ( ‘gphdfs://gphost:8020/data/earthquake.txt’)
FORMAT ‘text’ (delimiter ‘|’)
ENCODING ‘UTF8’;
select from the table using psql.

gpadmin=# select * from earthquake_raw_ext ;

source |                  period                  | latitude | longitude | magnitude | depth | nst |                region
——–+——————————————+———-+———–+———–+——-+—–+—————————————
nc     | Wednesday, October 10, 2012 03:45:36 UTC |  39.5662 | -123.3917 |       1.8 |   8.9 |   9 | Northern California
hv     | Wednesday, October 10, 2012 03:32:29 UTC |  19.4028 | -155.2697 |       2.9 |   1.9 |  24 | Island of Hawaii, Hawaii
hv     | Wednesday, October 10, 2012 03:24:59 UTC |  19.4048 | -155.2673 |       2.6 |   2.1 |  17 | Island of Hawaii, Hawaii
nn     | Wednesday, October 10, 2012 03:21:16 UTC |  36.7553 | -115.5388 |       1.2 |     7 |  20 | Nevada
nn     | Wednesday, October 10, 2012 03:09:13 UTC |   38.583 | -119.4507 |       1.3 |     7 |   7 | Central California
uw     | Wednesday, October 10, 2012 03:07:14 UTC |  47.7083 |  -122.325 |         2 |  29.7 |  36 | Seattle-Tacoma urban area, Washington
ci     | Wednesday, October 10, 2012 02:52:38 UTC |  32.8157 | -116.1407 |       1.3 |   7.4 |  22 | Southern California
ci     | Wednesday, October 10, 2012 02:46:21 UTC |   33.932 | -116.8478 |       1.8 |   7.6 |  87 | Southern California
hv     | Wednesday, October 10, 2012 02:17:29 UTC |  19.4042 | -155.2688 |       1.9 |   1.7 |  17 | Island of Hawaii, Hawaii

… sample output

Step 8: Load data into physical table

External table is just a pointer to the hdfs file location. Everytime you select from this external table its reading the hdfs file system.

For data analysis you want to store this data into a physical table.

Create a table in greenplum

gpadmin=# CREATE TABLE earthquake_raw as select * from earthquake_raw_ext ;

NOTICE:  Table doesn’t have ‘DISTRIBUTED BY’ clause — Using column(s) named ‘source’ as the Greenplum Database data distribution key for this table.
HINT:  The ‘DISTRIBUTED BY’ clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
SELECT 47

gpadmin=#
When using gphdfs Greenplum will  parallelize  data load into segment servers.

Now you can use earthquake_raw table for analysis and have fun.

As always comments are welcome.

Advertisements