One of the features of Greenplum 4.2 version is the use of Hadoop HDFS file system to create external tables.
This is extremely useful when you want to avoid file movement from HDFS to local folder for data loading.
In this post I will go through the configuration of single node (Cent OS) Greenplum database to access and create external tables using hdfs.
1. Greenplum and Hadoop are running on the same host
Note: If anyone is interested (send a comment or email) I can go through multi-node setup
too where Greenplum and hadoop are on different hosts.
Step 1: Install Java on greenplum
Download jdk from here and install.
I have it installed under /usr/java/jdk1.6.0_26
Step 2: Set JAVA and HADOOP home for gpadmin
Open .bashrc and add the following lines
Save and exit the file and source the .bashrc
gpadmin$ source ./.bashrc
Step 3: Modify postgres.conf file
As gpadmin user
gpadmin$ cd $MASTER_DATA_DIRECTORY
gpadmin$ vi postgres.conf
Add the following lines in the end
# Hadoop related setttings
gp_hadoop_target_version = cdh3u2
gp_hadoop_home = ‘/home/hadoop’
Save and exit the file.
Note: Possible values for gp_hadoop_target_version are
Step 4: Restart Greenplum database
gpadmin$ gpstop -a
gpadmin$ gpstart -a
Step 5: Check access to hadoop binaries
As gpadmin user
gpadmin$ cd $HADOOP_HOME/bin
gpadmin$ ./hadoop fs -ls /
You should be able to view the hdfs folders/files. If you run into permission issues then as
hadoop user grant read privileges to other users to hdfs files.
Step 6: Copy test file to hdfs
This earthquake file (copy the contents to earthquake.txt) is generated by a mapreduce function that analyzed raw log files from different earthquake monitoring centers. The output columns are
Source – Location of monitoring center
As hadoop user create a hdfs folder to hold this file
hadoop$ hadoop fs -mkdir /data
Copy earthquake.txt to /home/hadoop on hadoop server and copy it to hdfs
hadoop$ hadoop fs -copyFromLocal /home/hadoop/earthquake.txt /data
You can check the file by
hadoop$ hadoop fs -ls /data/earthquake.txt
Found 1 items
-rw-r–r– 1 hadoop hadoop 4417 2012-10-10 13:42 /data/earthquake.txt
Step 7: Create external table on greenplum
Use this definition to create external table on greenplum database
CREATE EXTERNAL TABLE earthquake_raw_ext
latitude double precision,
longitude double precision,
magnitude double precision,
depth double precision,
NST double precision,
LOCATION ( ‘gphdfs://gphost:8020/data/earthquake.txt’)
FORMAT ‘text’ (delimiter ‘|’)
select from the table using psql.
gpadmin=# select * from earthquake_raw_ext ;
source | period | latitude | longitude | magnitude | depth | nst | region
nc | Wednesday, October 10, 2012 03:45:36 UTC | 39.5662 | -123.3917 | 1.8 | 8.9 | 9 | Northern California
hv | Wednesday, October 10, 2012 03:32:29 UTC | 19.4028 | -155.2697 | 2.9 | 1.9 | 24 | Island of Hawaii, Hawaii
hv | Wednesday, October 10, 2012 03:24:59 UTC | 19.4048 | -155.2673 | 2.6 | 2.1 | 17 | Island of Hawaii, Hawaii
nn | Wednesday, October 10, 2012 03:21:16 UTC | 36.7553 | -115.5388 | 1.2 | 7 | 20 | Nevada
nn | Wednesday, October 10, 2012 03:09:13 UTC | 38.583 | -119.4507 | 1.3 | 7 | 7 | Central California
uw | Wednesday, October 10, 2012 03:07:14 UTC | 47.7083 | -122.325 | 2 | 29.7 | 36 | Seattle-Tacoma urban area, Washington
ci | Wednesday, October 10, 2012 02:52:38 UTC | 32.8157 | -116.1407 | 1.3 | 7.4 | 22 | Southern California
ci | Wednesday, October 10, 2012 02:46:21 UTC | 33.932 | -116.8478 | 1.8 | 7.6 | 87 | Southern California
hv | Wednesday, October 10, 2012 02:17:29 UTC | 19.4042 | -155.2688 | 1.9 | 1.7 | 17 | Island of Hawaii, Hawaii
… sample output
Step 8: Load data into physical table
External table is just a pointer to the hdfs file location. Everytime you select from this external table its reading the hdfs file system.
For data analysis you want to store this data into a physical table.
Create a table in greenplum
gpadmin=# CREATE TABLE earthquake_raw as select * from earthquake_raw_ext ;
NOTICE: Table doesn’t have ‘DISTRIBUTED BY’ clause — Using column(s) named ‘source’ as the Greenplum Database data distribution key for this table.
HINT: The ‘DISTRIBUTED BY’ clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
When using gphdfs Greenplum will parallelize data load into segment servers.
Now you can use earthquake_raw table for analysis and have fun.
As always comments are welcome.