Greenplum and Hadoop HDFS integration

Posted on October 10, 2012 by Diwakar Kasibhotla

One of the features of Greenplum 4.2 version is the use of Hadoop HDFS file system to create external tables.
This is extremely useful when you want to avoid file movement from HDFS to local folder for data loading.

In this post I will go through the configuration of single node (Cent OS) Greenplum database to access and create external tables using hdfs.

Architecture Overview:

Pre-requisites:
1. Greenplum database is up and running under gpadmin user
2. Hadoop (Cloudera CDH3) is up and running under hadoop user

Assumption:
1. Greenplum and Hadoop are running on the same host

Note: If anyone is interested (send a comment or email) I can go through multi-node setup
too where Greenplum and hadoop are on different hosts.

Install directories:
Greenplum: /home/gpadmin/greenplum-db
Hadoop: /home/hadoop/
Hostname: gphost

Step 1: Install Java on greenplum
Download jdk from here and install.
I have it installed under /usr/java/jdk1.6.0_26

Step 2: Set JAVA and HADOOP home for gpadmin
Open .bashrc and add the following lines

export JAVA_HOME=/usr/java/jdk1.6.0_26
export HADOOP_HOME=/home/hadoop

Save and exit the file and source the .bashrc

gpadmin$ source ./.bashrc

Step 3: Modify postgres.conf file
As gpadmin user

gpadmin$ cd $MASTER_DATA_DIRECTORY
gpadmin$ vi postgres.conf

Add the following lines in the end

# Hadoop related setttings
gp_external_enable_exec=on
gp_hadoop_target_version = cdh3u2
gp_hadoop_home = ‘/home/hadoop’

Save and exit the file.

Note: Possible values for gp_hadoop_target_version are

gphd-1.1
gpmr-1.0
gpmr-1.2
cdh3u2

Step 4: Restart Greenplum database

gpadmin$ gpstop -a
gpadmin$ gpstart -a

Step 5: Check access to hadoop binaries

As gpadmin user

gpadmin$ cd $HADOOP_HOME/bin
gpadmin$ ./hadoop fs -ls /

You should be able to view the hdfs folders/files. If you run into permission issues then as
hadoop user grant read privileges to other users to hdfs files.

Step 6: Copy test file to hdfs

This earthquake file (copy the contents to earthquake.txt) is generated by a mapreduce function that analyzed raw log files from different earthquake monitoring centers. The output columns are

Source – Location of monitoring center
period
latitude
longitude
magnitude
depth
NST
region

As hadoop user create a hdfs folder to hold this file

hadoop$ hadoop fs -mkdir /data

Copy earthquake.txt to /home/hadoop on hadoop server and copy it to hdfs

hadoop$ hadoop fs -copyFromLocal /home/hadoop/earthquake.txt /data

You can check the file by

hadoop$ hadoop fs -ls /data/earthquake.txt

Found 1 items
-rw-r–r– 1 hadoop hadoop 4417 2012-10-10 13:42 /data/earthquake.txt

Step 7: Create external table on greenplum

Use this definition to create external table on greenplum database

CREATE EXTERNAL TABLE earthquake_raw_ext
(
source text,
period   text,
latitude double precision,
longitude double precision,
magnitude double precision,
depth     double precision,
NST       double precision,
region    text
)
LOCATION ( ‘gphdfs://gphost:8020/data/earthquake.txt’)
FORMAT ‘text’ (delimiter ‘|’)
ENCODING ‘UTF8’;
select from the table using psql.

gpadmin=# select * from earthquake_raw_ext ;

source |                  period                  | latitude | longitude | magnitude | depth | nst |                region
——–+——————————————+———-+———–+———–+——-+—–+—————————————
nc     | Wednesday, October 10, 2012 03:45:36 UTC | 39.5662 | -123.3917 |       1.8 |   8.9 |   9 | Northern California
hv     | Wednesday, October 10, 2012 03:32:29 UTC | 19.4028 | -155.2697 |       2.9 |   1.9 | 24 | Island of Hawaii, Hawaii
hv     | Wednesday, October 10, 2012 03:24:59 UTC | 19.4048 | -155.2673 |       2.6 |   2.1 | 17 | Island of Hawaii, Hawaii
nn     | Wednesday, October 10, 2012 03:21:16 UTC | 36.7553 | -115.5388 |       1.2 |     7 | 20 | Nevada
nn     | Wednesday, October 10, 2012 03:09:13 UTC |   38.583 | -119.4507 |       1.3 |     7 |   7 | Central California
uw     | Wednesday, October 10, 2012 03:07:14 UTC | 47.7083 | -122.325 |         2 | 29.7 | 36 | Seattle-Tacoma urban area, Washington
ci     | Wednesday, October 10, 2012 02:52:38 UTC | 32.8157 | -116.1407 |       1.3 |   7.4 | 22 | Southern California
ci     | Wednesday, October 10, 2012 02:46:21 UTC |   33.932 | -116.8478 |       1.8 |   7.6 | 87 | Southern California
hv     | Wednesday, October 10, 2012 02:17:29 UTC | 19.4042 | -155.2688 |       1.9 |   1.7 | 17 | Island of Hawaii, Hawaii

… sample output

Step 8: Load data into physical table

External table is just a pointer to the hdfs file location. Everytime you select from this external table its reading the hdfs file system.

For data analysis you want to store this data into a physical table.

Create a table in greenplum

gpadmin=# CREATE TABLE earthquake_raw as select * from earthquake_raw_ext ;

NOTICE: Table doesn’t have ‘DISTRIBUTED BY’ clause — Using column(s) named ‘source’ as the Greenplum Database data distribution key for this table.
HINT: The ‘DISTRIBUTED BY’ clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
SELECT 47

gpadmin=#
When using gphdfs Greenplum will parallelize data load into segment servers.

Now you can use earthquake_raw table for analysis and have fun.

As always comments are welcome.

About Diwakar Kasibhotla

Oracle Database Tuning, VLDB designing, ETL/BI architect, Data Modelling, Exadata Architect.

View all posts by Diwakar Kasibhotla →

This entry was posted in gphdfs, Greenplum Database, Hadoop, HDFS, Postgress. Bookmark the permalink.

43 Responses to Greenplum and Hadoop HDFS integration

tyrone griffin says:

October 10, 2012 at 8:23 pm

cool, you cleared up a lot of questions I had about hadoop.

Reply
Bala says:

December 11, 2012 at 3:01 am

I have a question,
Can I use (pg or gp) dump utilities on the GreenPlum external tables stored in HADOOP.
-Bala

Reply
- Anonymous says:
  
  December 7, 2017 at 4:09 am
  
  pg_dump only as it was developed with postgresql(PG)
  
  Reply
Diwakar Kasibhotla says:

December 11, 2012 at 11:45 am

Bala: You can’t use dump utilities to dump external table contents. The idea being that you can move the files to recreate the external tables.
But you can use WRITABLE EXTERNAL tables to write hadoop HDFS file content to local file system on a ETL server.

In this example to write out the data I created an WRITABLE EXTERNAL table using gpfdist

Step 1: Start gpfdist on host where you want to copy the contents or dump the contents

etl-host1:data $ gpfdist -d /Users/gpadmin/data/ -p 8081 -l /Users/gpadmin/data/log1 &

Step 2: Create writable external table

CREATE WRITABLE EXTERNAL TABLE earthquake_raw_ext_w
( LIKE earthquake_raw_ext )
LOCATION (‘gpfdist://etl-host1:8081/earthquake.txt’)
FORMAT ‘TEXT’ (DELIMITER ‘|’)
DISTRIBUTED BY (source);

Step 3: Unload data from gphdfs external table to the external table.

insert into earthquake_raw_ext_w select * from earthquake_raw_ext ;

Step 4: Check dump file
Under /Users/gpadmin/data you should be able to see earthquake.txt with all its contents.

etl-host1: data $ ls -ltr

-rw——- 1 gpadmin gpadmin 4305 Dec 11 11:36 earthquake.txt

Let me know if this helps.

Reply
- Bala says:
  
  December 12, 2012 at 4:08 am
  
  HI Diwakar., thanks for your quick response.
  
  Reply
Bala says:

December 18, 2012 at 2:24 am

Hi
Can we use ‘distcp’ utility (to perform copy ) on a single file (large file 1 TB)
-Bala

Reply
Bala says:

December 18, 2012 at 5:06 am

Just now I clarified distcp utility can be used for single file too., but I’m wondering how the copy being implemented on a single file (all file blocks copied parallely copied with offset information to the target device, or sequential copy like normal file system copy)
-Bala

Reply
Diwakar Kasibhotla says:

December 18, 2012 at 11:36 am

Haven’t used distcp myself but its explained here http://hadoop.apache.org/docs/r0.19.2/distcp.html

Reply
Praveen Shettigar says:

March 2, 2013 at 12:45 pm

Nice Post! Your article is based on the assumption that gpdb and hadoop are running on the same host. I have Greenplum database running on DCA and GPHD-1.1 cluster running on 4 commodity servers. What additional steps do I have to take for greenplum database to see the HDFS file system running on remote set of servers. What should be the value for gp_hadoop_home in my scenario? Thanks!

Reply
- Diwakar Kasibhotla says:
  
  March 2, 2013 at 10:17 pm
  
  @Praveen
  For multinode setup do this
  1. tar your gphd-1.1 directory on your hadoop Namenode and move it to DCA on Master (you can pick any directory to move this)
  2. As gpadmin user on DCA gpscp gphd-1.1 tar file to all segment servers (use the same directory location as Master)
  3. gpssh to all segment servers and untar the file (also untar on master)
  4. For example if untar creates directory /home/gpadmin/gphd-1.1 then this becomes your gp_hadoop_home
  
  Hope this helps
  
  Reply
hasan dange says:

December 4, 2013 at 11:22 pm

Hi Diwakar,

I have the Greenplum running on 3 commodity servers and hadoop running on 5 commodity.
I am trying to pull the data from hadoop in to greenplum using externaltable.
Greenplum Version is —> Greenplum Database 4.2.2.4 build 1 Community Edition
Hadoop Version is —> Hadoop 1.0.3-gphd-1.2.0.0 (Greenplum Hadoop)

external table definition is :External table “public.part00001”
Column | Type | Modifiers
———-+——+———–
column_1 | text |
column_2 | text |
column_3 | text |
Type: readable
Encoding: UTF8
Format type: text
Format options: delimiter ‘,’ null ‘\N’ escape ‘\’
External location: gphdfs://10.26.121.239:9000/hive/gphd/warehouse/mri_out/part-00001

And I am firing bellow query to fetch data.

select * from part00001;
ERROR: external table gphdfs protocol command ended with error. Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/lib/input/FileInputFormat (seg0 slice1 seg-1:40000 pid=29480)
DETAIL:

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.lib.input.FileInputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(La
Command: ‘gphdfs://10.26.121.239:9000/hive/gphd/warehouse/mri_out/part-00001’
External table part00001, file gphdfs://10.26.121.239:9000/hive/gphd/warehouse/mri_out/part-00001

I am getting the above error, I copied Hadoop jar files into greenplum server in lib folder of greenplum database but still I am getting same error.

Please suggest me some solution will be really a great help !!

Thanks,
Hasan Dange

Reply
Diwakar Kasibhotla says:

December 6, 2013 at 8:54 pm

Seems like its unable to find core jar file. Send your HADOOP_HOME and gp_hadoop_home variable values.
Also list out the files on your hadoop home and send the output.

Reply
Sai says:

March 24, 2014 at 7:55 pm

Please upload GPDB and CHD4 integration for multi-node setup

Reply
- Diwakar Kasibhotla says:
  
  March 24, 2014 at 8:03 pm
  
  For multinode setup do this
  1. tar your gphd-1.1 directory on your hadoop Namenode and move it to DCA on Master (you can pick any directory to move this)
  2. As gpadmin user on DCA gpscp gphd-1.1 tar file to all segment servers (use the same directory location as Master)
  3. gpssh to all segment servers and untar the file (also untar on master)
  4. For example if untar creates directory /home/gpadmin/gphd-1.1 then this becomes your gp_hadoop_home
  
  Hope this helps
  
  Reply
Sai says:

April 7, 2014 at 7:18 pm

Thank you. Is this the only version “cdh3u2” supported? How about cdh4?
gp_hadoop_target_version = cdh3u2

Reply
Sai says:

April 7, 2014 at 7:33 pm

Got it “cdh4.1” is supported.

Reply
harsha says:

April 22, 2014 at 2:12 pm

its very nice..
i want to access data which is greenplumDB from hawq in pivotalHD(external table). plz give the way

Reply
- Diwakar Kasibhotla says:
  
  April 22, 2014 at 2:33 pm
  
  You can but its very convoluted. You have to create readable web external table and execute a psql command that reads data from GPDB.
  
  Reply
  - satya says:
    
    April 23, 2014 at 8:21 am
    
    hey nice.. i tried but it wont work. can u give full explanation. means, should we change any conf-properties setting in hawq?
Diwakar Kasibhotla says:

April 23, 2014 at 4:48 pm

@Harsha @Satya: Do you want to build external table in HAWQ pointing to GPDB tables and load into HAWQ or for just query purpose. If you can explain what you are planning to do that would help.

Reply
- satya says:
  
  April 24, 2014 at 1:10 am
  
  I want to build external table in HAWQ pointing to GPDB tables and load into HAWQ
  
  Reply
  - Diwakar Kasibhotla says:
    
    April 25, 2014 at 7:37 pm
    
    Read this
    
    Query Greenplum Database from Pivotal HAWQ
- nani says:
  
  April 24, 2014 at 2:48 am
  
  Hi, you want to build external table in HAWQ pointing to GPDB tables for just query purpose. ca nyou please explain.. thanx inadvance ??
  
  Reply
  - Diwakar Kasibhotla says:
    
    April 25, 2014 at 7:37 pm
    
    Read this
    
    Query Greenplum Database from Pivotal HAWQ
azzam says:

August 6, 2014 at 11:54 pm

Hi, when I issue a “select * from earthquake_raw_ext;” an error occurred as follows:
gpadmin=# select * from earthquake_raw_ext;
ERROR: missing data for column “period”
any idea ? Thx

Reply
- Diwakar Kasibhotla says:
  
  November 8, 2014 at 10:58 pm
  
  Check if you have the file loaded correctly on the HDFS.
  
  Reply
Frank says:

October 19, 2014 at 11:36 am

Hi Kasibhotla,

I am interested in “go through multi-node setup
too where Greenplum and hadoop are on different hosts”,
Could you give me some documents ? superdatashare@gmail.com. Thanks.

Reply
- Diwakar Kasibhotla says:
  
  November 8, 2014 at 11:01 pm
  
  @Frank I have them in my previous comments
  Here it is again
  For multinode setup do this
  1. tar your gphd-1.1 directory on your hadoop Namenode and move it to DCA on Master (you can pick any directory to move this)
  2. As gpadmin user on DCA gpscp gphd-1.1 tar file to all segment servers (use the same directory location as Master)
  3. gpssh to all segment servers and untar the file (also untar on master)
  4. For example if untar creates directory /home/gpadmin/gphd-1.1 then this becomes your gp_hadoop_home
  
  Hope this helps
  
  Reply
SKB says:

December 2, 2014 at 7:17 pm

What is the best practice to pull data from Hadoop to green plum.
option 1 . Create staging table in GreenPlum, and pull data from Hadoop using external table?
Option 2. Load data into green plum using scoop?
Option 3 : Create table in Hives, write mapReduce to pull data form HDFS and load into Hives. and from Hives pull data into GreenPlum.
Option 4. Any other?

Reply
- Diwakar Kasibhotla says:
  
  February 14, 2015 at 9:08 pm
  
  If GPDB and Hadoop cluster are on the same network then Option 1 is not bad. If thats not the case then other options are better.
  
  Reply
Ashok Sharma says:

May 19, 2015 at 3:37 pm

It is really nice step. I have three nodes Greenlum and single node hadoop installed. Hadoop is installed on the master node of GPDB. I am using hadoop-2.0.0-cdh4.7.1 with GBDP 4.3.4.0
when I select from the external table in GPDB Getting below Errors.
ERROR: external table gphdfs protocol command ended with error. Exception in thread “main” java.net.UnknownHostException: SGPDB3.localdomain: SGPDB3.localdomain: unknown error (seg3 slice1 SGPDB3.localdomain:40001 pid=18425)
DETAIL:

at java.net.InetAddress.getLocalHost(InetAddress.java:1484)
at org.apache.hadoop.security.SecurityUtil.getLocalHostName(SecurityUtil.java:223)
at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:243)
at com.emc.greenplum.gpdb.hdfsconnector.ConnectorUtil.loginSecureHadoop(ConnectorUtil.java:84)
at com.emc.greenplum.gpdb.hdfsconnector.HDFSReader.doRead(HDFSReader.java:85)
Command: ‘gphdfs://gpdb1.localdomain:9000/user/hadoop/a.text’
External table ak_ext, file gphdfs://gpdb1.localdomain:9000/user/hadoop/a.text

Reply
- Diwakar Kasibhotla says:
  
  May 28, 2015 at 8:03 am
  
  Check your host entries. The protocol is not able to find gpdb1.localdomain host.
  
  Reply
Igor Rudyak says:

May 26, 2015 at 8:09 pm

How about Hadoop cluster with Kerberos authentication turned on? Will such approach work in a such case? Is it necessary to add some extra settings when working with kerberized Hadoop cluster?

What account will be used to access external table files on HDFS, when Greenplum user tries to make select against such a table?

Reply
- Diwakar Kasibhotla says:
  
  May 28, 2015 at 8:05 am
  
  If you are using Pivotal HD then it uses “gpadmin” account to authenticate using Kerberos. For other HD distribution you need to add “gpadmin” user as a principal.
  Greenplum uses gpadmin user to access files on HDFS.
  
  Reply
Diwakar Kasibhotla says:

May 28, 2015 at 8:02 am

Check your host entries. The protocol is not able to find gpdb1.localdomain host.

Reply
Anonymous says:

May 28, 2015 at 9:44 am

Hi Diwakar,
Greenplum DB is already running on the same three nodes cluster without any issue.
I can insert, update and select Greenplum data successfully. So I dont understand why greenplum DB is not able to select data from single node Hadoop on the same cluster.

Reply
Anonymous says:

June 1, 2015 at 1:27 pm

Hi diwakar
GPDB installed on three nodes cluster. GPDB1(master node) and two segment hosts
SGPDB2 and SGPDB3 . GPDB is running fine so there should be any issue with host name resolution. I installed single node cloudera Hadoop 4.7 on GPDB1 and created a external table in gpdb while selecting data I am getting above errors.

Hadoop binaries installed on all above three nodes
gp_hadoop_target_version is set to cdh4.1

Reply
Anonymous says:

September 13, 2015 at 7:49 pm

Hi Diwakar,

Do you know if Greenplum works with Couldera 5.3.2

Reply
skacanski says:

October 20, 2015 at 9:33 pm

How about bulk loading 20 tb of data per hour from greenplum to hdfs?
is there any way to parallelize this process?

Reply
sree says:

February 14, 2016 at 10:11 pm

What is best approach to load data from daily hdfs files with CDC to green plum table?
1. create external table everyday on daily files and do CDC on GP table and upsert.
2. create master file in hdfs with all cdc daily files and create external table and load table in GP.

Any other best approach to load updates into green plum from hdfs?

Thanks.

Reply
Gurjeet Maini says:

August 29, 2016 at 12:26 pm

Thanks for the nice document.
In my case i have hadoop and greenplum both are on different set of servers.
I have set database parameters, also set $JAVA_HOME in all of the greenplum servers but still getting error while inserting or retrieving the data from external table.
Below is the error from database segment log. Can you please help ?
$ echo $JAVA_HOME
/usr/bin/java
java -version
java version “1.6.0_39”
OpenJDK Runtime Environment (IcedTea6 1.13.11) (rhel-1.13.11.1.el6_8-x86_64)
OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)

log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread “”main”” java.net.ConnectException: Call From /10.40.217.154 to hdpqa:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at s

Reply
Thejo Kumar says:

November 4, 2017 at 11:22 am

We are planning to move out of Greenplum to HAWQ, what is your option on this??

Reply
- Diwakar Kasibhotla says:
  
  December 25, 2017 at 2:19 am
  
  Make sure that you have functional parity when you move data from Greenplum to HAWQ. Also I have seen degraded performance on HAWQ when you increase concurrent queries.
  
  Reply