Greenplum and Hadoop HDFS integration

One of the features of Greenplum 4.2 version is the use of Hadoop HDFS file system to create external tables.
This is extremely useful when you want to avoid file movement from HDFS to local folder for data loading.

In this post I will go through the configuration of single node (Cent OS) Greenplum database to access and create external tables using hdfs.

Architecture Overview:

gp_hd_arch
Pre-requisites:
1. Greenplum database is up and running under gpadmin user
2. Hadoop (Cloudera CDH3) is up and running under hadoop user

Assumption:
1. Greenplum and Hadoop are running on the same host

Note: If anyone is interested (send a comment or email) I can go through multi-node setup
too where Greenplum and hadoop are on different hosts.

Install directories:
Greenplum: /home/gpadmin/greenplum-db
Hadoop:    /home/hadoop/
Hostname:  gphost

Step 1: Install Java on greenplum
Download jdk from here and install.
I have it installed under /usr/java/jdk1.6.0_26

Step 2: Set JAVA and HADOOP home for gpadmin
Open .bashrc and add the following lines

export JAVA_HOME=/usr/java/jdk1.6.0_26
export HADOOP_HOME=/home/hadoop

Save and exit the file and source the .bashrc

gpadmin$ source ./.bashrc

Step 3: Modify postgres.conf file
As gpadmin user

gpadmin$ cd $MASTER_DATA_DIRECTORY
gpadmin$ vi postgres.conf

Add the following lines in the end

# Hadoop related setttings
gp_external_enable_exec=on
gp_hadoop_target_version = cdh3u2
gp_hadoop_home = ‘/home/hadoop’

Save and exit the file.

Note: Possible values for gp_hadoop_target_version are

gphd-1.1
gpmr-1.0
gpmr-1.2
cdh3u2

Step 4: Restart Greenplum database

gpadmin$ gpstop -a
gpadmin$ gpstart -a

Step 5: Check access to hadoop binaries

As gpadmin user

gpadmin$ cd $HADOOP_HOME/bin
gpadmin$ ./hadoop fs -ls /

You should be able to view the hdfs folders/files. If you run into permission issues then as
hadoop user grant read privileges to other users to hdfs files.

Step 6: Copy test file to hdfs

This earthquake file (copy the contents to earthquake.txt) is generated by a mapreduce function that analyzed raw log files from different earthquake monitoring centers. The output columns are

Source – Location of monitoring center
period
latitude
longitude
magnitude
depth
NST
region

As hadoop user create a hdfs folder to hold this file

hadoop$ hadoop fs -mkdir /data

Copy earthquake.txt to /home/hadoop on hadoop server and copy it to hdfs

hadoop$ hadoop fs -copyFromLocal /home/hadoop/earthquake.txt /data

You can check the file by

hadoop$ hadoop fs -ls /data/earthquake.txt

Found 1 items
-rw-r–r–   1 hadoop hadoop       4417 2012-10-10 13:42 /data/earthquake.txt

Step 7: Create external table on greenplum

Use this definition to create external table on greenplum database

CREATE EXTERNAL TABLE earthquake_raw_ext
(
source text,
period   text,
latitude double precision,
longitude double precision,
magnitude double precision,
depth     double precision,
NST       double precision,
region    text
)
LOCATION ( ‘gphdfs://gphost:8020/data/earthquake.txt’)
FORMAT ‘text’ (delimiter ‘|’)
ENCODING ‘UTF8’;
select from the table using psql.

gpadmin=# select * from earthquake_raw_ext ;

source |                  period                  | latitude | longitude | magnitude | depth | nst |                region
——–+——————————————+———-+———–+———–+——-+—–+—————————————
nc     | Wednesday, October 10, 2012 03:45:36 UTC |  39.5662 | -123.3917 |       1.8 |   8.9 |   9 | Northern California
hv     | Wednesday, October 10, 2012 03:32:29 UTC |  19.4028 | -155.2697 |       2.9 |   1.9 |  24 | Island of Hawaii, Hawaii
hv     | Wednesday, October 10, 2012 03:24:59 UTC |  19.4048 | -155.2673 |       2.6 |   2.1 |  17 | Island of Hawaii, Hawaii
nn     | Wednesday, October 10, 2012 03:21:16 UTC |  36.7553 | -115.5388 |       1.2 |     7 |  20 | Nevada
nn     | Wednesday, October 10, 2012 03:09:13 UTC |   38.583 | -119.4507 |       1.3 |     7 |   7 | Central California
uw     | Wednesday, October 10, 2012 03:07:14 UTC |  47.7083 |  -122.325 |         2 |  29.7 |  36 | Seattle-Tacoma urban area, Washington
ci     | Wednesday, October 10, 2012 02:52:38 UTC |  32.8157 | -116.1407 |       1.3 |   7.4 |  22 | Southern California
ci     | Wednesday, October 10, 2012 02:46:21 UTC |   33.932 | -116.8478 |       1.8 |   7.6 |  87 | Southern California
hv     | Wednesday, October 10, 2012 02:17:29 UTC |  19.4042 | -155.2688 |       1.9 |   1.7 |  17 | Island of Hawaii, Hawaii

… sample output

Step 8: Load data into physical table

External table is just a pointer to the hdfs file location. Everytime you select from this external table its reading the hdfs file system.

For data analysis you want to store this data into a physical table.

Create a table in greenplum

gpadmin=# CREATE TABLE earthquake_raw as select * from earthquake_raw_ext ;

NOTICE:  Table doesn’t have ‘DISTRIBUTED BY’ clause — Using column(s) named ‘source’ as the Greenplum Database data distribution key for this table.
HINT:  The ‘DISTRIBUTED BY’ clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
SELECT 47

gpadmin=#
When using gphdfs Greenplum will  parallelize  data load into segment servers.

Now you can use earthquake_raw table for analysis and have fun.

As always comments are welcome.

About Diwakar Kasibhotla

Oracle Database Tuning, VLDB designing, ETL/BI architect, Data Modelling, Exadata Architect.
This entry was posted in gphdfs, Greenplum Database, Hadoop, HDFS, Postgress. Bookmark the permalink.

43 Responses to Greenplum and Hadoop HDFS integration

  1. tyrone griffin says:

    cool, you cleared up a lot of questions I had about hadoop.

  2. Bala says:

    I have a question,
    Can I use (pg or gp) dump utilities on the GreenPlum external tables stored in HADOOP.
    -Bala

  3. Bala: You can’t use dump utilities to dump external table contents. The idea being that you can move the files to recreate the external tables.
    But you can use WRITABLE EXTERNAL tables to write hadoop HDFS file content to local file system on a ETL server.

    In this example to write out the data I created an WRITABLE EXTERNAL table using gpfdist

    Step 1: Start gpfdist on host where you want to copy the contents or dump the contents

    etl-host1:data $ gpfdist -d /Users/gpadmin/data/ -p 8081 -l /Users/gpadmin/data/log1 &

    Step 2: Create writable external table

    CREATE WRITABLE EXTERNAL TABLE earthquake_raw_ext_w
    ( LIKE earthquake_raw_ext )
    LOCATION (‘gpfdist://etl-host1:8081/earthquake.txt’)
    FORMAT ‘TEXT’ (DELIMITER ‘|’)
    DISTRIBUTED BY (source);

    Step 3: Unload data from gphdfs external table to the external table.

    insert into earthquake_raw_ext_w select * from earthquake_raw_ext ;

    Step 4: Check dump file
    Under /Users/gpadmin/data you should be able to see earthquake.txt with all its contents.

    etl-host1: data $ ls -ltr

    -rw——- 1 gpadmin gpadmin 4305 Dec 11 11:36 earthquake.txt

    Let me know if this helps.

  4. Bala says:

    Hi
    Can we use ‘distcp’ utility (to perform copy ) on a single file (large file 1 TB)
    -Bala

  5. Bala says:

    Just now I clarified distcp utility can be used for single file too., but I’m wondering how the copy being implemented on a single file (all file blocks copied parallely copied with offset information to the target device, or sequential copy like normal file system copy)
    -Bala

  6. Praveen Shettigar says:

    Nice Post! Your article is based on the assumption that gpdb and hadoop are running on the same host. I have Greenplum database running on DCA and GPHD-1.1 cluster running on 4 commodity servers. What additional steps do I have to take for greenplum database to see the HDFS file system running on remote set of servers. What should be the value for gp_hadoop_home in my scenario? Thanks!

    • @Praveen
      For multinode setup do this
      1. tar your gphd-1.1 directory on your hadoop Namenode and move it to DCA on Master (you can pick any directory to move this)
      2. As gpadmin user on DCA gpscp gphd-1.1 tar file to all segment servers (use the same directory location as Master)
      3. gpssh to all segment servers and untar the file (also untar on master)
      4. For example if untar creates directory /home/gpadmin/gphd-1.1 then this becomes your gp_hadoop_home

      Hope this helps

  7. hasan dange says:

    Hi Diwakar,

    I have the Greenplum running on 3 commodity servers and hadoop running on 5 commodity.
    I am trying to pull the data from hadoop in to greenplum using externaltable.
    Greenplum Version is —> Greenplum Database 4.2.2.4 build 1 Community Edition
    Hadoop Version is —> Hadoop 1.0.3-gphd-1.2.0.0 (Greenplum Hadoop)

    external table definition is :External table “public.part00001”
    Column | Type | Modifiers
    ———-+——+———–
    column_1 | text |
    column_2 | text |
    column_3 | text |
    Type: readable
    Encoding: UTF8
    Format type: text
    Format options: delimiter ‘,’ null ‘\N’ escape ‘\’
    External location: gphdfs://10.26.121.239:9000/hive/gphd/warehouse/mri_out/part-00001

    And I am firing bellow query to fetch data.

    select * from part00001;
    ERROR: external table gphdfs protocol command ended with error. Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/lib/input/FileInputFormat (seg0 slice1 seg-1:40000 pid=29480)
    DETAIL:

    Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.lib.input.FileInputFormat
    at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
    at sun.misc.Launcher$AppClassLoader.loadClass(La
    Command: ‘gphdfs://10.26.121.239:9000/hive/gphd/warehouse/mri_out/part-00001’
    External table part00001, file gphdfs://10.26.121.239:9000/hive/gphd/warehouse/mri_out/part-00001

    I am getting the above error, I copied Hadoop jar files into greenplum server in lib folder of greenplum database but still I am getting same error.

    Please suggest me some solution will be really a great help !!

    Thanks,
    Hasan Dange

  8. Seems like its unable to find core jar file. Send your HADOOP_HOME and gp_hadoop_home variable values.
    Also list out the files on your hadoop home and send the output.

  9. Sai says:

    Please upload GPDB and CHD4 integration for multi-node setup

    • For multinode setup do this
      1. tar your gphd-1.1 directory on your hadoop Namenode and move it to DCA on Master (you can pick any directory to move this)
      2. As gpadmin user on DCA gpscp gphd-1.1 tar file to all segment servers (use the same directory location as Master)
      3. gpssh to all segment servers and untar the file (also untar on master)
      4. For example if untar creates directory /home/gpadmin/gphd-1.1 then this becomes your gp_hadoop_home

      Hope this helps

  10. Sai says:

    Thank you. Is this the only version “cdh3u2” supported? How about cdh4?
    gp_hadoop_target_version = cdh3u2

  11. Sai says:

    Got it “cdh4.1” is supported.

  12. harsha says:

    its very nice..
    i want to access data which is greenplumDB from hawq in pivotalHD(external table). plz give the way

  13. @Harsha @Satya: Do you want to build external table in HAWQ pointing to GPDB tables and load into HAWQ or for just query purpose. If you can explain what you are planning to do that would help.

  14. azzam says:

    Hi, when I issue a “select * from earthquake_raw_ext;” an error occurred as follows:
    gpadmin=# select * from earthquake_raw_ext;
    ERROR: missing data for column “period”
    any idea ? Thx

  15. Frank says:

    Hi Kasibhotla,

    I am interested in “go through multi-node setup
    too where Greenplum and hadoop are on different hosts”,
    Could you give me some documents ? superdatashare@gmail.com. Thanks.

    • @Frank I have them in my previous comments
      Here it is again
      For multinode setup do this
      1. tar your gphd-1.1 directory on your hadoop Namenode and move it to DCA on Master (you can pick any directory to move this)
      2. As gpadmin user on DCA gpscp gphd-1.1 tar file to all segment servers (use the same directory location as Master)
      3. gpssh to all segment servers and untar the file (also untar on master)
      4. For example if untar creates directory /home/gpadmin/gphd-1.1 then this becomes your gp_hadoop_home

      Hope this helps

  16. SKB says:

    What is the best practice to pull data from Hadoop to green plum.
    option 1 . Create staging table in GreenPlum, and pull data from Hadoop using external table?
    Option 2. Load data into green plum using scoop?
    Option 3 : Create table in Hives, write mapReduce to pull data form HDFS and load into Hives. and from Hives pull data into GreenPlum.
    Option 4. Any other?

  17. Ashok Sharma says:

    It is really nice step. I have three nodes Greenlum and single node hadoop installed. Hadoop is installed on the master node of GPDB. I am using hadoop-2.0.0-cdh4.7.1 with GBDP 4.3.4.0
    when I select from the external table in GPDB Getting below Errors.
    ERROR: external table gphdfs protocol command ended with error. Exception in thread “main” java.net.UnknownHostException: SGPDB3.localdomain: SGPDB3.localdomain: unknown error (seg3 slice1 SGPDB3.localdomain:40001 pid=18425)
    DETAIL:

    at java.net.InetAddress.getLocalHost(InetAddress.java:1484)
    at org.apache.hadoop.security.SecurityUtil.getLocalHostName(SecurityUtil.java:223)
    at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:243)
    at com.emc.greenplum.gpdb.hdfsconnector.ConnectorUtil.loginSecureHadoop(ConnectorUtil.java:84)
    at com.emc.greenplum.gpdb.hdfsconnector.HDFSReader.doRead(HDFSReader.java:85)
    Command: ‘gphdfs://gpdb1.localdomain:9000/user/hadoop/a.text’
    External table ak_ext, file gphdfs://gpdb1.localdomain:9000/user/hadoop/a.text

  18. Igor Rudyak says:

    How about Hadoop cluster with Kerberos authentication turned on? Will such approach work in a such case? Is it necessary to add some extra settings when working with kerberized Hadoop cluster?

    What account will be used to access external table files on HDFS, when Greenplum user tries to make select against such a table?

    • If you are using Pivotal HD then it uses “gpadmin” account to authenticate using Kerberos. For other HD distribution you need to add “gpadmin” user as a principal.
      Greenplum uses gpadmin user to access files on HDFS.

  19. Check your host entries. The protocol is not able to find gpdb1.localdomain host.

  20. Anonymous says:

    Hi Diwakar,
    Greenplum DB is already running on the same three nodes cluster without any issue.
    I can insert, update and select Greenplum data successfully. So I dont understand why greenplum DB is not able to select data from single node Hadoop on the same cluster.

  21. Anonymous says:

    Hi diwakar
    GPDB installed on three nodes cluster. GPDB1(master node) and two segment hosts
    SGPDB2 and SGPDB3 . GPDB is running fine so there should be any issue with host name resolution. I installed single node cloudera Hadoop 4.7 on GPDB1 and created a external table in gpdb while selecting data I am getting above errors.

    Hadoop binaries installed on all above three nodes
    gp_hadoop_target_version is set to cdh4.1

  22. Anonymous says:

    Hi Diwakar,

    Do you know if Greenplum works with Couldera 5.3.2

  23. skacanski says:

    How about bulk loading 20 tb of data per hour from greenplum to hdfs?
    is there any way to parallelize this process?

  24. sree says:

    What is best approach to load data from daily hdfs files with CDC to green plum table?
    1. create external table everyday on daily files and do CDC on GP table and upsert.
    2. create master file in hdfs with all cdc daily files and create external table and load table in GP.

    Any other best approach to load updates into green plum from hdfs?

    Thanks.

  25. Gurjeet Maini says:

    Thanks for the nice document.
    In my case i have hadoop and greenplum both are on different set of servers.
    I have set database parameters, also set $JAVA_HOME in all of the greenplum servers but still getting error while inserting or retrieving the data from external table.
    Below is the error from database segment log. Can you please help ?
    $ echo $JAVA_HOME
    /usr/bin/java
    java -version
    java version “1.6.0_39”
    OpenJDK Runtime Environment (IcedTea6 1.13.11) (rhel-1.13.11.1.el6_8-x86_64)
    OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)

    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    Exception in thread “”main”” java.net.ConnectException: Call From /10.40.217.154 to hdpqa:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
    at s

  26. Thejo Kumar says:

    We are planning to move out of Greenplum to HAWQ, what is your option on this??

Leave a reply to skacanski Cancel reply