One of the features of Greenplum 4.2 version is the use of Hadoop HDFS file system to create external tables.
This is extremely useful when you want to avoid file movement from HDFS to local folder for data loading.
In this post I will go through the configuration of single node (Cent OS) Greenplum database to access and create external tables using hdfs.
Architecture Overview:
Pre-requisites:
1. Greenplum database is up and running under gpadmin user
2. Hadoop (Cloudera CDH3) is up and running under hadoop user
Assumption:
1. Greenplum and Hadoop are running on the same host
Note: If anyone is interested (send a comment or email) I can go through multi-node setup
too where Greenplum and hadoop are on different hosts.
Install directories:
Greenplum: /home/gpadmin/greenplum-db
Hadoop: /home/hadoop/
Hostname: gphost
Step 1: Install Java on greenplum
Download jdk from here and install.
I have it installed under /usr/java/jdk1.6.0_26
Step 2: Set JAVA and HADOOP home for gpadmin
Open .bashrc and add the following lines
export JAVA_HOME=/usr/java/jdk1.6.0_26
export HADOOP_HOME=/home/hadoop
Save and exit the file and source the .bashrc
gpadmin$ source ./.bashrc
Step 3: Modify postgres.conf file
As gpadmin user
gpadmin$ cd $MASTER_DATA_DIRECTORY
gpadmin$ vi postgres.conf
Add the following lines in the end
# Hadoop related setttings
gp_external_enable_exec=on
gp_hadoop_target_version = cdh3u2
gp_hadoop_home = ‘/home/hadoop’
Save and exit the file.
Note: Possible values for gp_hadoop_target_version are
gphd-1.1
gpmr-1.0
gpmr-1.2
cdh3u2
Step 4: Restart Greenplum database
gpadmin$ gpstop -a
gpadmin$ gpstart -a
Step 5: Check access to hadoop binaries
As gpadmin user
gpadmin$ cd $HADOOP_HOME/bin
gpadmin$ ./hadoop fs -ls /
You should be able to view the hdfs folders/files. If you run into permission issues then as
hadoop user grant read privileges to other users to hdfs files.
Step 6: Copy test file to hdfs
This earthquake file (copy the contents to earthquake.txt) is generated by a mapreduce function that analyzed raw log files from different earthquake monitoring centers. The output columns are
Source – Location of monitoring center
period
latitude
longitude
magnitude
depth
NST
region
As hadoop user create a hdfs folder to hold this file
hadoop$ hadoop fs -mkdir /data
Copy earthquake.txt to /home/hadoop on hadoop server and copy it to hdfs
hadoop$ hadoop fs -copyFromLocal /home/hadoop/earthquake.txt /data
You can check the file by
hadoop$ hadoop fs -ls /data/earthquake.txt
Found 1 items
-rw-r–r– 1 hadoop hadoop 4417 2012-10-10 13:42 /data/earthquake.txt
Step 7: Create external table on greenplum
Use this definition to create external table on greenplum database
CREATE EXTERNAL TABLE earthquake_raw_ext
(
source text,
period text,
latitude double precision,
longitude double precision,
magnitude double precision,
depth double precision,
NST double precision,
region text
)
LOCATION ( ‘gphdfs://gphost:8020/data/earthquake.txt’)
FORMAT ‘text’ (delimiter ‘|’)
ENCODING ‘UTF8’;
select from the table using psql.
gpadmin=# select * from earthquake_raw_ext ;
source | period | latitude | longitude | magnitude | depth | nst | region
——–+——————————————+———-+———–+———–+——-+—–+—————————————
nc | Wednesday, October 10, 2012 03:45:36 UTC | 39.5662 | -123.3917 | 1.8 | 8.9 | 9 | Northern California
hv | Wednesday, October 10, 2012 03:32:29 UTC | 19.4028 | -155.2697 | 2.9 | 1.9 | 24 | Island of Hawaii, Hawaii
hv | Wednesday, October 10, 2012 03:24:59 UTC | 19.4048 | -155.2673 | 2.6 | 2.1 | 17 | Island of Hawaii, Hawaii
nn | Wednesday, October 10, 2012 03:21:16 UTC | 36.7553 | -115.5388 | 1.2 | 7 | 20 | Nevada
nn | Wednesday, October 10, 2012 03:09:13 UTC | 38.583 | -119.4507 | 1.3 | 7 | 7 | Central California
uw | Wednesday, October 10, 2012 03:07:14 UTC | 47.7083 | -122.325 | 2 | 29.7 | 36 | Seattle-Tacoma urban area, Washington
ci | Wednesday, October 10, 2012 02:52:38 UTC | 32.8157 | -116.1407 | 1.3 | 7.4 | 22 | Southern California
ci | Wednesday, October 10, 2012 02:46:21 UTC | 33.932 | -116.8478 | 1.8 | 7.6 | 87 | Southern California
hv | Wednesday, October 10, 2012 02:17:29 UTC | 19.4042 | -155.2688 | 1.9 | 1.7 | 17 | Island of Hawaii, Hawaii
… sample output
Step 8: Load data into physical table
External table is just a pointer to the hdfs file location. Everytime you select from this external table its reading the hdfs file system.
For data analysis you want to store this data into a physical table.
Create a table in greenplum
gpadmin=# CREATE TABLE earthquake_raw as select * from earthquake_raw_ext ;
NOTICE: Table doesn’t have ‘DISTRIBUTED BY’ clause — Using column(s) named ‘source’ as the Greenplum Database data distribution key for this table.
HINT: The ‘DISTRIBUTED BY’ clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
SELECT 47
gpadmin=#
When using gphdfs Greenplum will parallelize data load into segment servers.
Now you can use earthquake_raw table for analysis and have fun.
As always comments are welcome.
cool, you cleared up a lot of questions I had about hadoop.
I have a question,
Can I use (pg or gp) dump utilities on the GreenPlum external tables stored in HADOOP.
-Bala
pg_dump only as it was developed with postgresql(PG)
Bala: You can’t use dump utilities to dump external table contents. The idea being that you can move the files to recreate the external tables.
But you can use WRITABLE EXTERNAL tables to write hadoop HDFS file content to local file system on a ETL server.
In this example to write out the data I created an WRITABLE EXTERNAL table using gpfdist
Step 1: Start gpfdist on host where you want to copy the contents or dump the contents
etl-host1:data $ gpfdist -d /Users/gpadmin/data/ -p 8081 -l /Users/gpadmin/data/log1 &
Step 2: Create writable external table
CREATE WRITABLE EXTERNAL TABLE earthquake_raw_ext_w
( LIKE earthquake_raw_ext )
LOCATION (‘gpfdist://etl-host1:8081/earthquake.txt’)
FORMAT ‘TEXT’ (DELIMITER ‘|’)
DISTRIBUTED BY (source);
Step 3: Unload data from gphdfs external table to the external table.
insert into earthquake_raw_ext_w select * from earthquake_raw_ext ;
Step 4: Check dump file
Under /Users/gpadmin/data you should be able to see earthquake.txt with all its contents.
etl-host1: data $ ls -ltr
-rw——- 1 gpadmin gpadmin 4305 Dec 11 11:36 earthquake.txt
Let me know if this helps.
HI Diwakar., thanks for your quick response.
Hi
Can we use ‘distcp’ utility (to perform copy ) on a single file (large file 1 TB)
-Bala
Just now I clarified distcp utility can be used for single file too., but I’m wondering how the copy being implemented on a single file (all file blocks copied parallely copied with offset information to the target device, or sequential copy like normal file system copy)
-Bala
Haven’t used distcp myself but its explained here http://hadoop.apache.org/docs/r0.19.2/distcp.html
Nice Post! Your article is based on the assumption that gpdb and hadoop are running on the same host. I have Greenplum database running on DCA and GPHD-1.1 cluster running on 4 commodity servers. What additional steps do I have to take for greenplum database to see the HDFS file system running on remote set of servers. What should be the value for gp_hadoop_home in my scenario? Thanks!
@Praveen
For multinode setup do this
1. tar your gphd-1.1 directory on your hadoop Namenode and move it to DCA on Master (you can pick any directory to move this)
2. As gpadmin user on DCA gpscp gphd-1.1 tar file to all segment servers (use the same directory location as Master)
3. gpssh to all segment servers and untar the file (also untar on master)
4. For example if untar creates directory /home/gpadmin/gphd-1.1 then this becomes your gp_hadoop_home
Hope this helps
Hi Diwakar,
I have the Greenplum running on 3 commodity servers and hadoop running on 5 commodity.
I am trying to pull the data from hadoop in to greenplum using externaltable.
Greenplum Version is —> Greenplum Database 4.2.2.4 build 1 Community Edition
Hadoop Version is —> Hadoop 1.0.3-gphd-1.2.0.0 (Greenplum Hadoop)
external table definition is :External table “public.part00001”
Column | Type | Modifiers
———-+——+———–
column_1 | text |
column_2 | text |
column_3 | text |
Type: readable
Encoding: UTF8
Format type: text
Format options: delimiter ‘,’ null ‘\N’ escape ‘\’
External location: gphdfs://10.26.121.239:9000/hive/gphd/warehouse/mri_out/part-00001
And I am firing bellow query to fetch data.
select * from part00001;
ERROR: external table gphdfs protocol command ended with error. Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/lib/input/FileInputFormat (seg0 slice1 seg-1:40000 pid=29480)
DETAIL:
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.lib.input.FileInputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(La
Command: ‘gphdfs://10.26.121.239:9000/hive/gphd/warehouse/mri_out/part-00001’
External table part00001, file gphdfs://10.26.121.239:9000/hive/gphd/warehouse/mri_out/part-00001
I am getting the above error, I copied Hadoop jar files into greenplum server in lib folder of greenplum database but still I am getting same error.
Please suggest me some solution will be really a great help !!
Thanks,
Hasan Dange
Seems like its unable to find core jar file. Send your HADOOP_HOME and gp_hadoop_home variable values.
Also list out the files on your hadoop home and send the output.
Please upload GPDB and CHD4 integration for multi-node setup
For multinode setup do this
1. tar your gphd-1.1 directory on your hadoop Namenode and move it to DCA on Master (you can pick any directory to move this)
2. As gpadmin user on DCA gpscp gphd-1.1 tar file to all segment servers (use the same directory location as Master)
3. gpssh to all segment servers and untar the file (also untar on master)
4. For example if untar creates directory /home/gpadmin/gphd-1.1 then this becomes your gp_hadoop_home
Hope this helps
Thank you. Is this the only version “cdh3u2” supported? How about cdh4?
gp_hadoop_target_version = cdh3u2
Got it “cdh4.1” is supported.
its very nice..
i want to access data which is greenplumDB from hawq in pivotalHD(external table). plz give the way
You can but its very convoluted. You have to create readable web external table and execute a psql command that reads data from GPDB.
hey nice.. i tried but it wont work. can u give full explanation. means, should we change any conf-properties setting in hawq?
@Harsha @Satya: Do you want to build external table in HAWQ pointing to GPDB tables and load into HAWQ or for just query purpose. If you can explain what you are planning to do that would help.
I want to build external table in HAWQ pointing to GPDB tables and load into HAWQ
Read this
Hi, you want to build external table in HAWQ pointing to GPDB tables for just query purpose. ca nyou please explain.. thanx inadvance ??
Read this
Hi, when I issue a “select * from earthquake_raw_ext;” an error occurred as follows:
gpadmin=# select * from earthquake_raw_ext;
ERROR: missing data for column “period”
any idea ? Thx
Check if you have the file loaded correctly on the HDFS.
Hi Kasibhotla,
I am interested in “go through multi-node setup
too where Greenplum and hadoop are on different hosts”,
Could you give me some documents ? superdatashare@gmail.com. Thanks.
@Frank I have them in my previous comments
Here it is again
For multinode setup do this
1. tar your gphd-1.1 directory on your hadoop Namenode and move it to DCA on Master (you can pick any directory to move this)
2. As gpadmin user on DCA gpscp gphd-1.1 tar file to all segment servers (use the same directory location as Master)
3. gpssh to all segment servers and untar the file (also untar on master)
4. For example if untar creates directory /home/gpadmin/gphd-1.1 then this becomes your gp_hadoop_home
Hope this helps
What is the best practice to pull data from Hadoop to green plum.
option 1 . Create staging table in GreenPlum, and pull data from Hadoop using external table?
Option 2. Load data into green plum using scoop?
Option 3 : Create table in Hives, write mapReduce to pull data form HDFS and load into Hives. and from Hives pull data into GreenPlum.
Option 4. Any other?
If GPDB and Hadoop cluster are on the same network then Option 1 is not bad. If thats not the case then other options are better.
It is really nice step. I have three nodes Greenlum and single node hadoop installed. Hadoop is installed on the master node of GPDB. I am using hadoop-2.0.0-cdh4.7.1 with GBDP 4.3.4.0
when I select from the external table in GPDB Getting below Errors.
ERROR: external table gphdfs protocol command ended with error. Exception in thread “main” java.net.UnknownHostException: SGPDB3.localdomain: SGPDB3.localdomain: unknown error (seg3 slice1 SGPDB3.localdomain:40001 pid=18425)
DETAIL:
at java.net.InetAddress.getLocalHost(InetAddress.java:1484)
at org.apache.hadoop.security.SecurityUtil.getLocalHostName(SecurityUtil.java:223)
at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:243)
at com.emc.greenplum.gpdb.hdfsconnector.ConnectorUtil.loginSecureHadoop(ConnectorUtil.java:84)
at com.emc.greenplum.gpdb.hdfsconnector.HDFSReader.doRead(HDFSReader.java:85)
Command: ‘gphdfs://gpdb1.localdomain:9000/user/hadoop/a.text’
External table ak_ext, file gphdfs://gpdb1.localdomain:9000/user/hadoop/a.text
Check your host entries. The protocol is not able to find gpdb1.localdomain host.
How about Hadoop cluster with Kerberos authentication turned on? Will such approach work in a such case? Is it necessary to add some extra settings when working with kerberized Hadoop cluster?
What account will be used to access external table files on HDFS, when Greenplum user tries to make select against such a table?
If you are using Pivotal HD then it uses “gpadmin” account to authenticate using Kerberos. For other HD distribution you need to add “gpadmin” user as a principal.
Greenplum uses gpadmin user to access files on HDFS.
Check your host entries. The protocol is not able to find gpdb1.localdomain host.
Hi Diwakar,
Greenplum DB is already running on the same three nodes cluster without any issue.
I can insert, update and select Greenplum data successfully. So I dont understand why greenplum DB is not able to select data from single node Hadoop on the same cluster.
Hi diwakar
GPDB installed on three nodes cluster. GPDB1(master node) and two segment hosts
SGPDB2 and SGPDB3 . GPDB is running fine so there should be any issue with host name resolution. I installed single node cloudera Hadoop 4.7 on GPDB1 and created a external table in gpdb while selecting data I am getting above errors.
Hadoop binaries installed on all above three nodes
gp_hadoop_target_version is set to cdh4.1
Hi Diwakar,
Do you know if Greenplum works with Couldera 5.3.2
How about bulk loading 20 tb of data per hour from greenplum to hdfs?
is there any way to parallelize this process?
What is best approach to load data from daily hdfs files with CDC to green plum table?
1. create external table everyday on daily files and do CDC on GP table and upsert.
2. create master file in hdfs with all cdc daily files and create external table and load table in GP.
Any other best approach to load updates into green plum from hdfs?
Thanks.
Thanks for the nice document.
In my case i have hadoop and greenplum both are on different set of servers.
I have set database parameters, also set $JAVA_HOME in all of the greenplum servers but still getting error while inserting or retrieving the data from external table.
Below is the error from database segment log. Can you please help ?
$ echo $JAVA_HOME
/usr/bin/java
java -version
java version “1.6.0_39”
OpenJDK Runtime Environment (IcedTea6 1.13.11) (rhel-1.13.11.1.el6_8-x86_64)
OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread “”main”” java.net.ConnectException: Call From /10.40.217.154 to hdpqa:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at s
We are planning to move out of Greenplum to HAWQ, what is your option on this??
Make sure that you have functional parity when you move data from Greenplum to HAWQ. Also I have seen degraded performance on HAWQ when you increase concurrent queries.