Archive for the ‘Connectivity’ Category

Connecting SAP DataServcies to Hadoop: HDFS vs. Hive

SAP DataServices (DS) supports two ways of accessing data on a Hadoop cluster:

  1. HDFS:
    DS reads directly HDFS files from Hadoop. In DataServices you need to create HDFS file formats in order to use this setup. Depending on your dataflow DataServices might read the HDFS file directly into the DS engine and then handle the further processing in its own engine.If your dataflow contains more logic that could be pushed down to Hadoop, DS may as well generate a Pig script. The Pig script will then not just read the HDFS file but also handle other transformations, aggregations etc. from your dataflow.The latter scenario is usually a preferred setup for large amount of data because this way the Hadoop cluster can provide processing power of many Hadoop nodes on inexpensive commodity hardware. The pushdown of dataflow logic to Pig/Hadoop is similar to the pushdown to relational database systems. It depends on whether the dataflow uses functions that could be processed in the underlying systems (and of course whether DS knows about all the capabilities in the underlying system). In general, the most common functions, joins and aggregations in DS should be eligible to be pushed down to Hadoop.
  2. Hive / HCatalog:
    DS reads data from Hive. Hive accesses data that is defined in HCatalog tables. In DataServices you need to setup a datastore that connects to a Hive adapter. The Hive adapter will in turn read tables from Hadoop by accessing a hive server. DS will generate HiveQL commands. HiveQL is similar to SQL. Therefore the pushdown of dataflow logic to Hive works similar as the pushdown to relational database systems.

It is difficult to say which approach better to use in DataServices: HDFS files/Pig or Hive? Both, Pig and Hive generate MapReduce jobs in Hadoop and therefore performance should be similar. Nevertheless, some aspects can be considered before deciding which way to go. I have tried to describe these aspects in the table below. They are probably not complete and they overlap, so in many cases they will not identify a clear favorite. But still the table below may give some guidance in the decision process.

Subject HDFS / Pig Hive
Setup connectivity Simple via HDFS file formats Not simple. Hive adapter and Hive datastore need to be setup. The CLASSPATH setting for the hive adapter is not trivial to determine. Different setup scenarios for DS 4.1 and 4.2.2 or later releases (see also Connecting SAP DataServices to Hadoop Hive)
Overall purpose
  • Native HDFS access is only advisable if all the data in Hadoop necessarily need to be processed within DataServices or if the data volume is not too large.
  • Pig covers more a data flow (in a general sense, not to confuse with DataServices dataflows). A pig script is more like a script that processes various transformation steps and writes the results into a target table/file. Therefore a pig script might suit well for DataServices jobs that read, transform and write back data from/to Hadoop.
Hive are queries mainly intended for DWH-like queries. They might suit well for DataServices jobs that need to join and aggregate large data in Hadoop and write the results into pre-aggregated tables in a DWH or some other datastores accessible for BI tools.
Runtime performance
  1. Loading flat files from HDFS without any further processing: faster than Hive.
  2. Mapping, joining or aggregating large amount of data (presumed logic gets pushed down to Pig): performance is determined via MapReduce jobs in Hadoop – therefore similar to Hive.
  3. Mapping, joining or aggregating small amount of data: the processing might even run faster in the DS engine. Therefore it might be an option to force DS not to push down the logic to Pig and thus just read the HDFS file natively.
  1. Loading all data of a table without processing / aggregation: slower than native HDFS, because of unnecessary setup of MapReduce jobs in Hadoop.
  2. Mapping, joining or aggregating large amount of data (presumed logic gets pushed down to Hive): performance is determined via MapReduce jobs in Hadoop . therefore similar to Pig.
  3. Mapping, joining or aggregating small amount of data: there is no way to bypass Hive/HiveQL so there will always be some MapReduce jobs initiated by Hadoop, even if the data amount is small. The overhead of initiating these MapReduce jobs usually takes some time. It may overweight the performance of the data processing itself if the data amount is small.
Development aspects
  • HDFS File format need to be defined manually in order to define the data structure in the file
  • On the other hand the HDFS file format can easily be generated from the Schema Out view in a DataServices transform (in the same way as for local file formats)
  • No data preview available
  • HCatalog tables can be imported like database tables. The table structure is already predefine by HCatalog The HCatalog table might already exist, otherwise it still needed to be specified in Hadoop.
  • Template tables do not work with Hive datastores
  • From DS version 4.2.3 and later data can be pre-viewed in DS Designer
Data structure In general, HDFS file formats will suit better for unstructured or semi-structured data. There is little benefit from accessing unstructured data via HiveQL, because Hive will first save the results of a HiveQL query into a directory on the local file system of the DS engine. DS will then read the results from that file for further processing. Instead, reading the data directly via a HDFS file might be quicker.
Future technologies: Stinger, Impala, Tez
etc
.
Some data access technologies will improve the performance of HiveQL or Pig Latin significantly (some already do so today). Most of them will support HiveQL whereas some of them will support both, HiveQL and Pig. The decision between Hive and Pig might also depend on the (future) data access engines in the given Hadoop environment.
Advertisements

Connecting SAP DataServices to Hadoop Hive

Connecting SAP DataServices to Hadoop Hive is not as simple as connecting to a relational database for example. In this post I want to share my experiences on how to connect DataServices (DS) to Hive.

The DS engine cannot connect to Hive directly. Instead you need to configure a Hive adapter from the DS management console which will actually manage the connection to Hive.

In the rest of this post I will assume the following setup:

  • DS is not installed on a node in the Hadoop cluster, but has access to the Hadoop cluster. The DS server should run on a Unix server. I think that such a setup is a typical in most cases.
  • The Hadoop cluster consists of a Hortonworks Data Platform (HDP) 2.x distribution. Other distributions should work similarly, though.

1. Configure the DS server

The DS server will not be installed within the Hadoop cluster, but it will have access to it. Therefore Hadoop need to be installed and configured appropriately on the DS server. I wont’ go to much into detail for this step, because the given environment may be quite different from my tested environment.

Roughly, there are two approaches for installing Hadoop on the DS server:

  1. Manual installation:
    you may follow the instructions on Configuring Data Services on a machine not in your Hadoop cluster on SCN, sections Hadoop and Hive. I have never tried this approach, though.
  2. Installation via Ambari (preferred):
    this approach will initiate a Hadoop installation on the DS server from the Ambari system of the existing Hadoop cluster. The installation process is quite automated and will integrate the DS server as a kind of Hadoop client into the Ambari monitoring system of the cluster.
    In Ambari you need to select the Add Host option. This will start the Add Host Wizard:

    Ambari - Add host wizard

    Ambari – Add host wizard

    Before you can start the wizard you need to enable a passwordless login from the Ambari server to the DS server using an SSH private key.

    Later in the wizard process, you need to configure the DS server as a client host; this way all required Hadoop libraries, Hadoop client tools and the Ambari agent will be installed and configured on the DS server. Ambari will also monitor the availability of the DS server.

2.  Configure DS environment and test Hive

The DS jobserver needs some Hadoop environment settings. These settings mainly specify the Hadoop and Hive home directories on the DS server and some required Hadoop JAR files through CLASSPATH settings.

DS provides a script that sources these environments, please check DS Reference Guide 4.2, section 12.1  Prerequisites.

Important: for Hortonworks HDP distributions, DS provides another script than the documented script hadoop_env.sh. For HDP 2.x distributions you should use the script hadoop_HDP2_env.sh instead (this script is only available from DS version 4.2.2 and later)

On the DS server you should be able to start hive and test the connection. For instance by running the HiveQL command show databases:

Test hice connection

Test hive connection

Finally, restart the DS jobserver so that it has the same environment settings as your current session. Also make sure that the hadoop_env.sh script will be started during the DS server startup process.

3. Configure the DS Hive adapter and datastore

You may check the following documentation to set up the DS Hive adapter:

In my experience these two subjects usually will not work without problem. The tricky part here is to set CLASSPATH correctly for the adapter. This task is not well documented and depends on the Hadoop distribution and version. Therefore you might end in a series of try-and-error configurations:

  1. Configure the Hive adapter. Modify the CLASSPATH setting so that the adapter knows the location of all required Java objects. (Re-)start the adapter.
  2. Setup the Hive datastore in DS designer. This will also test the Hive adapter. Check the error message for missing Java objects. Return to step 1.

These steps are described in more detail in the following sections. The Troubleshooting section in the blog  Configuring Data Services and Hadoop on SCN may also help.

3.1 Setup of the Hive adapter

For the initial setup of the Hive adapter I used the CLASSPATH setting as described in Configuring the Data Services Hive Adapter on SCN. For instance, the initial configuration of the Hive adapter looked like this in my test environment:

Initial setup of hive adapater

Initial setup of the Hive adapter

3.2 Setup of DS datastore

In the DS designer I created a Hive datastore. The first error message I got when saving the datastore was:

Ceate Hive datastore - error message

Create Hive datastore – error message

The same error message should be visible in the Hive adapter error log:

Hive adapter - errror log

Hive adapter – error log

The error message means that the Hive adapter tries to use the Java object org.apache.hadoop.hive.jdbc.HiveDriver but cannot find it. You will need to find the corresponding JAR file that contains this Java object and add the full path of this JAR file to the Hive adapter; then return to step 3.1

There will probably be more than one Java object missing in the initial CLASSPATH setting. Therefore you may very well end up in an iterative process of configuring and re-starting the adapter and testing the adapter by saving the datastore in DS designer.

How do I find the JAR file that contains the missing Java object?

In most cases all required Java objects are in JAR files that are already provided by either the DS installation or the Hadoop installation. They are usually located in one of these directories:

  • $HADOOP_HOME
  • $HADOOP_HOME/lib
  • $HIVE_HOME/lib

I have developed a small shell script that will search for a Java object in all JAR files in a given directory:

[ds@dsserver ~]$ cat find_object_in_jars.sh
#!/bin/sh

jar_dir=$1
object=$2

echo "Object $object found in these jars:"
for jar_file in $jar_dir/*.jar
do
  object_found=`jar tf $jar_file | grep $object`
  if [ -n "$object_found" ]
  then
    echo $jar_file
  fi
done

exit 0

For example, the script above found the object org.apache.hadoop.hive.jdbc.HiveDriver in the file  $HIVE_HOME//lib/hive-jdbc-0.12.0.2.0.10.0-1.jar. The full path of this file need to be added to the CLASSPATH setting of the Hive adapter:

Finding objects in JAR files

Finding objects in JAR files

Note: the script above is using the jar utility. You need to install a Java development package (such as Java Open SDK) if the jar utility is not available on your DS server.

4. DS 4.2.1 and hive server

DS need to connect to a hive server. The hive sever actually splits the HiveQL commands into MapReduce jobs, accesses the data on the cluster and returns the results to the caller.

DS 4.2.2 and later versions are using the current hive server called HiveServer2 to connect to Hive. This version of the hive server is the default for most Hadoop clusters.

The older version of hive server, simply called HiveServer, is usually not started per default  on current versions of Hadoop clusters. But DS version 4.2.1 only works with the old hive server version.

4.1 Starting the old hive server version

The old version of hive server can easily be started from the hive client tool, see HiveServer documentation.

The default port number for HiveServer is 10000. But because HiveServer2 might already be  running and listening on this port you should define another port for HiveServer, let’s say 10001:

Starting HiveSever

Starting HiveServer

You can start the HiveServer on a node in the Hadoop cluster or on the DS server. I recommend to start it on the same node where HiveServer2 is already running – this is usually a management node within the Hadoop cluster.

It is also worth to test the HiveServer connection from the hive command line tool on the DS server. When calling the hive CLI tool without parameters it does not act as a client tool and does not connect to a hive server (instead it then acts as a kind of rich client). When calling the hive CLI tool with the host and port number parameters it will act as a client tool and connect to the hive server. For instance:

Testing HiveServer connection

Testing a HiveServer connection

5. Upgrading Hadoop

After upgrading either DS or Hadoop you might need to reconfigure the Hive connection.

5.1 Upgrading in DS

Upgrading from DS 4.2.1 to 4.2.2 requires a switch from HiveServer to HiveServer2, see section 4 above.

Furthermore, newer releases of DS might use other Java objects. Then the CLASSPATH of the Hive adapter need to be adapted as described in section 3.

5.2 Upgrading Hadoop

After upgrading Hadoop most often the names of jar files have changed because their names contain the version number. For example:

hive*.jar files in $HIVE_HOME/lib

hive*.jar files in $HIVE_HOME/lib

For instance, when upgrading HDP from version 2.0 to 2.1 the HDP upgrade process replaced some jar files with newer versions. The CLASSPATH setting of the Hive adapter need to be modified accordingly. I found this approach quite easy:

  • Open the Hive adapter configuration page in the DS management console
  • Save the configuration again. You will get an error message if the CLASSPATH contains a file reference that DS cannot resolve. For instance:
    Hive adapter - saving an invalid CLASSPATH

    Hive adapter: saving an invalid CLASSPATH

    Search for the new file name in the relevant file system directory and modify the CLASSPATH of the Hive adapter accordingly. Save again…

Note: the HDP installation and upgrade process maintains symbolic links without version number in their names. The symbolic links point to the latest version of the jar file. It is of course a good idea to reference these symbolic links in the CLASSPATH setting of the Hive adapter. But unfortunately not all of the required jar files do have symbolic links.