Archive for the ‘Hadoop’ Tag

Connecting SAP DataServcies to Hadoop: HDFS vs. Hive

SAP DataServices (DS) supports two ways of accessing data on a Hadoop cluster:

  1. HDFS:
    DS reads directly HDFS files from Hadoop. In DataServices you need to create HDFS file formats in order to use this setup. Depending on your dataflow DataServices might read the HDFS file directly into the DS engine and then handle the further processing in its own engine.If your dataflow contains more logic that could be pushed down to Hadoop, DS may as well generate a Pig script. The Pig script will then not just read the HDFS file but also handle other transformations, aggregations etc. from your dataflow.The latter scenario is usually a preferred setup for large amount of data because this way the Hadoop cluster can provide processing power of many Hadoop nodes on inexpensive commodity hardware. The pushdown of dataflow logic to Pig/Hadoop is similar to the pushdown to relational database systems. It depends on whether the dataflow uses functions that could be processed in the underlying systems (and of course whether DS knows about all the capabilities in the underlying system). In general, the most common functions, joins and aggregations in DS should be eligible to be pushed down to Hadoop.
  2. Hive / HCatalog:
    DS reads data from Hive. Hive accesses data that is defined in HCatalog tables. In DataServices you need to setup a datastore that connects to a Hive adapter. The Hive adapter will in turn read tables from Hadoop by accessing a hive server. DS will generate HiveQL commands. HiveQL is similar to SQL. Therefore the pushdown of dataflow logic to Hive works similar as the pushdown to relational database systems.

It is difficult to say which approach better to use in DataServices: HDFS files/Pig or Hive? Both, Pig and Hive generate MapReduce jobs in Hadoop and therefore performance should be similar. Nevertheless, some aspects can be considered before deciding which way to go. I have tried to describe these aspects in the table below. They are probably not complete and they overlap, so in many cases they will not identify a clear favorite. But still the table below may give some guidance in the decision process.

Subject HDFS / Pig Hive
Setup connectivity Simple via HDFS file formats Not simple. Hive adapter and Hive datastore need to be setup. The CLASSPATH setting for the hive adapter is not trivial to determine. Different setup scenarios for DS 4.1 and 4.2.2 or later releases (see also Connecting SAP DataServices to Hadoop Hive)
Overall purpose
  • Native HDFS access is only advisable if all the data in Hadoop necessarily need to be processed within DataServices or if the data volume is not too large.
  • Pig covers more a data flow (in a general sense, not to confuse with DataServices dataflows). A pig script is more like a script that processes various transformation steps and writes the results into a target table/file. Therefore a pig script might suit well for DataServices jobs that read, transform and write back data from/to Hadoop.
Hive are queries mainly intended for DWH-like queries. They might suit well for DataServices jobs that need to join and aggregate large data in Hadoop and write the results into pre-aggregated tables in a DWH or some other datastores accessible for BI tools.
Runtime performance
  1. Loading flat files from HDFS without any further processing: faster than Hive.
  2. Mapping, joining or aggregating large amount of data (presumed logic gets pushed down to Pig): performance is determined via MapReduce jobs in Hadoop – therefore similar to Hive.
  3. Mapping, joining or aggregating small amount of data: the processing might even run faster in the DS engine. Therefore it might be an option to force DS not to push down the logic to Pig and thus just read the HDFS file natively.
  1. Loading all data of a table without processing / aggregation: slower than native HDFS, because of unnecessary setup of MapReduce jobs in Hadoop.
  2. Mapping, joining or aggregating large amount of data (presumed logic gets pushed down to Hive): performance is determined via MapReduce jobs in Hadoop . therefore similar to Pig.
  3. Mapping, joining or aggregating small amount of data: there is no way to bypass Hive/HiveQL so there will always be some MapReduce jobs initiated by Hadoop, even if the data amount is small. The overhead of initiating these MapReduce jobs usually takes some time. It may overweight the performance of the data processing itself if the data amount is small.
Development aspects
  • HDFS File format need to be defined manually in order to define the data structure in the file
  • On the other hand the HDFS file format can easily be generated from the Schema Out view in a DataServices transform (in the same way as for local file formats)
  • No data preview available
  • HCatalog tables can be imported like database tables. The table structure is already predefine by HCatalog The HCatalog table might already exist, otherwise it still needed to be specified in Hadoop.
  • Template tables do not work with Hive datastores
  • From DS version 4.2.3 and later data can be pre-viewed in DS Designer
Data structure In general, HDFS file formats will suit better for unstructured or semi-structured data. There is little benefit from accessing unstructured data via HiveQL, because Hive will first save the results of a HiveQL query into a directory on the local file system of the DS engine. DS will then read the results from that file for further processing. Instead, reading the data directly via a HDFS file might be quicker.
Future technologies: Stinger, Impala, Tez
etc
.
Some data access technologies will improve the performance of HiveQL or Pig Latin significantly (some already do so today). Most of them will support HiveQL whereas some of them will support both, HiveQL and Pig. The decision between Hive and Pig might also depend on the (future) data access engines in the given Hadoop environment.
Advertisements

Connecting SAP DataServices to Hadoop Hive

Connecting SAP DataServices to Hadoop Hive is not as simple as connecting to a relational database for example. In this post I want to share my experiences on how to connect DataServices (DS) to Hive.

The DS engine cannot connect to Hive directly. Instead you need to configure a Hive adapter from the DS management console which will actually manage the connection to Hive.

In the rest of this post I will assume the following setup:

  • DS is not installed on a node in the Hadoop cluster, but has access to the Hadoop cluster. The DS server should run on a Unix server. I think that such a setup is a typical in most cases.
  • The Hadoop cluster consists of a Hortonworks Data Platform (HDP) 2.x distribution. Other distributions should work similarly, though.

1. Configure the DS server

The DS server will not be installed within the Hadoop cluster, but it will have access to it. Therefore Hadoop need to be installed and configured appropriately on the DS server. I wont’ go to much into detail for this step, because the given environment may be quite different from my tested environment.

Roughly, there are two approaches for installing Hadoop on the DS server:

  1. Manual installation:
    you may follow the instructions on Configuring Data Services on a machine not in your Hadoop cluster on SCN, sections Hadoop and Hive. I have never tried this approach, though.
  2. Installation via Ambari (preferred):
    this approach will initiate a Hadoop installation on the DS server from the Ambari system of the existing Hadoop cluster. The installation process is quite automated and will integrate the DS server as a kind of Hadoop client into the Ambari monitoring system of the cluster.
    In Ambari you need to select the Add Host option. This will start the Add Host Wizard:

    Ambari - Add host wizard

    Ambari – Add host wizard

    Before you can start the wizard you need to enable a passwordless login from the Ambari server to the DS server using an SSH private key.

    Later in the wizard process, you need to configure the DS server as a client host; this way all required Hadoop libraries, Hadoop client tools and the Ambari agent will be installed and configured on the DS server. Ambari will also monitor the availability of the DS server.

2.  Configure DS environment and test Hive

The DS jobserver needs some Hadoop environment settings. These settings mainly specify the Hadoop and Hive home directories on the DS server and some required Hadoop JAR files through CLASSPATH settings.

DS provides a script that sources these environments, please check DS Reference Guide 4.2, section 12.1  Prerequisites.

Important: for Hortonworks HDP distributions, DS provides another script than the documented script hadoop_env.sh. For HDP 2.x distributions you should use the script hadoop_HDP2_env.sh instead (this script is only available from DS version 4.2.2 and later)

On the DS server you should be able to start hive and test the connection. For instance by running the HiveQL command show databases:

Test hice connection

Test hive connection

Finally, restart the DS jobserver so that it has the same environment settings as your current session. Also make sure that the hadoop_env.sh script will be started during the DS server startup process.

3. Configure the DS Hive adapter and datastore

You may check the following documentation to set up the DS Hive adapter:

In my experience these two subjects usually will not work without problem. The tricky part here is to set CLASSPATH correctly for the adapter. This task is not well documented and depends on the Hadoop distribution and version. Therefore you might end in a series of try-and-error configurations:

  1. Configure the Hive adapter. Modify the CLASSPATH setting so that the adapter knows the location of all required Java objects. (Re-)start the adapter.
  2. Setup the Hive datastore in DS designer. This will also test the Hive adapter. Check the error message for missing Java objects. Return to step 1.

These steps are described in more detail in the following sections. The Troubleshooting section in the blog  Configuring Data Services and Hadoop on SCN may also help.

3.1 Setup of the Hive adapter

For the initial setup of the Hive adapter I used the CLASSPATH setting as described in Configuring the Data Services Hive Adapter on SCN. For instance, the initial configuration of the Hive adapter looked like this in my test environment:

Initial setup of hive adapater

Initial setup of the Hive adapter

3.2 Setup of DS datastore

In the DS designer I created a Hive datastore. The first error message I got when saving the datastore was:

Ceate Hive datastore - error message

Create Hive datastore – error message

The same error message should be visible in the Hive adapter error log:

Hive adapter - errror log

Hive adapter – error log

The error message means that the Hive adapter tries to use the Java object org.apache.hadoop.hive.jdbc.HiveDriver but cannot find it. You will need to find the corresponding JAR file that contains this Java object and add the full path of this JAR file to the Hive adapter; then return to step 3.1

There will probably be more than one Java object missing in the initial CLASSPATH setting. Therefore you may very well end up in an iterative process of configuring and re-starting the adapter and testing the adapter by saving the datastore in DS designer.

How do I find the JAR file that contains the missing Java object?

In most cases all required Java objects are in JAR files that are already provided by either the DS installation or the Hadoop installation. They are usually located in one of these directories:

  • $HADOOP_HOME
  • $HADOOP_HOME/lib
  • $HIVE_HOME/lib

I have developed a small shell script that will search for a Java object in all JAR files in a given directory:

[ds@dsserver ~]$ cat find_object_in_jars.sh
#!/bin/sh

jar_dir=$1
object=$2

echo "Object $object found in these jars:"
for jar_file in $jar_dir/*.jar
do
  object_found=`jar tf $jar_file | grep $object`
  if [ -n "$object_found" ]
  then
    echo $jar_file
  fi
done

exit 0

For example, the script above found the object org.apache.hadoop.hive.jdbc.HiveDriver in the file  $HIVE_HOME//lib/hive-jdbc-0.12.0.2.0.10.0-1.jar. The full path of this file need to be added to the CLASSPATH setting of the Hive adapter:

Finding objects in JAR files

Finding objects in JAR files

Note: the script above is using the jar utility. You need to install a Java development package (such as Java Open SDK) if the jar utility is not available on your DS server.

4. DS 4.2.1 and hive server

DS need to connect to a hive server. The hive sever actually splits the HiveQL commands into MapReduce jobs, accesses the data on the cluster and returns the results to the caller.

DS 4.2.2 and later versions are using the current hive server called HiveServer2 to connect to Hive. This version of the hive server is the default for most Hadoop clusters.

The older version of hive server, simply called HiveServer, is usually not started per default  on current versions of Hadoop clusters. But DS version 4.2.1 only works with the old hive server version.

4.1 Starting the old hive server version

The old version of hive server can easily be started from the hive client tool, see HiveServer documentation.

The default port number for HiveServer is 10000. But because HiveServer2 might already be  running and listening on this port you should define another port for HiveServer, let’s say 10001:

Starting HiveSever

Starting HiveServer

You can start the HiveServer on a node in the Hadoop cluster or on the DS server. I recommend to start it on the same node where HiveServer2 is already running – this is usually a management node within the Hadoop cluster.

It is also worth to test the HiveServer connection from the hive command line tool on the DS server. When calling the hive CLI tool without parameters it does not act as a client tool and does not connect to a hive server (instead it then acts as a kind of rich client). When calling the hive CLI tool with the host and port number parameters it will act as a client tool and connect to the hive server. For instance:

Testing HiveServer connection

Testing a HiveServer connection

5. Upgrading Hadoop

After upgrading either DS or Hadoop you might need to reconfigure the Hive connection.

5.1 Upgrading in DS

Upgrading from DS 4.2.1 to 4.2.2 requires a switch from HiveServer to HiveServer2, see section 4 above.

Furthermore, newer releases of DS might use other Java objects. Then the CLASSPATH of the Hive adapter need to be adapted as described in section 3.

5.2 Upgrading Hadoop

After upgrading Hadoop most often the names of jar files have changed because their names contain the version number. For example:

hive*.jar files in $HIVE_HOME/lib

hive*.jar files in $HIVE_HOME/lib

For instance, when upgrading HDP from version 2.0 to 2.1 the HDP upgrade process replaced some jar files with newer versions. The CLASSPATH setting of the Hive adapter need to be modified accordingly. I found this approach quite easy:

  • Open the Hive adapter configuration page in the DS management console
  • Save the configuration again. You will get an error message if the CLASSPATH contains a file reference that DS cannot resolve. For instance:
    Hive adapter - saving an invalid CLASSPATH

    Hive adapter: saving an invalid CLASSPATH

    Search for the new file name in the relevant file system directory and modify the CLASSPATH of the Hive adapter accordingly. Save again…

Note: the HDP installation and upgrade process maintains symbolic links without version number in their names. The symbolic links point to the latest version of the jar file. It is of course a good idea to reference these symbolic links in the CLASSPATH setting of the Hive adapter. But unfortunately not all of the required jar files do have symbolic links.

SAP DataServices Text Analysis and Hadoop – the Details

I have already used the text analysis feature within SAP DataServices in various projects (the transform in DataServices is called Text Data Processing or TDP in short). Usually, the TDP transform runs in the DataServices engine, means that DataServices first loads the source text in its own memory  and then runs the text analysis on its own server / engines.

The text sources are usually unstructured text or binary files such as Word, Excel, PDF files etc. If these files reside on a Hadoop cluster as HDFS files, DataServices can also push down the TDP transform as a MapReduce job to the Hadoop cluster.

Why running DataServices Text Analysis within Hadoop ?

Running the text analysis within Hadoop (means as MapReduce jobs) can be an appealing approach if the total volume of source files is big and at the same time the Hadoop cluster has enough resources. Then the text analysis might run much quicker inside Hadoop than within DataServices.

The text analysis extracts entities from unstructured text and as such it will transform unstructured data into structured data. This is a fundamental pre-requisite in order to be able to run any kind of analysis with the data in the text sources. You only need the text sources as input for the text analysis. Afterwards you could theoretically delete the text sources, because they are no longer needed for the analysis process. But in reality you still want to keep the text sources for various reasons:

  • Quality assurance of the text analysis process: doing manual spot checks by reviewing the text source.
    Is the generated sentiment entity correct? Is the generated organisation entity Apple correct or does the text rather refer to the fruit apple ? …
  • As an outcome of the QA you might want to improve the text analysis process and therefore rerun the text analysis with documents that already had been analyzed previously.
  • You might want to run other kinds of text analysis with the same documents. Let’s say you did a sentiment analysis on text documents until now. But in future you might have a different use case and you want to analyze customer requests with the same documents.

Anyway, in all these scenarios the text documents are only used as a source. Any BI solution will rely on the structured results of the text analysis. But it will not need to lookup any more information from these documents. Keeping large volumes of these documents in a Hadoop cluster can therefore be a cost-effective solution. If at the same time the text analysis process can run in parallel on multiple nodes on that cluster the overall performance might be improved – again on comparative cheap hardware.

I was therefore curious how the push-down of the DataServices TDP transform to Hadoop works technically. Unfortunately my test cluster is too small to test the performance. Instead my test results focus on technical functionality only. Below are my experiences and test results:

My Infrastructure

For my tests I am using a Hortonworks Data Platform HDP 2.1 installation on just one cluster. All HDP services are running on this node. The cluster is installed on a virtual machine with CentOS 6.5, 8 GB memory and 4 cores.

DataServices 4.2.3 is installed on another virtual machine based on CentOS 6.5, 8 GB memory and 4 cores.

My Test Cases

I have been running 3 test cases:

  1. Text analysis on 1000 Twitter tweets in one CSV file:
    The tweets are a small sample from a bigger file extracted from Twitter on September 9 using a DataSift stream. The tweets in the stream were filtered, they all mention Apple products like iPhone, iPad,AppleWatch and so on.

    Dataflow TA_HDP_APPLE_TWWET

    Dataflow TA_HDP_APPLE_TWWET

  2. 3126 binary MSG files. These files contain emails that I had sent out in 2013:

    Dataflow TA_HDP_EMAILS

    Dataflow TA_HDP_EMAILS

  3. 3120 raw text files. These files contain emails that I had sent out in 2014:

    Dataflow DF_TA_HDP_EMIALS_TEXT

    Dataflow DF_TA_HDP_EMIALS_TEXT

I have been running each of these test cases in two variants:

  • Running the TDP transform within the DS engine. In order to achieve this, the source files were saved on a local filesystem
  • Running the TDP transform within Hadoop. In order to achieve this, the source files were saved on the Hadoop cluster.

The TDP transform of both variants of the same test case is basically the same. This means they are copies of each other with exactly the same configuration options. For all the test cases I have configured standard entity extraction in addition to a customer provided dictionary which defines the names of Apple products. Furthermore, I specified sentiment and request analysis for English and German languages:

TDP configuration options

TDP configuration options

Whether a TDP transform runs in the DS engine or is pushed down to Hadoop depends solely on the type of source documents: on a local file system or in HDFS.

Finally, I wanted to compare the results between both variants. Ideally, there should be no differences.

DataServices vs. Hadoop: comparing the results

Surprisingly, there are significant differences between the two variants. When the TDP transform runs in the DS engine the transform produces much more output than when running the same transform inside Hadoop as a MapReduce job. More precise, for some source files the MapReduce job did not generate any output at all, while the DS engine generates output for the same documents:

Comparing number of documents and entities between the two variants

Comparing number of documents and entities between the two variants

For  example, the DS engine produced 4647 entities from 991 documents. This makes sense, because 9 tweets do not contain any meaningful content at all, so that the text analysis does not generate any entities for these documents. But the MapReduce job generated 2288 entities from only 511 documents. This does no longer make sense. The question is what happened with the other 502 documents?

The picture is similar for the other two test cases analyzing email files. So, the problem is not related to the format of the source files. I tried to narrow down this problem within Hadoop – please see the section Problem Tracking –> Missing entities at the end of this blog for more details.

On the other hand, for those documents when both, the DS engine and the MapReduce job generated entities, the results between the two variants are nearly identical. For example, these are the results of both variants for one example tweet.

TDP output for one tweet - TDP running in DS

TDP output for one tweet – TDP running in DS

TDP output for one tweet - TDP running in Hadoop

TDP output for one tweet – TDP running in Hadoop

The minor differences are highlighted in the Hadoop version. They can actually be ignored, because they won’t have any impact on a BI solution processing this data. Please note that the entity type APPLE_PRODUCT has been derived from the custom dictionary that I provided, whereas the entity type PRODUCT has been derived from the system provided dictionary.

We can also see that the distribution of entity types is nearly identical between the two variants, just the number of entities is different. The chart below shows the distribution of entity types across all three test cases:

Distribution of entity types

Distribution of entity types

Preparing Hadoop for TDP

It is necessary to read the relevant section in the SAP DataServices reference manual. Section 12.1.2 Configuring Hadoop for text data processing describes all the necessary prerequisites. The most important topics here are:

Transferring TDP libraries to the Hadoop cluster

The $LINK_DIR/hadoop/bin/hadoop_env.sh -c script will simply transfer some required libraries and the complete $LINK_DIR/TextAnalysis/languages directory from the DS server to the Hadoop cluster. The target location in the Hadoop cluster is set in the environment variable $HDFS_LIB_DIR (this variable will be set in the same script when called with the -e option).You need to provide this directory in the Hadoop cluster with appropriate permissions. I recommend that the directory in the Hadoop cluster is owned by the same login under which that the Data Services engines are running (in my environment this is the user ds).
After $LINK_DIR/hadoop/bin/hadoop_env.sh -c has been executed successfully the directory in HDFS will look something like this:

$HDFS_LIB_DIR in HDSF

Note that the $LINK_DIR/TextAnalysis/languages directory will be compressed in HDFS.
Important: If you are using the Hortonworks Data Platform (HDP) distribution version 2.x you have to use another script instead: hadoop_HDP2_env.sh. hadoop_env.sh does not transfer all the required libraries for HDP 2.x installations.

Optimizing text data processing for use in the Hadoop framework

The description in the SAP DataServices reference manual, section 12.1.2 Configuring Hadoop only works for older Hadoop distributions as the referred Hadoop parameters are deprecated. For HDP 2.x I had to use different parameters. See the section Memory Configuration in YARN and Hadoop in this blog below for more details.

HDFS source file formats

Reading the SAP DataServices reference manual word-for-word, it says that only unstructured text files work for the push-down to Hadoop and furthermore that the source file must be connected (directly?) to the TDP transform:

Extract from DataServices Reference Guide: sources for TDP

Extract from DataServices Reference Guide: sources for TDP

My tests – fortunately – showed that the TDP MapReduce job can handle more formats:

  • HDFS Unstructured Text
  • HDFS Unstructured Binary
    I just tested MSG and PDF files, but I assume that the same binary formats as documented in the SAP DataServices 4.2 Designer Guide – see section 6.10 Unstructured file formats will work. The TDP transform is able to generate some metadata for unstructured binary files (the TDP option Document Properties) : this feature does not work when the TDP transform is pushed down to Hadoop. The job still analyzes the text in the binary file. But the MapReduce job will temporary stage the entity records for the document metadata. When reading this staging file and passing the records back to DataServices file format, warnings will be thrown. Thus you will still receive all generated text entities but not those for the document metadata.
  • HDFS CSV files:
    I was also able to place a query transform between the HDFS source format and the TDP transform and specify some filters, mappings and so in the query transform. All these settings still got pushed down to Hadoop via a Pig script. This is a useful feature, the push-down seems to work very similar to the push-down to relational databases: as long as DataServices knows about corresponding functions in the source system (here HDFS and Pig Latin language) it pushes as much as possible to the underlying system instead of processing the functions in its own engine.

TDP pushed down to Hadoop

It is very useful to understand roughly how the TDP push down to Hadoop works, because it will ease performance and problem tracking. Once all perquisites for the push-down are met (see previous section Preparing Hadoop for TDP –> HDFS source file formats) DataServices will generate a Pig script which in turn starts the TDP MapReduce job on the Hadoop cluster. The generated Pig script will be logged in the DataServices trace log file:

DS trace log with pig script

DS trace log with pig script

The Pig script runs on the DataServices server. But the commands in the Pig script will be executed against the Hadoop cluster. This means – obviously – that the Hadoop client libraries and tools must be installed on the DataServices server. The Hadoop client does not necessarily need to be part of the Hadoop cluster, but it must have access to the Hadoop cluster. More details on the various installation options are provided in the SAP DataServices reference manual or on SCN – Configuring Data Services and Hadoop.

In my tests I have noticed two different sorts of Pig scripts that can be generated, depending on the source file formats.

  1. CSV files
    The Pig scripts for analyzing text fields in CSF files usually looks similar to this:Obviously,DataServices provides its own Pig functions for loading and storing data. They are also used for staging temporary results within the Pig script. During problem tracking it might be useful to check these files. All required files
    and information are located in one local directory on theDataServices server:

    Pig script for CSV files

    Pig script for CSV files

    Given all these information, you might also run performance tests or problem analysis by executing or modifying the Pig script directly on the DataServices server.

    Pig directory

    Pig directory

  2. Unstructured binary or text files
    The Pig script for unstructured binary or text files looks a bit different. This is obviously becausetheMapReduce jar file providedbyDataServices can read these files directly. In contrast, in case of CSV files, some additional pre-processing need to happen in order to extract the text fields from the CSV file.In case of unstructured binary or text files the Pig script calls another Pig script which in turn runs a Unix shell command (I don’t know what the purpose of these two wrapper Pig scripts is, though?). The Unix shell command simply runs theMapReduce job within theHadoop framework:

    Pig script for CSV files

    Pig script for unstructured files

What if DS does not push down the TDP transform to Hadoop?

Unstructured binary or text HDFS files:
If unstructured binary or text HDFS files are directly connected to the TDP transform, the transform should get pushed down to Hadoop. In most cases it does not make sense to place other transforms between the HDFS source file format and the TDP transform. If you do have one or more query transforms after the HDFS source file format, you should only use functions in the query transforms that DataServices is able to push down to Pig. As long as you are using common standard functions such as substring() and so on, the chances are hight that DataServices will push them down to Pig. In most cases you will anyway have to find out on your own which functions are candidates for a push-down because they are not documented in the SAP manuals.

CSV files:
Basically the same rules apply as for unstructured binary or text files. In addition there are potential pitfalls with the settings in the CSV file format:
keep in mind that DataServices provides its own functions for reading HDFS CSV files (see previous section TDP pushed down to Hadoop). We don’t know about the implemented features of the DSLoad Pig function and they are not documented in the SAP manuals. It may be that some of the settings in the CSV file format are not supported by the DSLoad function. In this case DataServices cannot push down the TDP transform to Hadoop, because it first need to read the complete CSV file from HDFS into its own engine. It will then process the specific CSV settings locally and also process the TDP transform locally in its own engine.

A typical example for this behaviour are the row skipping options in the CSV file format. They are apparently not supported by the DSLoad function. If you set these options, DataServices will not push down the TDP transform to Hadoop:

HDFS file format options for CSV

HDFS file format options for CSV

Memory Configuration in YARN and Hadoop

I managed to run a MapReduce job using standard entity extraction and english sentiment analysis using the default memory settings for YARN and MapReduce (means the settings initially provided when I setup the cluster). But when using the German voice-of-customer libraries (for sentiment or request analysis) I had to increase some memory settings. According to SAP the german language is more complex so that these libraries require much more memory. Please note that the memory required for the MapReduce jobs depends on these libraries and not on the size of the source text files that will be analyzed.

If you do not configure enough memory the TDP MapReduce job will fail. Such kind of problems cannot be tracked using the DataServices errorlogs. See the section Problem Tracking in this blog below.

I followed the descriptions in the Hortonworks manuals about YARN and MapReduce memory configurations. This way I had overridden the following default settings in my cluster (just as an example!):

Overriden YARN configurations

Overriden YARN configurations

Overriden MR2 configurations

Overriden MR2 configurations

Using these configurations I managed to run the TDP transform with German VOC libraries as MapReduce jobs.

Problem Tracking

Problems while running a TDP transform as a MapReduce job are usually more difficult to track. The errorlog of the DataServices job will simply contain the errorlog provided by the generated Pig script. It might already point you to the source of the problem. On the other hand, other kind of problems might not be listed here. Instead you need to review the relevant Hadoop or MapReduce log files within Hadoop.

For example, in my case I identified a memory issue within MapReduce like this:

DataServices errorolog

If a TDP MapReduce jobs fails due to insufficient memory the DataServices errorlog might look similar to this:

DS errorlog with failed TDP MapReduce job

DS errorlog with failed TDP MapReduce job

In this DS errorlog excerpt you just see that one MapReduce job from the generated Pig script failed, but you don’t see any more details about the failure. Also, the other referred log files pig_*.err and hdfsRead.err do not provide any more details. In such cases you ned to review the errorlog of the MapReduce job in Hadoop to get more details.

MapReduce errorlogs

In general it is helpful to roughly understand how logging works inside Hadoop. In this specific case of YARN and/or MapReduce I found the chapter Log files in Hortonworks’ YARN and MR2 documentation useful (especially section MapReduce V2 Container Log Files).

It is also good to know about YARN log aggregation, see  HDP configuration files for more information. Depending on the setting of the yarn.log-aggregation-enable parameter the most recent log files are either still on the local file system of the NameNode or they are compressed within the HDFS cluster.

In anyway, you can start the error log review with tools such as Hue or the JobHsitory server provided with Hadoop. In my example I used the Job browser view within Hue. Clicking on the TDP MapReduce job lists various failed attempts of the job to obtain a so-called memory container from YARN:

Failed task attempts

Failed task attempts

The log file of one of these attempts will show the reason for failure, for instance low memory:

Error log for low memory

Error log for low memory

In this case the MapReduce job couldn’t start at all and an error will be returned to the DataServices engine which started the Pig script. So in this case the DataServices job will also get a failed status.

Other type of errors to watch out

Once a task attempt succeeds it does not necessarily mean that the task itself will succeed. It means that a task has successfully started. Any errors during task execution will then be logged in the task log file. In case of TDP, a task on one node will sequentially analyze all the text source files. If the task fails when analyzing one of the source documents the error will be logged but the task will go ahead with the next source document. This is a typical fault tolerant behaviour of MapReduce jobs in Hadoop.

Important: although that errors with individual documents or records get logged in the log file of the MapReduce TDP task, the overall result of the MapReduce job will be SUCCEDED and no error or even warnings get returned to DataServices. The DataServices job will get a successful status in such situations.

It is therefore important to monitor and analyze the MapReduce log files in addition to the DataServices error logs!

During my tests I found some other typical errors in the MapReduce log files that hadn’t been visible in the DataServices errorlog. For example:

  • Missing language libraries:
    MapReduce error log: missing language libraries

    MapReduce error log: missing language libraries

    I configured automatic language detection in the TDP transform, but forgot to install all the available TDP languages during DS installation. After installing all the languages and re-executing $LINK_DIR/hadoop/bin/hadoop_env.sh -c these errors disappeared.
    (BTW: when running the same job with the TPD transform in the DS engine instead of having it pushed-down to Hadoop DataServices did not print any warnings in the errorlog, so I probably would have never caught this issue)

  • Non-ASCII characters in text sources:
    In the case of the Twitter test case some tweets may contain special characters such as the © or signs. The TDP transform (or MapReduce job) cannot interpret such characters as raw text and then aborts the analysis of this document.
    If the TDP runs within the DS engine these errors will be printed as warnings on the DS errorlog.
    If the TDP runs as MapReduce job there will be no warnings in the DS errorlog, but the errorlog of the MapReduce will contain these errors.
  • Missing entities:
    As described above I found that the TDP transform running as a MapReduce job generates much less entities than when running the same TDP transform (with the same input files) within the DS engine.
    To narrow down the problem I checked the stdout log file of theMapReduce job. For each text source file (or for each record in a CSV file) it prints a message like this if the analysis succeeds:

    Map Reduce stdout: analyzing text documents

    Map Reduce stdout: analyzing text documents

    In order to understand how many text sources had been successfully analyzed I grepped the stdout file:

    MapReduce stdout: check number of successfully analyzed documents.

    MapReduce stdout: check number of successfully analyzed documents.

    Well, in this particular case it actually didn’t help me to solve the problem: there had been 3126 text files as source for the TDP transform. According to stdout all of them had been successfully analyzed by the TDP job. Nevertheless, text entities had been generated for only 2159 documents. This would actually mean that 967 documents are empty or have no meaningful content so that the text analysis does not generate text entities. But because the same TDP transform when running in the DS engine generates entities for 2662 documents, there seem to be something wrong in the MapReduce job of the TDP transform!

    Having a closer look at how the generated entities distribute across various metadata I recognized that entities were missing only for documents for which the auto language detection mechanism has determined English language:

    Entities per language

    Entities per language

    With these findings I manage to narrow down the problem to the language settings in the TDP transform. If I change settings from Auto detection (and English as default language) to English as fix language both jobs (DS engine and Hadoop MapReduce – will produce the exact same number of entities.

    I have passed this issue to the SAP – hopefully there will be fix available soon.

Upgrading Hortonworks HDP from 2.0 to 2.1

Today I have upgraded my personal HDP cluster from version 2.0 to version 2.1. The cluster runs completely on a CentOS 6 VM on my notebook, so it just consists of one single node hosting the namenode, datanode and all other services. Beside this I have a second Linux VM hosting SAP Data Services 4.2 with connectivity to the HDP cluster. The HDP installation on that machine can be considered as a kind of Hadoop client. The HDP software on that VM needed to be upgrdaded as well.

Given these resourecs I am using these VMs just for evaluating Hadoop functionalities. I cannot run any useful performance tests on these machines. I am also not a Hadoop administrator, but I am using Hadoop occasionally as part of some analytic data processing.

For those people who are in a similar situation as me and who want to upgrade a test HDP 2.0 cluster I’d like to share my experiences with this upgrade:


Hortonworks has documented two upgrade approaches:

  1. Upgrade from HDP 2.0 to HDP 2.1 Manually
  2. Upgrading the HDP Stack from 2.0 to 2.1 (Ambari Upgrade)

I was using the second option with Ambari. Because of the term Ambari Upgrade you might presume a more or less automated procedure from within Ambari. Unfortunately this is not the case. There are many manual steps involved. You should plan roughly one day for the complete process (for small test clusters such as my installation).

If you follow thorouhly all the documented steps everything should work fine. Nevertheless, there were a few steps in the upgrade instructions that were not absolute clear to me and that could be potential pitfalls. Here are my notes to these steps:

Section 1.10: Upgrading Ambari Server to 1.6.1 –> Upgrade the Nagios add-ons package

The previously downloaded package didn’t contain any upgrade for Nagios add-ons. It is still the same version as for Ambari 1.5. I am not sure whether I missed somethimg here or whether there is really no add-on upgrade with Ambari 1.6. Anyway, at the end of the complete HDP upgrade to 2.1 Nagios monitoring still worked fine in my cluster.

Section 2.5: Upgrade the Stack –> Upgrade the Stack on all Ambari Agent hosts

In my understanding the documentation is not very helpful here:

Featured image

If you follow these instructions you might think that there are upgrade packages with names such as HDFS, YARN, Ganglia etc. But this is not true. I found the list of packages to be upgraded using this approach:

  1. List all installed pacakages on your machine and re-direct output into a file, for instance:
    yum list installed >installed_pre_upgrade
  2. View the file and search for the repository name HDP-2.0. The name in the first line of a package section igives the package name that is to be upgraded. For instance:
    Featured image

    Note:

    Unfortunately the yum command list installed produces an output format that is difficult to handle because the various repository names of a package are listed in separate lines. (You could write a script to deal with this but I think you are still quicker by manually reviewing the output of the list installed command)
  3. Upgrade the components. For instancee in my case I ended up with these components to be upgraded:
     yum upgrade "hadoop*" "hbase*" "hive*" "pig*" "sqoop*" "zookeeper*

Important:
Do not upgrade any hue* components at this stage. Hue will be upgraded later and requires a database backup prior to the upgrade, see below.

Section 3.5: Complete the Stack Upgrade –> Start NameNode

su -l <HDFS_USER> -c "export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec && /usr/lib/hadoop/sbin/hadoop-daemon.sh start namenode -upgrade"

This command starts the NameNode in upgrade mode. It actually does not start the HDFS upgrade itself. There is no need to wait for anything after the NameNode had been started with this command. Instead you can go ahead and start the DataNodes. The NameNode will then instruct the DataNodes to upgrade HDFS.

Section 3.20: Complete the Stack Upgrade –> Tez

If you are not sure which user is running the HivServer2 service you can start the service from Ambari and then check the process, for instance:

[root@hdp20 upgrade_21]# ps -ef | grep HiveServer2
hive 15691 1 4 05:57 ? 00:00:06 /usr/jdk64/jdk1.7.0_45/bin/java -Xmx1024m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop/hive -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=hive -Dhadoop.root.logger=INFO,console -Djava.library.path=:/usr/lib/hadoop/lib/native/Linux-amd64-64:/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx1024m -Xmx1024m -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /usr/lib/hive/lib/hive-service-0.13.0.2.1.5.0-695.jar org.apache.hive.service.server.HiveServer2 -hiveconf hive.metastore.uris=" " -hiveconf hive.log.file=hiveserver2.log -hiveconf hive.log.dir=/var/log/hive

The documented commands to create the scratch directory did not work for me. I used instead:

hdfs dfs -mkdir /tmp/hive-hive

(the directory already existed anyway)

hdfs dfs -chmod 777 /tmp/hive-hive

Section 3.21: Complete the Stack Upgrade –> Upgrade Hue

The documentation states: If you are using the embedded SQLite database, you must perform a backup of the database before you upgrade Hue to prevent data loss

The backup is nut only required in case the upgrade fails. The upgrade will definitely re-initialize the Hue desktop db. All your scripts and references in the Hue interface will disappear after the upgrade. So you better backup the desktop database before the Hue upgrade and restore it after upgrade.

The restore command in the documentation has a small error. The correct command is:

sqlite3 desktop.db < <backup_file>

Post Upgrade Issues

After the upgrade had completed and after I had shutdown the HDP cluster I rebooted the VM and the cluster. Hive did not work anymore after the reboot. I noticed that /etc/rc3.d still contained some Hive and Hadoop startup scripts. For Hive, the services HiveServer, Hiveserver2 and Hive Metastore had been started by the CentOS init process, but Ambari only showed HiveServer2 as running.

I guess that the startup scripts had been left over by the initial HDP 2.0 installation and did not get cleaned with the HDP upgrade to 2.1. So I removed all Hadoop services from the CentOS startup procedure (except Ambari and Hue). This solved the issue.