Upgrading Hortonworks HDP from 2.0 to 2.1

Today I have upgraded my personal HDP cluster from version 2.0 to version 2.1. The cluster runs completely on a CentOS 6 VM on my notebook, so it just consists of one single node hosting the namenode, datanode and all other services. Beside this I have a second Linux VM hosting SAP Data Services 4.2 with connectivity to the HDP cluster. The HDP installation on that machine can be considered as a kind of Hadoop client. The HDP software on that VM needed to be upgrdaded as well.

Given these resourecs I am using these VMs just for evaluating Hadoop functionalities. I cannot run any useful performance tests on these machines. I am also not a Hadoop administrator, but I am using Hadoop occasionally as part of some analytic data processing.

For those people who are in a similar situation as me and who want to upgrade a test HDP 2.0 cluster I’d like to share my experiences with this upgrade:


Hortonworks has documented two upgrade approaches:

  1. Upgrade from HDP 2.0 to HDP 2.1 Manually
  2. Upgrading the HDP Stack from 2.0 to 2.1 (Ambari Upgrade)

I was using the second option with Ambari. Because of the term Ambari Upgrade you might presume a more or less automated procedure from within Ambari. Unfortunately this is not the case. There are many manual steps involved. You should plan roughly one day for the complete process (for small test clusters such as my installation).

If you follow thorouhly all the documented steps everything should work fine. Nevertheless, there were a few steps in the upgrade instructions that were not absolute clear to me and that could be potential pitfalls. Here are my notes to these steps:

Section 1.10: Upgrading Ambari Server to 1.6.1 –> Upgrade the Nagios add-ons package

The previously downloaded package didn’t contain any upgrade for Nagios add-ons. It is still the same version as for Ambari 1.5. I am not sure whether I missed somethimg here or whether there is really no add-on upgrade with Ambari 1.6. Anyway, at the end of the complete HDP upgrade to 2.1 Nagios monitoring still worked fine in my cluster.

Section 2.5: Upgrade the Stack –> Upgrade the Stack on all Ambari Agent hosts

In my understanding the documentation is not very helpful here:

Featured image

If you follow these instructions you might think that there are upgrade packages with names such as HDFS, YARN, Ganglia etc. But this is not true. I found the list of packages to be upgraded using this approach:

  1. List all installed pacakages on your machine and re-direct output into a file, for instance:
    yum list installed >installed_pre_upgrade
  2. View the file and search for the repository name HDP-2.0. The name in the first line of a package section igives the package name that is to be upgraded. For instance:
    Featured image

    Note:

    Unfortunately the yum command list installed produces an output format that is difficult to handle because the various repository names of a package are listed in separate lines. (You could write a script to deal with this but I think you are still quicker by manually reviewing the output of the list installed command)
  3. Upgrade the components. For instancee in my case I ended up with these components to be upgraded:
     yum upgrade "hadoop*" "hbase*" "hive*" "pig*" "sqoop*" "zookeeper*

Important:
Do not upgrade any hue* components at this stage. Hue will be upgraded later and requires a database backup prior to the upgrade, see below.

Section 3.5:¬†Complete the Stack Upgrade –> Start NameNode

su -l <HDFS_USER> -c "export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec && /usr/lib/hadoop/sbin/hadoop-daemon.sh start namenode -upgrade"

This command starts the NameNode in upgrade mode. It actually does not start the HDFS upgrade itself. There is no need to wait for anything after the NameNode had been started with this command. Instead you can go ahead and start the DataNodes. The NameNode will then instruct the DataNodes to upgrade HDFS.

Section 3.20: Complete the Stack Upgrade –> Tez

If you are not sure which user is running the HivServer2 service you can start the service from Ambari and then check the process, for instance:

[root@hdp20 upgrade_21]# ps -ef | grep HiveServer2
hive 15691 1 4 05:57 ? 00:00:06 /usr/jdk64/jdk1.7.0_45/bin/java -Xmx1024m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop/hive -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=hive -Dhadoop.root.logger=INFO,console -Djava.library.path=:/usr/lib/hadoop/lib/native/Linux-amd64-64:/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx1024m -Xmx1024m -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /usr/lib/hive/lib/hive-service-0.13.0.2.1.5.0-695.jar org.apache.hive.service.server.HiveServer2 -hiveconf hive.metastore.uris=" " -hiveconf hive.log.file=hiveserver2.log -hiveconf hive.log.dir=/var/log/hive

The documented commands to create the scratch directory did not work for me. I used instead:

hdfs dfs -mkdir /tmp/hive-hive

(the directory already existed anyway)

hdfs dfs -chmod 777 /tmp/hive-hive

Section 3.21: Complete the Stack Upgrade –> Upgrade Hue

The documentation states: If you are using the embedded SQLite database, you must perform a backup of the database before you upgrade Hue to prevent data loss

The backup is nut only required in case the upgrade fails. The upgrade will definitely re-initialize the Hue desktop db. All your scripts and references in the Hue interface will disappear after the upgrade. So you better backup the desktop database before the Hue upgrade and restore it after upgrade.

The restore command in the documentation has a small error. The correct command is:

sqlite3 desktop.db < <backup_file>

Post Upgrade Issues

After the upgrade had completed and after I had shutdown the HDP cluster I rebooted the VM and the cluster. Hive did not work anymore after the reboot. I noticed that /etc/rc3.d still contained some Hive and Hadoop startup scripts. For Hive, the services HiveServer, Hiveserver2 and Hive Metastore had been started by the CentOS init process, but Ambari only showed HiveServer2 as running.

I guess that the startup scripts had been left over by the initial HDP 2.0 installation and did not get cleaned with the HDP upgrade to 2.1. So I removed all Hadoop services from the CentOS startup procedure (except Ambari and Hue). This solved the issue.

Advertisements

No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: