Building Apache Hadoop from Source on Windows 10 with Visual Studio 2015
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Introduction
The easiest way to setup Hadoop
is to download the binaries from any of the Apache Download Mirrors.
However this only works for GNU/Linux.
NOTE
Hadoop version 2.2 onwards includes native support for Windows.
The official Apache Hadoop
releases do not include Windows binaries (yet, as at the time of this writing).
According to the official Hadoop
Wiki:
building a Windows package from the sources is fairly straightforward.
This information however must be taken with a grain of salt as it wasn’t straightforward when I built my Windows distribution from
source. In this post we are going to build Hadoop
from source on Windows 10 with
Visual Studio Community Edition 2015.
Prerequisites
- Windows 10 64 bit
- Oracle JDK 1.8.0_111+
- Maven 3.3.9+
- Protocol Buffers 2.5.0
- CMake 3.7.1 +
- Visual Studio Community Edition 2015 Update 3+
- Windows SDK 8.1
- Cygwin 2.6.0
- zlib 2.6.0 1.2.8
- Internet connection for first build (to fetch all Maven and Hadoop dependencies)
Setup Environment
The first step to building Hadoop
on Windows is to setup our Windows build environment. To set this up, we will define
the following environment variables:
JAVA_HOME
MAVEN_HOME
C:\Protobuf
C:\Program Files\CMake\bin
C:\cygwin64\bin
Platform
ZLIB_HOME
temp
tmp
NOTE
The following assumeJDK 1.8.0_111
is downloaded and installed
The Java installation process on Windows will likely go ahead an install both the JDK and JRE directories (even if the JRE
wasn’t selected during the installation process). The default installation will install the following directories:
C:\Program Files\Java\jdk1.8.0_111
C:\Program Files\Java\jre1.8.0_111
Some Java programs do not work well with a JAVA_HOME
environment variable that contains embedded spaces (such as
C:\Program Files\java\jdk1.8.0_111
. To get around this Oracle has created a subdirectory at C:\ProgramData\Oracle\Java\javapath\
to contain links to various Java executables without any embedded spaces, however it has for some reason omitted the JDK compiler
from this list. To correct for this, we need to create an additional directory symbolic link to the JDK installation.
Open an Administrative Command Prompt by pressing the Win
key. Type cmd.exe
, and press Ctrl + Shift + Enter
to open Command Prompt in elevated mode. This will provide the proper privileges to create the symbolic link, using the
following commands:
The JAVA_HOME
environment System variable can then be set to the following (with no embedded spaces):
JAVA_HOME
==> C:\ProgramData\Oracle\Java\javapath\JDK
The Environment Variable editor can be accessed from the “Start” menu by clicking on “Control Panel”, then “System and Security”,
then “System”, then “Advanced System Settings”, then “Environment Variables”.
The PATH
environment System variable should then be prefixed with the following
%JAVA_HOME%\bin
Test your installation with:
Sample output:
NOTE
The following assumeMaven 3.3.9
is downloaded
Extract Maven
binaries to C:\apache-maven-3.3.9
. Set the MAVEN_HOME
environment System variable to the following:
MAVEN_HOME
==> C:\apache-maven-3.3.9
The PATH
environment System variable should then be prefixed with the following
%MAVEN_HOME%\bin
Test your installation with:
Sample output:
Next extract Protocol Buffers
to C:\protobuf
. The PATH
environment System variable should then be prefixed with the following
C:\protobuf
Test your installation with:
Sample output:
NOTE
The following assumeCMAKE
is installed toC:\Program Files\CMake
Next the PATH
environment System variable should then be prefixed with the following
C:\Program Files\CMake\bin
Test your installation with:
Sample output:
NOTE The following assume
cygwin
is installed toC:\cygwin64
Next the PATH
environment System variable should then be prefixed with the following
C:\cygwin64\bin
Extract the contents of zlib128-dll to C:\zlib
. This will be needed later.
And that’s it for setting up the System Environment variables. Whew! That was a lot of setting up to do.
Get and Tweak Hadoop Sources
Download the Hadoop
source files from the Apache Download Mirrors. At the time of this writing, the
latest Hadoop
version is 2.7.3. Extract the contents of hadoop-2.7.3-src.tar.gz
to C:\hdfs
.
The source files of Hadoop
are written for Windows SDK
or Visual Studio 2010 Professional
. This makes it incompartible with
Visual Studio 2015 Community Edition
. To work around this, open the following files in Visual Studio 2015
C:\hdfs\hadoop-common-project\hadoop-common\src\main\winutils\winutils.vcxproj
C:\hdfs\hadoop-common-project\hadoop-common\src\main\winutils\libwinutils.vcxproj
C:\hdfs\hadoop-common-project\hadoop-common\src\main\native\native.vcxproj
Visual Studio
will complain of them being of an old version. All you have to do is to save all and close.
Next enable cmake Visual Studio
2015 project generation for hdfs
. On the line 449 of
C:\hdfs\hadoop-hdfs-project\hadoop-hdfs\pom.xml
, edit the else
value as the following:
Build Hadoop
To build Hadoop
you need the Developer Command Prompt
. To launch the prompt on Windows 10 follow steps below:
- Open the
Start
menu, by pressing the Windows logo key on your keyboard for example. - On the
Start
menu, enter dev. This will bring a list of installed apps that match your search pattern. If you’re looking for a different command prompt, try entering a different search term such as prompt. - Choose the
Developer Command Prompt
(or the command prompt you want to use).
From this prompt, add a few more environment variables:
The Platform
variable is case sensitive. Finally the build:
When everything is successful, we will get an output similar to this:
This will build the binaries to C:\hdfs\hadoop-dist\target\hadoop-2.7.3.tar.gz
.
Install Hadoop
With our build successful, we can now install Hadoop
on Windows 10. Pick a target directory for installing the package.
We use C:\hadoop
as an example. Extract the tar.gz
file (e.g.hadoop-2.7.3.tar.gz) under C:\hadoop
. This will yield a
directory structure like the following. If installing a multi-node cluster (We will cover this in a different post),
then repeat this step on every node:
This section describes the absolute minimum configuration required to start a Single Node (pseudo-distributed) cluster.
Add Environment Variable HADOOP_HOME
and edit Path
Variable to add bin
directory of HADOOP_HOME
(say %HADOOP_HOME%\bin
).
Before you can start the Hadoop
Daemons you will need to make a few edits to configuration files. The configuration file
templates will all be found in C:\hadoop\etc\hadoop
, assuming your installation directory is C:\hadoop
.
Follow the following to edit the HDFS Configuration
First edit the file hadoop-env.cmd to add the following lines near the end of the file:
Edit the file core-site.xml and make sure it has the following configuration key:
fs.default.name:
The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.
Edit the file hdfs-site.xml and add the following configuration key:
dfs.replication:
Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
Finally, edit the file slaves and make sure it has the following entry:
localhost
The default configuration puts the HDFS metadata and data files under \tmp
on the current drive. In the above example
this would be C:\tmp
. For your first test setup you can just leave it at the default.
Follow the following to edit the YARN Configuration
Edit mapred-site.xml under %HADOOP_PREFIX%\etc\hadoop
:
Add the following entries:
mapreduce.framework.name:
The runtime framework for executing MapReduce jobs. Can be one of local, classic or yarn.
Finally, edit yarn-site.xml and add the following entries:
yarn.nodemanager.aux-services:
The auxiliary service name. Default value is omapreduce_shuffle
yarn.nodemanager.aux-services.mapreduce.shuffle.class:
The auxiliary service class to use. Default value is org.apache.hadoop.mapred.ShuffleHandler
yarn.application.classpath:
CLASSPATH for YARN applications. A comma-separated list of CLASSPATH entries.
Run
to setup environment variables that will be used by the startup scripts and the daemons.
Format the filesystem with the following command:
This command will print a number of filesystem parameters. Just look for the following two strings to ensure that the format command succeeded:
Run the following command to start the NameNode and DataNode on localhost:
Two separate Command Prompt windows will be opened automatically to run Namenode and Datanode.
To verify that the HDFS daemons are running, we would create a file:
Try copying the file to HDFS:
Sample output:
Finally, start the YARN daemons:
Similarly, two separate Command Prompt windows will be opened automatically to run Resource Manager and Node Manager. The cluster should be up and running! To verify, we can run a simple wordcount job on the text file we just copied to HDFS:
If everything goes well then you will be able to open the Resource Manager and Node Manager at http://localhost:8042 and Namenode at http://localhost:50070.
Stop HDFS & MapReduce with the following commands:
Conclusion
In this post, we built Hadoop
from source on Windows 10 64 bit with Visual Studio 2015 Community Edition
albeit a tideous process.
We also setup a single Hadoop
cluster on Windows. Until the next post, keep doing cool things .