A Comprehensive Guide to Installing and Configuring Apache Hive

Introduction:

link to this section

Apache Hive is a powerful data warehousing solution built on top of the Hadoop ecosystem, enabling data analysts and engineers to run SQL-like queries on large datasets stored in Hadoop. To fully harness the capabilities of Hive, it is essential to properly install and configure it. In this detailed blog, we will walk you through the step-by-step process of installing and configuring Hive and its prerequisites, ensuring a smooth and efficient setup for your big data processing needs.

Prerequisites:

link to this section

Before installing Hive, ensure that you have the following prerequisites installed and configured on your system:

  1. Java Development Kit (JDK) 1.8 or later
  2. Hadoop 2.x or 3.x

Step 1: Download and Extract Apache Hive

1.1. Visit the Apache Hive official website ( https://hive.apache.org/ ) and navigate to the 'Release' Drop Down section then Click on Release. Here is Direct Apache Link Download Apache Hive

1.2. Choose the desired version of Hive and download the binary tarball.

wget https://dlcdn.apache.org/hive/stable-2/apache-hive-2.3.9-bin.tar.gz

1.3. Extract the downloaded tarball using the following command:

tar -xzvf apache-hive-x.y.z-bin.tar.gz 
#for Example 
tar -zxvf apache-hive-2.3.9-bin.tar.gz

1.4. Move the extracted folder to your desired installation directory (e.g., /opt/hive) and set the HIVE_HOME environment variable:

mv apache-hive-2.3.9-bin.tar.gz /opt/hive
export HIVE_HOME=/opt/hive 

Step 2: Configure Hadoop and Hive Environment Variables

2.1. Open the Hadoop 'hadoop-env.sh' file, typically located in the '$HADOOP_HOME/etc/hadoop' directory, and set the JAVA_HOME variable to your JDK installation directory, for Java Installation visit Java Installation on Ubuntu:

export JAVA_HOME=<path_to_your_JDK_directory> 

2.2. Create a 'hive-env.sh' file in the '$HIVE_HOME/conf' directory and set the HADOOP_HOME and JAVA_HOME variables:

export HADOOP_HOME=<path_to_your_Hadoop_directory> 
export JAVA_HOME=<path_to_your_JDK_directory> 

Step 3: Configure Hive Metastore

Hive stores metadata about the tables, columns, and partitions in a relational database called the metastore. By default, Hive uses an embedded Derby database for the metastore. However, it is recommended to use an external database like MySQL or PostgreSQL for production environments. In this example, we will configure Hive to use MySQL as the metastore.

3.1. Install MySQL server and create a database and user for Hive:

CREATE DATABASE metastore; 
CREATE USER 'hive'@'localhost' IDENTIFIED BY 'hivepassword'; 
GRANT ALL PRIVILEGES ON metastore.* TO 'hive'@'localhost'; 
FLUSH PRIVILEGES; 

3.2. Download the MySQL Connector/J JDBC driver JAR from the MySQL website ( https://dev.mysql.com/downloads/connector/j/ ) and copy it to the '$HIVE_HOME/lib' directory.

3.3. Create a 'hive-site.xml' file in the '$HIVE_HOME/conf' directory with the following configuration:

<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
<configuration> 
    <property> 
        <name>javax.jdo.option.ConnectionURL</name> 
        <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true&amp;useSSL=false</value> 
    </property> 
    <property> 
        <name>javax.jdo.option.ConnectionDriverName</name> 
        <value>com.mysql.cj.jdbc.Driver</value> 
    </property> 
    <property> 
        <name>javax.jdo.option.ConnectionUserName</name> 
        <value>hive</value> 
    </property> 
    <property> 
        <name>javax.jdo.option.ConnectionPassword</name> 
        <value>hivepassword</value> 
    </property> 
    <property> 
        <name>hive.metastore.warehouse.dir</name> 
        <value>/user/hive/warehouse</value> 
    </property> 
    <property> 
        <name>hive.metastore.schema.verification</name> 
        <value>true</value> 
    </property> 
</configuration> 

Step 4: Initialize the Metastore Schema

4.1. Run the 'schematool' command to initialize the MySQL metastore schema:

$HIVE_HOME/bin/schematool -dbType mysql -initSchema 

4.2. Verify that the schema has been successfully created by checking the MySQL metastore database.

Step 5: Start the Hive CLI

5.1. Start the Hive command-line interface (CLI) by running the following command:

$HIVE_HOME/bin/hive 

5.2. If the Hive CLI starts successfully, you should see a 'hive>' prompt, indicating that Hive is now ready for use.

Conclusion:

link to this section

You have now successfully installed and configured Apache Hive on your system. With Hive set up, you can begin exploring its powerful features and capabilities for processing and analyzing large datasets in the Hadoop ecosystem. As you progress through your big data journey with Hive, be sure to explore more advanced topics, such as optimizations, integrations with other tools, and best practices for managing your data warehouse.