Installing Apache Hive on Linux: A Comprehensive Guide to Setup and Configuration

Apache Hive is a robust data warehousing tool built on top of Hadoop, designed to enable SQL-like querying of large-scale datasets stored in distributed systems. Linux is the preferred platform for Hive deployments due to its compatibility with Hadoop and robust support for big data tools. This blog provides a detailed, step-by-step guide to installing Hive on a Linux system, covering prerequisites, installation, configuration, and verification. By following this guide, you can set up a fully functional Hive environment for data warehousing, ETL, and big data analytics.

Overview of Hive on Linux

Hive integrates seamlessly with Hadoop’s Distributed File System (HDFS) for storage and YARN for resource management, making it a powerful tool for processing massive datasets. Installing Hive on Linux involves setting up Hadoop, configuring a metastore for metadata management, and ensuring compatibility between components. This guide focuses on Ubuntu, a popular Linux distribution, but the steps are adaptable to other distributions like CentOS or RedHat. For foundational context, refer to the internal resource on What is Hive.

Prerequisites for Hive on Linux

Before installing Hive, ensure the following prerequisites are met:

Operating System: Ubuntu 20.04 or later (or equivalent Linux distribution).
Java: OpenJDK or Oracle JDK 8 or later with JAVA_HOME set.
Hadoop: A running Hadoop cluster (version 3.x recommended) with HDFS and YARN configured.
Relational Database: MySQL or PostgreSQL for the metastore. Derby is suitable for testing but not recommended for production.
System Requirements: At least 4GB RAM, 20GB free disk space, and sufficient CPU for Hadoop and Hive services.
Network: Open ports for Hadoop (e.g., 9000 for HDFS) and Hive services (e.g., 9083 for metastore, 10000 for HiveServer2).
SSH: Passwordless SSH configured for Hadoop services in a multi-node cluster.

Verify Java and SSH:

java -version
ssh localhost

For Hadoop setup, refer to the Apache Hadoop documentation (https://hadoop.apache.org/docs/stable/).

Installing Hadoop on Linux

Hadoop is a prerequisite for Hive. Install it as follows:

Download Hadoop:

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xvzf hadoop-3.3.6.tar.gz
sudo mv hadoop-3.3.6 /usr/local/hadoop

Set Environment Variables: Edit ~/.bashrc:

echo 'export HADOOP_HOME=/usr/local/hadoop' >> ~/.bashrc
echo 'export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin' >> ~/.bashrc
source ~/.bashrc

Configure Hadoop: Edit files in /usr/local/hadoop/etc/hadoop:

core-site.xml:

fs.defaultFS
    hdfs://localhost:9000

hdfs-site.xml:

dfs.replication
    1
  
  
    dfs.namenode.name.dir
    /usr/local/hadoop/data/namenode
  
  
    dfs.datanode.data.dir
    /usr/local/hadoop/data/datanode

yarn-site.xml:

yarn.nodemanager.aux-services
    mapreduce_shuffle

mapred-site.xml:

mapreduce.framework.name
    yarn

Set JAVA_HOME: Install Java:

sudo apt update
sudo apt install openjdk-8-jdk

Find the Java path:

readlink -f /usr/bin/java

Edit hadoop-env.sh:

echo 'export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64' >> /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Format HDFS:

hdfs namenode -format

Start Hadoop:

start-dfs.sh
start-yarn.sh

Verify with:

jps

Expect NameNode, DataNode, ResourceManager, and NodeManager. For Hadoop integration, see Hive on Hadoop.

Installing Apache Hive

Download Hive:

wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
tar -xvzf apache-hive-3.1.3-bin.tar.gz
sudo mv apache-hive-3.1.3-bin /usr/local/hive

Set Environment Variables: Edit ~/.bashrc:

echo 'export HIVE_HOME=/usr/local/hive' >> ~/.bashrc
echo 'export PATH=$PATH:$HIVE_HOME/bin' >> ~/.bashrc
source ~/.bashrc

For more, see Environment Variables.

Configuring Hive Metastore

Use MySQL for the metastore to ensure reliability.

Install MySQL:

sudo apt install mysql-server
sudo systemctl start mysql
sudo mysql_secure_installation

Create Metastore Database:

mysql -u root -p
CREATE DATABASE hive_metastore;
CREATE USER 'hive'@'localhost' IDENTIFIED BY 'hivepassword';
GRANT ALL PRIVILEGES ON hive_metastore.* TO 'hive'@'localhost';
FLUSH PRIVILEGES;
EXIT;

Download MySQL JDBC Driver:

wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.28.tar.gz
tar -xvzf mysql-connector-java-8.0.28.tar.gz
sudo cp mysql-connector-java-8.0.28/mysql-connector-java-8.0.28.jar /usr/local/hive/lib/

Configure hive-site.xml: Create /usr/local/hive/conf/hive-site.xml:

javax.jdo.option.ConnectionURL
    jdbc:mysql://localhost:3306/hive_metastore?createDatabaseIfNotExist=true
  
  
    javax.jdo.option.ConnectionDriverName
    com.mysql.cj.jdbc.Driver
  
  
    javax.jdo.option.ConnectionUserName
    hive
  
  
    javax.jdo.option.ConnectionPassword
    hivepassword
  
  
    hive.metastore.uris
    thrift://localhost:9083
  
  
    hive.metastore.warehouse.dir
    /user/hive/warehouse

For details, see Hive Metastore Setup.

Configuring Hive for Hadoop

Create HDFS Directories:

hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir /tmp
hdfs dfs -chmod -R 777 /user/hive/warehouse /tmp

Set Up hive-env.sh: Copy the template:

cp /usr/local/hive/conf/hive-env.sh.template /usr/local/hive/conf/hive-env.sh

Edit:

export HADOOP_HOME=/usr/local/hadoop
export HIVE_CONF_DIR=/usr/local/hive/conf

Set Execution Engine: Use Tez for improved performance. Add to hive-site.xml:

hive.execution.engine
  tez

Download Tez:

wget https://downloads.apache.org/tez/0.10.2/apache-tez-0.10.2-bin.tar.gz
tar -xvzf apache-tez-0.10.2-bin.tar.gz
sudo mv apache-tez-0.10.2-bin /usr/local/tez

Configure Tez and upload to HDFS as per Hive on Tez. For configuration files, see Hive Config Files.

Initializing the Metastore

Initialize the schema:

schematool -dbType mysql -initSchema

Verify:

schematool -dbType mysql -info

Starting Hive Services

Start Metastore:

hive --service metastore &

Start HiveServer2:

hive --service hiveserver2 &

Verify ports:

netstat -tuln | grep 9083
netstat -tuln | grep 10000

Verifying the Installation

Test using Hive CLI or Beeline.

Hive CLI:

hive

Run a test query:

CREATE TABLE test (id INT, name STRING) STORED AS ORC;
INSERT INTO test VALUES (1, 'TestUser');
SELECT * FROM test;

For CLI usage, see Using Hive CLI.

Beeline:

beeline -u jdbc:hive2://localhost:10000 -n hive

Run the same query:

SELECT * FROM test;

Check HDFS:

hdfs dfs -ls /user/hive/warehouse/test

For Beeline details, see Using Beeline.

Troubleshooting Common Issues

MySQL Errors: Ensure MySQL is running (sudo systemctl start mysql) and hive-site.xml credentials are correct.
Hadoop Version Mismatch: Use Hive 3.1.3 with Hadoop 3.x.
Permission Issues: Adjust HDFS permissions:

hdfs dfs -chown -R hive:hive /user/hive/warehouse

Tez Errors: Verify Tez libraries in HDFS and configuration in tez-site.xml.

For more, see Common Errors.

Practical Example: Analyzing Sales Data

Create a sales table to test the setup:

CREATE TABLE sales (
  sale_id INT,
  product STRING,
  amount DOUBLE
)
STORED AS ORC;

INSERT INTO sales VALUES (1, 'Laptop', 999.99);
SELECT product, SUM(amount) as total FROM sales GROUP BY product;

This query uses HDFS for storage, YARN for resources, and Tez for execution, demonstrating Hive’s integration with Hadoop. For table creation, see Creating Tables.

External Insights

The Apache Hive documentation (https://hive.apache.org/) provides detailed setup instructions and version compatibility. A blog by Cloudera (https://www.cloudera.com/products/hive.html) offers insights into deploying Hive on Linux-based Hadoop clusters, providing practical context.

Conclusion

Installing Apache Hive on Linux is straightforward with a properly configured Hadoop cluster. By setting up Java, Hadoop, MySQL, and Hive, and configuring the metastore and execution engine like Tez, you can create a powerful environment for big data analytics. This setup leverages Linux’s robust support for Hadoop, enabling efficient data warehousing, ETL, and analytical querying. With this guide, you can confidently deploy Hive to unlock insights from large datasets.