Setting Up the Apache Hive Metastore: A Comprehensive Guide to Configuration and Management

The Apache Hive metastore is a critical component of Hive, responsible for storing metadata about tables, partitions, schemas, and columns, enabling Hive to map data in Hadoop Distributed File System (HDFS) to structured tables. Properly configuring the metastore is essential for reliable and scalable Hive operations. This blog provides a detailed, step-by-step guide to setting up the Hive metastore, covering prerequisites, database selection, configuration, initialization, and verification. By following this guide, you can establish a robust metastore for big data analytics with Hive on a Linux system.

Overview of the Hive Metastore

The metastore acts as a catalog for Hive, storing metadata that defines the structure and location of data in HDFS or other storage systems. It supports multiple configurations—embedded, local, and remote—each suited to different use cases, from testing to production. This guide focuses on setting up a remote metastore with MySQL, the recommended choice for production environments due to its scalability and reliability. For foundational context, refer to the internal resource on What is Hive.

Understanding Metastore Configurations

Hive supports three metastore modes:

Embedded Metastore: Uses an embedded Derby database, storing metadata in the local file system. Suitable for testing but not for multi-user or production environments due to single-connection limitations.
Local Metastore: Uses an external database (e.g., MySQL) running on the same machine as Hive, with Hive managing database connections. Better than embedded but less scalable.
Remote Metastore: Uses an external database accessed via a Thrift service, allowing multiple Hive instances to connect. Ideal for production due to concurrency and scalability.

This guide focuses on the remote metastore for enterprise-grade reliability. For Hive architecture details, see Hive Architecture.

Prerequisites for Metastore Setup

Before setting up the metastore, ensure the following:

Operating System: Linux (e.g., Ubuntu 20.04 or later), the standard platform for Hive.
Java: OpenJDK or Oracle JDK 8 or later with JAVA_HOME set.
Hadoop: A running Hadoop cluster (version 3.x) with HDFS and YARN configured. Verify with:

hadoop version
hdfs dfs -ls /

Hive: Installed Hive (e.g., version 3.1.3). Download from https://hive.apache.org/downloads.html.
Relational Database: MySQL or PostgreSQL installed and running. MySQL is used here.
System Requirements: At least 4GB RAM, 10GB free disk space for the database, and network connectivity for Thrift (port 9083).
HDFS Directories: Hive warehouse directory created:

hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod -R 777 /user/hive/warehouse

For Hadoop setup, see Hive on Hadoop.

Installing MySQL for the Metastore

MySQL is a popular choice for the metastore due to its performance and widespread use.

Install MySQL:

sudo apt update
sudo apt install mysql-server
sudo systemctl start mysql
sudo mysql_secure_installation

Verify MySQL:

mysql -u root -p

Create Metastore Database and User:

CREATE DATABASE hive_metastore;
CREATE USER 'hive'@'localhost' IDENTIFIED BY 'hivepassword';
GRANT ALL PRIVILEGES ON hive_metastore.* TO 'hive'@'localhost';
FLUSH PRIVILEGES;
EXIT;

Download MySQL JDBC Driver: The driver enables Hive to connect to MySQL. Download from https://dev.mysql.com/downloads/connector/j/:

wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.28.tar.gz
tar -xvzf mysql-connector-java-8.0.28.tar.gz
sudo cp mysql-connector-java-8.0.28/mysql-connector-java-8.0.28.jar /usr/local/hive/lib/

Configuring the Hive Metastore

Configure Hive to use the MySQL metastore in remote mode by editing hive-site.xml.

Create hive-site.xml: If not already present, create /usr/local/hive/conf/hive-site.xml:

javax.jdo.option.ConnectionURL
    jdbc:mysql://localhost:3306/hive_metastore?createDatabaseIfNotExist=true
    JDBC connect string for a JDBC metastore
  
  
    javax.jdo.option.ConnectionDriverName
    com.mysql.cj.jdbc.Driver
    Driver class name for a JDBC metastore
  
  
    javax.jdo.option.ConnectionUserName
    hive
    Username to use against metastore database
  
  
    javax.jdo.option.ConnectionPassword
    hivepassword
    Password to use against metastore database
  
  
    hive.metastore.uris
    thrift://localhost:9083
    Thrift URI for the remote metastore
  
  
    hive.metastore.warehouse.dir
    /user/hive/warehouse
    Location of default database for the warehouse

Configure hive-env.sh: Copy the template:

cp /usr/local/hive/conf/hive-env.sh.template /usr/local/hive/conf/hive-env.sh

Edit:

export HADOOP_HOME=/usr/local/hadoop
export HIVE_CONF_DIR=/usr/local/hive/conf

For configuration details, see Hive Config Files.

Initializing the Metastore Schema

Initialize the metastore schema to create the necessary tables in MySQL:

schematool -dbType mysql -initSchema

Verify the schema:

schematool -dbType mysql -info

If errors occur, check MySQL connectivity and hive-site.xml credentials.

Starting the Metastore Service

The remote metastore requires a Thrift service to handle client connections.

Start Metastore Service:

hive --service metastore &

Verify Port:

netstat -tuln | grep 9083

Ensure port 9083 is open and listening.

Configuring HiveServer2

HiveServer2 enables concurrent client connections to the metastore.

Add HiveServer2 Properties: Update hive-site.xml with:

hive.server2.thrift.port
  10000


  hive.server2.thrift.bind.host
  localhost

Start HiveServer2:

hive --service hiveserver2 &

Verify Port:

netstat -tuln | grep 10000

For HiveServer2 details, see Hive Server vs. HiveServer2.

Verifying the Metastore Setup

Test the metastore using Beeline or Hive CLI.

Using Beeline: Connect to HiveServer2:

beeline -u jdbc:hive2://localhost:10000 -n hive

Create and query a table:

CREATE TABLE test (id INT, name STRING) STORED AS ORC;
INSERT INTO test VALUES (1, 'TestUser');
SELECT * FROM test;

For Beeline usage, see Using Beeline.

Using Hive CLI:

hive

Run the same query:

SELECT * FROM test;

For CLI usage, see Using Hive CLI.

Check HDFS: Verify data storage:

hdfs dfs -ls /user/hive/warehouse/test

Inspect MySQL: Confirm metadata in the metastore:

mysql -u hive -p
USE hive_metastore;
SHOW TABLES;
SELECT * FROM TBLS;

For table creation details, see Creating Tables.

Troubleshooting Common Issues

MySQL Connection Errors: Ensure MySQL is running (sudo systemctl start mysql) and hive-site.xml credentials match the database user.
Schema Initialization Failure: Verify the JDBC driver is in /usr/local/hive/lib/ and MySQL user has sufficient privileges.
Thrift Service Errors: Check if port 9083 is blocked or another metastore service is running.
HDFS Permissions: Ensure the Hive user has write access:

hdfs dfs -chown -R hive:hive /user/hive/warehouse

For more, see Common Errors.

Practical Example: Setting Up a Sales Table

Create a sales table to test the metastore:

CREATE TABLE sales (
  sale_id INT,
  product STRING,
  amount DOUBLE
)
STORED AS ORC;

INSERT INTO sales VALUES (1, 'Laptop', 999.99);
SELECT product, SUM(amount) as total FROM sales GROUP BY product;

Verify metadata in MySQL:

SELECT * FROM TBLS WHERE TBL_NAME = 'sales';

Check HDFS:

hdfs dfs -ls /user/hive/warehouse/sales

This demonstrates the metastore’s role in mapping table schemas to HDFS data. For query examples, see Select Queries.

External Insights

The Apache Hive documentation (https://hive.apache.org/) provides detailed guidance on metastore configurations and best practices. A blog by Cloudera (https://www.cloudera.com/products/hive.html) discusses metastore setup in enterprise environments, offering practical context.

Conclusion

Setting up the Apache Hive metastore is a critical step in deploying a scalable and reliable Hive environment. By configuring a remote metastore with MySQL, initializing the schema, and starting the metastore and HiveServer2 services, you enable Hive to manage metadata effectively for big data analytics. This guide equips you to establish a robust metastore, supporting data warehousing, ETL, and analytical querying on Hadoop, ensuring seamless integration with Hive’s distributed architecture.