post

Getting Started with Hive

· 4 min read · 707 words

References: Programming Hive (1st edition), Hive Official Wiki, https://bbs.huaweicloud.com/blogs/197920

Programming Hive (1st edition) was published in 2013, before Hive 2 was released. Hive 3 is now widely used, and much of that material is outdated. This post is for learning purposes only.


Hive is an open-source data warehouse framework built on the Hadoop ecosystem, designed for processing and analyzing large volumes of structured and semi-structured data.

Hive provides a SQL-like query language called HiveQL (Hive Query Language), which allows users to write queries against data stored in HDFS or other compatible storage systems (such as Amazon S3). HiveQL queries are translated into a series of MapReduce jobs that execute on a Hadoop cluster. Hive also supports other execution engines such as Apache Tez and Apache Spark.

图片

The Hive distribution includes a CLI module, a simple web interface called the Hive Web Interface (HWI), and several modules accessible programmatically via JDBC, ODBC, and a Thrift server (see Chapter 16).

All commands and queries pass through the Driver module, which parses and compiles input, optimizes the required computations, and executes them according to the specified plan. When a MapReduce job needs to be generated, Hive drives the execution of built-in native Mapper and Reducer modules via an XML file representing the job execution plan.

Hive communicates with JobTracker to initialize MapReduce jobs and does not need to be deployed on the JobTracker node.

The Metastore is a standalone relational database in which Hive stores table schemas and other system metadata.


Single-Node Hive Installation and Usage

Prerequisites

wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
tar -xzvf apache-hive-3.1.3-bin.tar.gz

MySQL Installation and Configuration

Hive uses a metastore to store metadata. MySQL is used here.

Remove the pre-installed MySQL:

yum remove mariadb mariadb-libs

Install:

yum install TXSQL-client.x86_64

Start MySQL:

# Start MySQL
sudo service mysql start

# Test login; MySQL 5.7's default root password is empty
mysql -u root

If you encounter permission issues preventing login, follow these steps:

# Stop MySQL service
sudo service mysql stop
# Start MySQL in safe mode
sudo mysqld_safe --skip-grant-tables &
# Log in to MySQL
mysql -u root

Change the root user password:

update mysql.user set authentication_string=PASSWORD('newPwd'), plugin='mysql_native_password' where user='root';
flush privileges;

Restart the service:

sudo service mysql stop
sudo service mysql start
# Now log in with the password
mysql -u root -p

After configuring Hive, you can inspect the metastore:

mysql> use hive;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+-------------------------------+
| Tables_in_hive                |
+-------------------------------+
| AUX_TABLE                     |
| BUCKETING_COLS                |
| CDS                           |
| COLUMNS_V2                    |
| COMPACTION_QUEUE              |
| COMPLETED_COMPACTIONS         |
| COMPLETED_TXN_COMPONENTS      |
| CTLGS                         |
| DATABASE_PARAMS               |
| DBS                           |
| DB_PRIVS                      |
| DELEGATION_TOKENS             |
| FUNCS                         |
...

Hive Installation and Configuration

Configure Hive environment variables by appending to ~/.bashrc:

export HIVE_HOME=/path/to/hive                                                                     
export PATH=$HIVE_HOME/bin:$PATH

Apply the configuration in the current session:

source ~/.bashrc

Navigate to the Hive conf directory and generate hive-site.xml:

cp hive-default.xml.template hive-site.xml

Edit hive-site.xml:

# This must be configured; otherwise the Hive CLI will throw an error on startup
<property>
  <name>system:java.io.tmpdir</name>
  <value>/tmp/hive/java</value>
</property>
<property>
  <name>system:user.name</name>
  <value>${user.name}</value>
</property>

# MySQL-related configuration
<property>
  <name>hive.metastore.db.type</name>
  <value>MYSQL</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>123456</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
</property>

MySQL Connector

Download from Maven Repository. Using mysql-connector-java-5.1.38.jar here — download and move to Hive’s lib directory:

wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.38/mysql-connector-java-5.1.38.jar

Initialize the Schema

schematool -dbType mysql -initSchema

# Error 1: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument
# Cause: Version conflict between the guava jar under the Hadoop path and Hive's lib directory
# Fix: Remove the older guava jar from Hive's lib directory

# Error 2: Illegal character entity: expansion character (code 0x8 at [row,col,system-id]: [3215,96,"file:/……/hive-site.xml"]
# Cause: Some characters in hive-site.xml do not conform to XML format (usually in description fields)
# Fix: Delete the offending characters as indicated

Test Hive (start Hadoop first)

# Enter the Hive CLI
hive
# Create a table
create table test(id int);

# Exit the Hive CLI and verify that Hadoop created the corresponding directory
# If /user/hive/warehouse/test exists, the table was created successfully
hdfs dfs -ls /user/hive/warehouse

Metastore

Managed tables / External tables

Partition directories