Migrate data between Relational Database to big data application with Sqoop – Hadoop

Migrate data between Relational Database to big data application with Sqoop – Hadoop

This post is shared by big data analytics services vendors to make you understand about the basics of Sqoop. You will learn about using Sqoop-Hadoop for migrating data between relational database to big data app like professionals.

Introduction Sqoop

Sqoop is a tool which helps to migrate and transfer the data between RDBMS and Hadoop system in bulk mode. In past, moving and migrating the data between Hadoop and RDBMS is difficult because we need ensure the data must be consistent between Hadoop and RDBMS. Sqoop releases to solve problems.

The data ingestion is important layer in USA Big Data Services application because it must collect and prepare the data from many sources to Hadoop for analytic.In this blog, we use Sqoop 2 which added more enhancements in both security, ease of extension and ease of use.

Environment

Java: JDK 1.7

Cloudera version:  CDH4.6.0

Initial steps

  1. We need to prepare MySQL server
  2. We need to prepare some input data from MySQL server to migrate that data to Hadoop cluster.

Code walk through and verify the data inline

This is the basic sqoop command which migrate the data from mysql server (products table) to hadoop at default location to hadoop cluster with username and password from mysql server

sqoop import --connect jdbc:mysql://rdbms.com/mydb --username hn --password hn --table products 

Use this command to verify the data at Hadoop cluster: hadoop fs -cat products/part-m-*

This is the command to migrate the products table from mysql server to Hadoop cluster with the specific location at HDFS

sqoop import --connect jdbc:mysql://rdbms.com/mydb --username hn --password hn --table products --target-dir /my/input/products 

Use this command to verify the data at Hadoop cluster: hadoop fs -cat /my/input/products/*

This is the command to migrate the data with only a subset of data or any condition from mysql server to Hadoop cluster:

sqoop import --connect jdbc:mysql://rdbms.com/mydb --username hn --password hn --table products --where "productType = 'Milk'" 

Use this command to verify the data at Hadoop cluster: hadoop fs -cat products/part-m-*

In some previous, I have introduced about Avro and Sequence file in Hadoop to increase the performance to serialization and compress the data. This command to migrate the data with Avro and sequence file format:

sqoop import --connect jdbc:mysql://rdbms.com/mydb --username hn --password hn --table products --as-sequencefile

sqoop import --connect jdbc:mysql://rdbms.com/mydb --username hn --password hn --table products --as-avrodatafile

When we ingest the data in big data application, we need to check the latest data to continue ingesting in the next time to make sure the data cannot duplication. This command to do incremental ingestion in Sqoop

sqoop import --connect jdbc:mysql://rdbms.com/mydb --username hn --password hn --table visits --incremental lastmodified --check-column last_update_date --last-value "2016-05-22 11:11:11"

Sqoop command actually can run in shell mode then we can combine all above command into 1 sh file and run it in scheduling mode with Cron job to make everything is in automatically.

vi auto.sh

sqoop import --connect jdbc:mysql://rdbms.com/mydb --username hn --password hn --table products

sqoop import --connect jdbc:mysql://rdbms.com/mydb --username hn --password hn --table products --target-dir /my/input/products 

sqoop import --connect jdbc:mysql://rdbms.com/mydb --username hn --password hn --table products --where "productType = 'Milk'" 

sqoop import --connect jdbc:mysql://rdbms.com/mydb --username hn --password hn --table products --as-sequencefile

sqoop import --connect jdbc:mysql://rdbms.com/mydb --username hn --password hn --table products --as-avrodatafile

sqoop import --connect jdbc:mysql://rdbms.com/mydb --username hn --password hn --table visits --incremental lastmodified --check-column last_update_date --last-value "2016-05-22 11:11:11"

After run the sh file above, we can use hadoop fs –ls and hadoop fs –text to check the data from our Hadoop cluster

Hope that this blog can help you guys can use Sqoop to migrate the data between RDBMS and Hadoop cluster.

Big data analytics services vendors have shared this article with entire hadoop development community. The purpose behind sharing this post is to make people learn how to migrate data between relational database to big data app using Sqoop – Hadoop.