What is the difference between get_json_object and json_tuple functions in Hive? - Big Data In Real World

What is the difference between get_json_object and json_tuple functions in Hive?

How to insert into a Hive table when the table has columns with complex types like Arrays?
January 26, 2022
How to find out if a directory in HDFS is empty or not?
February 9, 2022
How to insert into a Hive table when the table has columns with complex types like Arrays?
January 26, 2022
How to find out if a directory in HDFS is empty or not?
February 9, 2022

Both get_json_object and json_tuple functions in Hive are meant to work with JSON data in Hive. 

Let’s create a table with some JSON data to work with. We have created a table named hirw_courses and loaded the below JSON text into the table.

 CREATE TABLE IF NOT EXISTS hirw_courses (course string) STORED AS TEXTFILE;

 {"course": {"course1": {"name": "Hadoop Developer In Real World","type": "online"}, "course2": {"name": "Spark Developer In Real World", "type": "online"}},"email": "info@hadoopinrealworld.com","owner": "Big Data"} 

When the above JSON string is formatted it looks like below.

 {
 "course": {
 "course1": {
 "name": "Hadoop Developer In Real World",
 "type": "online"
 },
 "course2": {
 "name": "Spark Developer In Real World",
 "type": "online"
 }
 },
 "email": "info@hadoopinrealworld.com",
 "owner": "Big Data"
 } 

get_json_obect

get_json_object() is a UDF and can take a JSON string as input and you can pass in the key that you want to be extracted from the JSON.

Here we are selecting 2 values from the JSON – owner and course. And we ended up calling the get_json_object twice.

 SELECT get_json_object(course, '$.owner'), get_json_object(course, '$.course')  FROM hirw_courses;
 +-----------+-------------------------------------------------------------------------------------------------------------------------------------------+--+
 |    _c0    |                                                                    _c1                                                                    |
 +-----------+-------------------------------------------------------------------------------------------------------------------------------------------+--+
 | Big Data  | {"course1":{"name":"Hadoop Developer In Real World","type":"online"},"course2":{"name":"Spark Developer In Real World","type":"online"}}  |
 +-----------+-------------------------------------------------------------------------------------------------------------------------------------------+--+
 1 row selected (0.125 seconds) 

json_tuple

json_tuple is a User Defined Table generating Function (UDTF). <<Check out the post, if you want know the difference between UDF, UDAF and UDTFs)

With json_tuple you can pass in the json string followed by a list of keys for which you want to extract the values from JSON.

Since json_tuple is an UDTF, you can also use it with LATERAL VIEW which can be quite powerful.

 SELECT json_tuple(course, 'owner', 'course')  FROM hirw_courses;
 +-----------+-------------------------------------------------------------------------------------------------------------------------------------------+--+
 |    c0     |                                                                    c1                                                                     |
 +-----------+-------------------------------------------------------------------------------------------------------------------------------------------+--+
 | Big Data  | {"course1":{"name":"Hadoop Developer In Real World","type":"online"},"course2":{"name":"Spark Developer In Real World","type":"online"}}  |
 +-----------+-------------------------------------------------------------------------------------------------------------------------------------------+--+
 1 row selected (0.177 seconds) 

get_json_object vs. json_tuple – which is better?

With get_json_object, if we are trying to extract 5 values from a JSON string we would have to call the function 5 times and this means the same JSON data is parsed through 5 times.

With json_tuple, we parse the JSON data only once even when we need to parse or extract 5 values from JSON.

So always use json_tuple in place of get_json_object for better performance.

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

What is the difference between get_json_object and json_tuple functions in Hive?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X