HIVE Json
vi json.txt
{"Foo": "ABC", "Bar": "20","Quux":{ "QuuxId": 1234,"QuuxName": "Sam"} }
hive> create table json1(line string);
hive> load data local inpath 'json.txt' into table json1;
hive> select * from json1;
{"Foo": "ABC", "Bar": "20","Quux":{ "QuuxId": 1234,"QuuxName": "Sam"} }
hive> select get_json_object(line,'$.Foo') j from json1;
ABC
hive> select get_json_object(line,'$.Quux') from json1;
{"QuuxId":1234,"QuuxName":"Sam"}
hive> select get_json_object(line,'$.Quux.QuuxId') j from json1;
1234
hive> select get_json_object(line,'$.Quux.QuuxName') j from json1;
Sam
-----------------------------------------------------
Here it is on one line for easy copy and pasting:
Let's create a Hive table to reference this. I've put the above document in a file called simple.json:
Since there are no delimiters, we leave off the ROW FORMAT section of the table DDL
Returns the full JSON document.
So do this to query all the fields:
You should get the output:
(Note: to get the header fields, enter
This works and has a nice JavaScript like "dotted" notation, but notice that you have to parse the same document once for every field you want to pull out of your JSON document, so it is rather inefficient.
The Hive wiki recommends using
This returns:
It doesn't know how to look inside the Quux subdocument. And this is where
This gives us the output we want:
With a complicated highly nested JSON doc, json_tuple is also quite
inefficient and clunky as hell. So let's turn to a custom SerDe to
solve this problem.
To get that SerDe, clone the project from GitHub and run
Then tell Hive about it with:
You can do this either at the hive prompt or put it in your
Now let's define the Hive schema that this SerDe expects and load the simple.json doc:
With the openx JsonSerDe, you can define subdocuments as maps or
structs. I prefer structs, as it allows you to use the convenient
dotted-path notation (e.g., Quux.QuuxId) and you can match the case of
the fields. With maps, all the keys you pass in have to be lowercase,
even if you defined them as upper or mixed case in your JSON.
The query to match the above examples is beautifully simple:
Result:
And now let's do a more complex JSON document:
Collapsed version:
Hive Schema:
Load the data:
First let's query something from each document section. Since we know
there are two orders in the orders array we can reference them both
directly:
Result:
But what if we don't know how many orders there are and we want a list of all a user's order Ids? This will work:
Result:
Oooh, it returns an array of ItemIds. Pretty cool. One of Hive's nice features.
Finally, does the openx JsonSerDe require me to define the whole schema? Or what if I have two JSON docs (say version 1 and version 2) where they differ in some fields? How constraining is this Hive schema definition?
Let's add two more JSON entries to our JSON document - the first has no orders; the second has a new "PostalCode" field in Shipping Address.
Collapsed version:
Add those records to complex.json and reload the data into the complex_json table.
Now try the query:
It works just fine and gives the result:
Any field not present will just return null, as Hive normally does even for non-JSON formats.
Note that we cannot query for User.ShippingAddress.PostalCode because we haven't put it on our Hive schema. You would have to revise the schema and then reissue the query.
vi json.txt
{"Foo": "ABC", "Bar": "20","Quux":{ "QuuxId": 1234,"QuuxName": "Sam"} }
hive> create table json1(line string);
hive> load data local inpath 'json.txt' into table json1;
hive> select * from json1;
{"Foo": "ABC", "Bar": "20","Quux":{ "QuuxId": 1234,"QuuxName": "Sam"} }
hive> select get_json_object(line,'$.Foo') j from json1;
ABC
hive> select get_json_object(line,'$.Quux') from json1;
{"QuuxId":1234,"QuuxName":"Sam"}
hive> select get_json_object(line,'$.Quux.QuuxId') j from json1;
1234
hive> select get_json_object(line,'$.Quux.QuuxName') j from json1;
Sam
-----------------------------------------------------
Here it is on one line for easy copy and pasting:
1
| { "Foo" : "ABC" , "Bar" : "20090101100000" , "Quux" :{ "QuuxId" :1234, "QuuxName" : "Sam" }} |
1
2
3
| CREATE TABLE json_table ( json string ); LOAD DATA LOCAL INPATH '/tmp/simple.json' INTO TABLE json_table; |
Built in function #1: get_json_object
Theget_json_object
takes two arguments: tablename.fieldname and the JSON field to parse, where '$' represents the root of the document.
1
| select get_json_object(json_table.json, '$' ) from json_table; |
So do this to query all the fields:
1
2
3
4
5
| select get_json_object(json_table.json, '$.Foo' ) as foo, get_json_object(json_table.json, '$.Bar' ) as bar, get_json_object(json_table.json, '$.Quux.QuuxId' ) as qid, get_json_object(json_table.json, '$.Quux.QuuxName' ) as qname from json_table; |
foo bar qid qname
ABC 20090101100000 1234 Sam
set hive.cli.print.header=true
at the hive prompt or in your $HOME/.hiverc
file.)This works and has a nice JavaScript like "dotted" notation, but notice that you have to parse the same document once for every field you want to pull out of your JSON document, so it is rather inefficient.
The Hive wiki recommends using
json_tuple
for this reason.Built in function #2: json_tuple
So let's see whatjson_tuple
looks like. It
has the benefit of being able to pass in multiple fields, but it only
works to a single level deep. You also need to use Hive's slightly odd LATERAL VIEW
notation:
1
2
3
4
| select v.foo, v.bar, v.quux, v.qid from json_table jt LATERAL VIEW json_tuple(jt.json, 'Foo' , 'Bar' , 'Quux' , 'Quux.QuuxId' ) v as foo, bar, quux, qid; |
foo bar quux qid
ABC 20090101100000 {"QuuxId":1234,"QuuxName":"Sam"} NULL
json_tuple
gets clunky fast - you have to create another lateral view for each subdocument you want to descend into:
1
2
3
4
5
6
| select v1.foo, v1.bar, v2.qid, v2.qname from json_table jt LATERAL VIEW json_tuple(jt.json, 'Foo' , 'Bar' , 'Quux' ) v1 as foo, bar, quux LATERAL VIEW json_tuple(v1.quux, 'QuuxId' , 'QuuxName' ) v2 as qid, qname; |
foo bar qid qname
ABC 20090101100000 1234 Sam
The best option: rcongiu's Hive-JSON SerDe
A SerDe is a better choice than a json function (UDF) for at least two reasons:- it only has to parse each JSON record once
- you can define the JSON schema in the Hive table schema, making it much easier to issue queries against.
To get that SerDe, clone the project from GitHub and run
mvn package
. It creates a json-serde-1.1.6.jar
in the target directory. If you have a place you like to put your jars for runtime referencing move it there. Then tell Hive about it with:
ADD JAR /path/to/json-serde-1.1.6.jar;
$HOME/.hiverc
file.Now let's define the Hive schema that this SerDe expects and load the simple.json doc:
1
2
3
4
5
6
7
8
| CREATE TABLE json_serde ( Foo string, Bar string, Quux struct<QuuxId: int , QuuxName:string> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' ; LOAD DATA LOCAL INPATH '/tmp/simple.json' INTO TABLE json_serde; |
The query to match the above examples is beautifully simple:
1
2
| SELECT Foo, Bar, Quux.QuuxId, Quux.QuuxName FROM json_serde; |
foo bar quuxid quuxname
ABC 20090101100000 1234 Sam
And now let's do a more complex JSON document:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| { "DocId" : "ABC" , "User" : { "Id" : 1234, "Username" : "sam1234" , "Name" : "Sam" , "ShippingAddress" : { "Address1" : "123 Main St." , "Address2" : null , "City" : "Durham" , "State" : "NC" }, "Orders" : [ { "ItemId" : 6789, "OrderDate" : "11/11/2012" }, { "ItemId" : 4352, "OrderDate" : "12/12/2012" } ] } } |
1
| { "DocId" : "ABC" , "User" :{ "Id" :1234, "Username" : "sam1234" , "Name" : "Sam" , "ShippingAddress" :{ "Address1" : "123 Main St." , "Address2" : "" , "City" : "Durham" , "State" : "NC" }, "Orders" :[{ "ItemId" :6789, "OrderDate" : "11/11/2012" },{ "ItemId" :4352, "OrderDate" : "12/12/2012" }]}} |
1
2
3
4
5
6
7
8
9
10
11
12
13
| CREATE TABLE complex_json ( DocId string, User struct<Id: int , Username:string, Name : string, ShippingAddress:struct<Address1:string, Address2:string, City:string, State:string>, Orders:array<struct<ItemId: int , OrderDate:string>>> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' ; |
1
2
| LOAD DATA LOCAL INPATH '/tmp/complex.json' OVERWRITE INTO TABLE complex_json; |
1
2
3
4
| SELECT DocId, User .Id, User .ShippingAddress.City as city, User .Orders[0].ItemId as order0id, User .Orders[1].ItemId as order1id FROM complex_json; |
docid id city order0id order1id
ABC 1234 Durham 6789 4352
1
2
| SELECT DocId, User .Id, User .Orders.ItemId FROM complex_json; |
docid id itemid
ABC 1234 [6789,4352]
Finally, does the openx JsonSerDe require me to define the whole schema? Or what if I have two JSON docs (say version 1 and version 2) where they differ in some fields? How constraining is this Hive schema definition?
Let's add two more JSON entries to our JSON document - the first has no orders; the second has a new "PostalCode" field in Shipping Address.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| { "DocId" : "ABC" , "User" : { "Id" : 1235, "Username" : "fred1235" , "Name" : "Fred" , "ShippingAddress" : { "Address1" : "456 Main St." , "Address2" : "" , "City" : "Durham" , "State" : "NC" } } } { "DocId" : "ABC" , "User" : { "Id" : 1236, "Username" : "larry1234" , "Name" : "Larry" , "ShippingAddress" : { "Address1" : "789 Main St." , "Address2" : "" , "City" : "Durham" , "State" : "NC" , "PostalCode" : "27713" }, "Orders" : [ { "ItemId" : 1111, "OrderDate" : "11/11/2012" }, { "ItemId" : 2222, "OrderDate" : "12/12/2012" } ] } } |
1
2
| { "DocId" : "ABC" , "User" :{ "Id" :1235, "Username" : "fred1235" , "Name" : "Fred" , "ShippingAddress" :{ "Address1" : "456 Main St." , "Address2" : "" , "City" : "Durham" , "State" : "NC" }}} { "DocId" : "ABC" , "User" :{ "Id" :1236, "Username" : "larry1234" , "Name" : "Larry" , "ShippingAddress" :{ "Address1" : "789 Main St." , "Address2" : "" , "City" : "Durham" , "State" : "NC" , "PostalCode" : "27713" }, "Orders" :[{ "ItemId" :1111, "OrderDate" : "11/11/2012" },{ "ItemId" :2222, "OrderDate" : "12/12/2012" }]}} |
Now try the query:
1
2
| SELECT DocId, User .Id, User .Orders.ItemId FROM complex_json; |
docid id itemid
ABC 1234 [6789,4352]
ABC 1235 null
ABC 1236 [1111,2222]
Note that we cannot query for User.ShippingAddress.PostalCode because we haven't put it on our Hive schema. You would have to revise the schema and then reissue the query.
No comments:
Post a Comment