数据处理中,一遇到json就头大,很长一段时间里,明知lateral view函数是个好东西,但就是很抗拒去学,都是找数仓的同事先理好字段直接用,顺便以菜鸡的身份,同情和膜拜一下埋头洗脏数的数仓同学,大佬辛苦。。
前段时间加入到一个数据建设的项目中作为先锋军打头阵,没办法遇到json还是硬着头皮终于学会了lateral view用法,感受:困难只是心中的一座大山!也不过如此!
我肯定我过不了几天一定会忘掉(其实已经忘掉一点了。。),没有好记性拿起烂笔头,记录在这里吧。
基本语法:
select
*
from T t
lateral view json_tuple(t.json_txt,[],[],……) q as item1,item2,……
假设T表中有个json_txt字段取值格式如下:
{
"student_no":"0001",
"student_name":'zhangxiaoxiao',
"class":"高三(1)班",
"score_detail":{
"scoreList":[{"scores":[
{"course":"语文","score":100,"rank":2}
,{"course":"数学","score":120,"rank":9}
,{"course":"英语","score":110,"rank":6}
,{"course":"化学","score":90,"rank":4}
,{"course":"物理","score":90,"rank":3}
,{"course":"生物","score":90,"rank":2}
]
}]
},
"total_score":"600"
"overal_rank":"3",
}
如果我要得到每个学生的所有信息字段,则需要将json中的信息解析出来。
select
t.* ----表中其他原始字段保留
,q.student_no
,q.student_name,
,q.class,
,q.total_score,
,q.overal_rank,
,q.course,q.score,q.rank
from T t
lateral view json_tuple(t.json_txt,
"student_no",
"student_name",
"class",
"total_score",
"overal_rank",
"score_detail.scoreList.[*].scores.[*].course",
"score_detail.scoreList.[*].scores.[*].score"
"score_detail.scoreList.[*].scores.[*].rank"
) q as course,score,rank
得到结果如下:
但是score和rank是以数列形式存储在同一行,不方便计算,用trans_array()函数可以解决啦:
select
trans_array(5,',',student_no,student_name,class,total_score,overal_rank,course,score,rank) as (student_no,student_name,class,total_score,overal_rank,course,score,rank)
from (
select
student_no,student_name,class,total_score,overal_rank
,regexp_replace(course,'(\\[)|(\\])|("))','') as course ---去掉[]"符号
,regexp_replace(score,'(\\[)|(\\])|("))','') as score ---去掉[]"符号
,regexp_replace(rank,'(\\[)|(\\])|("))','') as rank ---去掉[]"符号
from result
) t
得到的结果就是纵列的分数明细: