hive sql面试题及答案
本文所有展示了输出结果的sql都是我自己编写的,很可能不是最优解请谨慎参考。
所有的理论题和没有结果的SQL题都是我搜索来的也要谨慎参考
1.1窗口函数
1.1.1了解哪些窗口函数,都是什么功能?找一个在某个业务中的应用?
over (分组 排序 窗口):函数+over() :对每行都是窗口 max(col) over() :分组对col求最大值 min(col) over() :分组对col求最小值
sum(col) over() :分组对col累计求和
avg(col) over() :分组求col列的平均值
first_value(col) over() : 某分区排序后的第一个col值
序列函数
last_value(col) over() : 某分区排序后的最后一个col值
lag(col,n,default) :取往前n行的col的值,n可选,默认1,default当往上n行时是null,取默认值,不指定null
lead(col,n,default) :取往后n行的col的值,n可选,默认1,default当往下n行时是null,取默认值,不指定null
ntile(n):⽤于将分组数据按照顺序切分成n⽚,返回当前切⽚值
排名函数
row_number() over():没有并列,相同名次依顺序排
rank() over():有并列,相同名次空位
dense_rank() over():有并列,相同名次不空位
1.1.2 编写sql实现每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数
数据:
userid,month,visits
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,1
每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数,结果数据格式如下:
答案:
CREATE TABLE IF NOT EXISTS user_visit
(
userid string,
MONTH string,
visits int
) ROW format delimited fields terminated BY ',';
load data local inpath '/root/downfile/userVisit.txt' into table user_visit;
select
temp.userid userid,
temp.month month,
temp.sumv month_visit,
max(temp.sumv) over(partition by temp.userid order by temp.month) month_max,
sum(temp.sumv) over(partition by temp.userid order by temp.month) month_total
from
(select
uv.userid userid,
uv.month month,
sum(visits) sumv
from user_visit uv
group by uv.userid,uv.month) temp;
1.1.3 求出每个栏目的被观看次数及累计观看时长
数据:
vedio表
Uid channl min
1 1 23
2 1 12
3 1 12
4 1 32
5 1 342
6 2 13
7 2 34
8 2 13
9 2 134
答案:
CREATE TABLE if not exists vedio
(
uid int,
channl int,
min int
)row format delimited fields terminated by '\t';
load data local inpath '/root/downfile/vedio.txt' into table vedio;
select
channl,
sum(min) min_total,
count(channl) channl_total
from vedio
group by channl;
1.1.4 编写连续7天登录的总人数
数据:
t1表
Uid dt login_status(1登录成功,0异常)
1 2019-07-11 1
1 2019-07-12 1
1 2019-07-13 1
1 2019-07-14 1
1 2019-07-15 1
1 2019-07-16 1
1 2019-07-17 1
1 2019-07-18 1
2 2019-07-11 1
2 2019-07-12 1
2 2019-07-13 0
2 2019-07-14 1
2 2019-07-15 1
2 2019-07-16 0
2 2019-07-17 1
2 2019-07-18 0
3 2019-07-11 1
3 2019-07-12 1
3 2019-07-13 1
3 2019-07-14 1
3 2019-07-15 1
3 2019-07-16 1
3 2019-07-17 1
3 2019-07-18 1
答案:
create table if not exists t1
(
uid int,
dt string,
login_status int
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/t1.txt' into table t1;
select
count(tmp3.uid) total_per
from
(select
tmp2.uid uid,
max(tmp2.cnt) cnt_max
from
(select
tmp1.uid uid,
count(1) cnt
from
(select
tmp.uid uid,
date_sub(tmp.dt,row_num) datasub
from
(select
t.uid uid,
t.dt dt,
row_number() over(partition by t.uid order by dt) row_num
from t1 t
where login_status!=0
) tmp) tmp1
group by tmp1.uid,tmp1.datasub) tmp2
group by tmp2.uid) tmp3
where tmp3.cnt_max>=7;
1.1.5 你知道的排名函数有哪些?说一说它们之间的区别?
排名函数
row_number() over():没有并列,相同名次依顺序排
rank() over():有并列,相同名次空位
dense_rank() over():有并列,相同名次不空位
1.1.6 编写sql语句实现每班前三名,分数一样并列,同时求出前三名按名次排序的一次的分差:
数据:
stu表
Stu_no class score
1 1901 90
2 1901 90
3 1901 83
4 1901 60
5 1902 66
6 1902 23
7 1902 99
8 1902 67
9 1902 87
编写sql实现,结果如下:
结果数据:
班级 stu_no score rn rn1 rn_diff
1901 1 90 1 1 90
1901 2 90 2 1 0
1901 3 83 3 1 -7
1902 7 99 1 1 99
1902 9 87 2 2 -12
1902 8 67 3 3 -20
答案:
create table if not exists stu
(
Stu_no int,
class int,
score int
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/stu.txt' into table stu;
select
tmp.stu_no stu_no,
tmp.class class,
tmp.score score,
RANK() over(distribute by tmp.class sort by tmp.score desc) rn,
DENSE_RANK() over(distribute by tmp.class sort by tmp.score desc) rn1,
tmp.score-nvl(lag(tmp.score) over (distribute by tmp.class sort by tmp.score desc),0) rn_diff
from
(select
st.Stu_no stu_no,
st.class class,
st.score score,
row_number() over(distribute by st.class sort by st.score desc) row_num
from stu st) tmp
where tmp.row_num<4;
1.1.7 每个店铺的当月销售额和累计到当月的总销售额
数据:
店铺,月份,金额
a,01,150
a,01,200
b,01,1000
b,01,800
c,01,250
c,01,220
b,01,6000
a,02,2000
a,02,3000
b,02,1000
b,02,1500
c,02,350
c,02,280
a,03,350
a,03,250
编写Hive的HQL语句求出每个店铺的当月销售额和累计到当月的总销售额?
答案:
create table if not exists store
(
store string,
month string,
money string
)row format delimited fields terminated by ',';
load data local inpath '/root/downfile/store.txt' into table store;
select
tmp.store,
tmp.month,
tmp.sum_money,
sum(tmp.sum_money) over(partition by tmp.store order by tmp.month)
from
(select
st.store store,
st.month month,
sum(money) sum_money
from store st
group by st.store,st.month) tmp;
1.1.8 使用hive的hql如下
答案:
create table if not exists user_action_log
(
U_id int,
Time string,
Action string
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/user_action_log.txt' into table user_action_log;
select
tmp.*
from
(select
first_value(ual.U_id) over(distribute by ual.U_id sort by ual.Time) uid,
first_value(ual.Time) over(distribute by ual.U_id sort by ual.Time) time,
first_value(ual.Action) over(distribute by ual.U_id sort by ual.Time) action
from user_action_log ual) tmp
group by tmp.uid,tmp.time,tmp.action
1.1.9 订单及订单类型行列互换
t1表:
order_id order_type order_time
111 N 10:00
111 A 10:05
111 B 10:10
是用hql获取结果如下:
order_id order_type_1 order_type_2 order_time_1 order_time_2
111 N A 10:00 10:05
111 A B 10:05 10:10
答案:
create table if not exists tmp1
(
order_id int,
order_type string,
order_time string
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/tmp1.txt' into table tmp1;
select
tmp.*
from
(select
order_id,
order_type order_type_1,
lead(order_type) over(partition by order_id) as order_type_2,
order_time order_time_1,
lead(order_time) over(partition by order_id) as order_time_2
from
tmp1) tmp
where tmp.order_type_2 is not null;
1.1.10 某APP每天访问数据存放在表access_log里面,包含日期字段 ds,用户类型字段user_type,用户账号user_id,用户访问时间 log_time,请使用hive的hql语句实现如下需求:
(1)、每天整体的访问UV、PV?
(2)、每天每个类型的访问UV、PV?
(3)、每天每个类型中最早访问时间和最晚访问时间?
(4)、每天每个类型中访问次数最高的10个用户?
注意:这道题没有数据我随便写的,只写了思路
create table if not exists `access_log`
(
date1 sstring,
user_type string,
user_id int,
log_time string
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/access_log.txt' into table access_log;
select
count(user_id) over(distribute by user_id) uv,
sum(user_id) over(distribute by log_time) pv
from access_log
select
count(user_id) uv
sum(user_id) over(distribute by log_time) pv
from access_log al1
inner join
access_log al2
group by
user_type
select
first_value(log_time) over(distribute by user_type order by log_time) first_time,
last_value(log_time) over(distribute by user_type order by log_time)
from access_log
select
user_id
from
(select
count(user_id) cnt
row_number() over(distribute by user_type order by count(user_id)) rows
from access_log) tmp
where tmp.rows<=10
1.1.11 每个用户连续登陆的最大天数?
数据:
login表
uid,date
1,2019-08-01
1,2019-08-02
1,2019-08-03
2,2019-08-01
2,2019-08-02
3,2019-08-01
3,2019-08-03
4,2019-07-28
4,2019-07-29
4,2019-08-01
4,2019-08-02
4,2019-08-03
结果如下:
uid cnt_days
1 3
2 2
3 1
4 3
答案:
create table if not exists login
(
uid int,
date string
)row format delimited fields terminated by ',';
load data local inpath '/root/downfile/login.txt' into table login;
select
tmp2.uid,
max(tmp2.cnt)
from
(select
tmp1.uid uid,
count(1) cnt
from
(select
tmp.uid uid,
tmp.date1 date1,
date_sub(tmp.date1,row_num) datesub
from
(select
uid uid,
date date1,
row_number() over(distribute by uid sort by date) row_num
from login lg) tmp
) tmp1
group by tmp1.uid,tmp1.datesub) tmp2
group by tmp2.uid;
1.1.12 使用hive的hql实现男女各自第一名及其它
id sex chinese_s math_s
0 0 70 50
1 0 90 70
2 1 80 90
1、男女各自语文第一名(0:男,1:女)
2、男生成绩语文大于80,女生数学成绩大于70
答案:
create table if not exists `child`
(
id int,
sex int,
chinese_s int,
math_s int
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/child.txt' into table child;
select
ch.sex,
max(ch.chinese_s)
from child ch
group by ch.sex;
select
id,
sex,
chinese_s,
math_s
from child
where (sex=0 and chinese_s>80 ) or (sex=1 and math_s>70);
1.1.13 使用hive的hql实现最大连续访问天数
log_time uid
2018-10-01 18:00:00,123
2018-10-02 18:00:00,123
2018-10-02 19:00:00,456
2018-10-04 18:00:00,123
2018-10-04 18:00:00,456
2018-10-05 18:00:00,123
2018-10-06 18:00:00,123
答案:
create table if not exists `a`
(
log_time string,
uid int
)row format delimited fields terminated by ',';
load data local inpath '/root/downfile/a.txt' into table a;
select
tmp2.uid,
max(tmp2.cnt)
from
(select
tmp1.uid uid,
count(1) cnt
from
(select
tmp.uid uid,
tmp.logtime logtime,
date_sub(logtime,row_num) datesub
from
(select
a.uid uid,
a.log_time logtime,
row_number() over(distribute by a.uid sort by a.log_time) row_num
from a) tmp) tmp1
group by tmp1.uid,tmp1.datesub) tmp2
group by tmp2.uid;
1.2行列互换
1.2.1 对于行列互换,你有哪些解决方式,详细说明每一种方式
行转列:
1、使用case when 查询出多列即可,即可增加列。
列转行:
1、lateral view explode(),使用炸裂函数可以将1列转成多行,被转换列适用于array、map等类型。 lateral view posexplode(数组),如有排序需求,则需要索引。将数组炸开成两行(索引 , 值),需要 as 两个别名。
2、case when 结合concat_ws与collect_set/collect_list实现。内层用case when,外层用 collect_set/list收集,对搜集完后用concat_ws分割连接形成列。
1.2.2 编写sql实现行列互换:
数据如下:
编写sql实现,得到结果如下:
答案:
CREATE TABLE IF NOT EXISTS `scores`
(
id int,
userid string,
subject string,
score double
)row format delimited fields terminated by '\t';
load data local inpath '/root/downfile/scores.txt' into table scores;
select
tmp1.userid userid,
tmp1.chinese chinese,
tmp1.math math,
tmp1.english english,
tmp1.political political,
tmp1.chinese+tmp1.math+tmp1.english+tmp1.political total
from
(select
userid,
sum(case subject when '语文' then score else 0 end) chinese,
sum(case subject when '数学' then score else 0 end) math,
sum(case subject when '英语' then score else 0 end) english,
sum(case subject when '政治' then score else 0 end) political
from scores
group by userid
union
select
"total",
sum(tmp.chinese) chinese,
sum(tmp.math) math,
sum(tmp.english) english,
sum(tmp.political) political
from
(select
userid,
sum(case subject when '语文' then score else 0 end) chinese,
sum(case subject when '数学' then score else 0 end) math,
sum(case subject when '英语' then score else 0 end) english,
sum(case subject when '政治' then score else 0 end) political
from scores group by userid) tmp) tmp1
1.2.3 编写sql实现如下:
数据:
t1表
uid tags
1 1,2,3
2 2,3
3 1,2
编写sql实现如下结果:
uid tag
1 1
1 2
1 3
2 2
2 3
3 1
3 2
答案:
create table if not exists `table1`
(
uid int,
tags array<int>
)row format delimited fields terminated by ' '
collection items terminated by ',';
load data local inpath '/root/downfile/table1.txt' into table table1;
select
uid,
tag
from table1
lateral view explode(tags) tag1 as tag;
1.2.4 用户标签连接查询
数据:
T1表:
Tags
1,2,3
1,2
2,3
T2表:
Id lab
1 A
2 B
3 C
根据T1和T2表的数据,编写sql实现如下结果:
ids tags
1,2,3 A,B,C
1,2 A,B
2,3 B,C
答案:
select
tmp.tags tags,
concat_ws(',',collect_list(tb2.lab)) labs
from
(select
tags,
id
from tab1
lateral view explode(tags) tag as id) tmp
left join tab2 tb2
on tmp.id=tb2.id
group by tmp.tags;
1.2.5 用户标签组合
数据:
t1表:
id tag flag
a b 2
a b 1
a b 3
c d 6
c d 8
c d 8
编写sql实现如下结果:
id tag flag
a b 1|2|3
c d 6|8
答案:
create table if not exists `table2`
(
id string,
tag string,
flag string
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/table2.txt' into table table2;
select
id,
tag,
concat_ws('|',collect_list(table2.flag)) flag
from table2
group by id,tag;
1.2.6 用户标签行列互换
数据:
t1表
uid name tags
1 goudan chihuo,huaci
2 mazi sleep
3 laotie paly
编写sql实现如下结果:
uid name tag
1 goudan chihuo
1 goudan huaci
2 mazi sleep
3 laotie paly
答案:
create table if not exists `table3`
(
uid string,
name string,
tags array<string>
)row format delimited fields terminated by ' '
collection items terminated by ',';
load data local inpath '/root/downfile/table3.txt' into table table3;
select
uid ,
name ,
tag
from table3
lateral view explode(tags) tag1 as tag;
1.2.7 hive实现词频统计
数据:
t1表:
uid contents
1 i|love|china
2 china|is|good|i|i|like
统计结果如下,如果出现次数一样,则按照content名称排序:
content cnt
i 3
china 2
good 1
like 1
love 1
is 1
答案:
create table if not exists `table4`
(
uid int,
contents string
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/table4.txt' into table table4;
select
tmp2.content content,
tmp2.cnt cnt
from
(select
tmp.content content,
count(tmp.uid) cnt
from
(select
table4.uid,
content
from table4
lateral view explode(split(contents,'\\|')) contents1 as content) tmp
group by tmp.content ) tmp2
order by cnt desc,content;
1.2.8 课程行转列
数据:
t1表
id course
1,a
1,b
1,c
1,e
2,a
2,c
2,d
2,f
3,a
3,b
3,c
3,e
根据编写sql,得到结果如下(表中的1表示选修,表中的0表示未选修):
id a b c d e f
1 1 1 1 0 1 0
2 1 0 1 1 0 1
3 1 1 1 0 1 0
答案:
CREATE TABLE IF NOT EXISTS `table5`
(
id int,
course string
)row format delimited fields terminated by ',';
load data local inpath '/root/downfile/table5.txt' into table table5;
select
id,
sum(case course when 'a' then 1 else 0 end) a,
sum(case course when 'b' then 1 else 0 end) b,
sum(case course when 'c' then 1 else 0 end) c,
sum(case course when 'd' then 1 else 0 end) d,
sum(case course when 'e' then 1 else 0 end) e,
sum(case course when 'f' then 1 else 0 end) f
from table5
group by id;
1.2.9 兴趣行转列
t1表
name sex hobby
janson 男 打乒乓球、游泳、看电影
tom 男 打乒乓球、看电影
hobby最多3个值,使用hql实现结果如下:
name sex hobby1 hobby2 hobby3
janson 男 打乒乓球 游泳 看电影
tom 男 打乒乓球 看电影
答案:
create table if not exists `table6`
(
name string,
sex string,
hobby string
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/table6.txt' into table table6;
select
name,
sex,
if(split(hobby,'、')[0] is not null,split(hobby,'、')[0],"") hobby1,
if(split(hobby,'、')[1] is not null,split(hobby,'、')[1],"") hobby2,
if(split(hobby,'、')[2] is not null,split(hobby,'、')[2],"") hobby3
from table6;
1.2.10 用户商品行列互换
t1表:
用户 商品
A P1
B P1
A P2
B P3
请你使用hql变成如下结果:
用户 P1 P2 P3
A 1 1 0
B 1 0 1
答案:
create table if not exists `table7`
(
user_id string,
goods string
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/table7.txt' into table table7;
select
user_id,
sum(case goods when 'P1' then 1 else 0 end) p1,
sum(case goods when 'P2' then 1 else 0 end) p2,
sum(case goods when 'P3' then 1 else 0 end) p3
from table7
group by user_id;
1.2.11 成绩课程行列互换
t1表:
name course score
aa English 75
bb math 85
aa math 90
使用hql输出以下结果
name English math
aa 75 90
bb 0 85
答案:
create table if not exists `table8`
(
name string,
course string,
score int
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/table8.txt' into table table8;
select
name,
sum(case course when 'English' then score else 0 end) English,
sum(case course when 'math' then score else 0 end) math
from table8
group by name
;
1.2.12 求top3英雄及其pick率
答案:
create table if not exists `table9`
(
id int,
names array<string>
)row format delimited fields terminated by ' '
collection items terminated by ',';
load data local inpath '/root/downfile/table9.txt' into table table9;
select
tmp.name,
count(name),
count(name)/max(id) pick
from
(select
id,
name
from table9
lateral view explode(names) name1 as name) tmp
group by tmp.name
1.3时间函数
1.3.1 常见的时间函数
1. UNIX时间戳转日期函数: from_unixtime
语法: from_unixtime(bigint unixtime[, string format])
返回值: string
说明: 转化UNIX时间戳(从1970-01-01 00:00:00 UTC到指定时间的秒数)到当前时区的时间格式
举例:
hive> select from_unixtime(1323308943,'yyyyMMdd') from lxw_dual;
20111208
2. 获取当前UNIX时间戳函数: unix_timestamp
语法: unix_timestamp()
返回值: bigint
说明: 获得当前时区的UNIX时间戳
举例:
hive> select unix_timestamp() from lxw_dual;
1323309615
3. 日期转UNIX时间戳函数: unix_timestamp
语法: unix_timestamp(string date)
返回值: bigint
说明: 转换格式为"yyyy-MM-dd HH:mm:ss"的日期到UNIX时间戳。如果转化失败,则返回0。
举例:
hive> select unix_timestamp('2011-12-07 13:01:03') from lxw_dual;
1323234063
4. 指定格式日期转UNIX时间戳函数: unix_timestamp
语法: unix_timestamp(string date, string pattern)
返回值: bigint
说明: 转换pattern格式的日期到UNIX时间戳。如果转化失败,则返回0。
举例:
hive> select unix_timestamp('20111207 13:01:03','yyyyMMdd HH:mm:ss') from lxw_dual;
1323234063
5. 日期时间转日期函数: to_date
语法: to_date(string timestamp)
返回值: string
说明: 返回日期时间字段中的日期部分。
举例:
hive> select to_date('2011-12-08 10:03:01') from lxw_dual;
2011-12-08
6. 日期转年函数: year
语法: year(string date)
返回值: int
说明: 返回日期中的年。
举例:
hive> select year('2011-12-08 10:03:01') from lxw_dual;
2011
hive> select year('2012-12-08') from lxw_dual;
2012
7. 日期转月函数: month
语法: month (string date)
返回值: int
说明: 返回日期中的月份。
举例:
hive> select month('2011-12-08 10:03:01') from lxw_dual;
12
hive> select month('2011-08-08') from lxw_dual;
8
8. 日期转天函数: day
语法: day (string date)
返回值: int
说明: 返回日期中的天。
举例:
hive> select day('2011-12-08 10:03:01') from lxw_dual;
8
hive> select day('2011-12-24') from lxw_dual;
24
9. 日期转小时函数: hour
语法: hour (string date)
返回值: int
说明: 返回日期中的小时。
举例:
hive> select hour('2011-12-08 10:03:01') from lxw_dual;
10
10. 日期转分钟函数: minute
语法: minute (string date)
返回值: int
说明: 返回日期中的分钟。
举例:
hive> select minute('2011-12-08 10:03:01') from lxw_dual;
3
11. 日期转秒函数: second
语法: second (string date)
返回值: int
说明: 返回日期中的秒。
举例:
hive> select second('2011-12-08 10:03:01') from lxw_dual;
1
12. 日期转周函数: weekofyear
语法: weekofyear (string date)
返回值: int
说明: 返回日期在当前的周数。
举例:
hive> select weekofyear('2011-12-08 10:03:01') from lxw_dual;
49
13. 日期比较函数: datediff
语法: datediff(string enddate, string startdate)
返回值: int
说明: 返回结束日期减去开始日期的天数。
举例:
hive> select datediff('2012-12-08','2012-05-09') from lxw_dual;
213
14. 日期增加函数: date_add
语法: date_add(string startdate, int days)
返回值: string
说明: 返回开始日期startdate增加days天后的日期。
举例:
hive> select date_add('2012-12-08',10) from lxw_dual;
2012-12-18
15. 日期减少函数: date_sub
语法: date_sub (string startdate, int days)
返回值: string
说明: 返回开始日期startdate减少days天后的日期。
举例:
hive> select date_sub('2012-12-08',10) from lxw_dual;
2012-11-28
1.3.2 时间戳函数:unix_timestamp,from_unixtime
获取当前时间戳:
select unix_timestamp();
获取"2019-07-31 11:57:25"对应的时间戳:
select unix_timestamp('2019-07-31 11:57:25','yyyy-MM-dd HH:mm:ss');
获取"2019-07-31 11:57"对应的时间戳:
select unix_timestamp('2019-07-31 11:57','yyyy-MM-dd HH:mm');
获取时间戳:1564545445所对应的日期和时分秒:
select from_unixtime(1564545445,'yyyyMMdd HH:mm:ss');
获取时间戳:1564545446所对应的日期和小时(yyyy/MM/dd HH):
select from_unixtime(1564545446,'yyyy/MM/dd HH');
1.3.3 时间格式转换:yyyyMMdd -> yyyy-MM-dd
数据:
t1表
20190730
20190731
编写sql实现如下的结果:
2019-07-30
2019-07-31
答案:
create table if not exists `tab3`
(
dates string
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/tab3.txt' into table tab3;
select
from_unixtime(unix_timestamp(dates,'yyyyMMdd'),'yyyy-MM-dd') date1
from tab3;
1.4交差并集
1.4.1 使用hive求出两个数据集的差集
数据
t1表:
id name
1 zs
2 ls
t2表:
id name
1 zs
3 ww
结果如下:
id name
2 ls
3 ww
答案:
create table if not exists `tab4`
(
id int,
name string
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/tab4.txt' into table tab4;
create table if not exists `tab5`
(
id int,
name string
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/tab5.txt' into table tab5;
SELECT tab4.*
FROM tab4
LEFT JOIN tab5 ON tab4.id=tab5.id
WHERE tab5.id IS NULL
UNION
SELECT tab5.*
FROM tab5
LEFT JOIN tab4 ON tab4.id=tab5.id
WHERE tab4.id IS NULL;
1.4.2 两个表A 和B ,均有key 和value 两个字段,写一个SQL语句, 将B表中的value值置成A表中相同key值对应的value值
A:
key vlaue
k1 123
k2 234
k3 235
B:
key value
k1 111
k2 222
k5 246
使用hive的hql实现,结果是B表数据如下:
k1 123
k2 234
k5 246
答案:
create table if not exists `b`
(
`key` string,
`value` int
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/b.txt' into table b;
create table if not exists `c`
(
`key` string,
`value` int
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/c.txt' into table c;
select
*
from
(select
tmp.key2 `key`,
case when tmp.value1 is not null then tmp.value1 else tmp.value2 end `value`
from
(select
b.key key1,
b.value value1,
c.key key2,
c.value value2
from b
full join
c
on b.key=c.key) tmp) tmp1
where tmp1.key is not null;
1.4.3 有用户表user(uid,name)以及黑名单表Banuser(uid)
1、用left join方式写sql查出所有不在黑名单的用户信息
2、用not exists方式写sql查出所有不在黑名单的用户信息
答案:
select
user.*
from user
left join banuser
on user.uid=banuser.uid
where
banuser.uid is null;
select
user.*
from user
where not exists (select 1 from banuser t where t.uid=user.uid)
1.5函数
1.5.1 hive中coalesce()、nvl()、concat_ws()、collect_list()、 collect_set()、regexp_replace().这几个函数的意义
非空查找函数: COALESCE
语法: COALESCE(T v1, T v2, …)
返回值: T
说明: 返回参数中的第一个非空值;如果所有值都为NULL,那么返回
nvl(T v1,T v2):空值判断。如果v1为空则返回v2,不为空则为v1.v1,v2 为同类型数据
concat_ws(separator,str1,str2,....):指定分隔符(第一位)连接字符串函数。参数需要字符串。
collect_list(T col):将某列的值连在一起,返回字符串数组,有相同的列值不会去重
collect_set(T col):将某列的值连接在一起,返回字符串数组,有相同的列值会去重
regexp_replace(source_string, pattern[, replace_string [, position[,occurrence, [match_parameter]]]]):用一个指定的 replace_string 来替换匹配的模式,从而允许复杂的"搜 索并替换"操作
1.6 聚合
1.6.1 你们公司使用什么来做的cube
使用with cube 、 with rollup 或者grouping sets来实现cube。
详细解释如下:
0、hive一般分为基本聚合和高级聚合
基本聚合就是常见的group by,高级聚合就是grouping set、cube、rollup等。
一般group by与hive内置的聚合函数max、min、count、sum、avg等搭配使用。
1、grouping sets可以实现对同一个数据集的多重group by操作。
事实上grouping sets是多个group by进行union all操作的结合,它仅使用一个stage完成这些操作。
grouping sets的子句中如果包换() 数据集,则表示整体聚合。多用于指定的组合查询。
2、cube俗称是数据立方,它可以时限hive任意维度的组合查询。
即使用with cube语句时,可对group by后的维度做任意组合查询
如:group a,b,c with cube ,则它首先group a,b,c 然后依次group by a,c 、 group by b,c、group by a,b 、group a 、group b、group by c、group by () 等这8种组合查询,所以一般cube个数=2^3个。2是定 值,3是维度的个数。多用于无级联关系的任意组合查询。
3、rollup是卷起的意思,俗称层级聚合,相对于grouping sets能指定多少种聚合,而with rollup则表示从左 往右的逐级递减聚合,如:group by a,b,c with rollup 等价于 group by a, b, c grouping sets( (a, b, c), (a, b), (a), ( )).直到逐级递减为()为止,多适用于有级联关系的组合查询,如国家、省、市级联组合查 询。
4、Grouping__ID在hive2.3.0版本被修复过,修复后的发型版本和之前的不一样。对于每一列,如果这列 被聚合 过则返回0,否则返回1。应用场景暂时很难想到用于哪儿。
5、grouping sets/cube/rollup三者的区别: 注: grouping sets是指定具体的组合来查询。 with cube 是group by后列的所有的维度的任意组合查询。
with rollup 是group by后列的从左往右逐级递减的层级组合查询。 cube/rollup 后不能加()来选择列,hive是要求这样。
1.7 正则
1.7.1 访问日志正则提取
表t1(注:数据时正常的访问日志数据,分隔符全是空格)
8.35.201.160 - - [16/May/2018:17:38:21 +0800] "GET
/uc_server/data/avatar/000/01/54/22_avatar_middle.jpg HTTP/1.1" 200 5396
使用hive的hql实现结果如下:
ip dt url
8.35.201.160 2018-5-16 17:38:21
/uc_server/data/avatar/000/01/54/22_avatar_middle.jpg
答案:
create table if not exists `logs`
(
log string
);
load data local inpath '/root/downfile/logs.txt' into table logs;
SELECT regexp_extract(log,"([0-9.]+\\d+) - - \\[(.+ \\+\\d+)\\] .+(GET) (.+) (HTTP)\\S+ (\\d+) (\\d+)",1) ip,
from_unixtime(unix_timestamp(regexp_extract(log,"([0-9.]+\\d+) - - \\[(.+ \\+\\d+)\\] .+(GET) (.+) (HTTP)\\S+ (\\d+) (\\d+)",2),"dd/MMM/yyyy:HH:mm:ss Z"),"yyyy-MM-dd HH:mm:ss") dt,
regexp_extract(log,"([0-9.]+\\d+) - - \\[(.+ \\+\\d+)\\] .+(GET) (.+) (HTTP)\\S+ (\\d+) (\\d+)",4) url
FROM logs;
1.8 优化
1.8.1 你做过hive的那些优化
join优化
limit优化
并行执行
严格模式
mapper和reducer的个数
配置jvm重⽤
数据倾斜
索引
分区
推测执⾏
1.9普通查询
1.9.1 使用hive的hql查询用户所在部门
dpt表
dpt_id dpt_name
1 产品
2 技术
user_dpt表
user_id dpt_id
1 1
2 1
3 2
4 2
5 3
result表
user_id dpt_id dpt_name
1 1 产品
2 1 产品
3 2 技术
4 2 技术
5 3 其他部门
答案:
create table if not exists `dpt`
(
dpt_id int,
dpt_name string
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/dpt.txt' into table dpt;
create table if not exists `user_dpt`
(
user_id int,
dpt_id int
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/user_dpt.txt' into table user_dpt;
select
user_dpt.user_id user_id ,
user_dpt.dpt_id dpt_id,
case when dpt.dpt_name is not null then dpt.dpt_name else '其他' end dpt_name
from user_dpt
left join
dpt on user_dpt.dpt_id=dpt.dpt_id;
1.9.2 查出每个学期每门课程最高分记录
course_score表:
id,userid,course,score,term
1,zhangsan,数学,80,2015
2,lisi,语文,90,2016
3,lisi,数学,70,2016
4,wangwu,化学,80,2017
5,zhangsan,语文,85,2015
6,zhangsan,化学,80,2015
编写sql完成如下查询,一次查询实现最好,也可以写多次查询实现:
1、查出每个学期每门课程最高分记录(包含全部5个字段)
2、查出单个学期中语文课在90分以上的学生的数学成绩记录(包含全部5个字段)
答案:
create table if not exists `course_score`
(
id int,
userid string,
course string,
score int,
term int
)row format delimited fields terminated by ',';
load data local inpath '/root/downfile/course_score.txt' into table course_score;
select
id,
userid,
tmp.course,
tmp.max_score,
tmp.term
from course_score cs
inner join
(select
course,
max(score) max_score,
term
from course_score
group by term,course) tmp
on tmp.max_score=cs.score and tmp.course=cs.course and tmp.term=cs.term
;
select
cs.*
from course_score cs
inner join
(select
userid,
term
from course_score
where course='语文' and score>80
group by term,userid) tmp
on cs.userid=tmp.userid and cs.term=tmp.term where cs.course='数学';
1.9.3 设计数据库表,用来存放学生基本信息,课程信息,学生的课 程及成绩,并给出sql语句,查询平均成绩大于85的所有学生
stu_1
id,name,age,addr
1,zs1,22,bj
2,zs2,22,bj
3,zs3,22,bj
4,zs4,22,bj
5,zs5,22,bj
course_1
cid,cname
1,语文
2,数学
3,政治
4,美术
5,历史
course_sc
id,cid,score
1,1,87
1,2,92
1,3,69
2,2,83
2,3,92
2,4,87
2,5,83
答案:
create table if not exists `stu_1`
(
id int,
name string,
age int,
addr string
)row format delimited fields terminated by ',';
load data local inpath '/root/downfile/stu_1.txt' into table stu_1;
create table if not exists `course_1`
(
cid int,
cname string
)row format delimited fields terminated by ',';
load data local inpath '/root/downfile/course_1.txt' into table course_1;
create table if not exists `course_sc`
(
id int,
cid int,
score int
)row format delimited fields terminated by ',';
load data local inpath '/root/downfile/course_sc.txt' into table course_sc;
select
stu_1.*,
tmp.average_scores
from stu_1
inner join
(select
id,
count(cid),
sum(score)/count(cid) average_scores
from course_sc
group by id) tmp
on tmp.id=stu_1.id;
1.9.4 每个渠道的下单用户数、订单总金额
有一个订单表和渠道表,结构如下:
请使用hive hql查询出2019-08-06号 每个渠道的下单用户数、订单总金额。
hql语句实现,结果表头如下:
channel user_num order_amount
答案:
create table if not exists `order`
(
order_id bigint,
user_id bigint,
amount double,
channel string,
`time` string
)row format delimited fields terminated by ',';
load data local inpath '/root/downfile/order.txt' into table `order`;
select
count(distinct user_id) user_num,
sum(amount) order_amount
from `order`
group by channel;
1.9.5 登录且阅读的用户数,已经阅读书籍数量及其它
有如下三张表:
表A(登录表):
ds user_id
2019-08-06 1
2019-08-06 2
2019-08-06 3
2019-08-06 4
表B(阅读表):
ds user_id read_num
2019-08-06 1 2
2019-08-06 2 3
2019-08-06 3 6
表C(付费表):
ds user_id price
2019-08-06 1 55.6
2019-08-06 2 55.8
基于上述三张表,请使用hive的hql语句实现如下需求:
(1)、用户登录并且当天有个阅读的用户数,已经阅读书籍数量
(2)、用户登录并且阅读,但是没有付费的用户数
(3)、用户登录并且付费,付费用户书籍和金额
答案:
create table if not exists `a1`
(
ds string,
user_id int
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/a1.txt' into table a1;
create table if not exists `b1`
(
ds string,
user_id int,
read_num int
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/b1.txt' into table b1;
create table if not exists `c1`
(
ds string,
user_id int,
price double
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/c1.txt' into table c1;
select
count(b1.user_id) user_num,
sum(b1.read_num) sum_read_num
from b1
left join
a1
on a1.user_id=b1.user_id and a1.ds=b1.ds;
select
count(a1.user_id)
from a1
left join
b1 on a1.user_id=b1.user_id
left join
c1 on a1.user_id=c1.user_id
where b1.user_id is not null and c1.user_id is null;
select
c1.user_id,
b1.read_num,
c1.price
from a1
inner join
b1 on a1.user_id=b1.user_id
inner join
c1 on a1.user_id=c1.user_id;
1.9.6 高消费者报表
有三个表,分别是:
区域(district) 区域中有两个字段分别是区域Id(disid)和区域名称(disname)
城市(city) 城市有两个字段分别是城市ID(cityid)和区域ID(disid)
订单(order) 订单有四个字段分别是订单ID(orderid)、用户ID(userid)、城市ID(cityid)和消
费金额(amount)。
district表:
disid disname
1 华中
2 西南
city表:
cityid disid
1 1
2 1
3 2
4 2
5 2
order表:
oid userid cityid amount
1 1 1 1223.9
2 1 1 9999.9
3 2 2 2322
4 2 2 8909
5 2 3 6789
6 2 3 798
7 3 4 56786
8 4 5 78890
高消费者是消费金额大于1W的用户,使用hive hql生成如下报表:
区域名 高消费者人数 消费总额
答案:
CREATE TABLE IF NOT EXISTS `district`
(
disid int,
disname string
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/district.txt' into table district;
CREATE TABLE IF NOT EXISTS `city`
(
cityid int,
disid int
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/city.txt' into table city;
CREATE TABLE IF NOT EXISTS `order_tab`
(
oid int,
userid int,
cityid int,
amount double
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/order_tab.txt' into table order_tab;
select
district.disname disname,
count(order_tab.userid) user_num,
sum(order_tab.amount) sum_amount
from
order_tab
inner join
city
on city.cityid=order_tab.cityid
inner join
district
on district.disid=city.disid
where order_tab.amount>10000
group by district.disname;
1.9.7 请使用sql计算pv、uv
数据:
t1表
uid date url
1 2019-08-06 http://www.baidu.com
2 2019-08-06 http://www.baidu.com
3 2019-08-06 http://www.baidu.com
3 2019-08-06 http://www.soho.com
3 2019-08-06 http://www.meituan.com
3 2019-08-06
结果如下:
date uv pv
2019-08-6 3 5
答案:
CREATE TABLE IF NOT EXISTS `t1_table`
(
`uid` int,
`date` string,
`url` string
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/t1_table.txt' into table t1_table;
SELECT tmp1.date1 `date`,
count(tmp2.cnt_uid) uv,
tmp1.pv pv
FROM
(SELECT count(uid) cnt_uid,
`date` date1
FROM t1_table t1
GROUP BY uid,
`date`) tmp2
LEFT JOIN
(SELECT `date` date1,
count(uid) pv
FROM t1_table
WHERE url IS NOT NULL
GROUP BY `date`) tmp1 ON tmp1.date1=tmp2.date1
GROUP BY tmp1.date1,
tmp1.pv;
1.9.8 使用hive的hql实现买过商品3的用户及其昨日消费:
orderid userid productid price timestamp date
123,00101,3,200,1535945356,2019-08-28
124,00100,1,200,1535945356,2019-08-28
125,00101,3,200,1535945356,2019-08-29
126,00101,2,200,1535945356,2019-08-29
127,00102,5,200,1535945356,2019-08-29
128,00103,3,200,1535945356,2019-08-29
129,00103,3,200,1535945356,2019-08-29
答案:
SELECT market.orderid,
market.userid,
market.productid,
market.price,
market.date1
FROM market
INNER JOIN
(SELECT userid
FROM market
WHERE date1='2019-08-29'
AND productid=3
UNION SELECT userid
FROM market
WHERE date1='2019-08-28'
AND productid=3) tmp ON tmp.userid=market.userid
WHERE market.date1='2019-08-28';
1.9.9 统计为用户配重了角色的用户角色数量
用户表和角色表,统计为用户配重了角色的用户角色数量(hql)
用户表t1:
uid uname role
1 zs 1,1,2,2,2
2 ls 1,2
3 ww 2,3,3
角色表t2:
roleid rolename
1,唤醒师
2,召唤师
3,魔法师
结果如下:
uid uname roleid cnt
1 zs 1 召唤师 2
1 zs 2 唤醒师 3
3 ww 3 魔法师 2
答案:
create table if not exists `t2`
(
uid int,
uname string,
role string
)row format delimited fields terminated by ' ';
load data local inpath '/root/downfile/t2.txt' into table t2;
create table if not exists `t3`
(
roleid int,
rolename string
)row format delimited fields terminated by ',';
load data local inpath '/root/downfile/t3.txt' into table t3;
SELECT tmp.uid uid,
tmp.uname uname,
t3.roleid roleid,
t3.rolename rolename,
tmp.cnt cnt
FROM t3
INNER JOIN
(SELECT uid,
uname,
roles,
count(roles) cnt
FROM t2 LATERAL VIEW explode(split(t2.role,',')) role1 AS roles
GROUP BY uid,
roles,
uname) tmp ON tmp.roles=t3.roleid
INNER JOIN
(SELECT tmp1.roles roles,
max(tmp1.cnt) max_cnt
FROM
(SELECT uid,
uname,
roles,
count(roles) cnt
FROM t2 LATERAL VIEW explode(split(t2.role,',')) role1 AS roles
GROUP BY uid,
roles,
uname) tmp1
GROUP BY tmp1.roles) tmp2 ON tmp2.max_cnt=tmp.cnt
GROUP BY tmp.uid,
tmp.uname,
t3.roleid,
t3.rolename,
tmp.cnt;
1.10 其它
1.10.1 Hive是否发生过数据倾斜,怎么处理的,原理是什么
数据倾斜造成原因:
1、数据本身就倾斜(某一个key过多造成,再join当中认为hive.skewjoin.key=100000就是倾斜)
2、join 或 group by都容易造成
解决数据倾斜:
1、设置对应属性 如果有group by语句设置如下两个: hive.map.aggr=true
hive.groupby.skewindata=false (建议开启)
如果语句中出现join则可以设置如下属性: hive.optimize.skewjoin=false (建议开启)
2、如果是数据倾斜造成(抽样查询key),可以将倾斜数据单独提取出来进行计算,然后再和非倾斜数据结果进行合 并。
3、如果join的语句出现倾斜,你可以将出现的key加上一个范围的随机数,然后将连接的数量减少,避免倾斜。如 果过后续还有group by则可以按照第2部做。大前提不能影响业务。
1.10.2 Hive中什么时候使用过array和map,为什么使用
array 一字段多值
map key:value形式
1.10.3 hive的hql中,left outer join和left semi join的区别
left outer join 左变表的值都出来,右变没有的则null输出
left semi join 左变表的值都出来,右变没有的过滤了
1.10.4 一张大表A(上亿条记录)和小表B(几千条记录),如果join出现 数据倾斜,有什么解决办法
可以将出现的key加上一个范围的随机数,然后将连接的数量减少,避免倾斜。如果过后续还有group by则可以按照 第2部做。大前提不能影响业务
1.10.5 统计不同天的pv、uv及其它
样例数据:
t1表
gender,cookie,ip,timestampe,ua
F,1707041428491566106,111.200.195.186,1208524973899,Dalvik%2F2.1.0%20%28Linux
%3B%20U%3B%20Android
...具体数据如下图
将图片中的awk修改为使用sql编写,然后将上诉题作出回答?