转载请在文章起始处注明出处,谢谢。
1、取字符串某个位置的值
比如,我要取‘abcd’的b,有两种取法:
select split('abcd','')[1]
select substr('abcd',2,1)
2、我要取a表的用户,b表的用户,还有a、b重复的用户
这里可以用sum() over(partition by) 来取,下面举个例子:
假如a表和b表结构一样,都有dt,pin,type。
取数代码示意如下:
set hive.merge.mapfiles = true;
set hive.merge.mapredfiles = true;
set hive.merge.size.per.task = 512000000;
set hive.merge.smallfiles.avgsize = 512000000;
set mapred.max.split.size = 512000000;
set mapred.min.split.size.per.node = 512000000;
set mapred.min.split.size.per.rack = 512000000;
set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set hive.hadoop.supports.splittable.combineinputformat = true;
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nostrict;
set hive.exec.max.dynamic.partitions.pernode = 1000;
set hive.exec.parallel = true;
set hive.exec.parallel.thread.number = 16;
set hive.new.job.grouping.set.cardinality = 5000;
select
t2.dt,
count(distinct case
when t2.t = 1 then t2.pin
end) a的人数,
count(distinct case
when t2.t = 2 then t2.pin
end) b的人数,
count(distinct case
when t2.t >= 3 then t2.pin
end) ab重复人数
from
(
select
t1.dt,
t1.pin,
t1.type,
sum(t1.type) over(partition by t1.pin) t
from
(
表a a
union all
表b b
)
t1
group by
t1.dt,
t1.pin,
t1.type
)
t2
group by
t2.dt
注意:t2子查询里的t1.type是一定要写的,不然会报错:
FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 4:16 Invalid column reference 'type'
3、如果拿到一条数据,不需要跑底表可以测试代码是否准确。
比如我想把一段string改成array,底表很大,用代码测试一次需要很长时间,那可以通过with as来做一个临时表,方便测试。
hive> with releases as (select '["us","ca","fr"]' as country)
> select split(regexp_extract(country,'^\\["(.*)\\"]$',1),'","')
> from releases
> ;
OK
_c0
["us","ca","fr"]