一、环境准备
- hadoop-2.6.0-cdh5.15.1并支持压缩(参考:hadoop安装文档)
- lzo jar包(下载地址:lzo jar下载地址)
- lzo安装包(下载地址:lzo下载地址)
- lzop安装包(下载地址:lzop下载地址)
yum -y install gcc-c++ lzo-devel zlib-devel autoconf automake libtool
二、安装配置lzo
[hadoop@hadoop000 software]$ wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.10.tar.gz
[hadoop@hadoop000 software]$ tar -zxvf lzo-2.10.tar.gz -C ../app/
[hadoop@hadoop000 software]$ cd ../app/
[hadoop@hadoop000 app]$ ll
drwxr-xr-x 13 hadoop hadoop 4096 Sep 29 10:07 lzo-2.10
[root@hadoop000 lzo-2.10]# ./configure
[root@hadoop000 lzo-2.10]# make install
make[1]: Entering directory `/home/hadoop/app/lzo-2.10'
/bin/mkdir -p '/usr/local/lib'
/bin/sh ./libtool --mode=install /bin/install -c src/liblzo2.la '/usr/local/lib'
libtool: install: /bin/install -c src/.libs/liblzo2.lai /usr/local/lib/liblzo2.la
libtool: install: /bin/install -c src/.libs/liblzo2.a /usr/local/lib/liblzo2.a
libtool: install: chmod 644 /usr/local/lib/liblzo2.a
libtool: install: ranlib /usr/local/lib/liblzo2.a
libtool: finish: PATH="/usr/local/openresty/nginx/sbin:/usr/java/jdk1.8.0_45/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/root/bin:/sbin" ldconfig -n /usr/local/lib
----------------------------------------------------------------------
Libraries have been installed in:
/usr/local/lib
If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the '-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the 'LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the 'LD_RUN_PATH' environment variable
during linking
- use the '-Wl,-rpath -Wl,LIBDIR' linker flag
- have your system administrator add LIBDIR to '/etc/ld.so.conf'
See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
/bin/mkdir -p '/usr/local/share/doc/lzo'
/bin/install -c -m 644 AUTHORS COPYING NEWS THANKS doc/LZO.FAQ doc/LZO.TXT doc/LZOAPI.TXT '/usr/local/share/doc/lzo'
/bin/mkdir -p '/usr/local/lib/pkgconfig'
/bin/install -c -m 644 lzo2.pc '/usr/local/lib/pkgconfig'
/bin/mkdir -p '/usr/local/include/lzo'
/bin/install -c -m 644 include/lzo/lzo1.h include/lzo/lzo1a.h include/lzo/lzo1b.h include/lzo/lzo1c.h include/lzo/lzo1f.h include/lzo/lzo1x.h include/lzo/lzo1y.h include/lzo/lzo1z.h include/lzo/lzo2a.h include/lzo/lzo_asm.h include/lzo/lzoconf.h include/lzo/lzodefs.h include/lzo/lzoutil.h '/usr/local/include/lzo'
make[1]: Leaving directory `/home/hadoop/app/lzo-2.10'
三、安装配置lzop
[hadoop@hadoop000 software]$ wget http://www.lzop.org/download/lzop-1.04.tar.gz
[hadoop@hadoop000 software]$ tar -zxvf lzop-1.04.tar.gz -C ../app/
[hadoop@hadoop000 app]$ ll
drwxr-xr-x 6 hadoop hadoop 4096 Aug 10 2017 lzop-1.04
[root@hadoop000 ~]# cd /home/hadoop/app/lzop-1.04/
[root@hadoop000 lzop-1.04]# ./configure
[root@hadoop000 lzop-1.04]# make && make install
make[1]: Leaving directory `/home/hadoop/app/lzop-1.04'
make[1]: Entering directory `/home/hadoop/app/lzop-1.04'
/bin/mkdir -p '/usr/local/bin'
/bin/install -c src/lzop '/usr/local/bin'
/bin/mkdir -p '/usr/local/share/doc/lzop'
/bin/install -c -m 644 AUTHORS COPYING NEWS README THANKS doc/lzop.html doc/lzop.man doc/lzop.ps doc/lzop.tex doc/lzop.txt doc/lzop.pod '/usr/local/share/doc/lzop'
/bin/mkdir -p '/usr/local/share/man/man1'
/bin/install -c -m 644 doc/lzop.1 '/usr/local/share/man/man1'
make[1]: Leaving directory `/home/hadoop/app/lzop-1.04'
四、测试lzop
[hadoop@hadoop000 ~]$ ll
-rw-rw-r--. 1 hadoop hadoop 4448 Sep 7 23:36 zookeeper.out
[hadoop@hadoop000 ~]$ lzop zookeeper.out
[hadoop@hadoop000 ~]$ ll
-rw-rw-r--. 1 hadoop hadoop 4448 Sep 7 23:36 zookeeper.out
-rw-rw-r-- 1 hadoop hadoop 1630 Sep 7 23:36 zookeeper.out.lzo
五、上传hadoop-lzo jar包
[hadoop@hadoop000 common]$ pwd
/home/hadoop/app/hadoop/share/hadoop/common
[hadoop@hadoop000 common]$ ll
-rw-r--r-- 1 hadoop hadoop 193831 Sep 29 09:18 hadoop-lzo-0.4.20.jar
六、编译hadoop-lzo(可选)
[root@hadoop000 tar]# wget https://github.com/twitter/hadoop-lzo/archive/master.zip
[root@hadoop000 tar]# unzip -d /home/hadoop/app/ master.zip
修改hadoop version
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.current.version>2.6.0-cdh5.15.1</hadoop.current.version>
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>
添加仓库
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
</repository>
readme文件需要设置
[hadoop@hadoop000 hadoop-lzo-master]$C_INCLUDE_PATH=/usr/local/include
[hadoop@hadoop000 hadoop-lzo-master]$LIBRARY_PATH=/usr/local/lib
[hadoop@hadoop000 hadoop-lzo-master]$ mvn clean package -Dmaven.test.skip=true
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:41 min
[INFO] Finished at: 2019-04-14T12:03:05+00:00
[INFO] Final Memory: 37M/1252M
[INFO] ------------------------------------------------------------------------
拷贝相关文件放到本地库
[hadoop@hadoop000 hadoop-lzo-master]$ cd target/native/Linux-amd64-64/
[hadoop@hadoop000 Linux-amd64-64]$ tar -cBf - -C lib . | tar -xBvf - -C ~
[hadoop@hadoop000 ~]$ cp ~/libgplcompression* $HADOOP_HOME/lib/native/
把编译好的jar 加入 hadoop包下
[hadoop@hadoop000 target]$ cp hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/
七、配置core-site.xml
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<!--支持LZO使用类-->
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzopCodec</value>
</property>
八、配置mapred-site.xml
<!--启用map中间压缩类-->
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzopCodec</value>
</property>
<!--启用mapreduce文件压缩-->
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>
<!--启用mapreduce压缩类-->
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>com.hadoop.compression.lzo.LzopCodec</value>
</property>
<!--配置lzo Jar包-->
<property>
<name>mapred.child.env</name>
<value>LD_LIBRARY_PATH=/usr/local/lib</value>
</property>
九、测试生成lzo文件
[hadoop@hadoop000 data]$ ll
-rw-r--r-- 1 hadoop hadoop 533444411 Apr 1 2015 ratings.csv
[hadoop@hadoop000 data]$ du -sh *
509M ratings.csv
[hadoop@hadoop000 data]$ lzop ratings.csv
[hadoop@hadoop000 data]$ du -sh *
509M ratings.csv
220M ratings.csv.lzo
[hadoop@hadoop000 data]$ hdfs dfs -put ratings.csv.lzo /ruozedata/input/
[hadoop@hadoop000 hadoop]$ hadoop jar \
share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.15.1.jar \
wordcount \
-Dmapreduce.output.fileoutputformat.compress=true \
-Dmapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec \
/ruozedata/input/ratings.csv.lzo \
/ruozedata/output/lzo_02/
19/09/29 11:12:56 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/09/29 11:12:56 INFO input.FileInputFormat: Total input paths to process : 1
19/09/29 11:12:56 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
19/09/29 11:12:56 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 52decc77982b58949890770d22720a91adce0c3f]
19/09/29 11:12:57 INFO mapreduce.JobSubmitter: number of splits:1 只有1个分片 说明这种lzo 不支持分片
十、产生index文件
需要建立lzo索引
[hadoop@hadoop000 hadoop]$ hdfs dfs -mkdir /ruozedata/index/
[hadoop@hadoop000 hadoop]$ hadoop jar /home/hadoop/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /ruozedata/input/ratings.csv.lzo
[hadoop@hadoop000 hadoop]$ hdfs dfs -ls /ruozedata/input/*
-rw-r--r-- 1 hadoop hadoop 533444411 2019-09-29 10:25 /ruozedata/input/ratings.csv
-rw-r--r-- 1 hadoop hadoop 230567633 2019-09-29 11:11 /ruozedata/input/ratings.csv.lzo
-rw-r--r-- 1 hadoop hadoop 16280 2019-09-29 11:28 /ruozedata/input/ratings.csv.lzo.index
同目录下生成了一个index后缀的文件
重新执行命令
[hadoop@hadoop000 hadoop]$ hadoop jar \
> share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.15.1.jar \
> wordcount \
> -Dmapreduce.output.fileoutputformat.compress=true \
> -Dmapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec \
> -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat \
> /ruozedata/input/ratings.csv.lzo \
> /ruozedata/output/lzo_04/
19/09/29 11:40:59 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/09/29 11:41:00 INFO input.FileInputFormat: Total input paths to process : 1
19/09/29 11:41:00 INFO mapreduce.JobSubmitter: number of splits:2
注意:lzo文件必须在hdfs文件系统
十一、spark中使用lzo
[hadoop@hadoop000 conf]$ vi spark-defaults.conf
spark.jars /home/hadoop/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.20.jar
object CompressionApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
val sc = new SparkContext(conf)
val input = args(0)
val output = args(1)
FileUtils.deleteTarget(output, new Configuration())
val rdd = sc.newAPIHadoopFile[LongWritable, Text, LzoTextInputFormat](input)
rdd.map(_._2.toString).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
.saveAsTextFile(output, classOf[com.hadoop.compression.lzo.LzopCodec])
sc.stop()
}
spark-submit \
--class com.ruozedata.bigdata.spark.homework.CompressionApp \
--master yarn \
--deploy-mode client \
--executor-memory 3G \
--num-executors 1 \
/home/hadoop/lib/ruozedata-spark-flink-1.0.jar \
/ruozedata/input/ratings.csv.lzo \
/ruozedata/output/lzo_spark/
十一、总结
220M支持分片的话应该有 2个split,不支持就一个split,
支持分片当大于blocksize时,会有2个map处理,提高效率
不支持分片不论多大都只有一个 map处理 耗费时间
所以工作中使用lzo要合理控制生成的lzo大小,不要超过一个block大小。因为如果没有lzo的index文件,该lzo会由一个map处理。如果lzo过大,
会导致某个map处理时间过长。
也可以配合lzo.index文件使用,这样就支持split,好处是文件大小不受限制,可以将文件设置的稍微大点,这样有利于减少文件数目。
但是生成lzo.index文件虽然占空间不大但也本身需要开销。