最近有兴趣且工作需要来了解下spark,首先就是环境的搭建。一开始时觉得很困难以至于很早就开始搭建有问题就先放下了想起来再接着弄,直到需要的时候才拾起来接着搭建。再同事的帮助下成功运行起来发现真的是太简单了,因此记录下来以方便他人。
1.环境变量配置:
需要先将hadoop下载下来路径添加到环境变量中;
2.在IDEA上安装Scala插件
新建maven项目,在file->settings->plugins右面输入Scala点击install,再重新启动IDEA里面的scala项就会变成下图所示的uninstall,此时插件安装成功。
在IDEA的src上右键点击new->Directory,输入scala,新建一个scala文件目录如下图所示:
3.添加dependency至pom.xml中
将spark、scala等版本信息以及spark-hive、spark-core、spark-streaming、spark-sql、spark-streaming-kafka、spark-mllib等信息如下所示添加进pom.xml中,在pom.xml上点击maven->reimport更新maven依赖。
其中可选择spark版本:
<spark.version>2.3.0.2.6.5.0-292</spark.version> spark2.3.0
<spark.version>1.6.3</spark.version> spark1.6.3
<profiles>
<profile>
<id>test</id>
<properties>
<scope>compile</scope>
</properties>
</profile>
<profile>
<id>package</id>
<properties>
<scope>provided</scope>
</properties>
</profile>
</profiles>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compile.source>1.8</maven.compile.source>
<maven.compile.target>1.8</maven.compile.target>
<scala.version>2.10.5</scala.version>
<!--<spark.version>2.3.0.2.6.5.0-292</spark.version>-->
<spark.version>1.6.3</spark.version>
<scala.spark.version>2.10</scala.spark.version>
<file.name>${project.name}</file.name>
<!--<scala.tools.version>2.10</scala.tools.version>-->
<!--<spark.version>1.5.0</spark.version>-->
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.spark.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.spark.version}</artifactId>
<version>${spark.version}</version>
<scope>${scope}</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.spark.version}</artifactId>
<version>${spark.version}</version>
<scope>${scope}</scope>
</dependency>
<dependency><!-- Spark Streaming Kafka -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.6.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.spark.version}</artifactId>
<version>${spark.version}</version>
<scope>${scope}</scope>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<scope>${scope}</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${scala.spark.version}</artifactId>
<version>${spark.version}</version>
<scope>${scope}</scope>
</dependency>
<dependency>
<groupId>io.codis.jodis</groupId>
<artifactId>jodis</artifactId>
<version>0.4.1</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>cz.mallat.uasparser</groupId>
<artifactId>uasparser</artifactId>
<version>0.6.0</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<finalName>${file.name}</finalName>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.0</version>
<configuration>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.7</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<buildcommands>
<buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<additionalProjectnatures>
<projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
</additionalProjectnatures>
<classpathContainers>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
<classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
</classpathContainers>
</configuration>
</plugin>
</plugins>
</build>
4.最后配置
最后在测试时,点开最右侧的Maven Project,将顶部的Profiles里面的test选上。配置结束,就可以写测试用例测试了。
5测试程序
在scala目录下新建Scala Class,并在Kind处选择Object,测试代码如下:
def main(args: Array[String]):Unit = {
//屏蔽日志
// System.setProperty("hadoop.home.dir", "D:\\hadoop");//设置hadoop环境
val conf0 =new SparkConf().setAppName("Base").setMaster("local[4]")
val sc =new SparkContext(conf0); //声明一个sparkContext上下文
val inputfile ="C:\\Users\\dong.shan\\Documents\\WXWork\\1688852247359795\\Cache\\File\\2018-10\\pom.xml"
val textfile = sc.textFile(inputfile)
// 查询包含hello world的行
// val lines = textfile.filter(line => line.contains("hello world"))
textfile.foreach(println)
println("finished!")
}