1.读取文件
1. 使用sparksession.read.textFile可以读取 textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").
2. 使用.wholeTextFiles 可以读取一个包含多个小文件的目录,它会返回一个键值对(文件名,文件内容)
2.1 它可以提供第二个参数用来操作最小分区数
2. RDDOperations
1. transformations:从已经存在的创建一个新的dataset,(lazy)
2. actions,返回一个值在对dataset进行计算之后,
3. 持久化rdd
1. 调用persist或者是(cache)方法
输出RDD
1. 可以rdd.collect().foreach(println)他可以将各个节点的数据汇集到一起来,但是这样会耗尽你的内存.
2. 我们建议是获取小部分的数据进行获取遍历.rdd.take(100).foreach(println)
常见actioin操作
Action | Meaning |
---|---|
reduce(func) | row 1 col 2 |
collect | row 2 col 2 |
count | row 1 col 2 |
first | row 2 col 2 |
take(n) | |
takeSample(withReplacement,num,[seed]) | |
takeOrdered(n,[ordering]) | |
saveAsTextFile(path) | |
saveAsSeTextFile(path) | |
saveAsSequenceFile(path) | |
countByKey() |
foreach(func)
shuffle
1. The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O.
2. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations.
3. Spark also automatically persists some intermediate(内部) data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it
removing Data
1. 移除缓存,默认是使用LRU(最近,最少未使用),
2. 手动(manually)移除,RDD.unpersist()method
Shared Variables
1. spark提供两个共享变量,可以分发的每个机器中
1.1 广播变量(Broadcast Variables)
1.2