flink基础——简单原理介绍

image.png

总结来看，流处理是实时的，数据过来，接收后放入缓存，处理节点会立刻从缓存中拉取数据进行处理。批处理则是数据过来，序列化到缓存，并不会立即处理。flink通过引入超时值同时支持两种方式，当超时值为0时表示流处理。

flink编程基本步骤

基本步骤

1 运行环境

我们首先要获得已经存在的运行环境或者创建它。有3种方法得到运行环境：

（1）通过getExecutionEnvironment()获得；这将根据上下文得到运行环境，假如local模式，则它会创建一个local的运行环境；假如是集群模式，则会创建一个分布式的运行环境；
（2）通过createLocalEnvironment() 创建一个本地的运行环境；
（3）通过createRemoteEnvironment (String host, int port, String, and .jar files)创建一个远程的运行环境。

2 数据源

Flink支持许多预定义的数据源，同时也支持自定义数据源。

2.1 基于socket

DataStream API支持从socket读取数据，有如下3个方法：

socketTextStream(hostName, port);
socketTextStream(hostName,port,delimiter)
socketTextStream(hostName,port,delimiter, maxRetry)

2.2 基于文件

你可以使用readTextFile(String path)来消费文件中的数据作为流数据的来源，默认情况下的格式是TextInputFormat。当然你也可以通过readFile(FileInputFormat inputFormat, String path)来指定FileInputFormat的格式。

Flink同样支持读取文件流：

readFileStream(String filePath, long intervalMillis,
FileMonitoringFunction.WatchType watchType)

readFile(fileInputFormat, path, watchType, interval, pathFilter,
typeInfo)。

3 Transformation

Transformation允许将数据从一种形式转换为另一种形式，输入可以是1个源也可以是多个，输出则可以是0个、1个或者多个。下面我们一一介绍这些Transformations。

3.1 Map

输入1个元素，输出一个元素，Java API如下：

inputStream.map(new MapFunction<Integer, Integer>() {
@Override
public Integer map(Integer value) throws Exception {
return 5 * value;
}
});

3.2 FlatMap

输入1个元素，输出0个、1个或多个元素，Java API如下：

inputStream.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out)
throws Exception {
for(String word: value.split(" ")){
out.collect(word);
}
}
});

3.3 Filter

条件过滤时使用，当结果为true时，输出记录；

inputStream.filter(new FilterFunction<Integer>() {
@Override
public boolean filter(Integer value) throws Exception {
return value != 1;
}
});

3.4 keyBy

逻辑上按照key分组，内部使用hash函数进行分组，返回keyedDataStream：

inputStream.keyBy("someKey");

3.5 Reduce

keyedStream流上，将上一次reduce的结果和本次的进行操作，例如sum reduce的例子：

keyedInputStream. reduce(new ReduceFunction<Integer>() {
@Override
public Integer reduce(Integer value1, Integer value2)
throws Exception {
return value1 + value2;
}
});

3.6 Fold

在keyedStream流上的记录进行连接操作，例如：

keyedInputStream keyedStream.fold("Start", new FoldFunction<Integer,
String>() {
@Override
public String fold(String current, Integer value) {
return current + "=" + value;
}
});

假如是一个（1,2,3,4,5）的流，那么结果将是：Start=1=2=3=4=5

3.7 Aggregation

在keyedStream上应用类似min、max等聚合操作：

keyedInputStream.sum(0)
keyedInputStream.sum("key")
keyedInputStream.min(0)
keyedInputStream.min("key")
keyedInputStream.max(0)
keyedInputStream.max("key")
keyedInputStream.minBy(0)
keyedInputStream.minBy("key")
keyedInputStream.maxBy(0)
keyedInputStream.maxBy("key")

Event time is the time that each individual event occurred on its producing device. This time is typically embedded within the records before they enter Flink, and that event timestamp can be extracted from each record. In event time, the progress of time depends on the data, not on any wall clocks. Event time programs must specify how to generate Event Time Watermarks, which is the mechanism that signals progress in event time. This watermarking mechanism is described in a later section, below.
In a perfect world, event time processing would yield completely consistent and deterministic results, regardless of when events arrive, or their ordering. However, unless the events are known to arrive in-order (by timestamp), event time processing incurs some latency while waiting for out-of-order events. As it is only possible to wait for a finite period of time, this places a limit on how deterministic event time applications can be.
Assuming all of the data has arrived, event time operations will behave as expected, and produce correct and consistent results even when working with out-of-order or late events, or when reprocessing historic data. For example, an hourly event time window will contain all records that carry an event timestamp that falls into that hour, regardless of the order in which they arrive, or when they are processed. (See the section on late events for more information.)
Note that sometimes when event time programs are processing live data in real-time, they will use some processing time operations in order to guarantee that they are progressing in a timely fashion.

event time是指数据产生时的时间。
watermark是一种衡量Event Time进展的机制，它是数据本身的一个隐藏属性。通常基于Event Time的数据，自身都包含一个timestamp，例如1472693399700（2016-09-01 09:29:59.700），而这条数据的watermark时间则可能是：

watermark(1472693399700) = 1472693396700(2016-09-01 09:29:56.700)

这条数据的watermark时间是什么含义呢？即：timestamp小于1472693396700(2016-09-01 09:29:56.700)的数据，都已经到达了。
watermark是用于处理乱序事件的，而正确的处理乱序事件，通常用watermark机制结合window来实现。
我们知道，流处理从事件产生，到流经source，再到operator，中间是有一个过程和时间的。虽然大部分情况下，流到operator的数据都是按照事件产生的时间顺序来的，但是也不排除由于网络、背压等原因，导致乱序的产生（out-of-order或者说late element）。
但是对于late element，我们又不能无限期的等下去，必须要有个机制来保证一个特定的时间后，必须触发window去进行计算了。这个特别的机制，就是watermark。
通常，在接收到source的数据后，应该立刻生成watermark；但是，也可以在source后，应用简单的map或者filter操作，然后再生成watermark。
生成watermark的方式主要有2大类：

(1):With Periodic Watermarks
(2):With Punctuated Watermarks

第一种可以定义一个最大允许乱序的时间，这种情况应用较多。

编程示例

1）To work with event time, streaming programs need to set the time characteristic accordingly.

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

public class BoundedOutOfOrdernessGenerator implements AssignerWithPeriodicWatermarks<MyEvent> {

    private final long maxOutOfOrderness = 3500; // 3.5 seconds

    private long currentMaxTimestamp;

    @Override
    public long extractTimestamp(MyEvent element, long previousElementTimestamp) {
        long timestamp = element.getCreationTime();
        currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
        return timestamp;
    }

    @Override
    public Watermark getCurrentWatermark() {
        // return the watermark as current highest timestamp minus the out-of-orderness bound
        return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
    }
}

6. flink 并行相关的核心概念

参考： http://vinoyang.com/2016/05/02/flink-concepts/
程序基本流程

image.png

程序在Flink内部的执行具有并行、分布式的特性。stream被分割成stream partition，operator被分割成operator subtask，这些operator subtasks在不同的线程、不同的物理机或不同的容器中彼此互不依赖得执行。

一个特定operator的subtask的个数被称之为其parallelism(并行度)。一个stream的并行度总是等同于其producing operator的并行度。一个程序中，不同的operator可能具有不同的并行度。

image.png

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,271评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,275评论 2赞 380
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,151评论 0赞 336
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,550评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,553评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,559评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,924评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,580评论 0赞 257
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,826评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,578评论 2赞 320
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,661评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,363评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,940评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,926评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,156评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,872评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,391评论 2赞 342

flink基础——简单原理介绍

flink编程基本步骤

1 运行环境

2 数据源

2.1 基于socket

2.2 基于文件

3 Transformation

3.1 Map

3.2 FlatMap

3.3 Filter

3.4 keyBy

3.5 Reduce

3.6 Fold

3.7 Aggregation

4. 深入

4. Flink DataStream API Programming Guide

5. 事件时间

编程示例

6. flink 并行相关的核心概念

推荐阅读更多精彩内容