flink中文官网: https://flink.apache.org/zh/
在 Flink 中,应用程序由数据流组成,这些数据流可以由用户定义的运算符(注:有时我们称这些运算符为“算子”)进行转换。这些数据流形成有向图,从一个或多个源开始,以一个或多个输出结束。
Flink 支持流处理和批处理,它是一个分布式的流批结合的大数据处理引擎。在 Flink 中,认为所有的数据本质上都是随时间产生的流数据,把批数据看作是流数据的特例,只不过流数据是一个无界的数据流,而批数据是一个有界的数据流(例如固定大小的数据集)。如下图所示:
因此,Flink 是一个用于在无界和有界数据流上进行有状态计算的通用的处理框架,它既具有处理无界流的复杂功能,也具有专门的运算符来高效地处理有界流。通常我们称无界数据为实时数据,来自消息队列或分布式日志等流源(如 Apache Kafka 或 Kinesis)。而有界数据,通常指的是历史数据,来自各种数据源(如文件、关系型数据库等)。由 Flink 应用程序产生的结果流可以发送到各种各样的系统,并且可以通过 REST API 访问 Flink 中包含的状态。
当 Flink 处理一个有界的数据流时,就是采用的批处理工作模式。在这种操作模式中,我们可以选择先读取整个数据集,然后对数据进行排序、计算全局统计数据或生成总结所有输入的最终报告。当 Flink 处理一个无界的数据流时,就是采用的流处理工作模式。对于流数据处理,输入可能永远不会结束,因此我们必须在数据到达时持续不断地对这些数据进行处理。
Flink 提供了开发流/批处理应用程序的API供开发者使用,越往上抽象程度越高,使用起来越方便;越往下越底层,使用起来难度越大,如下图所示:
Flink 提供了三个分层的 API。每个 API 在简洁性和表达性之间提供了不同的权衡,并针对不同的应用场景。
注意:在Flink1.12时支持流批一体,DataSetAPI已经不推荐使用了,所以博主在文章中除了个别案例使用DataSet外,后续其他案例都会优先使用DataStream流式API,既支持无界数据处理/流处理,也支持有界数据处理/批处理!当然Table&SQL-API会单独学习
https://ci.apache.org/projects/flink/flink-docs-release-1.14/zh/docs/dev/dataset/overview/
创建maven项目,项目名称:flinkbase
1.4.2.导入pom依赖
aliyun http://maven.aliyun.com/nexus/content/groups/public/ apache https://repository.apache.org/content/repositories/snapshots/ cloudera https://repository.cloudera.com/artifactory/cloudera-repos/
UTF-8 1.14.0 3.1.2 5.1.48 3.9.0 4.4 1.2.68 1.7.7 1.18.22 3.0.0 3.1.1.7.2.9.0-173-9.0 4.4 1.10.2 1.11.4 1.8 2.12 2.12 ${java.version} ${java.version}
org.scala-tools maven-scala-plugin ${scala.version} org.apache.flink flink-clients_${scala.binary.version} ${flink.version} org.apache.flink flink-scala_${scala.binary.version} ${flink.version} org.apache.flink flink-runtime-web_${scala.binary.version} ${flink.version} org.apache.flink flink-java ${flink.version} org.apache.flink flink-streaming-scala_${scala.binary.version} ${flink.version} org.apache.flink flink-streaming-java_${scala.binary.version} ${flink.version} org.apache.flink flink-table-api-scala-bridge_${scala.binary.version} ${flink.version} org.apache.flink flink-table-api-java-bridge_${scala.binary.version} ${flink.version} org.apache.flink flink-table-planner_${scala.binary.version} ${flink.version} org.apache.flink flink-table-common ${flink.version} org.apache.flink flink-queryable-state-runtime ${flink.version} org.apache.flink flink-connector-kafka_${scala.binary.version} ${flink.version} org.apache.flink flink-sql-connector-kafka_${scala.binary.version} ${flink.version} org.apache.flink flink-connector-jdbc_${scala.binary.version} ${flink.version} org.apache.flink flink-connector-pulsar_${scala.binary.version} ${flink.version} org.apache.flink flink-csv ${flink.version} org.apache.flink flink-json ${flink.version} org.apache.flink flink-connector-filesystem_${scala.binary.version} ${flink-filesystem.version} org.apache.flink flink-connector-jdbc_${scala.binary.version} ${flink.version} org.apache.flink flink-parquet_${scala.binary.version} ${flink.version} org.apache.bahir flink-connector-redis_2.11 1.0 flink-streaming-java_${scala.binary.version} org.apache.flink flink-runtime_${scala.binary.version} org.apache.flink flink-core org.apache.flink flink-java org.apache.flink org.apache.flink flink-connector-hive_${scala.binary.version} ${flink.version} org.apache.hive hive-metastore ${hive.version} hadoop-hdfs org.apache.hadoop org.apache.hive hive-exec ${hive.version} org.apache.flink flink-shaded-hadoop-3-uber ${flink-shaded-hadoop.version} mysql mysql-connector-java ${mysql.version} io.vertx vertx-core ${vertx.version} io.vertx vertx-jdbc-client ${vertx.version} io.vertx vertx-redis-client ${vertx.version} org.slf4j slf4j-log4j12 ${log4j.version} runtime com.alibaba fastjson ${fastjson.version} org.projectlombok lombok ${lombok.version} org.apache.commons commons-collections4 ${collections4.version} org.apache.thrift libfb303 0.9.3 org.apache.avro avro ${avro.version} org.apache.flink flink-avro ${flink.version} org.apache.kafka kafka-clients ${kafka.version} org.apache.kafka kafka-streams ${kafka.version}
src/main/java org.apache.maven.plugins maven-compiler-plugin 3.5.1 1.8 1.8 org.apache.maven.plugins maven-surefire-plugin 2.18.1 false true **/*Test.* **/*Suite.* org.apache.maven.plugins maven-shade-plugin 2.3 package shade *:* META-INF/*.SF META-INF/*.DSA META-INF/*.RSA
编写Flink程序,读取文件中的字符串,并以空格进行单词拆分打印。
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.*;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;/*** 需求:批的方式统计单词出现的次数* todo 使用flink1.12版本之前流批一体架构不太成熟,因此批流开发需要使用两套api实现,但是从该版本以后,可以使用一套既可以跑批作业,也可以跑流作业* 批作业的数据抽象是:DataSet* 流作业的数据抽象是:DataStream** 在flink1.12以后,批流作业的数据抽象是:DataStream*/
public class BatchWordCount {public static void main(String[] args) throws Exception {/*** 实现步骤* 1)获取批处理的运行环境* 2)指定文件路径,获取数据* 3)对获取到的数据进行空格拆分* 4)对拆后的数据进行单词计数,每个单词记一次数* 5)对第四步的结果进行按照单词进行分组* 6)根据单词出现的计数进行聚合操作* 7)打印输出* 8)启动运行*///todo 1)获取批处理的运行环境ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();//todo 2)指定文件路径,获取数据(ctrl+p:查看方法参数说明)DataSource textFile = env.readTextFile("./data/input/wordcount.txt");//todo 3)对获取到的数据进行空格拆分(ctrl+n:搜索类)FlatMapOperator words = textFile.flatMap(new FlatMapFunction() {//(ctrl+i:重写父类的方法)@Overridepublic void flatMap(String value, Collector out) throws Exception {//根据空格拆分字符串String[] words = value.split(" ");//循环单词数组,将每个单词返回for (String word : words) {//返回数据out.collect(word);}}});//todo 4)对拆后的数据进行单词计数,每个单词记一次数MapOperator> wordAndOne = words.map(new MapFunction>() {@Overridepublic Tuple2 map(String value) throws Exception {return Tuple2.of(value, 1);}});//todo 5)对第四步的结果进行按照单词进行分组UnsortedGrouping> grouped = wordAndOne.groupBy(0);//todo 6)根据单词出现的计数进行聚合操作AggregateOperator> sumed = grouped.sum(1);//todo 7)打印输出sumed.print();//todo 8)启动运行(批处理里面可以忽略,print方法会触发作业执行)}
}
编写Flink程序,接收socket的单词数据,并以空格进行单词拆分打印。
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.operators.FlatMapOperator;
import org.apache.flink.api.java.operators.MapOperator;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;/*** 需求:编写Flink程序,接收socket的单词数据,并以空格进行单词拆分打印。*/
public class StreamingWordCount {public static void main(String[] args) {/*** 实现步骤:* 1)获取流处理的运行环境* 2)构建socket数据源接受数据,指定ip和端口号* 3)对接收到的数据进行空格拆分* 4)对拆后的数据进行单词计数,每个单词记一次数* 5)对第四步的结果进行按照单词进行分组* 6)根据单词出现的计数进行聚合操作* 7)打印输出* 8)启动运行*///todo 1)获取流处理的运行环境StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();//设置全局并行度env.setParallelism(1);//todo 2)构建socket数据源接受数据,指定ip和端口号DataStreamSource socketTextStream = env.socketTextStream("node1", 9999);//todo 3)对接收到的数据进行空格拆分SingleOutputStreamOperator streamOperator = socketTextStream.flatMap(new FlatMapFunction() {//(ctrl+i:重写父类的方法)@Overridepublic void flatMap(String value, Collector out) throws Exception {//根据空格拆分字符串String[] words = value.split(" ");//循环单词数组,将每个单词返回for (String word : words) {//返回数据out.collect(word);}}});//todo 4)对拆后的数据进行单词计数,每个单词记一次数SingleOutputStreamOperator> wordAndOne = streamOperator.map(new MapFunction>() {@Overridepublic Tuple2 map(String value) throws Exception {return Tuple2.of(value, 1);}});//todo 5)对第四步的结果进行按照单词进行分组KeyedStream, String> keyedStream = wordAndOne.keyBy(new KeySelector, String>() {@Overridepublic String getKey(Tuple2 value) throws Exception {return value.f0;}});//todo 6)根据单词出现的计数进行聚合操作SingleOutputStreamOperator> sumed = keyedStream.sum(1);//todo 7)打印输出sumed.printToErr();//try。。。(ctrl+alt+t)try {//todo 8)启动运行env.execute();} catch (Exception e) {e.printStackTrace();}}
}
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;import java.util.Arrays;/*** 需求:编写Flink程序,接收socket的单词数据,并以空格进行单词拆分打印。*/
public class LambdaStreamingWordCount {public static void main(String[] args) {/*** 实现步骤:* 1)获取流处理的运行环境* 2)构建socket数据源接受数据,指定ip和端口号* 3)对接收到的数据进行空格拆分* 4)对拆后的数据进行单词计数,每个单词记一次数* 5)对第四步的结果进行按照单词进行分组* 6)根据单词出现的计数进行聚合操作* 7)打印输出* 8)启动运行*///todo 1)获取流处理的运行环境StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();//todo 2)构建socket数据源接受数据,指定ip和端口号DataStreamSource socketTextStream = env.socketTextStream("node1", 9999);//todo 3)对接收到的数据进行空格拆分SingleOutputStreamOperator streamOperator = socketTextStream.flatMap((String line, Collector out) ->Arrays.stream(line.split(" ")).forEach(out::collect)).returns(Types.STRING);//todo 4)对拆后的数据进行单词计数,每个单词记一次数SingleOutputStreamOperator> wordAndOne = streamOperator.map(word -> (Tuple2.of(word, 1))).returns(Types.TUPLE(Types.STRING, Types.INT));//todo 5)对第四步的结果进行按照单词进行分组KeyedStream, Tuple> keyedStream = wordAndOne.keyBy(0);//todo 6)根据单词出现的计数进行聚合操作SingleOutputStreamOperator> sumed = keyedStream.sum(1);//todo 7)打印输出sumed.printToErr();//try。。。(ctrl+alt+t)try {//todo 8)启动运行env.execute();} catch (Exception e) {e.printStackTrace();}}
}
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;/*** 使用流批一体的方式实现wordCount的单词统计* 既可以跑批作业也可以跑流,根据数据源的类型决定*/
public class UnifyWordCount {public static void main(String[] args) {/*** 实现步骤:* 1)获取流处理的运行环境* 2)构建socket数据源接受数据,指定ip和端口号* 3)对接收到的数据进行空格拆分* 4)对拆后的数据进行单词计数,每个单词记一次数* 5)对第四步的结果进行按照单词进行分组* 6)根据单词出现的计数进行聚合操作* 7)打印输出* 8)启动运行*///todo 1)获取流处理的运行环境StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();//env.setRuntimeMode(RuntimeExecutionMode.BATCH); //将当前作业强制按照批的方式运行,报错,因为数据源本身决定了无法以批的方式来运行//env.setRuntimeMode(RuntimeExecutionMode.STREAMING);//将当前作业强制按照流的方式运行//env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);//默认的运行模式,根据数据源自动判断//todo 2)构建socket数据源接受数据,指定ip和端口号//DataStreamSource socketTextStream = env.socketTextStream("node1", 9999);DataStreamSource socketTextStream = env.readTextFile("./data/input/wordcount.txt");//todo 3)对接收到的数据进行空格拆分SingleOutputStreamOperator streamOperator = socketTextStream.flatMap(new FlatMapFunction() {//(ctrl+i:重写父类的方法)@Overridepublic void flatMap(String value, Collector out) throws Exception {//根据空格拆分字符串String[] words = value.split(" ");//循环单词数组,将每个单词返回for (String word : words) {//返回数据out.collect(word);}}});//todo 4)对拆后的数据进行单词计数,每个单词记一次数SingleOutputStreamOperator> wordAndOne = streamOperator.map(new MapFunction>() {@Overridepublic Tuple2 map(String value) throws Exception {return Tuple2.of(value, 1);}});//todo 5)对第四步的结果进行按照单词进行分组KeyedStream, String> keyedStream = wordAndOne.keyBy(new KeySelector, String>() {@Overridepublic String getKey(Tuple2 value) throws Exception {return value.f0;}});//todo 6)根据单词出现的计数进行聚合操作SingleOutputStreamOperator> sumed = keyedStream.sum(1);//todo 7)打印输出sumed.printToErr();//try。。。(ctrl+alt+t)try {//todo 8)启动运行env.execute();} catch (Exception e) {e.printStackTrace();}}
}
Flink程序递交方式有两种:
- 以UI的方式递交
- 以命令的方式递交
写入HDFS如果存在权限问题:
进行如下设置:
hadoop fs -chmod -R 777 /
并在代码中添加:
System.setProperty("HADOOP_USER_NAME", "root")
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;import java.util.Arrays;/*** 编写Flink程序,读取文件中的字符串,并以空格进行单词拆分打印**/
public class BatchWordCountToYarn {public static void main(String[] args) throws Exception {ParameterTool parameterTool = ParameterTool.fromArgs(args);String output = "";if (parameterTool.has("output")) {output = parameterTool.get("output");System.out.println("指定了输出路径使用:" + output);} else {output = "hdfs://node1:8020/wordcount/output47_";System.out.println("可以指定输出路径使用 --output ,没有指定使用默认的:" + output);}//TODO 0.env//ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();//env.setRuntimeMode(RuntimeExecutionMode.BATCH);//注意:使用DataStream实现批处理//env.setRuntimeMode(RuntimeExecutionMode.STREAMING);//注意:使用DataStream实现流处理//env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);//注意:使用DataStream根据数据源自动选择使用流还是批//TODO 1.source//DataSet lines = env.fromElements("itcast hadoop spark", "itcast hadoop spark", "itcast hadoop", "itcast");DataStream lines = env.fromElements("itcast hadoop spark", "itcast hadoop spark", "itcast hadoop", "itcast");//TODO 2.transformation//切割/*@FunctionalInterfacepublic interface FlatMapFunction extends Function, Serializable {void flatMap(T value, Collector out) throws Exception;}*//*DataStream words = lines.flatMap(new FlatMapFunction() {@Overridepublic void flatMap(String value, Collector out) throws Exception {//value就是每一行数据String[] arr = value.split(" ");for (String word : arr) {out.collect(word);}}});*/SingleOutputStreamOperator words = lines.flatMap((String value, Collector out) -> Arrays.stream(value.split(" ")).forEach(out::collect)).returns(Types.STRING);//记为1/*@FunctionalInterfacepublic interface MapFunction extends Function, Serializable {O map(T value) throws Exception;}*//*DataStream> wordAndOne = words.map(new MapFunction>() {@Overridepublic Tuple2 map(String value) throws Exception {//value就是一个个单词return Tuple2.of(value, 1);}});*/DataStream> wordAndOne = words.map((String value) -> Tuple2.of(value, 1)).returns(Types.TUPLE(Types.STRING, Types.INT));//分组:注意DataSet中分组是groupBy,DataStream分组是keyBy//wordAndOne.keyBy(0);/*@FunctionalInterfacepublic interface KeySelector extends Function, Serializable {KEY getKey(IN value) throws Exception;}*/KeyedStream, String> grouped = wordAndOne.keyBy(t -> t.f0);//聚合SingleOutputStreamOperator> result = grouped.sum(1);//TODO 3.sink//如果执行报hdfs权限相关错误,可以执行 hadoop fs -chmod -R 777 /System.setProperty("HADOOP_USER_NAME", "root");//设置用户名//result.print();//result.writeAsText("hdfs://node1:8020/wordcount/output47_"+System.currentTimeMillis()).setParallelism(1);result.writeAsText(output + System.currentTimeMillis()).setParallelism(1);//TODO 4.execute/启动并等待程序结束env.execute();}
}package cn.itcast.day01.b;import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.*;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;/*** 需求:批的方式统计单词出现的次数* todo 使用flink1.12版本之前流批一体架构不太成熟,因此批流开发需要使用两套api实现,但是从该版本以后,可以使用一套既可以跑批作业,也可以跑流作业* 批作业的数据抽象是:DataSet* 流作业的数据抽象是:DataStream** 在flink1.12以后,批流作业的数据抽象是:DataStream*/
public class BatchWordCount {public static void main(String[] args) throws Exception {// 获取参数ParameterTool parameterTool = ParameterTool.fromArgs(args);String output = "";if(parameterTool.has("output")) {output = parameterTool.get("output");System.out.println("制定了输入路径使用: " + output);} else {output = "hdfs://node1:8020/wordcount/output66_";System.out.println("可以指定输出路径使用 --output ,没有指定使用默认的:" + output);}// 环境StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();// 数据源DataStreamSource source = env.fromElements("hello flink spark hadoop", "hello flink spark", "hello flink");// 数据转换// 切割SingleOutputStreamOperator words = source.flatMap(new FlatMapFunction() {@Overridepublic void flatMap(String line, Collector out) throws Exception {String[] arr = line.split(" ");for (String word : arr) {out.collect(word);}}});// word, 1SingleOutputStreamOperator> wordAndOne = words.map(new MapFunction>() {@Overridepublic Tuple2 map(String word) throws Exception {return Tuple2.of(word, 1);}});// 分组 求和KeyedStream, Tuple> tuple2TupleKeyedStream = wordAndOne.keyBy(0);SingleOutputStreamOperator> result = tuple2TupleKeyedStream.sum(1);// sink 写System.setProperty("HADOOP_USER_NAME", "root");result.writeAsText(output + System.currentTimeMillis()).setParallelism(1);// 执行env.execute();}
}
使用 maven视图的 package 打包.
/onekey/hd1_start.sh
cd /export/server/flink
bin/start-cluster.sh
jar包位置
1.7.2.5.查看结果
参考官网:
https://ci.apache.org/projects/flink/flink-docs-release-1.14/zh/docs/dev/datastream/execution_mode/
./bin/flink run \
-Dexecution.runtime-mode=BATCH -m yarn-cluster -yjm 1024 -ytm 1024 \
-c cn.itcast.day01.b.BatchWordCount /root/original-flink-base-01-1.0-SNAPSHOT.jar \
--output hdfs://node1:8020/wordcount/output_50