一、 Overview
定义:Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
工作原理:Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. 接受实时的输入数据,划分为batches,经过spark engine处理后产生最终的结果。
DStream is represented as a sequence of RDDs.
三、Basic Concepts
3.1 Linking
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.1</version>
</dependency>
For ingesting data from sources like Kafka, Flume, and Kinesis that are not present in the Spark Streaming core API, you will have to add the corresponding artifact
spark-streaming-xyz_2.11
to the dependencies. For example, some of the common ones are as follows.
Source Artifact Kafka spark-streaming-kafka-0-8_2.11 Flume spark-streaming-flume_2.11 Kinesis spark-streaming-kinesis-asl_2.11 [Amazon Software License] 3.2
Input DStreams and Receivers
Points to remember
When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL. Either of these means that only one thread will be used for running tasks locally. If you are using an input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).
Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.
3.3
Basic Sources
转载请注明原文地址: https://ju.6miu.com/read-699995.html