Show Menu
Cheatography

Flink Cheat Sheet (DRAFT) by

For Flink revision in interview

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Flink's API

SQL
Highlevel Language
Table API
Declar­ative DSL
DataSt­rea­m/D­ataSet API
Core API
Stateful Stream Processing
Lower level building block

Flink archit­ecture

The client is not part of the runtime and program execution, but is used to prepare and send a dataflow to the JobMan­ager. After that, client can discon­nec­t(d­etached mode), or stay connected (attached mode)

DataSream

Data Stream
Immutable collec­tions of data that can contain duplic­ates, can either be finite or unbounded
Flink program
Obtain en execution enviro­nment
 
Load/c­reate the initial data
 
Transf­orm­ation
 
Where to put the result
 
Trigger the execution
Flink program executed lazily
do not happen directly. Rather, operation is created and added to dataflow graph

Datasource Overview

Stream­Exe­cut­ion­Env­iro­nment
getExe­cut­ion­Env­iro­nme­nt();
Filebase dataso­urces
env.re­adF­ile­(fi­leI­npu­tFo­rmat, path, watchType, interval, pathFi­lter, typeInfo)
watchType: can be ilePro­ces­sin­gMo­de.P­RO­CES­S_C­ONT­INU­OUSLY or FilePr­oce­ssi­ngM­ode.PR­OCE­SS_ONCE
Socket­-based:
env.so­cke­tTe­xtS­tream
Collection based
env. fromCo­lle­ction, env.fr­omE­lements
Custom source
env.ad­dSource
A sequence numbers
env.ge­ner­ate­Seq­uen­ce(0, 1000)

Data sink overview

writeA­sText() / TextOu­tpu­tFormat
Writes elements line-wise as Strings.
writeA­sCsv() / CsvOut­put­Format
Writes tuples as comma-­sep­arated value files.
print() / printT­oErr()
writeU­sin­gOu­tpu­tFo­rmat() / FileOu­tpu­tFormat
Method and base class for custom file outputs
writeT­oSocket
addSink

Timely Stream processing

Processing time
System time of the machine that is executing the respective operation
 
No coordi­nation between streams and machines
 
Best perfor­mance and lowest latency
 
Not provide determ­inism in distri­buted and async enviro­ments
Event time
The time that each individual event occurred on its producing device
 
Extract from the records
 
Consistent and determ­inistic
 
High latency while waiting for out-of­-order events
Watermark
A mechanism to measure progress in event time
 
Flow as part of the data stream and carry a timestamp t
 
Waterm­ark(t) declares that event time has reached time t, there should be no more elements with timestamp <= t
 
Crucial for out-of­-order streams
Watermark strategy
Timest­amp­Ass­igner + Waterm­ark­Gen­erator
Waterm­ark­Gen­erator
onEvent: Called for every event
 
onPeri­odi­cEmit: call period­ically, and might emit a new watermark or not
 
punctuate or periodic
Waterm­ark­Str­ate­gy.f­or­Mon­oto­nou­sTi­mes­tam­ps();
Event time itself
Waterm­ark­Str­ate­gy.f­or­Bou­nde­dOu­tOf­Ord­erness
Watermark lags behind the maximum timestamp seen in the stream by a fixed amount of time

State

Stateful operator
Remember inform­ation acc
Keyed state
Embedded key/value store
 
Partit­ioned and distri­buted strictly together with the streams
 
Only on keyed stream
State persis­tence
Fault tolerance: stream replay and checkp­ointing
Checkpoint
Marks a specific point in each of the input streams along with the corres­ponding state for each operators
 
Drawing consistent snapshots of the distri­buted data stream and operator state
Stream barriers
Injected into the data stream and flow with the records as part of the data stream
 
Separated the records in the data stream into the set of records that goes into the current snapshot, and the records that go into the next snapshot.
 
The point where the barriers for snapshot n are injected, is the position in the source stream up to which snapshot cover the data
 
Alignment phrase: Receive barrier for snapshot n of one incoming stream, operator need to wait until receive all others input
Snapshot operator state
At the point in time when they received all barriers from input streams and before emitting the barriers to their output streams
 
For each parallel stream data source, the offset­/po­sition in the stream when the snapshot started
 
For each operator, a pointer to the state that was stored
Unaligned checkpoint
Reacts on the first barrier that is stored in its input buffers
Checkpoint
Simple external depend­encies
 
Immutable and versioned
 
Decouple the stream transport from the persis­tence mechanism
Backpr­essure
Slow receiver makes the senders slow down in order not to overwhelm the receiver
Snapshot
generic term refer to global, consistent image of a state of a Flink job

RocksDB tunning

Increm­ental checkp­oints
Record the changes compared to the previous completed checkp­oint, instead of producing a full, self-c­ont­ained backup
Timers
Schedule actions for later => save on healp =>
 
state.b­ac­ken­d.r­ock­sdb.ti­mer­-se­rvi­ce.f­actory =heap
Tunning rocksdb memory
Flink's managed memory to buffer and cache
 
Increase the amount of managed memory

Window

Definition
Split stream into buckets of finite size, over which we can apply comput­ations
Keyed windows
.keyBy­().w­in­dow­().[.t­rig­er(­)][.ev­ict­or(­)][.al­low­edL­ate­nes­s()­][.s­id­eOu­tpu­tLa­teD­ata­()].re­duc­e/a­ggr­ega­te/­apply
 
Be performed in parallel by multiple tasks
Non-Keyed windows
window­All­().[.t­rig­er(­)][.ev­ict­or(­)][.al­low­edL­ate­nes­s()­][.s­id­eOu­tpu­tLa­teD­ata­()].re­duc­e/a­ggr­ega­te/­apply
 
Be performed by a single task (paral­lelism = 1)
Lifecycle
Created : the first element belong to this window arrie
 
Removed: the time passes its end timestamp + allowed lateness
Window Assigner
Respon­sible for assigning each incoming element to 1 or more windows
 
Assign based on time: start timestamp (inclu­sive) and an end timest­ime­(ex­clu­sive)
 
Tumbli­ngW­indows: each element to a window of a specified window size. Fixed size and not overlap
 
Slidin­gWi­ndows: each element to windows. Fixed size and can be overla­pping (window slide < window size)
 
Sessio­nWi­ndows: assigner groups elements by sessions of activity. Dont overlap, dont have fixed time. Close when it does not receive elements for a certain period of time
 
Global­Win­dows: all elements with the same key to same global window. Only useful if specify a custom trigger, because it does not have a natural end
Window Functions
Comput­ation that perform on each of windows
Reduce­Fun­ction
Increm­entally aggregate
 
Two elements from the input are combined to produce an output element with the same type
Aggreg­ate­Fun­ction
Genera­lised version of a Reduce­Fun­ction with 3 types: IN, ACC, OUT
 
Methods: creating initial accumu­lator, merging, extract output
Proces­sWi­ndo­wFu­nction
Iterable containing all the elements of the window
 
Context object with time and state inform­ation