Show Menu

Data Center and Cloud Cheat Sheet by

Flow table management


Compares data against predefined ruleset in one operation
Return action or address with first match
Rules consist of 1, 0, X's
OF Table entries contain Match, Action, Counter, Prio, Timeout

Geometric Repres­ent­ation of Rules

Rules can be specified by prefix­/length pairs or operat­or/­number (range)
Rule with d fields -> d-dime­nsional hyper-­rec­tangle
Match condition is finding highest priority hyper-­rec­tangle enclosing P
When rectangles overlap, smallest rectangle "­win­s"

Edge Network Mgmt

Mgmt is 80% of IT budget and respon­sible for 62% of otuages
Networks should be truly transp­arent
-Large network scalab­ility
-Flexible policies: custom routing, measur­ement and diagnosis, access control
-Commodity switches: small memory, expensive and power hungry, more link speed, storing lots of states, monitoring flows, qos
DC Networks:
-VM migration, load balancing, task schedu­ling, anomaly detect­ion­/is­olation


DIFANE Design goals:
-Scale with network growth
-Improve per-packet perfor­mance: always keep in dataplane
-Minimal switch modifi­cation: no change to dataplane hardware
Difane stages:
-Contr­oller proact­ively generates rules and distri­butes to authority switches
-Contr­oller proact­ively partitions rulespace in wildcards and distri­butes to all switches
-Ingress switches receive unknown flows and they contact Authority switch in correct wildcard space


-Authority switch forwards packet to correct destin­ation and caches corres­ponding rule in ingress switch for future packets
Caching wildcard rules:
-Contr­oller creates new rules for lower priority rules that overlap with high priority rules (e.g. R1 0-7, 5-6, R3 6-7,0-15, they overlap in 6-7,5-6, so controller creates R3 rule for 6-7,0-3 and 6-7,7-15)
-Rules must be correctly partit­ioned by controller to ensure optimal usage of TCAM, some cuts are better than others


Network dynamics:
-Policy change at contro­ller: Timeout cache rules, Change authority rules, No change for partitions
-Topology change at switch: No change in cache rules, no change in authority rules, Change in Partition rules
-Host mobility: Timeout cache rules, No change in authority or partition rules

Caching in Buckets

Partition rulespace in a grid of buckets
-Larger buckets mean more rules are cached each time
-Smaller buckets means more buckets need to be cached
-Partition until number of associated rules is bounded
-Sweetspot for bucket size is in a region, smaller and larger than this leads to memory overflow
-CAB reduces control network BW, flow setup latency and controller load
-Fully compatible with OF standard, resolves depend­encies wildcard rule caching

Cloud Security

Typical practices:
-Reinforce applic­ation security, strong network perimeter security
-Access control inside cloud for app/se­rvi­ce/­tenant isolation
-Gauge risk control when using public cloud
-Placing new security hardware is not easy
-Security devices are typically shared, miscon­fig­uration in one compro­mises many services, apps and hosts
-Tight work between network and security teams, high cost and low efficiency

Policy Aware Switch

-Makes forwarding decisions based on various factors, such as previous hop, input port, source­/dest address

Cloud NaaS

-Virtual network isolation
-Custom addressing
-Service differ­ent­iation
-Flexible middlebox intepr­osition
Cloud contro­ller: provides VM instance manage­ment, self-s­ervice provis­ioning, host virtual switch interc­one­ction
Network contro­ller: provides VM placement directives to cloud contro­ller, generates virtual network between VMs, Configures physical and virtual switches

Hybrid Security Archit­ecture

-Tenants everywhere -> Middlebox anywhere
-Flexible traversal: traffi­c-s­pec­ific, middlebox type, arbitrary number and order
-Decouple networking from security, creating appliance layer
App layer: App VMs with security groups
Appliance layer: Traversal path of middle­boxes
Network layer: Only cares about packet delivery
-Forwa­rding: MAC rewrite for L2, IP in IP for L3

HSA Benefits

-Scalable and flexible provis­ioning
-Facil­itates virtua­liz­ation, simplifies service develo­pment, testing, deployment and troubl­esh­ooting
-Enables dynamic and hetero­geneous service provis­ioning
-Minimize miscon­fig­uration impacts

SDN Security

Bottle­necks: Weak OF Agent CPU, limited message processing capabi­lities, Limited TCAM/SRAM resour­ces­->table overflows
Solution: Leverage NFV to build a softwa­re-­based defense line
NFV in edge clouds:
-Elastic resource allocation
-Network function as a service
-Rapid innovation

SDN Shield

-Contr­oller monitors switch packet-in message rate from each switch
-When one switch-s rate approaches satura­tion: counte­rme­asure
-Use a second Attack Mitigation Unit
How to Identify Legitimate Flows:
-Use statis­tical filtering
Condit­ional Legitimate Probab­ility:
-Analyze header field distri­bution
-Compare most recent measur­ement to reference profile
-Build scoreboard to calculate new flow's legitimacy probab­ility
-Threshold to control the rate of passed flows


1: Detect Attack -> monitor key parameters of traffic destined to protected targets, contain by limiting resource consum­ption
2: Differ­entiate attacking packets from legitimate ones in suspicious traffic: compare against baseline and use CLP to compute likelihood of each suspicious packet of being legitimate
3: Discard suspicious packets select­ively comparing CLP with dynamic threshold

Attack types

Endpoint: overload a victim or stub network -> Easily isolated by upstream routers, attacking packets have victim IP/subnet
-Monitor traffic rate, flow rate towards each host/stub -> large number of targets monitored
-Use Bloom-­filter to catch targets under attack, use DDoS control server to aggregate and correlate
Infras­tru­cture: Overload some choke-­point (router uplink) -> hard to isolate unless packet traceback infras­tru­cture is in place
-Monitor traffic parameters on links in routers


If packet attributes are indepe­ndent, Joint Probab­ility Mass function can be separated in P(A=a)­*P(­B=b)...


-Network virtua­liz­ation technology to improve scalab­ility problems in large cloud deploy­ments
-VLAN-like encaps­ula­tion, encaps­ulates L2 frames in UDP packets with port 4789 using a VNI
-Endpoints are called VTEPs, and may be virtual switches, hyperv­isors or NVGREs
-Overlay network is usually a multicast cloud
-NVGRE uses GRE to encaps­ulate L2 frames in L3 packets across L3 networks


Challe­nges: Oversu­bsc­rip­tion, Scalab­ility, Cost, Mobility and Latency
Network virtua­liz­ation: Create overlay networks on top of physical network infras­tru­cture
VXLAN 24 bit ID -> 16M networks
-Can cross L3, 50bytes of overhead
-VMs don't see tag
-L2 broadcast is replaced by IP multicast
-VLAN sprawl
-Single fault domains
-Scala­bility beyond 4096 segments
-Non-p­rop­rietary fabric
-IP mobility
-Physical cluster size and locality improves
-Better multit­enancy

Arista VXLAN 2

VTEP: Tunnel endpoint
VXLAN GW: Bridges VXLAN to non-VXLAN enviro­nment (HW or SW)
VNI: Identifies VXLANs
VTI: Terminates a VTEP
VXLAN Segment: L2 overlay network over which VMs commun­icate, only VMs within same VXLAN segment can commun­icate
OVSDB: Allows management of Open vSwitches, create or delete ports, tunnels, and queues


VXLAN Overlay is an L2 broadcast domain identified by a VNI
VXLAN encap:
-Outer header -> IP source and dest from VTEP endpoints, L2 source from VTEP source, L2 dest from next L3 hop, UDP port dest 4789
Gateway types:
L2-> VLAN to VXLAN bridging
L3-> VXLAN to VXLAN routing

VXLAN Flood and Learn

VNI is mapped to a multicast group on a VTEP
Broadcast, Unknown Unicast and Multicast traffic is flooded to the multicast group of the VNI
Remote VTEPs of the group learn host MAC, VNI and source VTEP IP from flooded multicast traffic
Unicast packets for the host are sent directly to the source VTEP IP
Encaps­ulated packet:
UDPd: 4789, IPd: remote VTEP/m­ult­icast group, IPs: source VTEP, Md: remote VTEP/m­ult­icast MAC, IPd: Remhost, IPs: Source­host, Md: Remhos­t/B­roa­dcast, Ms: Sourcehost


1 L3 VNI per VRF per VTEP
1 L2 VNI per L2 segment, multiple L2 VNIs per tenant
BGP minimizes network flooding and allows VTEP peer discovery and authen­tic­ation
All VTEPs keep the same IP address for L2 VNIs
-Host sends out GARP when they come online
-Local VTEP creates local ARP cache and advertises through BGP as Route Type 2
-Remote VTEP puts IP-MAC info into remote ARP cache and suppresses ARP for this IP
-VTEP floods if no match is found in cache


Asymmetric IRB: different path from source to dest and back, VTEP must be configured with both source and dest VNIs for both l2 and l3
Symmetric IRB: same path to destin­ation and back, ingress VTEP routes from source VNI to L3 VNI and changes inner dest MAC to egress VTEP router MAC
Route Types:
Type 2: MAC adveti­sement -> L2 VNI MAC/MAC-IP -> MAC and ARP resolution
Type 5: IP Prefix Route -> L3 VNI route -> advertise prefix


L2 traffic cannot traverse VNI boundaries
L3 traffic from one VRF is mapped to a L3 VNI
L3 traffic from different VRFs cannot traverse L3 VNI boundaries
BGP update sends Host MAC, Host IP, L3 VNI and VTEP
Remote VTEPs take Host MAC and put it in MAC table, and Host IP and put it in VRF (L3 VNI) IP table
Local host inform­ation is learned through conven­tional L2 learning and GARP, or through mgmt plane integr­ation between VTEP and hosts


-Improves scalab­ility
-Enables control plane learning of L2 end host and L3 reacha­bility
-Reduced network flooding
-Optimal east-west and north-­south forwarding
-VTEP discovery and authen­tic­ation


-Ingress VTEP does L2 and L3 lookup, egress VTEP only L2 lookup
-Both VTEPs perform L2 and L3 lookup
-All VTEP need all VNIs
-Inter­VXLAN traffic is encaps­ulated in L3 VNI, which identifies VRFs
-Ingress VTEP routes from source VNI to dest VNI
-ingress VTEP does not need to know dest VNI
-Not scalable

TCP in the DC

Not good for DC
-Adds latency
-Wastes buffer space
-Performs bad with shallo­w-b­uffer switches
DC Workloads:
-Parti­tio­n/A­ggr­egate (Delay sensitive, bursty)
-Short messages (delay­-se­nsi­tive)
-Large flows (throu­ghput sensitive)
Incast: Synchr­onized congestion from partit­ion­-ag­gregate workloads
-Seemingly underu­tilized links become overut­ilized in short burst causing unseen drops

DC Transport Requir­ements

-High burst tolerance
-Low latency
-High throughput
Tradit­ional TCP:
-Window flow control: lost packets detected by missing ACKs
-W=BW x RTT -> awnd (recei­ver), cwnd (network), W = min(aw­nd,­cwnd)
Algorithms to calculate cwnd: Tahoe, Reno, NewReno, DCTCP

TCP Tahoe and Reno

-3 DUP ACKS -> Fast Retran­smit, set ssthresh to cwnd/2, reduce cwnd to 1 MSS, reset to slow start
-3 DUP ACKS -> Fast Retransmit and skip slow start, set cwnd to cwnd/2, enter fast recovery
-ACK time out (RTO) -> Slow start, cwnd -> 1MSS
-ACK time out (RTO) -> Slow start, cwnd -> 1MSS
Fast recovery: wait for ACK for entire window before returning to CA, if no ACK enter slowstart


Slowstart: Start with cwnd =1, each ACK cwnd <- cwnd + 1, each RTT cwnd <- 2xcwnd (expon­ential)
CA: enter when cwnd >= ssthresh, each ACK cwnd<-­cwn­d+1­/cwnd
-Each RTT: cwnd <- cwnd + 1
Fast Retran­smit: flightsize = min(aw­nd,­cwnd), sshthresh = max(fl­igh­tsi­ze/2,2)
-Enter slowstart cwnd=1

TCP Reno

TCP NewReno

New Reno

Remember last segment sent before Fast Retransmit
-Deal with partial ACK (new ACK does not cover last remembered segment, i.e. more packets lost before entering FR)
-Retra­nsmit new lost packet too and remain in Fast Recovery, exit when ACK that covers last segment sent before FR is received)
-each new dupack cwnd=c­wnd+mss
-when partial ack received cwnd=c­wnd­-(c­urr­ACK­-pr­evA­CK)­*ms­s+mss


A single flow needs C*RTT buffers for 100% TP
For large N flows C*RTT/­sqrt(N) is enough
-Idea: React to ECN marks, every ECN mark cuts down window by 5% (TCP cuts by half regardless of number of marks)
-At switch mark packets when queue length > K
-At sender keep F=#mar­kAC­K/t­ota­lACK, a=(1-g­)*a+gF
Benefit: keep queue length short and TP high
Tradeoff: Conver­gence time is greater for new flows

TCP Losses

Block loss: lose a whole window of packets
Double loss: lose a retran­smitted packet, protocol can't tell
-Solution: timestamp
Tail loss: one of the last packets of the stream is lost, not enough DUP ACK to trigger retran­smi­ssion
-Solution: send dummy data (e.g. reiterated FIN)
PLATO: Send heartbeats interl­eaved to avoid RTO, to infer loss by 3 DUP ACK, heartbeat is rarely dropped

Traffic Schedu­ling: D3

Make network aware of flow deadlines
Prioritize based on deadlines
When capacity is greater than desired rates: deadline flows get desired rate + fair share, non-de­adline get only fair-share
When capacity is not enough: greedily satisfy as many flows as possible according to request rates in order of arrival
-Need to modify hosts and switches, not backward compat­ible, no increm­ental deployment
-Not friendly with legacy transport protocols, running in parallel degrades perfor­mance

Traffic Scheduling - pFabric

-Prior­itize packets based on remaining flow size
-pFabric switch: implement scheduling based on priority (send high priority first, drop low priority first)
-pFabric host: send/r­etr­ansmit aggres­sively, use simple flow control (minTCP)
-Very small buffers, 2xLink­Spe­edxRTT
-Worst case: small packets (64B), 51.2ns (64*8/­10Gbps) to find min/max of 600 numbers with binary tree, 10 clock cycles, 1ns with current ASICs

Traffic Scheduling - pFabric 2

-Start at line rate, no RTO estima­tion, reduce window on packet drop, increase same as TCP (ss, CA)
-Simple, yet near-o­ptimal
-Requires new switches and minor host changes (clean­-slate)
-Does not meet deadline requir­ements

Traffic Scheduling - Baraat

-Flow schedu­lin­g-> ineffi­cient
-Priority scheduling -> does not meet deadlines
Idea: Task-aware scheduling
-Schedule tasks in Smart Priority Classes
-Switch maps flows to classes and handles heavy tasks
-Flows mapped to higher prio class get preference
-Flows with same priority class fair share
-TaskID is used as priority (FIFO)
-Heavy tasks are identified on the fly by byte count, upon exceeding threshold, task and immedi­ately subsequent task are assigned same priority

Baraat features

Keep 3 counters: Total demand, total bytes reserved so far, number of flows in task
Also single aggregate counter for each link to keep track of BW alloca­tions
-Schedule tasks, not flows
-FIFO-LM algorithm
-No need to know flow size
-New transport protocol
-Modifies switches and hosts
-Does not meet deadlines
-Reduces task completion time for partit­ion-agg workflows compared to Fair share

Green DC

Minimize energy consumed by servers and cooling
-70-80% of total
-Conso­lidate workload to minimal set of servers and turn off unnece­ssary
-Conso­lidate workload based on locations to maximize efficiency of cooling
Minimize energy consumed by DC network (switches)
-10-20% of total
-Conso­lidate traffic to minimal set of paths and turn off switch­es/­links

Green DC 2

Intra DC: dispatch loads to minimal servers and to cooler areas
Inter DC: dispatch loads to DC's with less energy cost or with renewable energy
JEC (Joint inter and intra)
-Considers variation of electr­icity prices and workload distri­bution on the efficiency of cooling systems
Random LB < Electr­icity InterDC < Cooling aware IntraDC < EIR+CIA < JEC

Elastic Tree

Power Knobs: vary link speed, disable links, disable switches, move workload
-Turn off unneeded link and switch
-Create energy propor­tional DC network
-Takes topology, routing restri­ctions, power models, traffic matrix
-Produces network subset and flow routes
-Formal: best quality, any topo, not scalable, input: Traff Matrix
-Greedy: good quality, any topo, scalable, traffic matrix
-Topo-­aware: ok quality, structured topo, best scalab­ility, port counters


Considers BW demand variation over time
Elastic Tree might overes­timate demand wasting power (average or peak, real demand is less)
-Use flow correl­ation (90 percentile data) to consol­idate flows with low correl­ation using non-peak rate (low prob of peaking together)
-Minimize total power within a consol­idation period based on traff correl­ation and non-peak data rate
-link rate adaptation for remaining links
Result: lowest power consum­ption and most savings, minor delay and drop degrad­ation

Support Cheatography!



No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.