Blog‎ > ‎


posted Apr 30, 2014, 10:06 AM by Rick McGee   [ updated May 5, 2014, 10:56 AM ]
Why QoS? 
    Because we have more data/traffic on the wire then there is capacity
    Designs typically use an oversubscribe model 10:1, 40:1, etc...
    More data on the line that we have the capability to switch it 

Undertand you business Requirements 
    HPC/Grid Computing 
        RDMA/RoCE, iWARP (RDMA over WAN)

    HFT  High Frequency/Algorithmic Trading     
        Nanoseconds now matter in QoS (e.g. 500ns late is a problem)

    Storage/Virtualized Data Center     
        FCOE, iSCSI, NFS, CIFS, vMotion, Voice, and Video (Jumbo Frames)

    MSDC (Google and Amazon) Massively Scaled Data Center
        Vecoming more common 
        Hadoop/HDFS/IncastTCP (highly oversubscribed)
        Data Center TCP using ECN (Explicit Congestion Notifier)

Understanding your applications 

6500 Series QoS
    Was developed by the module group, you could have different 
    QoS commands for a 6148 vs a 6748 or even different generation 
    of line cards 
    QoS is off by default
    Nothing is trusted

    QoS is on by default
    everything is trusted at first 
    Classify at the ingress 
    Queue mapping has to be done at Layer 2 COS 
    Granular hardware ASIC queuing capabilities, but all the CLI now 
    gets standardized 
    Functions stay the same - CLI Changes - but only once

Data Needs for QoS

    Voice:  10ms Tc (Time Interval Value) small, not bursty, time sensitive 
    and drop sensitive 150ms RTT Jitter  
        CoS 5, DECP EF (CoS 3 Signaling)
    Video:  33ms Tc, big, variable data rate, time sensitive, drop sensitive
        CoS 4, DSCP CS5 , CS4, AF4x, AF3x (CoS 3 Signaling) 
    FCOE: Consistent, giant frames, somewhat time sensitive, and drop
    intolerant  (2112 MTU)     
        T11-FC-BB5 CoS 3 Recommendation no DSCP value not L3

New QoS Protocols 
    Data Center Bridging Exchange (DCBX)
        Uses LLDP with new TLV (type layer value) fields
        Standard progressed from CIN (Cisco, Intel, Nuova) then CEE
        (Converged Enhanced Ethernet), and finally standard IEEE DCBX
        Proper negotiation between Switch-to-Switch and Switch-to-CNA
        resfults in PFC, ETS, CoS values, per-priority-pause frame support
        as well as vFC interface bring-up    
        "sh lldp dcbx int e1/1"

    802.1Qbb Priority Flow Control (PFC)
        Pause Frames
        Pause frame basically says "My Rx buffer is full buffer your Tx buffer"
        Historically recommendation has been to use TCP drop mechanism 
        Rather then PAUSE
        TCP is resilient and can tolerate drops. SCSI drops are devastating and
        and cause lost data and OS stops
        FC used something called buffer-to-buffer credits for lossless 
        FC moving to Ethernet net (FCOE) deal with Ethernet's underlying 
        capabilities and short comings 
        With PFC, PAUSE frame are not based on CoS value
        Provide a unique "service lane" for FCOE traffic larger buffers
        Results in "loseless" or "no-drop" behavior for SCSI/FCOE
    802.1Qaz Enhanced Transmission Selection (ETS)
         CB-Queuing or QoS within Cisco 
         Strict Priority ad defined in 802.1p
         Credit-Based shaping as defined in 802.1Qav (think token bucket)   
            Set amount of BW on an interface  in times of congestion 
        Traffic is based on classes which are based on CoS priorities 
        Ability to place multiple CoS values in a class or queue, bot no more
        than 8 classes (3 bits 0-7)
        Industry standard means that now CNA's participate 
        Queuing form the server up to the switch

    8021Qau Explicit Congestion Notification (ECN or CN)
           Data Center TCP  largely driven by Google 
           Uses bits in DiffServ field to allow switch to markup TCP 
           and indicate that interfaces are becoming congesting 
           Uses RED to detect to mark ECN rather than drop 
           Uses LSB in DSCP 
                0x00 Non-ECN Capable Transport
                0x10 ECN-Capable Transport and No Congestion Encountered
                0x11 ECN-Capable Transport and  Congestion Encountered 

Order of QoS Operations in Nexus

Within Nexus there is the Port SoC ASIC and the EARL forwarding engine
    Port ASIC or SoC (Switch on Chip) 
    EARL is Cisco’s ‘Enhanced Address Recognition Logic’ 
        This is the Fabric – Supervisor decisions pushed down into Fabric 

Nexus 7000 performs different functions at different locations

    Ingress Port ASIC
            Performs Ingress Queuing and Scheduling 
            CoS-to-Q Mapping
            Bandwidth Allocation (DWRR)
            Buffer Allocation (Memory)
            Congestion Avoidance (WRED) 
            Set CoS Value 

Ingress to EARL

    CoS/DSCP Mutation 
        ACL/SMAC/DMAC/SA/DA/L4 ports/CoS/DSCP

        DSCP/QoS Group/Discard Class 

    1-Rate/2-Color, 2-Rate/3-Color, Aggregate/Flow/Shared 
    Actions to Drop/Transmit/ReMark/Markdown
    Set QoS Group, Set Discard Class 

    Remember the module architecture comes into the play here
    M series modules share 4 Ports per ASIC and F Series Share 2 Ports per 

    F Series Modules primarily do Ingress gueueing (Completely new Mod)
        smaller buffers(memory) on the egress 
        Larger buffers(memory) on the ingress
        Same on the Nexus 5K's 
        This type of queueing architecture that Cisco is going with 
        Think Hadoop 1000's of request to 100's of servers

    M Series module primarily do Egress queueing (Extension of Cat6K Mod)
        larger buffers(memory) on the egress 
        Smaller buffers(memory) on the ingress

Use Memory to store frames

7K's Take the forwarding from the Supervisor and write it to the individual line cards. This allows for greeter throughput and granularity . Distributed 
forwarding is sent to the SoC (Switch on Chip ASIC) to each line card

Egress to EARL 

    ACL/SMAC/DMAC/SA/DA/L4 ports/CoS/DSCP • 
    QoS Group, Discard Class

    DSCP/QoS Group/Discard Class

    1-Rate/2-Color, 2-Rate/3-Color, Aggregate/Flow/Shared 
    Actions to Drop/Transmit/ReMark/Markdown

CoS/DSCP Mutation

Egress Port ASIC 

Performs Ingress Queuing and Scheduling
    CoS-to-Q Mapping 
    Bandwidth Allocation 
    Buffer Allocation 
    Congestion Avoidance (WRED and Tail-Drop based on Thresholds 
    Priority Queuing 
    SRR (Shaped Round Robin) will disable Priority Queue

Switch and Module Buffering 

What is buffering?
    Storing frames in memory until wire/switch is ready to Tx/Rx them

Ingress Buffering vs Egress Buffering 
    Historically, we’ve always been able to do both, but emphasis has been        
    placed on egress buffering 

    Now we want incredibly powerful and fast switches, but we don’t want to 
    drive up cost (any more than is necessary) 

    Egress buffering is very costly

    Ingress buffering drives down cost per port (less SRAM needed at ingress 
    to create aggregate buffers), less power, less heat 

    Statistically speaking, ingress queuing provides the same advantages as 
    a shared memory buffer architecture

Problem with Ingress buffering is introduced
    If I am buffering (necessary) traffic destined for a congested egress port, 
    but a frame comes in behind my buffer that is destined for a non- 
    congested egress port – that frame has to wait until buffer is emptied to 
    be forwarded across fabric to egress port It is (Head of Line Blocked) 

    To mitigate this, we utilize something called the Virtual Output Queue 

    8 virtual output queues for every egress port for unicast, and 8 more per 
    egress port for multicast – per ingress port 
        If you had 256 ports in a system, this means 2048 VoQs per ingress    
        port (these are pointer lists, not physical buffers)’

Ingress Buffering Devices
    Nexus F-series line cards
        L2 only and built for fabric performance
    Nexus 5000/5500 

Egress Buffering Devices 
    Nexus 7000 M-series line cards 
        L3 and rich feature set
        Basically next-gen Cat6K linecards
        Queuing structure is largely based on Cat6K linecard model 
        e.g.1p3q2t, 1p7q7t, etc  
        (p= priority queue, q= standard queue, t= trail drop thresholds)

CoS, DSCP, Trust and Trust Boundaries

Catalyst IOS – Not Trusted by default

Nexus NX-OS – Trusted by default 

Trust boundary is also changing
    Now we assume that either the CNAs, Phone, TP Video Endpoints, or 
    other switches are appropriately marking 
    We can still remark if need be

Also, what do we trust? 
    In NX-OS, we primarily use CoS 
    Why? We now have non-IP based traffic 
            FCoE, RoCE 

Disadvantage is now we have less granularity with service lanes 

Mappings and Mutations are certainly possible

Defaults for 7K: 

    Bridge Unicast = CoS trusted, DSCP preserved 
    Routed Unicast = CoS copied from 3-Most significant bit's of ToS 
    e.g. DSCP 101(CoS Mapping) 110 > CoS 101 
    Routed Multicast = CoS copied from 3-MSB of ToS 

    Bridge Multicast with L3 state for group = CoS copied from 3-MSB of ToS 

    Bridge Multicast with no L3 state for group = CoS trusted, DSCP     

CoS/DSCP to Queue & Threshold Mapping

Today in Nexus you can:
    Perform CoS-to-Q Mapping
    Perform DSCP-to-Threshold Mapping

Today in Nexus you cannot: 
    Perform DSCP-to-Q Mapping (like you can in Cat IOS)
QoS Configuration and Comparisons

Catalyst IOS uses ‘MLS’ nomenclature along with some MQC

Nexus NX-OS everything uses ‘MQC’ nomenclature

    Still 3-step model with
    Class-Maps (which go into)
    Policy-Maps (which are applied with)     
    Service-Policies (Applied to interfaces)

But no 3 type of Class-Maps/Policy-Maps
(actually 4 when you count CoPP)
    QoS = Configures classification/marking
    Queuing = configures port-based hardware queuing 
    Same as Cas-OS
    mls qos srr-queue
    mls qos queue-set output buffers
    mls qos-queue-set output threshold

Network QoS = configures system wide fabric queuing 
    Queuing inside the fabric

QoS Group (NeW!!!) 
    Allows for simple mapping 
    e.g. map one or more classes of traffic to a 
    QoS-Group, then apply QoS-Group to Queue

When configuring "queuing" (or CoPP) type, you have to
configure in the default VDC which could momentarily
disrupt traffic

Type QoS
    Defines traffic classification 
Type Queuing 
    Here is where we configure what will affect system 
    wide cards with these ASIC capabilities 
        Strict Priority Queue 
"show eth*/* capabilities"
Look at the difference between the M line cards and F line cards for the TX and RX queuing 

You will notice a pause when applying a cos-value to a
    This is the switch writing that instruction into the 
    hardware SoC ASIC

Sanity check is performed when applied to an interface

Queuing Attributes in Policy-Map
    Priority (level) - defines queue as priority queue 
    Bandwidth - defines WRR weights for each queue
    Shape - defines SRR weights for each queue 
        enabling shaping disables PQ for that port
    Queue-Limit - Defines Q size/depth and defines 
    tail-drop thresholds 
    Random-Detect - defines WRED thresholds per Queue
        Tail-Drop and WRED are mutually exclusive on a 
        per-Q basis 

Type Network QoS 
    System class characteristics
        MTU (layer 2)(2112 FCOE, 1500 Ethernet, 9000 iSCI)
        Buffer Size 

Three attache points for Policy Types
    Think of this how the frame forwards across the switch
        Ingress Port, Crossbar Fabric, Egress Port

    Ingress Interfaces 
        QoS Type
        Queuing Type

    Unifed Crossbar Fabric
        QoS Type
        Queuing Type 
        Network-QoS Type
    Egress Interface
        Queuing Type 

9 Step Configuration Process
  1. Define Type QoS Class-Map
  2. Define Type QoS Policy-Map
  3. Apply Type QoS Policy-Map
    1. Ingrees Port
  4. Define Type Network-QoS Class-Map
  5. Define Type Network-QoS Policy-Map
  6. Apply Type network-QoS Policy-Map
    1. Apply Globally 
  7. Define Type Queuing Class-Map
  8. Define Type Queuing Policy-Map
  9. Apply Type Queuing Policy-Map
    1. Egress or Ingress
      1. Optionally defined and apply separate input/output queuing
Hardware Specifics
    N5K has a 4 system default QoS Group classes
        Two for Control COS 6 & 7
        Have 4 left to play with 
    N55K No FCoE denied as default QoS-Group
        This allow's memory/buffer allocation if not using
        If you are using FCOE you must allocate this to the 
        QoS group 
    N55K L3 Daughter Card (DON'T USE)
        When traffic passes up through L3 Daughter Card the
        CoS is lost 
        Must reclassify/remark or Voice, Video, Signaling traffic could end up in the default 
        If you always set the CoS value even if your are just calling a class-map from a policy
       -map where you had just classified based on CoS you could steer clear of this issue

Nexus 5K/7K FCoE Treatment

FCoE is CoS 3, but Nexus 5K matches on both CoS 3 and EtherType for FIP and FCoE 
    FCoE = 0x8906 
    FIP = 0x8914 

There is a hardware override to ensure they make it to QoS-Group 1 

You can’t get FCoE out of QoS-Group 1 

You can misconfigure and put other things (like signaling) in qos-group 1 

You can tune FCoE ONLY for Distance 
    300m is default 
    mtu 2158 
    pause no-drop 

3km example 
    mtu 2158 
    pause no-drop buffer-size 152000 pause-threshold 103360 resume-threshold     
Nexus 2K
    L2 only line card - not a switch
    Match on CoS values only 

Nexus 7K 
    FCoE is only supported on F series line cards
    Existing, non-changeable policy templates for FCoE

Catalyst QoS vs. Nexus QoS
    Enable and Trust 
        Catalyst  manually turn on and trust
        Nexus on and trusted by default 

COS only in a 802.1Q trunk 3 bit value 
0 Best Effort
1 Scavenger    
3 FCOE (voice and video signaling)
4 Video 
5 Voice 
6 Reserved Layer 3 IGP Intra network control 
7 Reserved Network over LAN Spanning Tree 

12 Class Model 

DSCP Layer 3 ToS 12 Bit Value 
EF Voice 

Router primarily do EGREES QOS
    Done is software
    Less Queues  

    Done is hardware Queues 
    Typically done in hardware

Remember that QoS does not come into effect unless there is congestion 
on the network 
Rick McGee,
May 5, 2014, 10:24 AM