Milan P. Stanic
mps@arvanta.net
About this document
This document should be (comprehensive) description of tc command utility from iproute2 package.
Primary motivation for this work is my wish to learn about QoS in Linux (and about QoS in general). If you find errors or big mistakes in this document that is because I don’t yet understand QoS. I hope it will improve over time.
It is based on kernel 2.4 and iproute2 version 000305
It is far from to be complete and/or without errors. I am writing it for purpose of my learning only, and I am not sure will (and when) it be finished. But, I am working on it (especially when I have time :).
All of the text is taken from different documents from the net, from linux-diffserv, linux-net mailing lists and from the Linux kernel source files.
Disclaimer: Use at your own risk. I am not responsible if you make loss or damage in any sense by using information from this document.
When the kernel has several packets to send out over a network device, it has to decide which ones to send first, which ones to delay, and which ones to drop. This is the job of the packet scheduler, and several different algorithms for how to do this "fairly" have been proposed.
With Linux QoS subsystem (which is constructed of the building blocks of the kernel and user space tools like ip and tc command line utilities) it is possible to make very flexible traffic control.
tc (traffic controller) is the user level program which can be used to create and associate queues with the network devices. It is used to set up various kinds of queues and associate classes with each of those queues. It is also used to set up filters by which the packets is classified.
Usage: tc [ OPTIONS ] OBJECT { COMMAND | help }
where OBJECT := { qdisc | class | filter }
OPTIONS := { -s[tatistics] | -d[etails] | -r[aw] }
Where it’s expecting a number for BPS; it understands some suffixes: kbps (*1024), mbps (*1024*1024), kbit (*1024/8), and mbit (*1024*1024/8). If I’m reading the code correctly; "BPS" means Bytes Per Second; if you give a number without a suffix it assumes you want BITS per second (it divides the number you give it by 8). It also understands bps as a suffix.
Where it’s expecting a time value, it seems it understands suffixes of s, sec, and secs for seconds, ms, msec, and msecs for milliseconds, and us, usec, and usecs for microseconds.
Where it wants a size parameter, it assumes non-suffixed numbers to be specified in bytes. It also understands suffixes of k and kb to mean kilobytes (*1024), m and mb to mean megabytes (*1024*1024), kbit to mean kilobit (*1024/8), and mbit to mean megabits (*1024*1024/8).
1Mbit == 128Kbps or 1 megabit is 128 kilobytes per second
bps = bits/sec (uhmm...)
kbps = bytes/sec * 1024
mbps = bytes/sec * 1024 * 1024
kbit = bits/sec * 1024
mbit = bits/sec * 1024 * 1024
In the examples Xbit and Xbps are interchangeably, when tc treats them very differently.
note: this is very confusing
note: make sure whenever you are dealing with memory related things like queue size, buffer size that their units are in bytes and when it is bandwidth and rate related parameters the units are in bits.
Each network device has a queuing discipline associated with it, which controls how packets enqueued on that device are treated. It can be viewed with ip command:
root@dl:~# ip link show
1: lo: <LOOPBACK,UP> mtu 3924 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,PROMISC,UP> mtu 1500 qdisc pfifo_fast qlen 100
link/ether 52:54:00:de:bf:19 brd ff:ff:ff:ff:ff:ff
3: tap0: <BROADCAST,MULTICAST,NOARP> mtu 1500 qdisc noop
link/ether fe:fd:00:00:00:00 brd ff:ff:ff:ff:ff:ff
Generally, queueing discipline ("qdisc") is a black box, which is able to enqueue packets and to dequeue them (when device is ready to send something) in order and at times determined by algorithm hidden in it.
By default queueing discipline is pfifo_fast which cannot be manipulated with tc. It is assigned to device when the device is started or when the other qdisc’s deleted from the device. That qdiscs have 3 bands which are processed from band 0 to band 2, and when there is a packet in queue in higher priority band (lower number)
Qdisc’s are:
qdisc’s are divided to two categories:
- "queues", which have no internal structure visible from outside.
- "schedulers", which split all the packets to "traffic classes", using "packet classifiers". ? is qdisc’s which can split packets to “traffic classes”
In turn, classes may have child qdiscs (as rule, queues) attached to them etc. etc. etc.
note: Certain qdiscs can have children and they are classfull, and others are leafs (describe it!)
classfull qdiscs: CBQ, ATM, DSMARK, CSZ and the ( p-FIFO ???? or prio )
leaf qdiscs: TBF, FIFO, SFQ, RED, GRED, TEQL
note: classfull qdiscs can be also leafs
The syntax for managing queuing discipline is:
Usage: tc qdisc [ add | del | replace | change | get ] dev STRING
[ handle QHANDLE ] [ root | ingress | parent CLASSID ]
[ estimator INTERVAL TIME_CONSTANT ]
[ [ QDISC_KIND ] [ help | OPTIONS ] ]
tc qdisc show [ dev STRING ] [ingress]
Where:
QDISC_KIND := { [p|b]fifo | tbf | prio | cbq | red | etc. }
OPTIONS := ... try tc qdisc add <desired QDISC_KIND> help
Usage: ... estimator INTERVAL TIME-CONST
INTERVAL is interval between measurements
TIME-CONST is averaging time constant
Example: ... est 1sec 8sec
The time constant for the estimator is a critical parameter; this time constant determines the interval over which the router attempts to enforce the link-sharing guidelines.
[1]Unfortunately, rate estimation is not a very easy task. F.e. I did not find a simple way to estimate the current peak rate and even failed to formulate the problem. So I preferred not to built an estimator into the scheduler, but run this task separately. Ideally, it should be kernel thread(s), but for now it runs from timers, which puts apparent top bounds on the number of rated flows, has minimal overhead on small, but is enough to handle controlled load service, sets of aggregates.
We measure rate over A=(1<<interval) seconds and evaluate EWMA:
avrate = avrate*(1-W) + rate*W
where W is chosen as negative power of 2: W = 2^(-ewma_log)
The resulting time constant is:
T = A/(-ln(1-W))
NOTES.
* The stored value for avbps is scaled by 2^5, so that maximal rate is ~1Gbit, avpps is scaled by 2^10.
* Minimal interval is HZ/4=250msec (it is the greatest common divisor for HZ=100 and HZ=1024 8)), maximal
interval is (HZ/4)*2^EST_MAX_INTERVAL = 8sec. Shorter intervals are too expensive, longer ones can be
implemented at user level painlessly.
You *have* to declare first, the CBQ qdisc, then the CBQ "parent" class, and then (optionally, I think), the CBQ "leaf " classes.
I’m not 100% sure of what I’ve just said. It’s just how I think it works.
to stop QoS completely use the following for eth0:
tc qdisc del dev eth0 root
In CBQ, every class has variables idle and avgidle and parameter maxidle used in computing the limit status for the class, and the parameter offtime used in determining how long to restrict throughput for overlimit classes.
Usage: ... cbq bandwidth BPS avpkt BYTES [ mpu BYTES ]
[ cell BYTES ] [ ewma LOG ]
CBQ class is automatically generated when a CBQ qdisc created. ??
note: rtab is rate table?
note: mariano: should first declare a cbq "parent" class (which uses all the bandwidth) and then declare the two "leaf" classes.
CBQ is complex qdisc and to be fully understood it is good to read Sally Floyds and Van Jacobsons paper.
Simple priority queue
Usage: ... prio bands NUMBER priomap P1 P2...
Where:
So if you define more than 3 bands, make sure to re-define the priomap
In prio as long as there is data to be dequeued in the higher priority queue, prio will favor the higher queue.
Simple First-In-First-Out queue which provides basic store-and-forward capability. FIFO is default qdisc on most real interfaces.
Usage: ... [p|b]fifo [ limit NUMBER ]
"b" stands for bytes, while "p" stands for packets.
This means that the maximum length of the fifo queue is measured in bytes in the first case and in number of packets in the second case.
small note: The fifo queue can be set to 0, but this still allows a single packet to be enqueued.
Token Bucket Filter is qdisc which have tokens and works like that if there is token in the bucket it possible to enqueue packet and take token. Kernel puts token in the bucket in some intervals
Usage: ... tbf limit BYTES burst BYTES[/BYTES] rate KBPS
[ mtu BYTES[/BYTES] ] [ peakrate KBPS ] [ latency TIME ]
Jamal: TBF is influenced by quiet a few parameters; peakrate, rate, MTU, burst size etc. It will do what you ask it to ;-> And at times it will let bursts flood the gate i.e you might end up sending at wire speed. What are your parameters like?
Random Early Detection discard packet even when there is space in the queue. As the queue length increases drop probability also increases. This approach enables sender to be notified that there is likelihood of congestion before it is actually appeared.
Usage: ... red limit BYTES min BYTES max BYTES avpkt BYTES burst PACKETS
probability PROBABILITY bandwidth KBPS [ ecn ]
Always make sure that min < max < limit
Generalized RED is used in DiffServ implementation and it has virtual queue (VQ) within physical queue. Currently, the number of virtual queues is limited to 16.
GRED is configured in two steps. First the generic parameters are configured to select the number of virtual queues DPs and whether to turn on the RIO-like buffer sharing scheme. Also at this point, a default virtual queue is selected.
The second step is used to set parameters for individual virtual queues.
Usage: ... gred DP drop-probability limit BYTES min BYTES max BYTES
avpkt BYTES burst PACKETS probability PROBABILITY bandwidth KBPS
[prio value]
OR ... gred setup DPs <num of DPs> default <default DP> [grio]
Stochastic Fair Queue as it’s name implies. It processes queues in round-robin order.
Usage: ... sfq [ perturb SECS ] [ quantum BYTES ]
Used to re-direct flows from the default path to ATM VCs. Each flow can have its own ATM VC, but multiple flows can also share the same VC.
Werner: ATM qdisc is different. It takes packets from some traffic stream (no matter what interface or such), and sends it over specific (and typically dedicated) ATM connections.
Werner: Then there’s the case of qdiscs that don’t really queue data, e.g. sch_dsmark or sch_atm.
Diff-serv marker isn’t really a queuing discipline. It marks packet according to specified rule. It is configured as qdisc first and after that as class (if it is used for classification)
Usage: dsmark indices INDICES [ default_index DEFAULT_INDEX ] [ set_tc_index ]
When invoked to create class it’s parameter are:
Usage: ... dsmark [ mask MASK ] [ value VALUE ]
Outgoing DSCP = (Incoming DSCP AND mask) OR value
Where Incoming DSCP is the DSCP value of the original incoming packet, and Outgoing DSCP is the DSCP that the packet will be assigned as it leaves the queue.
if present, the ingress qdisc is invoked for each packet arriving on the respective interface
ingress is a qdisc that only classifies but doesn’t queue
the usual classifiers, classifier combinations, and policing functions can be used
the classification result is stored in skb->tc_index, a la sch_dsmark
if the classification returns a "drop" result (TC_POLICE_SHOT), the packet is discarded. Otherwise, it is accepted.
Since there is no queue for implicit rate limiting (via PRIO, TBF, CBQ, etc.), rate limiting must be done explicitly via policing. This is still done exactly like policing on egress.
mps: should I explain what is class and their intimacy with qdisc? Yes? Classes are main component of the QoS. (stupid explanation)
The syntax for creating a class is shown below:
tc class [ add | del | change | get ] dev STRING
[ classid CLASSID ] [ root | parent CLASSID ]
[ [ QDISC_KIND ] [ help | OPTIONS ] ]
tc class show [ dev STRING ] [ root | parent CLASSID ]
Where: QDISC_KIND := { prio | cbq | etc. }
OPTIONS := ... try tc class add <desired QDISC_KIND> help
The QDISC_KIND can be one of the queuing disciplines that support classes. The interpretation of the fields:
This algorithm classifies the waiting packets into a tree-like hierarchy of classes; the leaves of this tree are in turn scheduled by separate algorithms (called "disciplines" in this context).
Usage: ... cbq bandwidth BPS rate BPS maxburst PKTS [ avpkt BYTES ]
[ minburst PKTS ] [ bounded ] [ isolated ]
[ allot BYTES ] [ mpu BYTES ] [ weight RATE ]
[ prio NUMBER ] [ cell BYTES ] [ ewma LOG ]
[ estimator INTERVAL TIME_CONSTANT ]
[ split CLASSID ] [ defmap MASK/CHANGE ]
A note about CBQ class setup:
cbq class has fifo qdisc attached by default
You *have* to declare first, the CBQ qdisc, then the CBQ "parent" class, and then (optionally, I think), the CBQ "leaf " classes. I’m not 100% sure of what I’ve just said. It’s just how I think it works.
Filters are used to classify (map) packets based on certain properties of the packet e.g. TOS byte in the IP header, IP addresses, port numbers etc to certain classes. Queuing disciplines uses filters to assign incoming packets to one of its classes. Filters can be maintained per class or per queuing disciplines based on the design of the queuing discipline. Filters are maintained in filter lists. Filter lists are ordered by priority, in ascending order. Also, the entries are keyed by the protocol for which they apply, e.g., IP, UDP etc. Filters for the same protocol on the same filter list must have different priority values.
Filter vary in the scope
Filters have meters associated with them (TB+rate estimator)
Usage: tc filter [ add | del | change | get ] dev STRING
[ pref PRIO ] [ protocol PROTO ]
[ estimator INTERVAL TIME_CONSTANT ]
[ root | classid CLASSID ] [ handle FILTERID ]
[ [ FILTER_TYPE ] [ help | OPTIONS ] ]
or
tc filter show [ dev STRING ] [ root | parent CLASSID ]
Where:
FILTER_TYPE := { rsvp | u32 | fw | route | etc. }
FILTERID := ... format depends on classifier, see there
OPTIONS := ... try tc filter add <desired FILTER_KIND> help
The interpretation of the fields:
Use RSVP protocol for classification
Usage: ... rsvp ipproto PROTOCOL session DST[/PORT | GPI ]
[ sender SRC[/PORT | GPI ]
[ classid CLASSID ] [ police POLICE_SPEC ]
[ tunnelid ID ] [ tunnel ID skip NUMBER ]
Where:
GPI := { flowlabel NUMBER | spi/ah SPI | spi/esp SPI |
u{8|16|32} NUMBER mask MASK at OFFSET}
POLICE_SPEC := ... look at TBF
FILTERID := X:Y
Comparing to general packet classification problem, RSVP needs only several relatively simple rules:
(dst, protocol) are always specified, so that we are able to hash them.
rsvp filter is used to distinguish an application session (dst port dst ip address). In an DiffServ edge router it can be used to mark packets of specific applications in order to be classified in the appropriate PHB.
Alexey: IMPLEMENTATION.
We use a two level hash table: The top level is keyed by destination address and protocol ID, every bucket contains a list of "rsvp sessions", identified by destination address, protocol and DPI(="Destination Port ID"): triple (key, mask, offset).
Every bucket has a smaller hash table keyed by source address (cf. RSVP flowspec) and one wildcard entry for wildcard reservations. Every bucket is again a list of "RSVP flows", selected by source address and SPI(="Source Port ID" here rather than "security parameter index"): triple (key, mask, offset).
All the packets with IPv6 extension headers (but AH and ESP) and all fragmented packets go to the best-effort traffic class.
Two "port id"’s seems to be redundant, rfc2207 requires only one "Generalized Port Identifier". So that for classic ah, esp (and udp,tcp) both *pi should coincide or one of them should be wildcard.
At first sight, this redundancy is just a waste of CPU resources. But DPI and SPI add the possibility to assign different priorities to GPIs. Look also at note 4 about tunnels below.
One complication is the case of tunneled packets. We implement it as following: if the first lookup matches a special session with "tunnelhdr" value not zero, flowid doesn’t contain the true flow ID, but the tunnel ID (1...255). In this case, we pull tunnelhdr bytes and restart lookup with tunnel ID added to the list of keys. Simple and stupid 8)8) It’s enough for PIMREG and IPIP.
Two GPIs make it possible to parse even GRE packets. F.e. DPI can select ETH_P_IP (and necessary flags to make tunnelhdr correct) in GRE protocol field and SPI matches GRE key. Is it not nice? 8)8)
Well, as result, despite its simplicity, we get a pretty powerful classification engine.
Panagiotis Stathopoulos: Well an rsvp filter is used to distinguish an application session (dst port dst ip address). In an DiffServ egde router it can be used to mark packets of specific applications in order to be classified in the appropriate PHB.
note: I have to read more about RSVP
Anything in the header can be used for classification
The U32 filter is the most advanced filter available in the current implementation. It entirely based on hashing tables, which make it robust when there are many filter rules.
Usage: ... u32 [ match SELECTOR ... ] [ link HTID ] [ classid CLASSID ]
[ police POLICE_SPEC ] [ offset OFFSET_SPEC ]
[ ht HTID ] [ hashkey HASHKEY_SPEC ]
[ sample SAMPLE ]
or u32 divisor DIVISOR
Where: SELECTOR := SAMPLE SAMPLE ...
SAMPLE := { ip | ip6 | udp | tcp | icmp | u{32|16|8} } SAMPLE_ARGS FILTERID := X:Y:Z
The syntax here is match ip <item> <value> <mask>
So match ip protocol 6 0xff matches protocol 6, TCP. (See /etc/protocols) match ip dport 0x17 0xffff is TELNET
(/etc/services). Note that the number is hexadecimal, not decimal.
note: (mps) ht - hash table HTID Hash Table ID is fh - filter handle in filter show
The filters are packed to hash tables of key nodes with a set of 32bit key/mask pairs at every node. Nodes reference next level hash tables etc.
It seems that it represents the best middle point between speed and manageability both by human and by machine.
It is especially useful for link sharing combined with QoS; pure RSVP doesn’t need such a general approach and can use much simpler (and faster) schemes.
Classifier mapping ipchains’ fwmark to traffic class
Usage: ... fw [ classid CLASSID ] [ police POLICE_SPEC ]
POLICE_SPEC := ... look at TBF
CLASSID := X:Y
Use routing table decisions for classification
Usage: ... route [ from REALM | fromif TAG ] [ to REALM ]
[ flowid CLASSID ] [ police POLICE_SPEC ]
POLICE_SPEC := ... look at TBF
CLASSID := X:Y
Use tc_index internal tag in skb to select classes.
Usage: ... tcindex [ hash SIZE ] [ mask MASK ] [ shift SHIFT ] [ pass_on | fall_through ] [ classid CLASSID ] [ police POLICE_SPEC ]
note: key = (skb->tc_index >> shift) & mask
The purpose of policing is to ensure that traffic does not exceed certain bounds. For simplicity, we will assume a broad definition of policing and consider it to comprise all kinds of traffic control actions that depend in some way on the traffic volume.
We consider four types of policing mechanisms:
Usage: ... police rate BPS burst BYTES[/BYTES] [ mtu BYTES[/BYTES] ]
[ peakrate BPS ] [ avrate BPS ] [ ACTION ]
Where: ACTION := reclassify | drop | continue
note: "drop" is only recognized by the following qdiscs: atm, cbq, dsmark, and (ingress - really?). In particular, prio ignores it.
[1] A. N. Kuznetsov, docs from iproute2
[2] Werner Almesberger, Linux Network Traffic Control – Implementation Overview
[3] Jamal Hadi Salim, IP Quality of Service on Linux http://????
Saravanan Radhakrishnan, Linux - Advanced Networking Overview http://qos.ittc.ukans.edu/howto/howto.html
[4] Almesberger, Jamal Hadi Salim, Alexey Kuznetsov - Differentiated Services on Linux
[5] linux-diffserv mailing list linux-diffserv@lrc.di.epfl.ch
[6] Sally Floyd, Van Jacobson - Link-sharing and Resource Management Models for Packet Networks
[7] Sally Floyd, Van Jacobson - Random Early Detection Gateways for Congestion Avoidance
[8] Related Cisco documents from http://www.cisco.com/
[9] Lixia Zhang, Steve Deering, Deborah Estrin, Scott Shenker, Daniel Zapalla - RSVP: A New Resource ReSerVation Protocol
note: flowid is sometimes class handle sometimes something else
mariano - good setup for me: If you remove the router and then the modem line becomes ppp0 (instead of eth0), you should declare that ppp0 has "bandwidth 30K". Then, the classes should use "bandwidth 30K rate 20K" and "bandwidth 30K rate 10K"