Dear reader, I'm not updating these pages anymore. If you have tc or ip related questions, you can post them on the LARTC mailing list.

From help-file :
CSZ packet scheduler
CONFIG_NET_SCH_CSZ
Say Y here if you want to use the Clark-Shenker-Zhang (CSZ) packet
scheduling algorithm for some of your network devices. At the
moment, this is the only algorithm that can guarantee service for
real-time applications (see the top of net/sched/sch_csz.c for
details and references about the algorithm).

Note: this scheduler is currently broken.

This code is also available as a module called sch_csz.o ( = code
which can be inserted in and removed from the running kernel
whenever you want). If you want to compile it as a module, say M


From source-files :
    Clark-Shenker-Zhang algorithm.
=======================================

SOURCE.

David D. Clark, Scott Shenker and Lixia Zhang
"Supporting Real-Time Applications in an Integrated Services Packet
Network: Architecture and Mechanism".

CBQ presents a flexible universal algorithm for packet scheduling,
but it has pretty poor delay characteristics.
apparently contradict minimization of network delay and jitter.
Moreover, correct handling of predictive flows seems to be
impossible in CBQ.

CSZ presents a more precise but less flexible and less efficient
approach. As I understand it, the main idea is to create
WFQ flows for each guaranteed service and to allocate
the rest of bandwith to dummy flow-0. Flow-0 comprises
the predictive services and the best effort traffic;
it is handled by a priority scheduler with the highest
priority band allocated for predictive services, and the rest ---
to the best effort packets.

Note that in CSZ flows are NOT limited to their bandwidth.  It
is supposed that the flow passed admission control at the edge
of the QoS network and it doesn't need further shaping. Any
attempt to improve the flow or to shape it to a token bucket
at intermediate hops will introduce undesired delays and raise
jitter.

At the moment CSZ is the only scheduler that provides
true guaranteed service. Another schemes (including CBQ)
do not provide guaranteed delay and randomize jitter.
There is a proof (Sally Floyd), that delay
can be estimated by a IntServ compliant formula.
This result is true formally, but it is wrong in principle.
It takes into account only round-robin delays,
ignoring delays introduced by link sharing i.e. overlimiting.
Note that temporary overlimits are inevitable because
real links are not ideal, and the real algorithm must take this
into account.

ALGORITHM.

--- Notations.

$B$ is link bandwidth (bits/sec).

$I$ is set of all flows, including flow $0$.
Every flow $a \in I$ has associated bandwidth slice $r_a < 1$ and
$\sum_{a \in I} r_a = 1$.

--- Flow model.

Let $m_a$ is the number of backlogged bits in flow $a$.
The flow is {\em active}, if $m_a > 0$.
This number is a discontinuous function of time;
when a packet $i$ arrives:
$m_a(t_i+0) - m_a(t_i-0) = L^i,$
where $L^i$ is the length of the arrived packet.
The flow queue is drained continuously until $m_a == 0$:
${d m_a \over dt} = - { B r_a \over \sum_{b \in A} r_b}.$
I.e. flow rates are their allocated rates proportionally
scaled to take all available link bandwidth. Apparently,
it is not the only possible policy. F.e. CBQ classes
without borrowing would be modelled by:
${d m_a \over dt} = - B r_a .$
More complicated hierarchical bandwidth allocation
policies are possible, but unfortunately, the basic
flow equations have a simple solution only for proportional
scaling.

--- Departure times.

We calculate the time until the last bit of packet is sent:
$E_a^i(t) = { m_a(t_i) - \delta_a(t) \over r_a },$
where $\delta_a(t)$ is number of bits drained since $t_i$.
We have to evaluate $E_a^i$ for all queued packets,
then find the packet with minimal $E_a^i$ and send it.

This sounds good, but direct implementation of the algorithm
is absolutely infeasible. Luckily, if flow rates
are scaled proportionally, the equations have a simple solution.

The differential equation for $E_a^i$ is
${d E_a^i (t) \over dt } = - { d \delta_a(t) \over dt} { 1 \over r_a} = { B \over \sum_{b \in A} r_b}$
with initial condition
$E_a^i (t_i) = { m_a(t_i) \over r_a } .$

Let's introduce an auxiliary function $R(t)$:

--- Round number.

Consider the following model: we rotate over active flows,
sending $r_a B$ bits from every flow, so that we send
$B \sum_{a \in A} r_a$ bits per round, that takes
$\sum_{a \in A} r_a$ seconds.

Hence, $R(t)$ (round number) is a monotonically increasing
linear function of time when $A$ is not changed
${ d R(t) \over dt } = { 1 \over \sum_{a \in A} r_a }$
and it is continuous when $A$ changes.

The central observation is that the quantity
$F_a^i = R(t) + E_a^i(t)/B$ does not depend on time at all!
$R(t)$ does not depend on flow, so that $F_a^i$ can be
calculated only once on packet arrival, and we need not
recalculate $E$ numbers and resorting queues.
The number $F_a^i$ is called finish number of the packet.
It is just the value of $R(t)$ when the last bit of packet
is sent out.

Maximal finish number on flow is called finish number of flow
and minimal one is "start number of flow".
Apparently, flow is active if and only if $F_a \leq R$.

When a packet of length $L_i$ bit arrives to flow $a$ at time $t_i$,
we calculate $F_a^i$ as:

If flow was inactive ($F_a < R$):
$F_a^i = R(t) + {L_i \over B r_a}$
otherwise
$F_a^i = F_a + {L_i \over B r_a}$

These equations complete the algorithm specification.

It looks pretty hairy, but there is a simple
procedure for solving these equations.
See procedure csz_update(), that is a generalization of
the algorithm from S. Keshav's thesis Chapter 3
"Efficient Implementation of Fair Queeing".

NOTES.

* We implement only the simplest variant of CSZ,
when flow-0 is a explicit 4band priority fifo.
to "dequeue" to implement complete CSZ.
I do not want to do that, unless it is absolutely
necessary.

* A primitive support for token bucket filtering
presents itself too. It directly contradicts CSZ, but
even though the Internet is on the globe ... :-)
"the edges of the network" really exist.

BUGS.

* Fixed point arithmetic is overcomplicated, suboptimal and even
wrong. Check it later.