Packet

Network

Every complex computer system involves one or more communication links, usually organized to form a network

The basic abstractions across the board

send(message)
receive(message)

The design of communication networks is dominated by three intertwined considerations:

A trio of fundamental physical properties
The mechanics of sharing
A remarkably wide range of parameter values

A trio of fundamental physical properties

The speed of light is finite
- It takes 20 milliseconds to transmit a signal across 2600 miles from Boston to Los Angeles
- Going to and from a geostationary satellite perched 22,400 miles above the equator: 244 milliseconds — large enough for a human to notice!
Propagation delay
- There is no way to avoid it without moving the two cities closer together!
Communication environments are hostile
- Radio signals, wires, and optical fibers must traverse far more hostile environments — under the floor, deep in the ocean
- Much easier to break: wiping out individual bits, severing cables
Communication media have limited bandwidth
- Every transmission medium has a maximum rate at which one can transmit distinct signals
- This maximum rate is determined by its physical properties, such as the distance between transmitter and receiver and the attenuation characteristics of the medium
- Signals can be multilevel, not just binary, so the data rate can be greater than the signaling rate
- However, noise limits the ability of a receiver to distinguish one signal level from another
- The combination of limited signaling rate, finite signal power, and the existence of noise limits the rate at which data can be sent over a communication link

A network design must deal with

Propagation delay: range over a few orders of magnitude
Lack of reliability: Errors may occur
Bandwidth: Also range over a few orders of magnitude

Communication links are always shared

Sharing arises for two distinct reasons

Any-to-any connection: when you wish to communicate between any pair of computers, dedicated links connecting all pairs are no longer feasible
- The number of computers a network should connect also spans a wide range of a few orders of magnitude
Sharing of communication costs: Some parts of a communication system follow Moore’s law as they are silicon
- Other parts, such as building cables deep in the ocean or launching a satellite, are not getting cheaper

Wide range of parameter values

Propagation delays, date rates, the number of communicating computer: All spanning a range of several orders of magnitude

Another wide-ranging parameter

Load: A single computer may at different times present a network with widely differing loads, ranging from transmitting a file at 30 megabytes per second to interactive typing at a rate of one byte per second!

Designing a network as a system is hard due to these considerations:

Unyielding physical limits
Sharing of facilities
Existence of four different parameters that can each range over seven or more order of magnitude
- propagation delay
- data rates
- the number of communicating computers
- Load

Consider a multiplexed communication link an in a telephone network between Boston and Los Angeles.

If there is an earthquake in Los Angeles and a “flash crowd” of incoming phone calls from Boston

The multiplexed link has a limited capacity

At some point the next caller will get a network is busy signal

Assume each call is 64 kilobits/sec, and the multiplexed link runs at 45 megabits/sec

A call runs at 8000 frames/sec, each frame is an 8-bit block

In between each pair of frames belonging to the call there is room for 702 other frames, carrying bits belonging to other phone calls

The entire link can carry up to 703 simultaneous calls

$45 * 1000000 / 64 * 1000$

The 704th person will receive the network busy signal

Hard-edged: No resistance to the first 703 calls, but absolutely refuses the 704th call!

Time-division multiplexing

Time-division multiplexing(TDM), used in telephony, is very suitable for voice calls

Provides a constant rate of data flow, and the delay from one end to the other is the same for every frame

But there is one prerequisite

The must be some prior arrangement between the sending switch, and the receiving switch

A connection requires

Some previous communication between the two switches to set it up storage for remembered state at both ends of the link some method to discard (tear down) that remembered state at the end

Data network

Why do data networks use a different strategy?

Data is being sent in bursts on an irregular basis - We can call these bursts messages

Compared with the continuous stream of bits that flows out of a simple digital telephone

Bursty traffic: ill-suited to fixed size and spacing of isochronous frames

A system designed will be happy to give up the guarantee of uniform data rate and uniform latency - If in return an entire message can get through more quickly

This trade-off is achieved using asynchronous multiplexing - The sampling rate is different for each of the signals and a common clock is not required

The receiver is no longer able to figure out where the message in the frame is destined by simply counting bits

Each frame must include a few extra bits that provide guidance about where to deliver it

A variable-length frame together with its guidance information is called a packet

Guidance information can include the destination address of the message

Also need some way of figuring out where each frame starts and ends, a process known as framing

Asynchronous communication offers the possibility of connectionless transmission - No state need to be maintained on the switches about particular end-use communications (packets are usually called datagrams)

An additional complication: most links place a limit on the maximum size of a frame - When a message is larger than this maximum size, it is necessary for the sender to break it up into segments - Each of which the network carries in a separate packet - Include enough information with each segment to allow the original message to be reassembled at the other end

A packet going from A to B may follow several different paths or routes - choosing a particular path for a packet: routing - Choosing an outgoing link on a packet switch forwarding(using table look up)

Propagation delay

The time required for the signal to travel across a link is determined by the speed of light in the transmission medium connecting the packet switches, and the physical distance the signals travel

Transmission delay

Since the frame that carries the packet may be long or short, the time required to send the frame at one switch - and receive it at the next switch - depends on the data rate of the link and the length of the frame

Processing delay

Time for a switch to decide which outgoing link to forward a packet to, and to place the packet in the outgoing link’s buffer

Queuing delay

When the packet from A to B arrives at the upper right packet witch, link#3 may already be transmitting another packet, perhaps one that arrived from link #2, and there may also be other packets queued up waiting to use link #3

If so, the packet witch will hold the arriving packet in a queue in memory until it has finished transmitting the earlier packets

The duration of this delay depends on the amount of other traffic passing through that packet switch, so it can be quite variable

Queuing delay as a function of utilization

Asynchronous systems introduced a fundamental trade-off: if we limit the average queuing delay, it will be necessary to leave unused, on average, some capacity of each link!

If we allow the utilization to approach 100%, delays will grow without bound
The asynchronous design have replaced the abrupt appearance of the busy signal of the isochronous network with a gradualtrade trade-off: as the system becomes busier, the delays increase

Maximum queuing delays

The figure is just about average queuing delays under ideal conditions - If we are serious about keeping the maximum delay almost always below a given value, we must prepare for occasional worse peaks by holding utilization well below the level of $\rho_{max}$ suggested by the previous figure

Queuing theory is not useful in predicting the behaviour of a network

In practice, network systems put a bound on link queuing delays by limiting the size of queues and by exerting control on arrivals - Effectively shifting delays to other places in the network

Design buffer size

Three strategies to design the buffer size:

Plan for the worse case: Examine the network traffic carefully, figure out what the worst-case traffic situation will be, and allocate enough buffers to handle it
Plan for the usual case and fight back: Based on a similar calculation, choose a buffer size that will work most of the time, and if the buffers fill up send messages back through the network asking someone to stop sending
Plan for the usual case and discard overflow:Choose a buffer size that will work most of the time, and ruthlessly discard packets when the buffers are full

Planing for the worst case

Buffer memory is usually inexpensive, so planning for the worst case seems like an attractive idea?

But it is actually much harder than it sounds!
- Impossible to figure out what the worse case is in a large network
- Packets will queue up for too long in the large buffer: the sender may abort
Go for the average (usual) case, then?
- What do we do when traffic will exceed the average for long enough to run out of buffer space — a congestion?

Fight back

If buffer space begins to run low, send a message back along an incoming link saying “please don’t send any more until you hear from me” - Such a “quench” request may go to the upstream packet switch, or all the way back to the original source

Harder than it sounds, too - If a packet switch is experiencing congestion, there is a good chance that the adjacent switch is also congested - if it is not already congested, it soon will be if it is told to stop sending data over the link to this switch - Sending an extra message is adding to the congestion - Worse, a set of packet switches configured in a cycle can easily end up in a form of deadlock: all buffers filled and waiting for the OK signal

Back to source

Problematic, too

It may not be clear to which source to send the quench
Such a request may not have any effect because the source you choose to quench is no longer sending anyway, by the time the quench request reaches there
Assuming that the quench message is itself forwarded back through the packet-switched network, it may run into congestion and be subject to queuing delays
- The busier the network, the longer it will take to exert control!
Even if all the data is coming from one source, by the time the quench gets back and the source acts on it, the packets already in the pipeline may exceed the buffer capacity

Discarding packets

When the buffers fill up, the switches start throwing packets away

Sound like an awful choice!
- Eventually each discarded packet will have to be sent again so the effort to send the packet this far will have been waster
But unfortunately, this is the only choice left
It leads to a remarkable consequence
- Automatic rate adaptation: the sender of a packet can interpret the lack of its acknowledgment as a sign that the network is congested, and can in turn reduce the rate at which it sends new packets into the network

Forwarding networks

The possibility that a network may actually discard packets to cope with congestion leads to a useful distinction between two kinds of forwarding networks

Best-effort network: If a network switch cannot dispatch a packet soon after receipt, it may discard it (therefore providing a best-effort contract to its users)
- Usually in a lower network layer
- The Internet
Guaranteed-delivery network: It takes heroic measures to avoid every discarding payload data
- Usually in a higher network layer
- The Internet e-mail delivery

Duplicate

Problem 1: Duplicate Requests

Problem 2: Duplicate Responses

Duplicate Suppression

The general procedure to suppress duplicates has two component

Nonces
- Each request includes a nonce, which is a unique identifier that will never be reused by A when sending requests to B
- Example in previous figures: Use monotonically increasing serial numbers as nonces
Receiver must keep a list of nonces
- If a duplicate is received, do not take action, but resend the previous response

Broken message

Damaged packets: Detected by error detection algorithms (such as using a checksum)

Once detected, simply discarded by intermediate packet switches

Broken links: Needs more flexibility in routing

Reordered delivery: May be caused by following different paths - Needs additional information to reconstruct packet oder at the receive - It is not wise for the network system to maintain order

Isochronous vs. asynchronous multiplexing

Adopt Sweeping Simplifications

Fundamental design principle: Adopt sweeping simplifications so you can see what you are doing

All networks use the divide-and-conquer technique known as layering of protocols

Module A may call any of the modules J, K, or L, but A doesn’t even know the existence of X, Y, and Z

The figure here shows A using module K. Module K, in turn, may call any of X, Y, or Z

Analogy: Post office

Consider a customer who walks into a post office to send a letter to a friend

Postal delivery involves an exchange of a message between two sets of parties - called “peer entities” - The customers, who view the postal network as a black that accepts letters at one end and delivers them to another - The postal workers, who handle the letters on behalf of their customers

Advantages of Protocol Layering

The set of rules that peer entities need to follow when they communicate with each other is called a protocol

Three advantages of protocol layering

It allows a complex problem to be broken into smaller, more manageable pieces
The implementation detail can be hidden (abstracted) from other layers - we can change it without disturbing the rest of the protocol stack
Many upper layer protocols can share the services provided by the same lower layer protocol

Problems of Protocol Layering

Should we design upper layer protocols with the knowledge of implementation detail of a lower layer protocol? - Of course not, it violates the layering principle and defeats information hiding

But sometimes information hiding leads to poor performance - An upper layer protocol can throttle the source sending rate when detecting a packet loss, with the knowledge that a lower layer protocol will drop packets when congestion occurs

The art is to leak enough information between layers to allow efficient performance, but not so much that it is hard to change the implementation of a layer!

How many layers do we need?

At the minimum, three

The link layer: Moving data directly from one point to another
The network layer: Forwarding data through intermediate points to move it to the place it is wanted
The end-to-end layer: Everything else required to provide a comfortable application interface

Perhaps we can also divide the link layer into the physical layer and the data link layer

Or divide the end-to-end layer into the transport layer and the application layer

It will then become a total of 5 layers

One end-to-end layer or four layers

Different applications have radically different requirements for transport, session, and presentation services!

Even to the extent that the order in which they should be applied may be different
This situation makes it difficult to propose any single layering, since a layering implies an ordering
Any implementation decisions that a lower layer makes may be counterproductive for at least some applications

Instead, it is likely to be more effective to provide a library of service modules that can be selected and organized by the developer of a specific application

Sending and receiving data using layers

An application starts by asking the end-to-end layer to transmit a message or a stream of data to a correspondent

The end-to-end layer:

Splits long messages and streams into segments
Copes with lost or duplicated segments
Places arriving segments in proper order
Enforces specific communication semantics
Performs presentation transformations
Finally, calls on the network layer to transmit each segment

Typical services in the end-to-end layer Presentation services: Translating data formats and emulating the semantics of a procedure call

Remote Procedure Call (RPC) Transport services: Dividing streams and messages into segments and dealing with lost, duplicated, and out-of-order segments
Using serial numbers of the segments Session services: Negotiating a search, handshake, and binding sequence to locate and prepare to use a service that knows how to perform the requested procedure
Session Initiation Protocol (SIP)

The network layer:

Accepts segments from the end-to-end layer
Constructs packets, and transmits those packets across the network
Choosing which links to follow to move a given packet from its origin to its destination

The link layer:

Accepts packets from the network layer
Constructs and transmits frames across a single link between two forwarders or between a forwarder and a customer of the network

The end-to-end argument

The argument against additional layers constitutes part of a design principle

Fundamental design principle: The end-to-end argument

The application knows best

The basic idea of the end-to-end argument is that the application knows best what its real communication requirements are, and for a lower network layer to try to implement any feature other than transporting the data risks implementing something that is not quite what the application needed.

Moreover, if it is not exactly what is needed, the application will probably have to re-implement that function on its own.

Don’t bury it in a lower layer, let the end points deal with it because they know best what they need!

A simple example:careful file transfer

To transfer a file carefully, the appropriate method is to calculate a checksum from the contents of the file as it is stored in the file system of the originating site

After the file has been transferred and written to the new file system, the receiving site should read the file back out of its file system, recalculate the checksum anew, and compare it with the original checksum

If the two checksums are the same, the file is correctly transferred

An end-to-end approach of checking the accuracy of the file transfer

Is there any value in having the link layer add a frame checksum?

Despite this protection, the data may still be damaged!

While it is being passed through the network layer
While it is buffered by the receiving part of the file transfer application
While it is being written to the disk

Therefore, the careful file transfer application cannot avoid calculating its end-to-end checksum, despite the protection provided by the link layer checksum!

Is the link layer checksum worthless?

With a checksum, the link layer will discover data transmission errors at a time when they can be easily corrected by resending just one frame.
Rather than redoing the entire file transfer

There may be a significant performance gain in having a feature in the lower-level layer

This implies that a lower-layer checksum does not eliminate the need for the application layer to implement the function

It is thus not required for application correctness

It is just a performance enhancement

System Design

What is system design?

A computer network provides computation, storage and transmission resources
Sytem design is the art and science of putting together these resources into a harmonious whole
Ultimately, you wish to extract the most from what you have

Simplicity

Three rules of the work:

Out of clutter find simplicity
From discord find harmony
In the middle of difficulty lies opportunity

Simplicity can be an overarching philosophy, but it is too high level

Performance metrics and resource constraints

In any system, some resources are more freely available than others
- Think about a high0end laptop connected to Internet by a DSL modem
- The constrained resource is link bandwidth
- CPU and memory are unconstrained
We wish to maximize a set of performance metrics given a set of resource constraints
Explicitly identifying constraints and metrics helps in designing efficient systems
- Maximize reliability (mean time between failures) for a car that costs less than 10,000 to manufacture

Real-world system design should be

Scalable
Modular
Extensible
Elegant
Future-proof
- Rapid technological change
- Market conditions may dictate changes to design halfway through the process
- International standards, which themselves change slowly, also impose constraints

Most resources are a combination of

time
space
computation
money
labor
scaling

Time

Shows up in many constraints

Deadline for task completion, time to market, mean time between failures

Metrics

Response time: Mean time to complete a task
Throughput: Number of taks completed per unit time
Degree of parallelism: $Response time * Throughput$
- 20 taks completed in 10 seconds, and each task takes 3 seconds
- $3 * \frac{20}{10} = 6$

Space

Example: A limit on the memory available for a buffer to hold packets in switches and routers

We can also view bandwidth as a space constraint - A T3 link has a bandwidth of 44.768 Mbps. If we use it to carry video streams with a mean bit rate of 1.5 Mbps, we can fit at most 29 streams in this link

Scaling

A design constraint, rather than a resource constraint

Minimizes the use of centralized elements in the design
- Yet, forces the use of complicated distributed algorithms
Hard to measure
But necessary for success

Bottleneck

Bottlenecks are the most constrained elements in a sytem

System performance improves by removing the bottlenecks

But inevitably creates new bottlenecks

In a balanced system, all resources are simultaneously bottlenecked

This is optimal, but nearly impossible to achieve
In practice, bottlenecks move from one part of the system to another

Multiplexing and virtualization

Multiplexing: Another for sharing

Trades time and space for money

Users see an increased response time, and take up space when waiting, but the system costs less

Economies of scale make a single large resource cheaper

Examples:

Multiplexed communication links
Servesr in cloud computing

Another way to look at a shared resource

Unshared virtual resource - the telephone network with time-division multiplexing

Server controls access to the shared resource

Uses a schedule to resolve contention
Choice of scheduling: Critical in providing quality of service guarantees
Think about boarding a flight

Statistical multiplexing

Suppose resource has capacity C

Shared by N identical tasks
Each task requires capacity c
If $N * c \le C$, then the resource is underloaded
If at most 10% of tasks active, then $C \ge N * \frac{c}{10}$ is enough
- We used statistical knowledge of users to reduce system cost
- This is the statistical multiplexing gain

Two types of statistical multiplexing Spatial: We expect only a fraction of tasks to be simultaneously active Temporal: We expect a task to be active only part of the time - its average resource consumption is less than its peak

Batching

Batching: Trading response time for throughput

Group tasks together to amortize overhead

Only works when overhead for $N tasks < N time overhead for one task$

Also, time taken to accumulate a batch shouldn’t be too long

We’re getting reduced overhead and increased throughput, but suffering from a longer worst case response time

Randomize

Randomized algorithms may avoid maintaining states and improve performance

Randomized quicksort
The Monte Carlo method uses randomness for deterministic problems that are difficult to solve

But what if I have to remember and maintain some states?

Ue soft states which expire after some time without a refresh

Layers

Use layers, but no more than 3

Traditional OSI Model

Application Layer
Presentation Layer
Session Layer
Transport Layer
Network Layer
Link Layer
Physical Layer

What we need

End-to-end layer
Network layer
Link layer

Review end-to-to argument

Error control: Best done in the applications Two-phase commit protocols: Do not depend on reliability FIFIO sequencing, or duplicate suppression for their correctness RISC architectures: No need to anticipate application requirements for an esoteric feature

Corollary

Use software, unless performance is not good enough

Case in point: Machine learning Implement more functions in hardware, but only when they definitely help improve the performance of the applications we run.

Hierarchies

Use hierarchies to be more scalable, but no more than 3 Examples:

Network Time Protocol(NTP)
Domain name system(DNS)
- Root servers
  - Organization servers
    - Authorized name servers
Certificate Authorities
- Root certificates
  - Intermediate certificates
    - Certificates
Web service
- Original servers
  - Edge servers in CDNs
    - Clients

Why do we use only 3 levels in the hierarchical design?

Well 3 is *8scalable** enough based on real-world experiences the complexity from more levels in not necessary

Hierarchical designs are conceptually easy, but difficult to implement correctly

Ideas toward implementing hierarchies

Cache aggressively in leaf nodes to avoid congesting the root
- Nodes in each hierarchy should depend only on their parents for proper execution
Left-to-leaf communication is expensive to support
- Peer-to-peer vs. client-server

PTP vs. Client-Server

PTP	Client-Server
Permisssionless BlockChain	Cloud Computing
Slow	Fast

PTP	Client-Server
Traditional Routers	Software-Defined Networking
Fault-tolerant	Central Point of failure

Use the cloud as much as possible, only use peer-to-peer when necessary

Pipelines

Traditional parallelism with more processors requires breaking up a task into multiple independent subtasks

But what if the subtasks are dependent?

One cannot start until another ends?

Pipelining parallelism increases throughput

Trad-off

Fundamentally, system design is about trade-offs

Trade-offs between more abundant resources and scarce resources at the bottleneck

Trade-offs in system design

Multiplexing and virtualization: More time, more space consumed, but costs less
Parallelism: More computation, less time to complete
Batching: Trading response time for system throughput

Routing

Network Addr

The conceptual model of a network

A cloud bristling with network attachment points identified by numbers known as network addresses

A segment enters the network at one attachment point, known as the source

The network layer wraps the segment in a packet and carries the packet across the network to another attachment point, known as the destination, where it unwraps the original segment and delivers it to the end-to-end layer

Managing the forwarding table

The primary challenge in a packet forwarding network is to set up and manager the forwarding tables

Constructing these tables requires first figuring out appropriate paths (sometimes called routes) to follow from each source to each destination routing

Setting these tables by hand is not scalable

When links are added, removed, failed or repaired, the forwarding tables need to be recalculated
It would be nice for forwarding tables to automatically adapt to avoid congestion

A packet forwarder that also participates in an adaptive routing algorithm is called a router

Challenges

How to construct a consistent, efficient set of forwarding tables - so that there are no loops in routes?

What defines a better routing protocol?

A smaller number of hops to the destination
Adaptive routing: Able to adapt to a change in topology
Scalability: Handles a large number of destinations

Path vector algorithm

Exchanges information of about 100,000 routers in the core of the Internet

Path Vector

Basic Idea

Each participant maintains, in addition to its forwarding table
- A path vector, each element of which is a complete path to some destination
Initially, the only path it knows about is the zero0length path to itself
As the algorithm proceeds it gradually learn about other paths
Eventually its path vector accumulates paths to every point in the network
After each step of the algorithm it can construct a new forwarding table from its new path vector,so the forwarding table gradually becomes more and more complete

Two steps

Advertising
Path selection

Advertising

In the advertising step, each participant sends its own network address and a copy of its path vector down every attached link to its immediate neighbours

Path selection

G now performs the path selection step by merging the information received from its neighbours with that already in its own previous path vector

To do this merge, G takes each received path, prepends the network address of the neighbour that supplied it, and then decides whether of not to use this path in its own path vector
For previously unknown destination, the answer is yes
For previously known destinations, G compares the paths that its neighbours have provided with the path it already had in its table, to see if the neighbour has a better path (smaller number of hops to the destination)

Each router discards any paths that a neighbour stops advertising - to discard links that go down

Loops

Temporary loops are still possible

If a link has gone down, some packets may loop for a while until everyone agrees on the new forwarding table!

Solution: hop limit

Add a field to the network-layer header containing a hop limit counter
Decrements the hop limit counter by each router
If a router finds it to be zero, it discards the packet

Hierarchy in routing

Two problems in our solutiosn so far

Every attachment point must have a unique address
- It is hard to maintain a complete and accurate list of addresses already assigned when the number of addresses is large
- The path vector grows in size with the number of attachment points Solution: Introducing hierarchies
Network addresses should be designed to have a hierarchical structure
Both for decentralizing address assignments and for reducing the size of forwarding tables and path vectors

Assume that we have two hierarchies

Region & Station
We may assign to A the network address “11,75” where 11 is a region identifier and 75 is a station identifier

Key benefit

Reduction of path vectors
If we can adopt a policy that regions must correspond to the set of network attachment points served by a gorup of closely-connected routers
We can use it to reduce the size of forwarding tables and path vectors
For example, when a router for region 11 gets ready to advertise its path vectors to a router serving region 12
- It can condense all paths for region 11 into a single path

Now the problem of assigning unique addresses in a large network is also solved

The station part of a network address needs to be unique only within its region
A central authority assigns region identifiers
Local authorities assign station identifiers within each region

The table lookup process is more complicated

The forwarder needs to first extract the region component of the destination address, and look that up in its forwarding table
Either the forwarding table contains an entry showing a link over which to send the packet to that region
Or the forwarding table contains an entry saying that this forwarder is already in the destination region
- It is now necessary to extract the station identifier and look that up in a different part of the forwarding table

The addresses are becoming geographically dependent

Paths may no longer be the shortest possible

Congestion Control

Shared resources

Resource sharing examples in systems:

Many virtual processors(threads) sharing a few physical processors using a thread manager
A multilevel memory manager creates the illusion of large, fast virtual memories by combing a small and fast shared memory with large and slow storage devices

In networks, the resource that is shared is a set of communication links and the supporting packet forwarding switches

They are geographically and administratively distributed
- Managing them is more complex

Analogy: Supermarket vs. Packet switch

Queues are started to manage the problem that packets amy arrive at a switch at a time when the outgoing link is already busy transmitting another packet

Just like checkout lines in the supermarket Any time there is a shared resource, and the demand for that resource comes from several statistically independent sources, there will be fluctuations in the arrival of load
Thus there will be fluctuations in the length of the queue, and the time spent waiting for service in the queue
$Offered load > capacity of a resource$
Overloaded

How long will overload persist?

If the duration of overload is a comparable to service time, it is normal
- The time in a supermarket to serve one customer
- Or the time for a packet forwarding switch to handle one packet
- In this case, a queue handles short bursts of too much demand by time-averaging with adjacent periods when there is excess capacity
If overload persists for a time significantly longer than the ervice time, there begins to develop a risk that the system will fail
- To meet some specification, such as maximum delay
- When this occurs, the resource is said to be congested
- If congestion is chronic, the length of the queue will grow without bound

The stability of offered load The stability of offered load is another factor in the frequency and duration of congestion

When the load on a resource is aggregated from a large number of statistically independent small sources
- Averaging can reduce the frequency and duration of load peaks
When the load comes from a small number of large sources, even if the sources are independent
- The probability that they all demand service at about the same time can be high enough
- That congestion can be frequent or long-lasting

Congestion Collapse

Competition for resource may lead to waster of resource

Example -Customers who are tired of waiting may just walk out, leaving filled shopping carts behind

Someone has to put the goods from abandoned carts back to the shelves
One of two of the checkout clerks leave their registers to do so
The rate of sales being rung up drops while they are away
- The queues at the remaining registers grow longer
- Causing more people to abandon their carts
Eventually, the clerks will be doing nothing but restocking

Once temporary congestion induces a collapse, even if the offered load drops back to a level that the resource can handle, the already induced waster rate can continue to exceed the capacity of the resource

This will cause it to continue to waster the resource, remain congested indefinitely

Resource management

Primary goal of resource management

Avoid congestion collapse
- By increasing the capacity of the resource
- By reducing the offered load There is a need to move quickly to a state in which the load is less than the capacity of the resource
But when offered load is reduced, the amount reduced does not really go away
It is just deferred to a later time at the source
The source is still averaging periods of overload with periods of excess capacity, but over a longer period of time

How to increase capacity or reduce load?

It is necessary to provide feedback to one or more congestion points
- An entity that determines
  - The amount of resource that is available
  - The load being offered
A congestion control system is fundamentally a feedback system
- A delay in the feedback path can lead to oscillations in load

In a supermarket, a store manager can be used to watch the queues at the checkout lines

Whenever there are more than two or three customers in any line
- The manager calls for staff elsewhere in the store
  - To drop what they are doing
  - Temporarily take stations as checkout clerks
This practically increases capacity

When you call customer service, you may hear an automatic response message

“The estimated wait time is 30 minutes”
- This may lead some callers to hang up and try again at a different time
This practically decreases load Both may lead to oscillations

Shared resources in a computer network

Communication links
The processing and buffering capacity of the packet forwarding switches

Main Challenges

Part 1

There is more than one resource

Even a small number of resources can be used up in a large number of different ways, which is complex to keep track of
There can be dynamic interactions among different resources as one nears capacity it may push back on another
- Which may push back on yet another
- Which may push back on the first one

It is easy to induce congestion collapse

As queues for a particular communication link grow, delays grow
When queuing delays become too long, the timers of higher layer protocols begin to expire and trigger retransmissions of the delayed packets
The retransmitted packets join the long queues, and waster capacity

As memory gets cheaper, the idea is tempting - but it doesn’t work

Suppose memory is so cheap that a packet forwarder can be equipped with an infinite buffer size, which can absort an unlimited amount of overload
But as more buffers are used, the queuing delay grows
At some point the queuing delay exceeds the timeouts of end-to-end protocols
Packets are retransmitted
The offered load is now larger, so the queue grows even longer
It becomes self-sustaining and the queue grows even longer
The infinite buffer size does not solve the problem, it makes it worse

Part2

There are limited options to expand capacity

Capacity is determined by physical facilities (wireless spectrum)
- One can try sending some queued packets via an alternate path
  - But these strategies are too complex to work well, as one needs to know the instantaneous state of the queues throughout the network
- Reducing the offered load (the demand) is the only realistic way

Part 3

The options to reduce load are awkward

The control point for the offered load is too far away
- Feedback path to that point may be long
  - By the time the feedback signal get there
  - The sender may have stopped sending already
  - The congestion may have disappeared
  - The feedback may be lost
The control point must be capable of reducing its offered load
- Video streaming protocols are not able to do this
The control point must be willing to cooperate
- The packet forwarder in the network layer may be under a different administration than the control point in the end-to-
- The control point is more intersted in keeping its offered load equal to its intended load
  - In the hope of capturing more of the capacity in the face of competition

Address challenges

OverProvisioning

Basic idea: Configure each link of the network to have 125% or 200% as much capacity as the offered load at the busiest minute of the day

Works best on interior links of a large network, where no individual client represents more than a tiny fraction of the load
Average load offered by a large number of statistically independent sources is relatively stable

Problem

Odd event can disrupt statistical independence
OverProvisioning on one link will move congestion to another
At the edge, statistical averaging stops working flash crowd
User usage patterns may adapt to the traditional capacity

Pricing in a market: The invisible hand Since network resources are just another commodity with limited availability, it should be possible to use pricing as a congestion control mechanism

If demand for a resource temporarily exceeds its capacity, clients will bid up the price
The increased price will cause some clients to defer their use of the resource until a time when it is cheaper, thereby reducing offered load
- Consider hydro charge in different time slot
It will also induce additional suppliers to provide more capacity

Challenges

How do we make it work on the short time scales of congestion?
Clients need a way to predict the costs in the sort term, too
there has to be a minimal barrier of entry by alternate suppliers

Cross-layer feedback

The packet forwarder that notices congestion provides feedback to one or more end-to-end layer sources

The end-to-end source responds by reducing its offered load

The best solution:

The packet forwarder simply discards the packet

Which packet to discard? The choice is not obvious

The simplest strategy, tail drop, limits the size of the queue, and any packet that arrives when the queue is full gets discarded
A better technique, called random drop, chooses a victim from the queue at random
- The sources that are contributing the most to congestion are the most likely to received the feedback
Another refinement, called early drop
- Begins dropping packets before the queue is completely full
  - In the hope of altering the source sooner
- The goal of early drop
  - Start reducing the offered load as soon as the possibility of congestion is detected, rather than waiting till congestion is confirmed
  - Avoidance rather than recovery
Random drop + Early drop: Random early detection (RED)

What should the end-to-end protocol do?

One idea: Just retransmit the lost packet, and continue to send more data as rapidly as its application supplies it

This way, it may discover that by sending packets at the greatest rate it can sustain, it will push more data through the congested packet fowrader
Problem: If this is the standard mode of operation of all end hosts, congestion will set in and all will suffer
- “The tragedy of the commons”

Two things that the end-to-end protocol can do

Be careful about the use of timers
- Involves setting the timer’s value
Pace the rate at which it sends data
- Automatic rate adaptation
- Involves managing the flow control window
Both require having an estimate of the round-trip time between the two end of the protocol

The retransmit timer

With congestion, an expired timer may imply that either a queue in the network has grown too long, or a packet forwarder has intentionally discarded the packet

We need to reduce the rate of retransmissions

Idea: RTT estimates
- Develop an estimate of the RTT by directly measuring it
- Longer queuing delays will increase the observed RTT
- These observations will increase the rtt estimate used for setting future retransmit timers

When a timer does expire, exponential backoff for the timer interval should be used for retransmitting the same packet

Effectively avoids contributing to congestion collapse

Automatic rate adaptation The flow control windows and the receiver’s buffer should both be at least as large as the bottleneck data rate multiplied by the RTT - the BDP(Bandwidth-delay produce)

But if it is larger, it will result in more packets piling up in the queue of the bottleneck link

We need to ensure that the flow control window is no larger than necessary

TCP design

In the original TCP design, the only form of acknowledgement to the sender was “I have received all the bytes up to X”

But not in the form of “I am missing bytes Y through Z”

The consequences

When a timer expired due to a lost packet, as soon as the sender retransmitted that packet, the timer of the next packet, expired, causing its retransmission
This will repeat until the next acknowledgment returns, a full round trip later
On long-delay routers, the flow control window may be large
Each discarded packet will trigger retransmission of a window full of packets
- Congestion Collapse

Solution By the time this effect was noticed, TCp was already widely deployed, so changes to TCP were severely constrained

They have to be backward compatible

The result

One expired timer leads to slow start
- Send just one packet, and wiat for its acknowledgement
- For each acknowledged packet, add one to the window size
- In each RTT, the number of packets that the sender sends doubles
- This repeats till
  - The receiver’s window size has been reached
    - The network is not the bottleneck
  - A packet loss has been detected

Duplicate acknowledgment

The receiving TCP implementation is modified slightly

Whenever it receives an out-of-order packet, it sends back a duplicate of its latest acknowledgment 0 Such a duplicate can be interpreted by the sender as a NAK

The sender then operates in an equilibrium mode

Upon duplicate acknowledgment
- The sender retransmits just the first unacknowledged packet and also drops its windows size to some fixed fraction of its previous size
- After this it proves gently for more capacity by doing
  - Additive increase: Whenever all the packets in a RTT are successfully acknowledged, the sender increases the window size by 1
  - Multiplicative decrease: Whenever a duplicate acknowledgment arrives again, the sender decreases the size of the window again, by a fixed fraction

Real-world meaurement studie have shown

By probing 5000 popular web servesr, researchers have found that only 15-20% uses AIMD TCP

What about the other 80%?

It turns out that web servers, which are what define the Internet, use a wide variety of different TCP protocols, each with its own congestion control protocol

Problem of AIMD

As Internet evolves, the number of long latency and high bandwidth networks grows

The Bandwidth and Delay Product(BDP)
- The total number of packets in flight, determined by the flow control window ize on the sender, must fully utilize the bandwidth
Standard AIMD TCP increases its congestion window size too slowly in high BDP environments
- Example: Bandwidth 10Gbps, RTT 100ms, Packet size 1250 bytes, takes 10,0000 seconds to fully utilize
  - $10 * 1000000 / (1250 * 8) / (0.1) = 100000$

New congestion control design objectives

Highly scalable to high BDP environments
Very slow (gentle) window increase at the saturation point
Fair to AIMD TCP flows
- Backward compatibility

Binary Increase Congestion (BIC)

After a packet loss, reduces its window by a multiplicative factor of $\beta$, the deafult is 0.2 -The window size just before reduction is set to $W_{max}$ and after reduction $W_{min}$ -In the next step, it finds the midpoint using theses two sizes and jump there - Binary Search - But if midpoint is very far from $W_{min}$ a constant value called $S_{max}$ is used

If no loss, $W_{min}$ is set to the new window size
The process continues until the increment is less than a constant value of $S_{min}$
- Then it is set to the maximum window
If not loss, new maximum must be found and it enters a “max probing” phase
Window growth function is exactly symmetric to the previous part

BIC works very well in production, but in low speed or short RTT networks it is too aggressive for TCP

Different phases like binary search increase, max probing, $S_{max}$ and $S_{min}$, make its implementation not very efficient

A new congestion control protocol is required to solve these problems, while keeping its advantages of stability and scalability

CUBIC

As the name suggests, it uses cubic function for window growth

It uses time instead or RTT to increase the window size
It contains a “TCP mode” to behave the same as AIMD TCP when RTTs are short

After a packet loss, reduces its window by a multiplicative factor of $\beta$, the default is 0.2

The window size just before reduction is set to $W_{max}$
After it enters intro congestion avoidance, it starts to increase the window using a cubic function
The plateau of cubic function i set to $W_{max}$
Size of the window grows in concave mode to reach $W_{max}$ then it enters the convex part

BBR

Congestion-Based Congestion Control

BBR algorithm differs from loss-based congestion control in that it pays relatively little attention to packet loss

Instead, it focuses on the actual bottleneck bandwidth
In the “startup” state, it behaves like most traditional congestion-control algorithms
- Quickly ramps up the sending rate in an attempt to measure the bottleneck bandwidth

Stop Ramping Up Early -Instead of continuing to ramp up until it experiences a dropped packet, it watches the bottleneck bandwidth measured for the last several round-trip times

-Once the bottleneck bandwidth stops rising, BBR concludes that it has found the effective bandwidth of the connection and can stop ramping up - This has a good chance of happening well before packet loss would begin

Drain State

In measuring the bottleneck bandwidth, BBR probably transmitted packets at a higher rate for a while
Some of them will be sitting in queues waiting to be delivered
To drain those packets out of the buffers where they languish, BBR will go into a drain state
- During which it will transmit below the measured bandwidth
- Until it has made up for the excess packets sent before

Steady State

Once the drain phase is done
- BBR goes into the steady-state mode
  - Where it paces outgoing packets
  - So that the packets in flight is more-or-less the calculated BDP
Scale the rate up by 25% periodically to probe for an increase in effective bandwidth

Network

Communication links are always shared

Wide range of parameter values

The significant consequences of sharing

Time-division multiplexing

Data network

Packet

Propagation delay

Transmission delay

Processing delay

Queuing delay

Maximum queuing delays

Design buffer size

Planing for the worst case

Fight back

Back to source

Discarding packets

Forwarding networks

Duplicate

Duplicate Suppression

Broken message

Isochronous vs. asynchronous multiplexing

Adopt Sweeping Simplifications

Analogy: Post office

Advantages of Protocol Layering

Problems of Protocol Layering

How many layers do we need?

One end-to-end layer or four layers

Sending and receiving data using layers

The end-to-end argument

A simple example:careful file transfer

System Design

Simplicity

Time

Space

Scaling

Bottleneck

Multiplexing and virtualization

Statistical multiplexing

Batching

Randomize

Layers

Corollary

Hierarchies

PTP vs. Client-Server

Pipelines

Trad-off

Routing

Network Addr

Managing the forwarding table

Challenges

Path Vector

Advertising

Path selection

Loops

Hierarchy in routing

Congestion Control

Shared resources

Analogy: Supermarket vs. Packet switch

Congestion Collapse

Resource management

Shared resources in a computer network

Main Challenges

Part 1

Part2

Part 3

Address challenges

OverProvisioning

Cross-layer feedback

What should the end-to-end protocol do?

The retransmit timer

TCP design

Duplicate acknowledgment

Problem of AIMD

New congestion control design objectives

Binary Increase Congestion (BIC)

CUBIC

BBR

CATALOG

FEATURED TAGS