This is my first draft of my Extended Essay that I wrote for the IB Diploma Programme. At the time we were choosing our topics for our essays, I was getting interested in Vector Packet Processing. The maximum limit for this essay is 4000 words, and I am at 3425. If any of you have improvements I can make, please contact me.

Introduction

Vector Packet Processing is a novel way to process vast amounts of internet traffic. Compared to the traditional method of Scalar Packet Processing (or SPP), vector packet processing (or VPP) is much more efficient and overall much faster. So what are SPP and VPP? Internet traffic is transmitted in small segments of information called “packets”. Each packet contains control information (or where the packet should go) and a payload (the user data actually being transmitted). Think of packets like sending a letter: what you want to send is inside the envelope and there is the source and destination address written on the envelope itself.

However, these envelopes can only be so big. That means that computers are sending thousands to hundreds of thousands of these packets per second. A computer processor (the “brain” of a computer) can only process so many packets per second. Modern computer’s processors run at about three to four billion cycles per second (hertz). However, it takes tens of thousands of cycles to process a single packet. That means a computer can only process a few hundred thousand packets per second. This is fine for personal computers that do not need to transmit the whole internet at once. However, in situations such as Facebook, or Google, or Amazon, or Microsoft, or Netflix, or any other large scale company that is processing incredible amounts of information, computers run into a roadblock.

Computer Processors’ History

From the 1980s to the early 2000s, the speed of computer processors has increased from a few megahertz to several gigahertz. Due to physical constraints, the rate of clock speed development has slowed down. In order to continue making computer processors more powerful than their predecessors, multiple processors (cores) were put into one chip. Over the past two decades, a paradigm shift has occurred where computer processors added more processors per package instead of increasing the speed at which these processors run. In 2004, the computer hardware company AMD released the AMD Athlon 64 3500+, which was rated at a speed of over 3.5 gigahertz. In 2021, almost twenty years later, AMD has released the AMD Ryzen 5 5950X, which is rated at a speed of just 4.9 gigahertz. That may seem like a large increase, but compared to the development from a piddly few megahertz in the early 1980s to multiple gigahertz almost twenty years later (an increase of multiple orders of magnitude), computer processor speeds are reaching a plateau. Although computers began as single-core machines, only able to run one program at a time (they were able to run multiple programs at once by “sharing” the running time. In a nanosecond, one program updates. In another nanosecond, the other program updates, so on so forth). However, modern desktop processors have eight or more cores per processor, meaning that they can run eight programs at once. SPP cannot take advantage of that. Packets must be processed in order. Parallelizing packets runs into an issue called a race condition. If eight packets are processed at once, it is uncertain which packets will be first to be processed and which packets will be last to be processed. This is due to incredibly small factors such as the time it takes electricity to move a two micrometers versus three micrometers, the purity of the silicon in a computer board, and various other infetismal details. Regardless of how small these details are, they will still impact timings at the scale of one billionth of a second. Packet processing cannot take advantage of multiple cores. As internet use increases, there is a desperate need for an internet connection that can keep up. This is where Vector Packet Processing comes in.

Principle of VPP

The VPP project processes packets in vectors (hence the name), which are hundreds of packets at the same time. Instead of processing one packet at a time, VPP processes many packets at a time, which leads to increased throughput and less latency. However, with the widespread prevalence of SPP, in what situations is VPP better suited? Because of the simplicity of SPP, SPP can remain in use for personal devices where high throughput is not needed. VPP will replace SPP in situations where high throughput and low latency are critical, such as with internet service providers and with large technology companies.

What is VPP

The premises that VPP is built on are twofold: one is to process packets in vectors (up to two hundred and fifty-six packets at a time) and the other is to optimize processing of these packets so that the computer processor’s instruction cache is utilized to the fullest. An instruction cache in a computer processor keeps track of what the processor did last, just in case a similar instruction happens again. Similar to a physical cache, the more important instructions are kept in the cache. The computer determines what is most important through an algorithm, such as LRU (least recently, which discards the earliest item), or LFU (least frequently used, which discards the least “hit”, or read, item). In VPP, each vector is processed through “nodes”.

Vector Packet Processing Node Graph

Within each node, there is a set of instructions that are run to process the vector partway, and then the vector is sent to the next node. For each node, that leads to a decrease in instructions that must be kept track of. Each node is abstracted away from the other, leading to a decrease in instructions per node. Each cycle, the processor runs a few instructions per core to process the vector.

Cache Thrashing and SPP

The instructions that the processor must execute to process a packet is loaded from the computer’s RAM if it is not in the cache. The cache is now considered “warm”, or ready to be used. Each packet first moves through the first node, using that cache again and again. Then the vector is passed on to the next node, where the process repeats until the packets have left the machine onto their destination. Compared with SPP, one packet is passed through the nodes. That means the cache, which keeps the best instruction according to an algorithm, is eventually invalidated. The next packet cannot use the cache properly, and has to load the instruction from other sources. That leads to a significant slowdown. VPP, through processing multiple packets at once, maximises cache hit rate and decreases latency (the time it takes to process a single packet) and keeps the number of “wasted” processor cycles (where the processor is waiting for the instructions to arrive) down. Decreasing wasted cycles ensures that the processor is more efficient because it still has work to do. Think about a factory with many workers. If the worker is doing nothing and standing around instead of working and is still being paid per hour, the factory is not as cost-efficient as it can be. The same principle applies to computer processors. With SPP, in contrast, “cache thrashing”—in which the processor’s cache is being written to, invalidated, and not read from—occurs because the processor is continuously loading many new instructions and minimizing cache hits. This leads to a significant slowdown of the computer due to needing to move instructions to the cache many more times than what is required.

Along with instruction cache hits and misses, processing packets in vectors is more efficient in terms of the data cache. The data cache in a CPU keeps often-used data in a cache to decrease latency, similar to how instructions are kept in the cache. Because multiple packets are processed at once in vectors, the call stack of the data can be reduced and the cache is prevented from falling out of use. With SPP, each packet has a large call stack due to the various operations performed on it. That means that the data cache in the CPU will be thrashed as well from the constant updates and invalidations of the cache. In the words of the FD.io VPP project,

Because the first packet in the vector warms up the instruction cache, the remaining packets tend to be processed at extreme performance. The fixed costs of processing the vector of packets are amortized across the entire vector. This leads not only to very high performance, but also statistically reliable performance. If VPP falls a little behind, the next vector contains more packets, and thus the fixed costs are amortized over a larger number of packets, bringing down the average processing cost per packet, causing the system to catch up. As a result, throughput and latency are very stable. If multiple cores are available, the graph scheduler can schedule (vector, graph node) pairs to different cores.

A simple analogy would be building one hundred chairs. It would be easier, faster, and more efficient to make all the parts for the chairs first, then swap tools, then assemble the chairs than to assemble one chair at a time.

Hardware Acceleration

Not only can we process packets more efficiently with VPP’s node-based and vector based architecture, but also we can speed up processing more with dedicated hardware. The node-based architecture lends itself very well to hardware acceleration. Certain nodes can be replaced with dedicated hardware that can perform the instructions much faster than the general-purpose computer processor. By using a processor that is dedicated to one very specific task, these application-specific integrated circuits (ASICs) can increase efficiency and improve throughput even more because the hardware is specialized for processing vectors of packets. Similar to how a hole can be drilled with a Swiss Army Knife, it is much easier and faster to use a specialized drill press to do so. The Swiss Army Knife can do many things, just not as well as specialized equipment. However, specialized equipment is just that: specialized. They can’t be used for anything that is not their function. Using highly-specialized hardware for network processing used to be exorbitant in cost due to the low interoperability between hardware vendors. This is how Cisco gained dominance in the networking industry. However, by using the VPP platform, nodes in the process graph can be replaced with hardware that is open and standardized, decreasing costs for deployment.

VPP and DPDK

There is an important distinction to be made between VPP and DPDK. VPP is the process and the userspace software in which packets are processed. DPDK (Data Plane Development Kit) is the library that VPP uses to interact with the system. VPP is an algorithm that is built using DPDK as the base. DPDK without VPP offers very speedup, but most importantly offers kernel-level networking bypass. This prevents the operating system’s kernel from interacting with the networking and gives greater flexibility to the software that is actually doing the networking. Because the kernel processes packets through interrupts (everything pauses, the packet is processed, everything resumes), changes in network throughput can heavily impact the performance of the rest of the system. Because VPP and DPDK are in userspace, packet processing can happen in another thread without impacting the rest of the system.

Experimental Design

Just how much faster is VPP compared to SPP, and under what conditions should it be used? In order to determine the differences in speed and to gain a deeper understanding of VPP through hands-on learning, I created an experiment to determine speedups with VPP compared to traditional kernel-based SPP. The purpose of this essay was to determine experimentally whether VPP is faster than traditional methods.

Experimental Process

First, on the host machine, I installed VPP through the official FD.io package repository. Then, I created a virtual machine and repeated the process. Both of these machines were running Rocky Linux 8.4 with SELinux disabled. For the virtualization, I used QEMU bound to the Linux Kernel Virtual Machine to ensure minimum overhead. I set up a bridged network connection between the two machines.

Utter Failure

As I continued to read the documentation, I realized that there was no feasible way for me to test VPP without purchasing actual hardware to test on. VPP does not work the same in purely virtualized environments compared to bare-metal environments. VPP still requires physical network interfaces for the virtual machines to bind to. I had no dedicated network interfaces for the virtual machines to use. I could only use paravirtualized network interfaces, which offer the best performance compared to fully virtualized network adapters. Using SPP in the virtual environment, I was able to hit the upper limit of the virtualized network connection at ten gigabits per second. Unfortunately, VPP simply refused to run on the virtualized connection. I realized, after many weeks of attempting to make the experiment work, that I needed to either purchase actual computer equipment capable of testing VPP or to use pre-existing results and experiments from FD.io, Intel, and other independent researchers. This meant that I had to shift the scope of my extended essay. Earlier, my extended essay was to test VPP at a small scale and to experimentally determine if VPP is faster than or slower than SPP. Because I am now using the data and the experiments of other researchers and companies, I had to shift my scope and determine when VPP ought to be used. As with every other technology, there is a niche that VPP must fill. I set out to answer that question. Although I was disappointed with my experiment’s results (or lack thereof), I believe that what I learned from these experiments that could encompass mine hundreds of times over is much more valuable, and speaks to the reason I chose this topic: to learn about an interesting technology.

However, because these experiments that other researchers have done were immense in scale and high in complexity, it took me even more time to even comprehend the data that was gathered by some of these experiments. Although some of the experiments that I am performing an analysis on were simple, the CSIT experiments turned out to be vast in scale. For the sake of simplicity, I am only selecting a few experiments from that section that I believe were most representative of the use-case of VPP.

Experiment Results and Analysis

Analysis of Intel’s Experiment

One experiment that is representative of what VPP performs like in the real world was done by Intel as a tutorial for people using their networking hardware and their computer processors.

In this tutorial, three systems named csp2s22c03, csp2s22c04, and net2s22c05 are used. The system csp2s22c03, with VPP installed, is used to forward packets, and the systems csp2s22c04 and net2s22c05 are used to pass traffic. All three systems are equipped with Intel® Xeon® processor E5-2699 v4 @ 2.20 GHz, two sockets with 22 cores per socket, and are running 64-bit Ubuntu 16.04 LTS. The Intel® Ethernet Converged Network Adapter XL710 10/40 GbE is used to connect these systems.

In this experiment, we are using VPP as a network switch and router to send traffic from one machine to another. This is a very common use-case in large-scale datacenters where there are hundreds of thousands of machines that must communicate with each other.

Intel Connections

Intel Throughput

In this experiment, there are multiple results. First, there is kernel-based packet forwarding using SPP. Using the iperf utility to measure internet speed, following result is generated:

Connecting to host 10.10.1.2, port 5201
[  4] local 10.10.2.2 port 54074 connected to 10.10.1.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   936 MBytes  7.85 Gbits/sec  2120    447 KBytes
[  4]   1.00-2.00   sec   952 MBytes  7.99 Gbits/sec  1491    611 KBytes
[  4]   2.00-3.00   sec   949 MBytes  7.96 Gbits/sec  2309    604 KBytes
[  4]   3.00-4.00   sec   965 MBytes  8.10 Gbits/sec  1786    571 KBytes
[  4]   4.00-5.00   sec   945 MBytes  7.93 Gbits/sec  1984    424 KBytes
[  4]   5.00-6.00   sec   946 MBytes  7.94 Gbits/sec  1764    611 KBytes
[  4]   6.00-7.00   sec   979 MBytes  8.21 Gbits/sec  1499    655 KBytes
[  4]   7.00-8.00   sec   980 MBytes  8.22 Gbits/sec  1182    867 KBytes
[  4]   8.00-9.00   sec  1008 MBytes  8.45 Gbits/sec  945    625 KBytes
[  4]   9.00-10.00  sec  1015 MBytes  8.51 Gbits/sec  1394    611 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  9.45 GBytes  8.12 Gbits/sec  16474             sender
[  4]   0.00-10.00  sec  9.44 GBytes  8.11 Gbits/sec                  receiver

Here, the bandwidth is about 8.12 Gigabits per second, averaged over the whole test. This is similar to the benchmark I had done without VPP in my experiment. However, using VPP nearly triples the throughput.

Connecting to host 10.10.1.2, port 5201
[  4] local 10.10.2.2 port 54078 connected to 10.10.1.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  2.02 GBytes  17.4 Gbits/sec  460   1.01 MBytes
[  4]   1.00-2.00   sec  3.28 GBytes  28.2 Gbits/sec    0   1.53 MBytes
[  4]   2.00-3.00   sec  2.38 GBytes  20.4 Gbits/sec  486    693 KBytes
[  4]   3.00-4.00   sec  2.06 GBytes  17.7 Gbits/sec  1099   816 KBytes
[  4]   4.00-5.00   sec  2.07 GBytes  17.8 Gbits/sec  614   1.04 MBytes
[  4]   5.00-6.00   sec  2.25 GBytes  19.3 Gbits/sec  2869   716 KBytes
[  4]   6.00-7.00   sec  2.26 GBytes  19.4 Gbits/sec  3321   683 KBytes
[  4]   7.00-8.00   sec  2.33 GBytes  20.0 Gbits/sec  2322   594 KBytes
[  4]   8.00-9.00   sec  2.28 GBytes  19.6 Gbits/sec  1690  1.23 MBytes
[  4]   9.00-10.00  sec  2.73 GBytes  23.5 Gbits/sec  573    680 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  23.7 GBytes  20.3 Gbits/sec  13434             sender
[  4]   0.00-10.00  sec  23.7 GBytes  20.3 Gbits/sec                  receiver

8.12 gigabits per second to 20.3 gigabits per second is a dramatic increase by bypassing the kernel-level networking system. This experiment sets a precedent for how VPP performs in real-life scenarios.

Analysis of FD.io CSIT Experiments

The FD.io project, in order to ensure that with every update to the software there is no impact to performance, runs a continuous integration system to automatically test each release of VPP on various configurations of test hardware. According to the FD.io CSIT (continuous system integration and testing) website,

FD.io CSIT system design needs to meet continuously expanding requirements of FD.io projects including VPP, related subsystems (e.g. plugin applications, DPDK drivers) and FD.io applications (e.g. DPDK applications), as well as growing number of compute platforms running those applications. With CSIT project scope and charter including both FD.io continuous testing AND performance trending/comparisons, those evolving requirements further amplify the need for CSIT framework modularity, flexibility and usability.

On the CSIT website, there is historical as well as current data for VPP performance. So how is the CSIT system structured? There are SUTs, DUTs, and TGs that create the core component of these tests (systems under test, devices under test, and traffic generators). Then, there are layers that will interpret this information and parse it and present it to the end user.

CSIT Design

Each test contains the setup, configuration, test, and verification for each configuration of the system. This is done to properly simulate the real-life conditions in which VPP will be used, and not artificial situations.

VPP and DPDK Measurement Metrics

In order to determine the efficacy of VPP and DPDK, there are multiple metrics that ought to be measured, most important of which is packets per second. Packet throughput is measured in an Maximum Receive Rate test. According to the CSIT website, “Performance trending relies on Maximum Receive Rate (MRR) tests. MRR tests measure the packet forwarding rate, in multiple trials of set duration, under the maximum load offered by the traffic generator regardless of packet loss. Maximum load for specified Ethernet frame size is set to the bi-directional link rate.” Under the MRR test for 64b-l2switching-base-avf with a machine with 2t1c, about 37 million packets are processed per second.

Test results

With this low power machine, and assuming an average packet size of 1400 bytes, we have a throughput of over four hundred gigabits per second. That is one dual-layer Ultra-HD Blu-Ray disks per second. This is much higher than what SPP can reasonably do (see Intel performance tests). For the rest of the CSIT website, this trend continues with different machines, different tests, and different situations. VPP with DPDK is simply much faster than simple kernel-level SPP. So why is VPP not installed on everyone’s computer?

Conclusion

Although I was unable to perform an experiment to experimentally determine for myself the difference between VPP and kernel-based SPP, I still believe that I was able to prove my earlier hypothesis using data from previous experiments and other third-party sources. Because of these difficulties, my extended essay has shifted scope from a pure experiment to an analysis of an experiment and a novel technology. VPP has been around for nearly two decades at this point, but it and DPDK have only begun to gain traction in actual use over the past few years. There is a need for both of these technologies. With Moore’s Law becoming increasingly irrelevant, turning to different and more efficacious technologies to solve the same problem is required. Through my analysis, I have concluded that VPP ought to be only used in an enterprise environment where there is more than ten gigabits of information flowing per second. Firstly, SPP works just fine at speeds at or below ten gigabits per second on commodity hardware. It might not be the most efficient solution, but it is definitely the simplest. Because SPP is supported at the kernel level (as opposed to VPP which runs in userspace), SPP does not require additional programs to be installed to run. SPP does not require constant maintenance and configuration tweaking. SPP is viable for low-cost deployments of networking. However, VPP is very difficult to set up, debug, and to tune to get the maximum speed. This leads to a lot of time and money being spent on VPP compared to the relative ease of SPP. Since VPP is only efficacious in extremely high-throughput low-latency environments such as the exascale datacenters of Facebook and Google, the conditions that VPP will be deployed also ensures network specialists who ensure that VPP is kept running at peak efficiency. Consumers will not see VPP being used on their machines in the near future. Most consumer devices do not have the necessary network hardware to transmit beyond one gigabit, a speed SPP is perfectly capable of handling. But, is it worth it to be excited over this technology if SPP is good enough? Yes. Although SPP will likely reign supreme in the consumer world, the effects of VPP will be felt by consumers. As the amount of data transmitted over the internet continues to grow exponentially, companies need ways to fulfil consumers’ need for more and more bandwidth without running into exorbitant costs. Because of the increase in uptake of VPP, I believe that the future looks bright for the development of the internet.

Sources

“Build a Fast Network Stack with Vector Packet Processing (VPP) on An.” Intel.Com, https://software.intel.com/content/www/us/en/develop/articles/build-a-fast-network-stack-with-vpp-on-an-intel-architecture-server.html. Accessed 10 Oct. 2021.

“Business Data Communications : Stallings, William : Free Download, Borrow, and Streaming : Internet Archive.” Archive.Org, https://archive.org/details/businessdatacomm00stal/page/632/mode/2up. Accessed 10 Oct. 2021.

“CSIT-2106 — FD.Io CSIT-2106.32 Rls2106 Documentation.” Docs.Fd.Io, https://docs.fd.io/csit/rls2106/report. Accessed 10 Oct. 2021.

Lawson, Stephen. “Cisco’s CRS-1 Router Reaches Five-Year Milestone.” Infoworld.Com, 27 May 2009, https://www.infoworld.com/article/2632331/cisco-s-crs-1-router-reaches-five-year-milestone.html.

Linguaglossa, Leonardo, et al. “High-Speed Software Data Plane via Vectorized Packet Processing.” Telecom-Paristech.Fr, https://perso.telecom-paristech.fr/drossi/paper/vpp-bench-techrep.pdf. Accessed 10 Oct. 2021.

“The TCP/IP Guide - Understanding the OSI Reference Model: An Analogy.” Tcpipguide.Com, http://www.tcpipguide.com/free/t_UnderstandingTheOSIReferenceModelAnAnalogy.htm. Accessed 10 Oct. 2021.

“Vector Packet Processing.” Tnsr.Com, https://www.tnsr.com/vpp. Accessed 10 Oct. 2021.

“What Is Vector Packet Processing? — Vector Packet Processor 01 Documentation.” Readthedocs.Io, https://fdio-vpp.readthedocs.io/en/latest/overview/whatisvpp/what-is-vector-packet-processing.html. Accessed 10 Oct. 2021.

Previous Post Next Post