The standardization of Distributed Access Architecture (DAA) for Data over Cable Service Interface Specifications (DOCSIS) and further advancements in the flexible MAC (media access control) architecture (FMA) standard have enabled the transition to a software-centric cable network infrastructure . One option that the DAA and FMA architectures allow is for the DOCSIS MAC software to be deployed as a virtual network function (VNF) on general purpose x86 servers in a multiple-system operator (MSO) headend as a virtualized cable modem termination system (vCMTS), while the DOCSIS physical (PHY) layer is housed in a street-unit near subscribers.
The DOCSIS MAC component of a vCMTS is typically built on top of technologies such as Data Plane Development Kit (DPDK)  or Vector Packet Processing (VPP) , both of which are open-source projects that accelerate packet processing workloads running on a variety of central processing unit (CPU) architectures. One of the main principles of DPDK and VPP that help with this packet processing acceleration are poll-mode drivers (PMDs). These are user-space drivers that rely on continuously polling an Ethernet interface for incoming packets, thereby removing any unnecessary interrupt processing overhead, and ultimately boosting performance.
Coinciding with this move to Network function virtualization (NFV), there is also a major push across all aspects of society and industry to reduce our carbon footprints. The cable industry is, and should be, no exception. As NFV and vCMTS deployments become more and more prolific across MSO networks, there have never been more opportunities to find energy efficiencies on a per-server, per-rack, or per-site (headend) basis. This is largely achievable through the use of newly available telemetry and controls to manage hardware and software elements in real time. In fact, it is quite common for the operating system (OS) itself to automatically reduce operating power when it detects that the demand on CPU cores allows entry to a power-optimized state in a given piece of equipment. However, it has traditionally been difficult to identify these periods of lower demand for applications that optimize for performance and relentlessly poll the hardware for packets or events using DPDK PMDs - like the DOCSIS MAC component of a vCMTS. These types of applications appear to fully utilize the CPU cores regardless of the network load they are processing due to their polling nature. This behavior prevents built-in hardware and OS capabilities from effectively controlling power elements for DOCSIS MAC implementations. Alternative techniques are therefore required to achieve energy efficiencies on such deployments.
This paper will discuss the hardware capabilities present on the latest generations of x86 servers (namely C-states and P-states) and new techniques available to flawlessly match the network load at the lowest possible power for these types of data plane applications. It will also address some common misconceptions regarding the usage of some of these capabilities and techniques. In doing so, the paper aims to set out a pathway towards a greener and more energy efficient vCMTS. Through further exploration and detailed lab benchmarking, quantifiable benefits for different system/test configurations will provide actionable recommendations for operators and their vendors creating and deploying NFV solutions in the edge or access network.
For the purpose of this paper, the CPU power draw was measured on a dual processor server running a reference implementation of a vCMTS DOCSIS MAC data plane. The server contained 2 Intel® Xeon® Scalable Gold 6338N CPUs, each of which has 32 cores. The Intel® vCMTS reference data-plane  was used as the workload on the platform. Measurements included in the subsequent sections are based on this hardware and software in conjunction with the power management techniques described in detail in the paper. Similar power savings are achievable on various other hardware CPUs and software stacks.
Overview of C-states
C-states are power savings states that reduce power consumption on a per-core basis, by turning off specific portions of a core. Entering a power optimized C-state leads to power savings. However, C-states also require that execution of instructions on the core be stopped while the core resides in the power saving state .
Several C-States are supported on x86 server processors. Deeper C-states provide greater power savings as more functions of the core are disabled. However, they also have a higher exit latency. While the increased power savings of deeper C-states are desirable, the increased exit latency results in the core taking longer to start executing instructions upon exit from the power saving state. As more functions of the core are powered off in deeper C-states, there can also be an impact even when the core has restarted instruction execution. For example, data may get flushed from caches in the deepest C-state (C6); this data may have to be reloaded into cache as the core resumes execution. While the power savings provided by C-states may be significant, these associated costs also need to be considered when deciding to enable or disable them.
The following are the traditional C-states on an x86 server:
- C0 is the active core state where the core is executing instructions. In C0 the core is considered fully turned on.
- C1 and C1E (C1 Enhanced) are light C-states. In these states the main CPU clock is stopped via software and the CPU voltage is reduced.
- C6 is the deepest C-state offering the greatest power savings. In C6 the cores L1/L2 cache and last level cache (LLC) are also flushed. Disabling these additional functions increases energy efficiency.
C-states can be controlled either autonomously by hardware, or the OS. If controlled by hardware, they are controlled solely by the CPU power control unit. The OS can also control C-states through the Advanced Configuration and Power Interface (ACPI) governor by the Linux scheduler . In both cases, C-states are triggered by reduced load on CPUs, detected at either a hardware or OS level using the Linux scheduler.
The latest x86 CPU architectures also provide additional, and extremely light, C-states - C0.1 and C0.2  . These new power saving states allow for user-space application software to directly request the core to enter power optimized states, with negligible exit latencies on supported CPUs. C0.1 and C0.2 provide optimum flexibility, carefully balancing power saving C-states without the cost of wakeup times.
Enabling Traditional C-states (C1, C1E, C6)
Due to the exit latencies of C1, C1E and C6 states and the expectation that they will negatively affect performance, it has generally been considered best practice in the cable industry to disable these C-states entirely for workloads requiring high-performance and determinism. Contrary to this, if used correctly, these C-states can provide significant power savings for such workloads without any negative impact to performance or operation.
Take, for example, a vCMTS deployment on a common off-the-shelf (COTS) server. If C-states are enabled on such a deployment, cores that are not fully utilized will be placed in power saving C-states. While this is true for cores running threads managed by the Linux scheduler, it is not true for cores running DOCSIS MAC threads that continuously poll the hardware for packets or events via a DPDK PMD. These cores will not enter any idle C-state as they will be detected by both the CPU power control unit and the OS to be fully utilized, regardless of the level of network load that they are processing. As a result, these performance critical polling threads will not be affected, while cores running non-performance critical threads will transition between C-states. For example, 20 cores of an x86 CPU may be used for performance optimized polling threads and a further 12 cores for control-plane, infrastructure and failover. With such a deployment and at peak network load, we saw power savings of up to 10% by simply enabling C-states. When the system was made entirely redundant during lower periods of activity, the C-states gave savings of up to 70% of overall CPU power. Figure 3 shows how each CPU core in a dual CPU x86 server would be used in such a deployment. Enabling C-states on the 2 blue (infrastructure) and 10 yellow (control-plane) cores on each CPU allows the platform to save power.
For greater peace of mind that applications will be unaffected in terms of operation and performance, C-states can be enabled or disabled at a per-core level . Each C-state (C1, C1E and C6) can also be enabled or disabled exclusively at a per-core level allowing for fine-grained control over which C-states will be used. This becomes significant when we consider that C1 gives good power savings with an exit latency of just a few microseconds, while the C6 exit latency is tens of microseconds (as reported by the ACPI kernel module on some CPUs). Enabling C1 and disabling C6 can be a good tradeoff between achieving power savings and not impacting the operation, performance, and determinism of certain control-plane threads if required. As power becomes increasingly important, identifying previous misconceptions that are costing operators watts is a valuable exercise. Before disabling C-states, their potential power savings should first be explored by operators and vendors.
Enabling C-states Efficiencies on DOCSIS MAC Dataplane Cores
While the previous section addressed the use of C-states for threads that do not relentlessly poll hardware for packets, performance optimized DOCSIS MAC threads experiencing periods of lower network load can also benefit from the use of C-states. The main obstacle is that both CPU hardware and the OS are unable to distinguish between high and low load for applications that use optimized polling models. Regardless of the network load they are processing, these types of applications appear to fully utilize the CPU by continuously polling even when they are not receiving packets. This challenge can be overcome by new techniques involving the detection of low loads by the application’s DPDK PMD, the component best placed to determine the true real-time load of the application.
Latest CPUs support a WAITPKG instruction set which allows user-space applications to put the core into one of the two previously mentioned, power-optimized C0.1 or C0.2 states. C0.1 has a faster exit latency than C0.2, and both exit much faster than C1 or C6 . Two instructions, in particular, are of interest:
- UMONITOR: Sets up an address range to be monitored by hardware for writes, and activates the monitor.
- UMWAIT: Instructs the core to stop instruction execution until the monitored address range is written to. The core can enter the C0.1 or C0.2 states or switch to a hyper-thread sibling.
Support for these instructions has been added recently to several DPDK Ethernet PMDs via its power management application programming interface (API) . When enabled, these PMDs monitor the number of packets received each time the network interface card (NIC) is polled. After a configurable number of empty reads, the PMD issues the UMONITOR instruction to activate a monitor on the next receive descriptor address of the NIC. It then issues the UMWAIT instruction, allowing the core to enter one of the aforementioned power-optimized states. As soon as the address is written to (signaling there is a new packet available), the core is woken up instantly and it continues processing traffic. Unlike C1 and C6, these lighter C-states provide power savings but are more suited to performance critical cores, thanks to their negligible exit latencies.
This technique can be easily integrated into an application as it is implemented within the PMDs and requires only minor modifications to the application itself to enable the feature. The example code within DPDK can be adapted for any PMD, making this new power saving technique suitable for any user-space application driven by a PMD. Other virtualized workloads have seen a CPU power saving in the region of 2-8%, varying with the network load, and it is expected that a DPDK-based vCMTS deployment would achieve similar energy efficiencies.
Overview of P-states
Performance states, or P-states, is the term used to describe a specific frequency and voltage that a core operates at while executing instructions. Power can be saved by the reduction in operating frequency and voltage achieved by using lower P-states. P-states allow for the frequency and voltage to be controlled on a per-core basis. Unlike C-states, the execution of instructions continues while P-states are adjusted on a core. Thus, P-states provide a mechanism by which to save power while the core continues to operate, albeit at a lower frequency and voltage reducing the instruction execution rate .
The associated frequency of P-states varies greatly across different CPUs. Due to this, they are commonly referred to by names. P-states are named P1 to Pn, where P1 is the guaranteed base frequency that all cores can run at and Pn is the lowest frequency P-state providing greatest power savings. The operating frequency of P-states is measured in MHz. An example of P-state distribution is shown in Figure 5, ranging from P1 to P15 in 100MHz decrements.
P-states can be controlled using hardware-controlled power states (HWP), meaning they are controlled entirely by hardware based on the individual load of each core and OS hints. Alternatively, they can be controlled by direct requests from the OS using the intel_pstate or acpi_cpufreq driver. While HWP gives more autonomy in how P-states are controlled, it can, again, be difficult for the hardware to differentiate between peak loads and periods of low demand due to the polling nature of workloads such as a vCMTS. Direct OS control is more useful in such deployments as operators and vendors can use custom software to accurately detect the network load and match the P-states in an energy efficient manner.
P-states can have an associated transition latency during which core execution is temporarily paused as the core changes from one state to another. This transition latency was once a limiting factor in terms of their usage. However, due to major improvements in the architecture and reductions in this latency, it is now at a point where the potential impact on workload latency and jitter are much less apparent, though it remains a consideration in solutions.
Potential Power Savings using P-state Tuning Techniques
As the P-state is lowered across cores, the CPU power draw reduces accordingly. The potential savings will vary depending on the number of cores being scaled down and the extent to which the P-states are reduced on those cores. Figure 6 below shows the results of an experiment we ran, giving an estimation of the potential power savings for a typical vCMTS deployment on an x86 server where the P-state is lowered to different levels on 40 cores, each running a polling-based implementation of a DOCSIS MAC.
From this initial exploration, we can conservatively predict that the savings achievable by tuning P-states sits in the envelope somewhere between 5% to 30% of CPU power draw. Energy efficiencies within this range are expected for all the P-state tuning techniques discussed in the subsequent sections.
P-state Tuning Based on Predictable Traffic Patterns
The most basic method to achieve power savings with P-states is to configure a pre-determined core frequency capable of handling the expected network load. For the most part, network load follows a reasonably predictable pattern over a 24-hour period. By studying varying levels of network load and the P-state requirements of a specific DOCSIS MAC implementation, operators and vendors can make accurate estimations of what P-state to set cores to for selected periods of a 24-hour timeframe. Pre-adjusting the frequency of cores in such a manner is sure to provide power savings, particularly at nighttime when networks are usually under-utilized.
This rudimentary approach, however, is not without considerable pitfalls. Pre-configuring the P-state leaves the operator susceptible to unexpected increases in network load atypical of a normal 24-hour period. Such increases in load could be caused by unforeseen events or certain social holidays and would result in degradation of service for the end subscriber. The lack of real-time metrics used in this P-state tuning technique give it clear limitations.
In-band P-state Tuning
As was the case with C-states, the application’s DPDK PMD is also very well placed to make decisions on the P-state of the cores. The PMD can monitor the load it is feeding to the application and scale the frequency of the core accordingly in a “just-in-time” fashion. As the first reception point of packets, the PMD is the first place capable of detecting an increase in load and quickly scaling up the frequency of the core. Such a technique is an example of in-band P-state tuning, where the decision-making and action to scale the core up or down is contained entirely within the user-space vCMTS application. The exact algorithms used to calculate the P-state setting can vary. The advised approach is to conservatively scale down frequency and save power but aggressively scale up frequency to reduce the likelihood of a degradation in service when an unexpected burst of packets is detected. Once the informed P-state decision is made, the PMD adjusts the P-state via the Linux kernel system file system “sysfs”. Again, DPDK contains examples of how this in-band frequency scaling technique can be used on many different CPUs .
Although P-state transition latency has been greatly reduced, the response time between the increase in network load and the P-state being increased remains the most important design consideration when implementing in-band P-state tuning techniques. NIC descriptors generally only contain enough space in their buffers to hold at most a few milliseconds worth of packets, thus a response time in this region is required to prevent packet loss. In-band techniques are capable of reactions within this time frame as the PMD or user-space application itself is responsible for detecting the increase in network load, making a P-state adjustment decision, and applying the determined P-state.
Out-of-band P-state Tuning
Out-of-band P-state tuning refers to P-state tuning techniques that are managed by an entity outside of the user-space application itself. In the case of a vCMTS DOCSIS MAC implementation, real-time telemetry can be delivered to an external agent. Such an approach not only separates the power management logic from the application itself but also allows for a single agent to control power elements of the entire platform. This is a distinct advantage when placing containerized DOCSIS MAC implementations in orchestrated environments. In a container-based infrastructure, privileged permissions are required to apply power management controls. Restricting such permissions to a single power agent on a node simplifies the placement of network functions and ensures the entity applying power controls is aware of the entire platform.
Similar to in-band P-state tuning, P-state reaction time is a key factor when implementing out-of-band tuning. A fast response time in the region of a few milliseconds becomes more difficult to achieve when P-states are being managed by an external entity. This is in large part due to the requirement for telemetry input for the P-state decision to propagate to the agent. A low-latency mechanism is, as a result, necessary for an effective solution.
Solutions can be based on two categories of telemetry. The first is application specific metrics including, but not limited to, active subscriber statistics, throughput per active cable service group, and average size of PMD dequeues. These types of metrics provide an exact insight of the network load of a vCMTS DOCSIS MAC implementation. With awareness of these metrics, a power management agent can apply informed P-state actions. The telemetry stack used is of prime importance to ensure metrics can be delivered to the external agent within the aforementioned timeframe.
Alternatively, the external agent can achieve energy efficiencies by monitoring platform specific metrics on a per-core basis and applying suitable P-state changes. While HWP and the OS are unable to distinguish between varying network load on polling workloads, custom software can be used to do exactly that by monitoring detailed core metrics. Instruction rates, cache utilization and branch distribution are examples of per-core statistics that can provide an accurate representation of the true network load of a core. By first measuring these statistics at known levels of network load during a training period, an accurate determination of network load can then be computed by the agent as it closely monitors real-time metrics.
As operators accelerate the deployment of NFV workloads, including vCMTS, opportunities to do so in a more energy efficient manner must be capitalized upon. This is best enabled via the newly available techniques and controls discussed in this paper. Both power saving C-states and P-states play a major role in the pathway to a greener, more environmentally friendly vCMTS deployment on COTS x86 servers. In the experiments we ran, legacy and newly available C-states provided CPU power savings of up to 10% under high network load and up to 70% reduction in CPU power draw on an idle system. P-state tuning techniques also showed significant energy savings of up to 30% of CPU power draw on a vCMTS deployment. We strongly recommend operators and vendors perform detailed analysis of the power management controls available to them and carefully consider their integration either within the vCMTS application itself or as a separate software agent. By developing tailored algorithms and undergoing lab benchmarking operators and vendors will begin to develop a greater understanding of the power elements under their control with the view to maximizing energy efficiencies on their vCMTS deployments.
Bibliography and References
 B. Ryan, M. O'Hanlon, D. Coyle, R. Sexton and S. Ravisundar, “Maximizing vCMTS Data Plane Performance with 3rd Gen Intel® Xeon® Scalable Processor Architecture,” [Online]. Available: https://networkbuilders.intel.com/solutionslibrary/maximizing-vcmts-data-plane-performancewith-3rd-gen-intel-xeon-scalable-processor-architecture.
 “DPDK (Data Plane Development Kit),” Linux Foundation Projects, [Online]. Available: https://www.dpdk.org/.
 “FD.io - The World’s Secure Networking Data Plane,” Linux Foundation Projects, [Online]. Available: https://fd.io/.
 Intel Corporation, “Intel vCMTS Reference Dataplane,” [Online]. Available: https://www.intel.com/content/www/jp/ja/developer/topic-technology/open/vcmts-referencedataplane/overview.html.
 K. Devey, D. Hunt and C. MacNamara, “Power Management - Technology Overview,” [Online]. Available: https://builders.intel.com/docs/networkbuilders/power-management-technologyoverview-technology-guide.pdf.
 The Kernel Development Community, “CPU Idle Time Management,” [Online]. Available: https://www.kernel.org/doc/html/v5.0/admin-guide/pm/cpuidle.html#.
 Intel Corporation, “Intel 64 and IA-32 Architectures Software Developer's Manual,” [Online]. Available: https://software.intel.com/content/www/jp/ja/develop/download/intel-64-and-ia-32- architectures-software-developers-manual-volume-2b-instruction-set-reference-m-u.html.
 “Comms Power Management Github,” Intel Corporation, [Online]. Available: https://github.com/intel/CommsPowerManagement.
 “DPDK Power Management,” Linux Foundation Projects, [Online]. Available: https://doc.dpdk.org/guides/prog_guide/power_man.html.