How to Truly Measure Scale

Right-sizing the Stateful Device Under Test (DUT)

Thu at rack of serversWhen we test application and security devices on the network, an important question to ask is “How will the device or network really work in the production network?”

Traditional ways of measuring the performance of a DUT like bandwidth, OpenCons, or Cons/sec are necessary yet not sufficient in providing valid data and metrics.

The pre-deployment engineering scale tests tend to focus on a single sub-system of the DUT like the TCP state table, Application table, or performance of an ASIC, or forwarding engine.

Additionally, the test patterns used to isolate and test these subsystems are synthetic, designed to isolate a specific element in the test.

This methodology intentionally limits higher layer traffic realism by converting finite state machine (FSM) transitions of protocols from dynamic to static variables and other elements of the DUT into static and unmeasured systems—systems in that process state, which cannot be evaluated in a system test, without taking the entire system heuristics into consideration, resulting in errors.

This level of tester-induced errors tends to generate overly optimistic deployment performance measurements. In addition, they tend not to be repeatable run after run.

The real world effect is that one may measure (for example), the bandwidth to be 100 units of performance, whereby in deployed environments (with real world complexities), you may get perhaps 70-80 units-of-performance which amounts to an error margin of 20-30%! Therefore, we must design tests that mitigate this unacceptable margin of error.

To mitigate such a large margin of error we must build a test that reflects Realism, Reliability, and Repeatability … in the best way possible.

First let’s discuss Realism.

Realism is the emulation of traffic that is close to or indistinguishable from production traffic. It’s at the same degree of complexity, measuring the same KPIs (key performance indicators) that users perceive traffic (i.e., their Quality of Experience, or QoE.)

Practically, this means that a HTTP GET of a single 64-byte object is not the same degree of complexity as a modern, HTML5-based stateful application.

Next, we need to consider Realism alignment.

When one aligns the test traffic of the basic unit with traffic that will traverse the DUT, the gap of sub-systems inside the DUT are exercised, and the interrelationship of those systems working together approach zero. This result is considered “Very Good”, because when there is Realism alignment, there is reduction in the chance of false positives or negative results.

On the analysis side, Realism means measuring the right metrics. Specifically, the correct root metric to measure is the same metric that the user judges the quality and predictability of the traffic. For a web application, the primary metric is page load time (or render time), with zero errors.

If the page loads in 2 seconds or less—remember we are talking about hundreds to thousands of lines of code—then the user perceives the page as “Good.” If the page loads in 2-5 seconds, the perceived QoE is generally considered “Neutral-Poor.” And past 5 seconds or more (if there are any errors like a “404 Object Not Found”), then the user’s perception is “Bad.”

In addition, users experience a “negativity bias” when perceiving differing types of performance outcomes, whether they are in fact rated “Good” or “Bad.” What this means is that historical failure in performance will impact user perception, ironically in a very asymmetrical fashion.

For instance, if the user perceives 99% “Good” performing instances, with a 1% “Bad” instance, the “Bad” instance will have substantially more weight. This distribution of performance metric now becomes a KPI, in addition to individual load times. And don’t forget each application has its own KPI measurement sets.

So once you have defined meaningful traffic with a KPI table to measure the traffic, you can truly measure scale!

To measure production network scale, one might begin by making a declaration. This declaration could be “I can scale up users, each of which are generating meaningful traffic and measuring meaningful KPIs, until any active user sees a KPI performance lower than the lowest allowable minimum.” Next, one can start to ramp-up users either concurrently, or at a rate until failure.

The way one reports performance (with meaningful traffic) is to ask:

“How many users or users-per-second can be concurrently processed across the DUT, such that each and every user-satisfaction ranking remains high and consistent?

Once you have this number you have a very accurate way of provisioning the DUT in the production network with high reliability. 

The next thing to consider is Repeatability.

One needs to ask, “Will this test generate the same pattern over and over; or will the limitations of the test alter the ability to regress traffic?”

Repeatability is critical because without it you have no way of trending or comparing results. 

You will always want to see if the DUT changes over time, or compare one DUT result to another with confidence.

Spirent AppSec Test Solutions promote Realism, Reliability, and Repeatability.

We do this first by aligning the basic unit of schedule to the user as opposed to the protocol. By doing this, we eliminate the “Unknown of unknown’s problem” by emulating layers midway in the OSI stack. Second, our Load Spec algorithm and TCP stack is world-class. With TCP we emulate real word, one-arm, stack with windowing, congestion avoidance, backoff, etc., just like real hosts.

This means that when there is a fault detected in the DUT, we will go through the RFC procedure to recover, as opposed to simply restarting the connection. In other words, we don’t treat TCP like UDP.

Our Load Spec is a patented heuristic engine that allows for deep Reliability and Repeatability from test-to-test. A great way of testing a tester is to run the same test 10Xs, and see what is generated and what is measured. Furthermore, one can ask, “Do they compare, or is there tester-induced failure?” 

Lastly, we always build in the sense of temporal cohesion at the user layer. This means that you can program a user to perform specific actions over time, correlated to the user (kind of important for Web-based applications.) Plus, all of this can be done at extremely high scale.


Testing with the right traffic complexity, with the right KPIs, with high Reliability and Repeatability, makes for very relevant and meaningful testing.

Don’t settle for anything less. 

For more information on how to mitigate large margins of error in enterprise network testing please visit: http://www.spirent.com/Solutions/Security-Applications.



comments powered by Disqus
× Spirent.com uses cookies to enhance and streamline your experience. By continuing to browse our site, you are agreeing to the use of cookies.