Touring the Netometer and Webometer archives

Searching for patterns in the Netometer and Webometer archives

The Netometer pings 10 sites across the country 10 times every 15 minutes of every day. A mean, minimum, and maximum ping time is determined for each site, and overall mean, minimum and maximum times are recorded. This system has been operational since late 1997, and has collected 5 full years of data.

The Webometer downloads a web page from 10 sites across the country twice every 15 minutes of every day. Mean, minimum, and maximum access times are determined for the collection of downloads, and individual means are recorded. This system has been operational since early 2002.

The Netometer was originally intended to give users a sense of "speed" of the network at any given instant, so they could put their own experience in context. Most network data collection tools record the AMOUNT of traffic over particular circuits (e.g., MRTG records), which tell you nothing about the behavior of individual connections carried by those circuits. For example, there is no way to tell how many packets are transmitted successfully or how fast individual streams are running. Ping times, on the other hand, include in a single value the time required to traverse a collection of circuits, network equipment such as switches and routers, and a destination host that is usually a computer.

The Webometer was developed in response to contention that ping times are notoriously unreliable; ICMP ping traffic may be handled differently than other IP or TCP traffic, so that it may not represent overall network performance, except in the grossest sense. As it happens, Netometer and Webometer results do NOT correlate well (usually less than .5), though they display similar distributions, as will be discussed later.

Note that the University of Kansas has been connected to the outside world through a single high-speed link (DS-3 or OC-3) during most of this recording period. (There were a few months when an ancillary path directly to a local ISP was in operation.) Both commodity Internet traffic (Internet1) and Internet2 traffic share this single line to the Great Plains GigaPop (and thence both I1 and I2), and this line is usually saturated during business hours of every business day.

After a general description of the archived data, this page will attempt a discussion with respect to 1) self-similarity and 2) critical phenomena. (For a brief description of critical phenomena see Scaling, renormalization, and universality: Three pillars of modern critical phenomena .") This effort was focused both on understanding these concepts via the archived data, as well as understanding the data network itself. (Note that this page has evolved from simple diary form to something more organized, but still retains much of the disorganization of its history. Please accept my apologies.)

Various tools were used for demonstrating "self-similar," "scale-invariant," "self-affine," etc. traffic, and for characterizing "critical phenomena." Three data manipulation techniques were employed:

Log-log plots of frequency tables, looking for evidence of of power-law behavior,

Pseudo-phase space plots on both raw data an on data differenced to remove low-frequency periodicity,

Aggregation of adjacent segments of a data sequence into single mean values generating a compressed sequence, as suggested by Sir D.R. Cox in "Long-range dependence," a review in Statistics: An Appraisal, edited by H.A. David and H.T. David and published in 1984 by Iowa State University Press. Subsequent analysis of such compressed sequences, can be used to build functions on the segment size. For example, Cox looked at sequence autocovariance as a function of segment size.

Autocorrelations on both the raw and differenced data (as described in for pseudo-phase space plots), usually presented as "correlograms," with autocorrelation values plotted against data lag, and

Detrended Fluctuation Analysis,

Benford's Law (just for the heck of it)

Internet traffic has previously been shown to display self-similarity/scale-invariance, but the Netometer data is handled differently and includes both educational and commercial sites, including sites on Internet2. The PingER and Active Measurement Project also do pinging of this sort and archive their results, but they focus on educational/government sites.

The Netometer data

To get a feeling for the raw Netometer data, take a look at the ping results for October of 2001, as either choropleth, graphical, or phase space form. You may also inspect the October 2001 frequency counts against the whole range of possible ping values, or frequency counts against only those ping times equal to or under 150ms. Note that the former looks like a power-law graph (with a small hump in the 100 to 200ms range) and the latter resembles a log-normal graph. The difference is due to size of the categories used in each chart. The same data is plotted in a log-log frequency chart, which also shows a well-defined hump between 100 and 200ms, and as a reverse cummulative frequency chart, which graphs the count of all observations greater than X against X. In this reverse cummulative graph the initial rise in the log-normal distribution is hidden and the hump is smoothed.

You can also look at the correlogram for this data, the phase chart of the data differenced by 2 and then by 96 time steps to remove autocorrelation, and the the correlogram on this difference data, and a frequency graph of the differenced phase plot X values. Data values for these correlogram and phase charts were cutoff at 300ms (observations over 300ms were reduced to 300)..

Finally, you can look at this data as results of a Detrended Fluctuation Analysis (DFA), which will be discussed later.

The frequency results above are for means of all successful (up to 100) pings, but the programs that produce these graphs can also show mean, minimum, and maximum ping times, as well as percentage of failed pings (lost packets).

October 2001 is a "cannonical month" from the standpoint of usage (though possibly NOT from the standpoint of network "health", as will be suggested later); it includes high traffic rates during business days and hours and relatively low traffic rates otherwise, along with very quick transitions from low-use to high-use periods. As a result, the high-use periods sometimes appear in the graphical output as craggy mesas above a flat plain, and in the choropleth output as fairly well defined magenta/red blotches.

You may also wish to see the frequency distributions for all 5 years, for 2002 in either the lower range or across the entire range, and for 1999.

Since, as will be mentioned later, Netometer times correlate poorly with web page access times, Netometer may not actually be very useful for evaluating the user's experience, and yet, as will be argued later, it MAY prove useful for evaluating the overall health of the network.

Log-log plots of Netometer frequency tables and data aggregation

The raw Netometer observations were converted to frequency tables, and then used to construct log-log plots. For the most part, all the collected data were included with no cutoffs made on the basis of minium or maximum observed ping times. However, observations are included in these charts only when at least 8 of the 10 target sites can be reached by at least one ping during a particular ping series.

In this section, frequency tables and graphs are presented for the following years:

1998-2002, inclusive, comprising approximately 170,000 records,
2002, comprising approximately 30,000 records.
1999, also 30,000 records.

The regression equations calculated for these charts and for other years are presented in a table following the charts themselves. Note that the slopes and Y-intercepts are quite similar, and that the correlations are relatively high. Note also that these correlations are performed on "frequency" rather than "cummulative" frequency values.

The first 5 charts, links to which appear in the list below, show mean ping times collected over the years 1998 through 2002 (in spite of the labels which refer to the year 2003), inclusive, aggregated at 5 different levels. For example, the first chart shows the data unaggregated. The X-axis shows the natural logarithm of the ping times recorded during the 5 year period. The Y-axis shows the natural log of the number of times each ping time was observed.

The second chart shows the 98-02 data aggregated in groups of 10 data values. That is, each set of 10 values was averaged to give a single data value.

1998 through 2002, inclusive, aggregated in (non-overlapping) groups of
- 1
- 10
- 100
- 1000
- 10000
2002
- 1
- 10
- 100
- 1000
1999
- 1
- 10
- 100
- 1000

Note that many of these charts show distinctive "valleys" located close to X-axis values of around 4.5, or e^4.5 = 90 ms. That is, there are often more observations at values around 90 ms than at 90 ms. This appears to be the transition point from the plain to the mesa on the raw traffic graphs.

The 1999 graphs appear somewhat similar to graphs of behavior of abstract models at their "critical points," presented in Figure 5 of Sole' and Valverde. (More about that later.)

Relevant regression data are presented in the following table (these statistical values are ever so slightly different from the ones presented in the links above because the regression X-values used for the results below were based on maximum endpoints of each category and results above were based on category midpoint values). Note that regression slopes, etc. in this table were calculated on BOTH frequency values AND reverse cummulative frequency values.

Time interval	Sites reached	Aggregation	Non-cummulative			Cummulative reverse			DFA slope
Time interval	Sites reached	Aggregation	Regression Slope	Y-intercept	Correlation	Regression Slope	Y-intercept	Correlation	DFA slope
1998-2002	8-10	1	-1.88	16.62	-0.92	-2.15	20.0	-0.99	0.88
1998-2002	8-10	10	-2.02	15.18	-0.93	-2.32	18.4	-0.99	0.87
1998-2002	8-10	100	-2.02	13.24	-0.87	-2.92	18.9	-0.97	1.04
1998-2002	8-10	1000	-0.71	5.37	-0.52	-2.13	13.4	-0.95	1.16
1998-2002	8-10	10000	-0.43	2.52	-0.30	-2.14	11.2	-0.98	N/A

2002	8-10	1	-1.85	13.95	-0.88	-1.96	16.4	-0.97	0.80
2002	8-10	10	-1.89	12.39	-0.88	-2.15	14.9	-0.96	0.76
2002	8-10	100	-2.75	14.34	-0.95	-3.47	18.3	-0.98	1.02
2002	8-10	1000	-2.87	13.49	-0.92	-3.63	16.9	-0.99	1.53

2001	8-10	1	-1.83	14.56	-0.90	-1.85	16.5	-0.96	0.84
2001	8-10	10	-1.81	12.63	-0.90	-1.85	14.7	-0.96	0.73
2001	8-10	100	-1.67	10.20	-0.88	-2.46	14.8	-0.93	1.03
2001	8-10	1000	-1.43	7.04	-0.73	-2.88	14.7	-0.98	1.69

2000	8-10	1	-2.12	16.60	-0.94	-2.18	18.6	-0.98	0.80
2000	8-10	10	-2.26	15.14	-0.91	-2.51	17.90	-0.99	0.78
2000	8-10	100	-1.70	10.70	-0.83	-3.39	19.62	-0.97	0.96
2000	8-10	1000	-1.94	10.65	-0.91	-3.11	16.47	-0.99	1.26
2000	8-10	10000	-1.97	9.17	-1.0	-2.50	11.39	-0.71	N/A

1999	8-10	1	-2.41	17.54	-0.96	-2.52	19.5	-0.99	0.74
1999	8-10	10	-2.83	17.45	-0.97	-3.30	20.9	-1.0	0.69
1999	8-10	100	-3.63	18.96	-0.85	-5.50	27.9	-0.98	0.76
1999	8-10	1000	1.27	-3.51	0.30	-6.59	30.5	-0.94	1.47

1998	8-10	1	-1.81	15.42	-0.89	-2.44	20.79	-0.99	0.90
1998	8-10	10	-1.86	13.36	-0.84	-2.67	19.54	-0.99	0.84
1998	8-10	100	-1.12	7.87	-0.51	-2.93	18.41	-0.95	0.98
1998	8-10	1000	1.58	-6.26	0.68	-1.14	8.31	-0.74	2.94

Note that the DFA results for aggregations of 100 observations are all very close to 1, and that 96 observations are recorded each day (when the system and network are functioning properly). This indicates high self-similarity within the daily mean value series, but somewhat less within the less highly aggregated series.

The slopes show the power to which each ping time must be raised to get its frequency. That is, the log-log regressions yield the equations

     ln( Frequency )  =  ( slope * ln( Ping.time) ) + Y.intercept

or, after using each side as an exponent to e,

     Frequency =  ( e ^ ( slope * ln( Ping.time) ) ) + ( e ^ Y.intercept )

     Frequency =  ( Ping.time ^ slope ) * ( e ^ Y.intercept )

so, for the 5 year period, from 1998 through 2002, we have, for the unaggregated data:

     Frequency =  ( Ping.time ^ -1.88 ) * ( e ^ 16.62 )

     Frequency =  ( 1 / (Ping.time ^ 1.88 ) ) * ( 16490199 )

so that for a Ping time of 50ms we should see approximately 10,548 data points, which we, in fact, do.

Some readers may expect the data to be exponentially distributed. The following table shows log-linear and log-log correlations on the frequency data for each year (using any data point resulting from a mean of pings from at least 8 sites). (I have not attempted to evaluate the extent to which the raw frequency data resemble a log-normal distribution.)

Year	Correlation
Year	log-linear	log-log
2002	.59	.88
2001	.50	.89
2000	.63	.94
1999	.63	.96
1998	.89	.84

It appears that a power-law distribution will describe the data better than an exponential distribution, except for 1998. (These correlations are similar to correlations performed on completely raw data sets.)

Autocorrelations

On the whole this data is autocorrelated at many lag intervals. Here are the correlograms for:

A quick scan suggests that mean autocorrelation values for each year correlate well with the DFA results presented below (at least for the first 2000 lag intervals presented in these plots). See the DFA section for an example.

The positive peaks in these correlograms occur around 96 observation lags, which correspond to 1 day time lags. To check for chaotic regions, the data was filtered by taking differences at various lags. Differencing the data at lags of 2 and then 96 observations seems to reduce most data set autocorrelation values to within the 96% confidence interval, and the phase plots of the differenced data show only unimodal distributions (rather than distributions expected of chaotic regimes, although this does not preclude the existence of chaotic regions). See the October 2001 plots presented earlier for an example of this process.

In these correlograms, the "autocorrelation variability" seems to diminish over time. For example, correlating year against the difference between first maximum and first minimum for that year yielded a correlation coefficent of .83.

Detrended Fluctuation Analysis(DFA) of the Netometer data

DFA was first proposed in: Peng C-K, Buldyrev SV, Havlin S, Simons M, Stanley HE, Goldberger AL. Mosaic organization of DNA nucleotides. Phys Rev E 1994;49:1685-1689, and was obtained from Physionet The DFA algorithm (informally) takes input data series of length N and divides it into a series of equal-sized intervals, or "boxes," each containing n entries, and for each box:

a regression line is determined using the data in each interval,
the variation of the data with respect to the regression line is determined,
a mean variation (relative to the local trend), F, is calculated from variations for the individual boxes.

The process is repeated for a group of box sizes, n (ranging from 2 (usually) up to N/4), and the logs of the mean variations, F(n) are plotted against logs of the box sizes.

According to Peng, et al. in "Quantification of scaling exponents and crossover phenomena in nonstationary heartbeat time series" (CHAOS, Vol. 5, No. 1, 1995), a straight line "indicates the presence of scaling," and the slope of the line indicates the nature of that scaling:

a sequence with a slope between 0. and .5 indicates a sequence "such that large and small values of the time series are likely to alternate." For example, the 2002 Pittsburg Pirate/Tennessee Volunteer NFL playoff game yielded a DFA value of .35 of for yardage gains on the ground, using both positive and negative values to indicate which team achieved which yardage.

a completely random series (either unformly or exponentially distributed) produces a straight line with a slope of .5,

a sequence where relatively large observations tend to be followed after some interval by additional larger observations, should produce a slope between .5 and 1.0. Mapping the aforementioned playoff game using absolute values of yardage gains yielded a DFA slope of .59. (This seems to support the notion of a team being "on a roll.")

Slopes of, or close to, 1.0 indicate a 1/f noise relationship, a series possessing dependencies at many, or all, time scales examined, and, finally,

slopes above 1.0 are no longer of a power law form; a slope of 1.5 "indicates brown noise, the integration of white noise".

(Postscript: This corresponds to most descriptions of the Hurst exponent as showing "anti-persistence" when between 0. and .5; no relationship at .5, and "persistence" or "long term autocorrelation when between .5 and 1.0. However, the literature includes confusing and/or contradictory descriptions of "white" and "brown" noise. The DFA algorithm used here yields values quite close to .5 for datasets containing sequences of random numbers (either uniformly or exponentially distributed), and values quite close to 1.5 for datasets composed of random walk sequences (running sums of positive and negative uniformly distributed random numbers).

Here is a graph of the DFA results for the raw ping data from 1998-2002. The X-axis represents n, and the Y-axis represents F(n).

and a graph of the DFA results for 2002:

These graphs show overall regression slopes of 0.88 and 0.80, respectively, with regression correlation coefficients around 0.98. (Note that DFA boxes do not usually overlap, but calculations can be performed with overlapping or "sliding" boxes. The DFA algorithm applied to the 2002 data using sliding boxes yielded very similar results.)

These graphs also show "crossover regions" where the lines transition from one slope to another, forming an s-shaped curve. In the graphs above, these crossovers occur around box sizes of 12 observations, which represent 3 hours of recording, and slopes transition from approximately 1.03 to 1.14 for the 5-year data (<2.0 and >3.25, respectively), and .99 to 1.20 for the 2002 data (<2.25 and >3.25).

The different slopes on either side of the crossover suggest different short- versus long-range behaviors. Peng, et al. suggest that these crossover phenomena are important for characterizing system behavior.

Results from DFA analyses of heartbeat data from healthy subjects show slopes close to 1.0. Data from impaired hearts show values close to 1.3. In addition, data from healthy vs. impaired hearts show different crossover patterns.

Other physiological processes also show DFA differences between healthy and impaired behavior (e.g., gait, breathing, etc.). Using these observations as a model, one might speculate that the Internet (at least as seen through the Netometer) is "healthy" when DFA results are close to 1.0 and otherwise impaired, or that it is healthy when displaying values in a common region, say .7 to .8. One might also speculate that it COULD work more efficiently, if it were modified to work closer to a region around 1.0. For example, modifications to TCP and/or limitations on or redesign of UDP traffic, might make it possible for the network to operate closer to 1.0.

Here are the DFA results for each year (Click the year to see the chart):

Year	DFA slope
2002	.80
2001	.84
2000	.80
1999	.74
1998	.90

Correlating these DFA values against the first minimum autocorrelation value for the years 1998 through 2002 yielded a correlation coefficient of .97, and correlating against the ninth minimum yielded a coefficient of .95.

Correlating against first and seventh maximum values yielded a coefficient of .74. (Note that the first and seventh maxima are one week apart, so they would be expected to be similar. Whoops! the first and EIGHTH maxima would be one week apart.) Correlating DFA values against the difference between first maxima and first minima yielded a coefficent of .78.

These correlations suggest that DFA and autocorrelation reflect similar properties of the data. Usually, correlations against single autocorrelation values would probably NOT reflect overall qualities of a data set, but these data display such consistency that single autocorrelation values DO appear to speak for the whole set.

Here are the DFA slopes for Netometer ping data for the months of 2002 and the first 2 months of 2003:

Month	DFA slope on Netometer ping times
January 2002	.90
February 2002	.84
March 2002	.83
April 2002	.83
May 2002	.78
June 2002	.75
July 2002	.76
August 2002	.65
September 2002	.78
October 2002	.96
November 2002	.73
December 2002	.82
January 2003	.84
February 2003	.95

Note that most months show only slight, or indeterminate, crossover regions, and appear closer to a single bend, rather than an s-shape, if anything at all.

See the table above for more DFA results. I expected that Netometer DFA values close to 1.0 would correlate with the Netometer correlation coefficients produced by the log-log frequency tables, and/or with the Netometer regression coefficients at different levels of data aggregation. Neither of these expectations were met.

Webometer results

Webometer results are available in fewer formats, as shown for September 2002:

choropleth form,
graphical form, and
Log-log frequency distributions, as shown below.

It is also possible to plot Netometer results against Webometer results. For the most part mean Webometer access times do not correlate very well with Netometer ping times, as exemplified by the September 2002 comparison, which shows a correlation of .75.

Webometer and Netometer frequency charts DO show strikingly similar distributions, as will be shown below, and those distribution patterns persist over several orders of aggregation. For example, see the log-log distributions for November of 2002:

Webometer DFA slopes ARE similar to Netometer DFA slopes. Here is the DFA chart for Webometer records from March through December of 2002:

This chart shows a slope of .84 with arms below and above the transition bend of .90 and .86. The table below shows Webometer DFA slopes, and mean access times for the last 10 months of 2002, and the first 2 months of 2003:

Month	Webometer access times
Month	DFA slope	Mean access time (ms)
March 2002	.81	767
April 2002	.78	837
May 2002	.76	703
June 2002	.75	776
July 2002	.94	906
August 2002	.72	780
September 2002	.84	933
October 2002	.89	686
November 2002	.75	558
December 2002	.75	527
January 2003	.84	582
February 2003	.73	606

These DFA plots include single bends in most cases with lower arms somewhat larger than the upper arms. In several cases the lower arm slopes are very close to 1.0, and many monthly charts restricted to business hours give slopes very close to 1.0, suggesting that individual TCP connections tend towards infinitely scaled 1/f behavior during highly congested periods.

Neither the Netometer nor Webometer DFA slopes correlate with mean Webometer access times. However, of the last 12 months, the months with the 5 fastest times show larger Netometer DFA slopes and better log-log regression correlations that the 5 slowest months.

Group	Mean Netometer DFA slope	Mean log-log regression correlation on Netometer times
5 slowest months	.75	-.81
5 fastest months	.86	-.89

This table might be taken to suggest that web efficiency increases as overall network behavior, as measured by the Netometer, approaches a completely scaled 1/f power law distribution. (The Webometer DFAs do NOT display this relationship; in fact, they show a slight reverse relationship. The Webometer log-log regression correlations show a decrease from -.82 to -.86.)

Applying Benford's Law

Benford's Law claims that given a set of numbers related to power-law behavior, according to http://www.cut-the-knot.com/do_you_know/zipfLaw.shtml: "...digit D appears as the first digit with the frequency proportional to log10(1 + 1/D). In other words, one may expect 1 to be the first digit of a random number in about 30% of cases, 2 will come up in about 18% of cases, 3 in 12%, 4 in 9%, 5 in 8%, etc." For an explanation.

To test this rule vis a vis the Netometer archive data, starting digits returned by the frequency tables for aggregations of 1, 10, 100, and 1000 for the years 2001, 2002, and 1998-2002, were examined. 35% were seen to begin with the digit "1", and 14% with the digit "2".

Details of the October 2001 Netometer data

Much of the Netometer data appears to support the notion that there are two separate "regimes" in place: one during high-use periods and one during low-use periods, the two separated by the 90ms valley. (This is probably the point at which the single KU link to the outside world becomes saturated.)

To disect these two regimes, several log-log charts were produced for October, 2001:

and

All days during business hours
All days during off-peak hours (off-peak is here taken to be off-peak relative to the KU timezone with a little to spare: 9am until 18:15.)

The charts for Week-end days and off-peak hours both look like the graph for a system at its critical point, shown in Figure 5 of Sole' and Valverde. That is they look like a steep downward linear slide (with negative slope). The "Business days" and "business hours" graphs, on the other hand, both tend towards the graphs for systems loaded beyond their critical points. That is, they look like a slide with a large hump in it. (In fact, linear-linear frequency graphs tightly restricted to business days and periods from 10am til noon or 1pm til 4pm (such as that for October afternoons, 2001) sometimes resemble a ragged, upside down "U", reminiscent of a normal or log-normal distribution.)

The "business" charts also look remarkably similar to a chart of earthquake magnitude frequencies (Fig. 14) appearing in 1/f noise: a pedagogical review by Edoardo Milotti, who considers that shape to violate the notion of a power-law graph.

Note, however, that aggregated datasets for the "business traffic" and the "week-end" traffic seem to show similar statistical scaling properties. That is they both tend to show fairly high correlations through aggregations of 10 and 100 observations.

The DFA results for October 2001 are a bit ambiguous, but they may be interpreted as a slight s-shape with an overall slope of 1.02, and arms with slopes of 1.43 and .93, respectively, below (< 1.75) and above (>2.50) the crossover region. For October business days, the line has only a single-bend connecting lines of slopes of 1.46 and .28 (below 1.75 and above 2.25, respectively). Business hours during business days show a single-bend connecting slopes of 1.21 and .73 (<1.5 and >1.75, respectively).

All hours during non-business days show single-bend slopes of .88 connecting to a slope of .37 (<1.75 and >2.1, respectively), and off-peak hours during non-business days show single-bend slopes of 1.01 connecting to .39 (<1.75 and >1.9, respectively). These graphs seem to corroborate the notion that off-peak behavior approaches infinitely scaled 1/f noise.

Difference plots were generated for peak and off-peak hours using lags of 2 and 96 to reduce or eliminate periodicity. These plots showed no unusual geometric patterning, but rather correlations less than .24 and simple unimodal frequency distributions.

The Internet as a "critical phenomenon?"

It seems plausible that the Internet could behave as a "critical phenomenon" and display "self-similar" behavior and critical points. At the critical point, all components are apparently influencing one another, either directly or indirectly, and within the Internet there are at least 3 activities that might be considered to constitute such interaction:

communicating computers (using TCP) negotiate with one another to deal with congestion by setting window sizes and occasionally to restart interrupted traffic in a controlled way (the "slow-start").
globally distributed routing tables (OSPF and BGP), including paths and relative path weights, are continuously adjusted as routers communicate with one another.
packets are sometimes simply dropped to relieve local device loads, adjust for local conditions, and may signal traffic interruption.

This kind of system-wide communication appears to become more frequent and more relevant as traffic loads grow. In fact, it seems fair to say that when the generated load begins to overwhelm the available single-path and/or network-wide capacity for handling that load, the network will usually have reached a state where some relatively large proportion of the systems connected to the Internet are directly or indirectly connected together, perhaps crossing a "connectivity threshold," as described by Stanley.

With this model, the Internet may or may not at any given moment constitute a system (a function of load, structure, and protocols) that is organized at its critical point. This would explain the apparent superimposition of graph shapes typical of behavior within or beyond critical points in that the Internet would be seen as at critical point during most low-traffic periods, and beyond critical point during high-traffic periods.

These Netometer data do seem to show the kind of power law "similarity" required of "self-similar" traffic. Log-log regressions of frequency data from many months and most years remain reasonably large through several orders of magnitude of aggregation, and the DFA results are appropriate for 1/f traffic, though not for infinite self-similarity, except for data aggregated at 100 observations. However, the log-log graph shapes often change dramatically as the data is aggregated, and some authors have argued that some systems may show power law behavior through several orders of magnitude for systems not particularly close to their critical points.

Park, Kim, and Corella demonstrate that self-similarity can arise when TCP is used to transfer files whose sizes display a heavy-tailed distribution. And Feng and Tinnakornsrisuphap argue that "TCP itself is the primary source of self-similarity," without requiring heavy-tailed distributions of file sizes, fractal network geometries, etc.

And, indeed, it seems possible that self-similar behavior comes about through a combination of these features, including any (effective) congestion control mechanism, not just TCP. (In fact, Sole', et al. show simulated power-law behavior generated by non-TCP congestion control mechanisms.) Keep in mind that the Netometer displays ping behavior, which is NOT TCP-based (but rather ICMP-based), so that ping traffic displays patterns dependent on other network traffic. Also, Feng, et al. might have trouble explaining occasions when the network does NOT display power-law behavior. (Postscript: During 2003 autocorrelations dropped significantly, so that a description of changes in network load and/or relative load and/or the type of network traffic may be identified as agents of self-similar behavior.

Apparently power law behavior is characteristic of some natural systems that are composed of many subunits and must respond quickly to changing conditions: heart rhythms, breathing rhythms, gait, etc. and it has been suggested that this behavior is somehow optimal. For examples, it has been suggested that coastlines take the shape they take because it is the most efficient way for shores to dissipate the energy of waves. Analogously, perhaps "waves of traffic" generated by traffic characteristics and congestion control mechanisms may be seen to attack the network infrastructure and generate power-law/multifractal behavior as an, or possibly the most, effective way to handle the load.

Presummably the building materials used in the "average" natural shoreline are not subject to replacement. The Internet, however, can be modified in a number of ways, one of which is the operation of its dominant protocols. It might be possible for example to redesign protocols to increase efficiency and it is seems reasonable to assume that doing so would change the traffic distribution. It would be interesting to see if more efficient protocols maintain the power-law behavior observed using the Netometer stats, and if in fact, it would approach infinitely scaled 1/f behavior more closely. If so, such distributions might then become a design goal of TCP, and /or network, designers.

What about self-organized criticality (SOC)?

It is also not clear to this author what it would mean for the Internet to organize itself for critical point behavior. Would we would see the characteristic slide shape, and DFA slopes very close to 1.0 through all (or most?) loading conditions? If so, the data archives herein discussed contradict the notion. For example, we do not see the 1/f log-log slide for most periods of high load, and DFA results are seldom within 5% of 1.0.

However, log-log graphs of traffic between 1 and 5am show the steep slide pattern (and high correlations) through aggregations of 500. So, perhaps at relatively low loads, the Internet adjusts itself so as to remain around a critical point. This might be explained by the way TCP increases tranmission window sizes "until something breaks," and then backs off. So, no matter how lightly loaded the network, TCP will always push to maximize link utilization, and tend to produce bursty traffic in the process.

The log-log frequency graph of unaggregated 2002 mean ping times is interesting in this regard. It looks like two log-log charts of systems in their critical regions superimposed upon one another and separated by the 90 ms valley. (In fact, this pattern, or a variation of it, persists through aggregation by 1000 observations, and is echoed by the mean maximum and mean minimum ping time frequency charts.)

That is, it appears that the rounded curve shown in high-traffic graphs of some years, as well as the October business results, has become another downward slide, possibly reflecting criticality (and possibly scalable to the low-traffic graph?). If so, then the network of 2002 may have the flexibility to adjust itself to the higher load following a brief transition?

To partially isolate the two regimes, charts have been prepared for the 2002 data limited to:

Business hours, and
Off-peak hours

It would be better to compare a chart of business hours during business days with a chart of low-use days during off-peak hours, but those are quite time-consuming to produce.

Access to data

The raw data and various programs to manipulate these records can be accessed directly.

For records of past network performance, see the Netometer Archives, look at network ping behavior in "phase space", and/or look at Netometer frequency data, either raw or log-linear or log-log. The frequency and log-linear charts were constructed to help identify exponential distributions in the data. They both include a regression on the log of the frequencies along with the associated correlation coefficient. The log-log charts were constructed to help identify power-law traffic patterns in the Netometer data, and have been more successful (or at least more interesting), but not as useful as the Detrended Fluctuation Analysis (DFA) charts. (Postscript: A facility to estimate the Hurst parameter was added during 2004.)

(You can also plot sample functions in phase and iterative spaces to help understand the nature of chaotic behavior that might emerge within the Netometer data.)

You can also look at the Webometer for information about Web download times, and at the the Webometer Archives, and plot log-log graphs of Webometer archive data, Webometer data against Netometer data, and/or DFA results for Webometer records.

Some useful datasets:

Comments?

For more information about this material, please contact Michael Grobe at grobe@ku.edu. Comments and interpretations of this data are welcome. Much confusion remains to be resolved.

Credits

I have been working with this data off and on for several years, and was inspired to search for power law behavior within the Netometer records by Mark Ward's Beyond Chaos, which gave me a survey of the field and pointers to researchers and research papers, many of which are available via the Web and have been linked in to this page.

(Postscript: Garnett William's Chaos Theory Tamed motivated and informed my understanding and use of autocorrelation, correlograms, and difference graphs in search of "chaotic behavior." IMHO, the book displays a wonderful didactic approach. My understanding of the Hurst exponent came from reading The (Mis) Behavior of Markets, by Benoit Mandelbrot. In fact, this book helped me finally weave together an intuitive sense for some of the basic ideas in fractal geometry and dynamic processes, aka fractals and "temporal fractals".)

Michael Grobe
Academic Computing Services
The University of Kansas
January 14, 2003 through March 8, 2003
February 2004 through August 2004

Personal notes for later work:

is aggregation as described above a form of "scaling" or a form of "normalization"....or both?
does aggregation "lead towards the critical point?" (Stanley?) that is, does it make graphs look more like they are showing critical phenomena than they are? apparently that can happen with some normalization functions.
as more UDP traffic appears on the net (with increase in media streaming), will the system become less "connected," since UDP does not have reliable transport? does UDP prevent "clutter up" or emphasize power-law behavior? should UDP be modified to include more powerful congestion control? does it degrade overall network operation?