Statistics of Networks
Thursday 24th June 2010 to Friday 25th June 2010
08:50 to 09:00  Opening Remarks  INI 1  
09:00 to 09:45 
H Shen ([North Carolina]) Robust Estimation of the Selfsimilarity Parameter in Network Traffic
We shall discuss the problem of estimating the selfsimilarity parameter of network traffic traces.
Robust estimators will be reviewed that are less sensitive to commonly encountered nonstationary
traffic conditions, such as sudden level shifts and breaks. Numerical results from simulated experiments
and real trace applications will be used for illustration.

INI 1  
09:45 to 10:30 
M Handcock ([UCLA]) Modeling networks when data is missing or sampled
Network models are widely used to represent relational information among interacting units and the structural implications of these relations. Recently, social network studies have focused a great deal of attention on random graph models of networks whose nodes represent individual social actors and whose edges represent a specified relationship between the actors.
Most inference for social network models assumes that the presence or absence of all possible links is observed, that the information is completely reliable, and that there are no measurement (e.g. recording) errors. This is clearly not true in practice, as much network data is collected though sample surveys. In addition even if a census of a population is attempted, individuals and links between individuals are missed (i.e., do not appear in the recorded data).
In this paper we develop the conceptual and computational theory for inference based on sampled network information. We first review forms of network sampling designs used in practice. We consider inference from the likelihood framework, and develop a typology of network data that reflects their treatment within this frame. We then develop inference for social network models based on information from adaptive network designs.
We motivate and illustrate these ideas by analyzing the effect of linktracing sampling designs on a collaboration network.
This is joint work with Krista J. Gile, Nuffield College, Oxford.

INI 1  
10:30 to 11:00  Morning Coffee  
11:00 to 11:45 
M Coates ([McGill]) The Value of Clustering for Distributing Content in Mobile Social Networks
In several situations it is desirable to distribute content to all users in a wireless network. Examples
include the dissemination of software and firmware updates, pushbased videoondemand systems,
and news feeds. When the content is not extremely timesensitive, there is an opportunity to
reduce the consumption of serverclient bandwidth by exploiting local connections between users.
Rather than transmitting the entire content to each user individually, the content is divided between
multiple users. When users meet they exchange data, with the result that the content percolates
through the network and eventually each user has the entire content. As a network designer, we
have no control over the contacts between users, which are dictated by social interactions and user
behaviour. We can model the contact events as a random process and, making certain assumptions
about the behaviour of this process, design procedures to optimally distribute the content among
the users so that it spreads through the network as quickly as possible. In this talk, I will describe a
content distribution procedure based on clustering the users into groups using a network clustering
algorithm that strives to maximize the conductance of each cluster. The procedure is suboptimal,
but the performance deterioration is relatively small, and there are significant improvements in
terms of estimation overhead and robustness.

INI 1  
11:45 to 12:30 
M Crovella ([Boston]) Inferring Invisible Traffic
A traffic matrix encompassing the entire Internet would be very valuable. Unfortunately, from any
given vantage point in the network, most traffic is invisible. In this talk I will describe results
that hold some promise for this problem. First, I will show a new characterization result: traffic
matrices (TMs) typically show very low effective rank. This result refers to TMs that are purely
spatial (have no temporal component), over a wide range of spatial granularities. Next, I will define
an inference problem whose solution allows one to infer invisible TM elements. This problem relies
crucially on an atomicity property that I will define. Finally, I will show example solutions of
this inference problem via two different methods: subset regression and matrix completion. The
example consists of an AS inferring the amount of invisible traffic passing between other pairs of
ASes. Using this example I will illustrate the accuracy of the methods as a function of spatial
granularity.
This work is joint with Lokesh Setia, Vineet Bharti, Pankaj Setia, Gonca Gursun, and Anukool
Lakhina.

INI 1  
12:30 to 13:30  Lunch at Wolfson Court  
14:00 to 14:45 
M Newman ([Michigan]) Community Structure and Link Prediction in Networks
We investigate two questions of broad interest in the study of networks: the discovery of “communities”
(regions of unusually high density within networks) and link prediction (judgements about
which links in a network are mostly likely to be either false negatives or false positives). We approach
both problems using likelihoodbased fits to appropriate random graph models, including
stochastic block models, edgeplacement models, and hierarchical models. The overall conclusion is
that these models appear to work well in general, giving useful answers to the questions of interest,
but that there are a number of substantive issues, about which our understanding is currently poor,
that prevent their wider application.

INI 1  
14:45 to 15:30 
MW Mahoney ([Stanford]) Community Structure in Large Social and Information Networks
The concept of a community is central to social network analysis, and thus a large body of work has
been devoted to identifying community structure. For example, a community may be thought of as
a set of web pages on related topics, a set of people who share common interests, or more generally
as a set of nodes in a network more similar amongst themselves than with the remainder of the
network. Motivated by difficulties we experienced at actually finding meaningful communities in
very large realworld networks, we have performed a large scale analysis of a wide range of social and
information networks. Our main methodology used local spectral methods and involved computing
isoperimetric properties of the networks at various size scales—a novel application of ideas from
statistics and scientific computation to internet data analysis, which required the development of
new algorithmic tools, as well as the reinterpretation of the statistical basis underlying traditional
spectral approximation algorithms.
Our empirical results suggest a significantly more refined picture of community structure than has
been appreciated previously. Our most striking finding is that in nearly every network dataset we
examined, we observe tight but almost trivial communities at very small size scales, and at larger
size scales, the best possible communities gradually “blend in” with the rest of the network and thus
become less “communitylike.” This behavior is not explained, even at a qualitative level, by any of
the commonlyused network generation models. Moreover, this behavior is exactly the opposite of
what one would expect based on experience with and intuition from expander graphs, from graphs
that are wellembeddable in a lowdimensional structure, and from small social networks that
have served as testbeds of community detection algorithms. Possible mechanisms for reproducing
our empirical observations will be discussed, as will implications of these findings for clustering,
classification, and more general data analysis in modern large social and information networks.

INI 1  
15:30 to 16:00  Afternoon Tea  
16:00 to 16:45 
N Duffield ([AT&T]) New Methods for Sampling and Estimation in Communications Networks
The scale of communications networks, and the immense amount of operational measurements
they make available, and the large numerical ranges of data values, together present formidable
challenges for data analysis, reduction, and fusion. This talk describes some new sampling based
approaches to these problems.

INI 1  
16:45 to 18:15  Poster Session  
18:45 to 19:30  Dinner at Wolfson Court 
09:00 to 09:45 
P Bühlmann (ETH Zürich) Sparse Graphs and Causal Inference
Understanding causeeffect relationships between variables is of interest in many fields of science.
To effectively address such questions, we need to look beyond the framework of variable selection
or importance from models describing associations only. We will show how graphical modeling and
intervention calculus can be used for quantifying intervention and causal effects, particularly for
highdimensional, sparse settings where the number of variables can greatly exceed sample size.
Besides methodology and theory we illustrate some findings on gene intervention effects (of single
gene deletions) in yeast.

INI 1  
09:45 to 10:30 
M Maggioni ([Duke]) Multiscale Methods for the Analysis of Dynamic Graphs
Dynamic graphs arise in a variety of realworld situations: from social networks, to engineered
physical networks, to graphs associated with data sets (e.g. financial transactions) that vary in
time. The challenges are the need to develop robust tools and metrics for comparing graphs at different
times, in order to model statistical significant changes, and capture anomalies: in realworld
situation a graph/network will vary stochastically in time with vertex/edge additions/deletions,
and classical tools such as graph isomorphism are not robust enough to handle such changes. We
use multiscale decompositions of graph and random walks at multiple scales to introduce metrics
of change (in time) of a graph, that allow use to capture changes of different magnitude at different
scales and “locations” on the graph. We apply these techniques to synthetic graphs as well as real
world data sets, and discuss strengths and weaknesses of this approach.

INI 1  
10:30 to 11:00  Morning Coffee  
11:00 to 11:45 
W Willinger ([AT&T]) When Everything Looks Like a Nail: Graph Models of the Internet
The general appeal of abstracting realworld networks to simple graphs is understandable and has
been partly responsible for fueling the new field of “network science”. However, as the Internet
application has demonstrated, such abstractions that ignore much of what engineers consider as
critical come at a price. For example, they can lead to the study of graph models that have little
in common with the realworld networks that motivated these models in the first place. In turn,
they tend to focus on cyberthreat models that are largely of academic interest because they are
incapable of dealing with the most potent and potentially lethal realworld threats.
Fortunately, the Internet application also suggests an alternative and more engineeringbased approach to the “art” of abstracting realworld networks. This approach emphasizes network function over network structure and requires (1) a basic understanding of the process by which network connectivity measurements are obtained, (2) a careful assessment of the kind of inferences that the available measurements can and cannot support, and (3) a break with tradition when it comes to network modeling and model validation. I will illustrate with specific Internetrelated examples that ignoring any of these issues is bound to lead to specious network models or alluring results that quickly collapse when scrutinized with real data or examined by domain experts. 
INI 1  
11:45 to 12:30 
E Volz ([Michigan]) Diffusion in Networks and Infectious Disease Epidemics
I review several special cases of epidemics spreading through random networks which reduce to
simple solutions based on ordinary differential equations. This reveals a link between traditional
mass action models of epidemics in which contacts are instantaneous and uncorrelated, and networks
which have dynamically rearranging ties. I then present several applications to HIV and
Influenza. A simple modification to our equations allows us to model serosorting of HIV positive
individuals, which is the tendency to rearrange relationships to those with matching infection status.
Other extensions to the model are explored, including diffusion in networks with clustering
(transitivity), and heterogeneous susceptibility and infectiousness. We find that ODE approximations
using probability generating functions are more precise and computationally tractable than
corresponding systems based on pair approximation.

INI 1  
12:30 to 13:30  Lunch at Wolfson Court  
14:00 to 14:45 
A Feldmann (Technische Universität Berlin) An Opportunity for ISP and Application Collaboration
Our analysis of residential broadband Internet traffic reveals a number of surprises in terms of the
mental models we developed from the measurement literature. For example, we find that HTTP—
not peertopeer—traffic dominates by a significant margin; that more often than not the home
user’s immediate ISP connectivity contributes more to the roundtrip times the user experiences
than the WAN portion of the path; and that the DSL lines are frequently not the bottleneck in
bulktransfer performance.
Moreover, we note that applications often put their own traffic optimization strategies in place.
Among the popular examples are Content Distribution Networks, large content providers with
multiple data centers around the world, as well as P2P networks. However, these applications rarely
respect or even consider the underlying network. To overcome this, we suggest that applications
and ISPs should collaborate. We propose and evaluate the feasibility of a solution using as ISP
offered information service.
16

INI 1  
14:45 to 15:30 
M Roughan ([Adelaide]) Statistically Accurate Network Measurements
Estimations of Internet loss rates from measurements often have large statistical errors. Any meaningful
application of Internet loss measurement therefore needs to incorporate these uncertainties
in the results. However, estimating measurement errors is difficult as it requires the knowledge of
the loss process itself. Many experiments use a simple and unrealistic Bernoulli model for the loss
process and severely underestimate these errors.
Here we develop SAIL a measurement tool that can accurately estimate not only the loss rate of
an endtoend path but also its statistical errors. The key idea in our method is to capture the
correlation between packet losses using an ON/OFF renewal model and to use a Hidden Semi
Markov Model algorithm to infer the parameters of the ON and OFF periods from measurement
data. Once the model parameters are known, statistical properties of the loss process such as the
loss rate, its variance, and the distribution of the loss and noloss bursts can be easily computed.
This is cowork with Hung Nguyen.

INI 1  
15:30 to 16:00  Afternoon Tea  
16:00 to 16:45 
E Kolaczyk ([Boston]) (Anti)social Behavior in Malicious Internet Source IPs: Characterisation and Detection
We consider the problem of monitoring Internet traffic at the IP address level, for the purpose of
identifying malicious source IPs. This problem is highly challenging, due to such diverse factors as
data volume, limited measurement vantage, sampling effects, and user privacy concerns. Moreover,
efforts typically are made for traffic from the very IP addresses we seek to detect to blend in with
the rest of (normal) traffic. In this talk, we present work characterising the traffic behavior of IP
source addresses from a social network perspective and exploiting our characterizations to build
simple but effective detection tools. Specifically, we analyze network flow data, collected on a major
Internet backbone network, in combination with log records from Internet security programs, using
both local and global network representations and network analysis tools. Our findings are twofold.
First, we show that malicious source nodes in IP traffic are distinctive in their communication
behavior, in that they “interact” with other nodes without substantively ‘participating’ in the
natural communities within which the latter exist. Second, we demonstrate that, with appropriate
social network analysis tools, this behavior can be exploited in developing detection algorithms.
This is joint work with Qi Ding, Natallia Katenka, Paul Barford, and Mark Crovella.

INI 1  
16:45 to 17:30 
P Thiran ([EPFL]) Locating Congested Links in the Internet with Unicast Probes
How can we locate congested IP (Internet Protocol) links from endtoend measurements using
active probing mechanisms? For practical reasons, we want to avoid using IP multicast protocols,
which are not widely deployed, and to avoid relying on tight temporal synchronization between
beaconing nodes, which is difficult to achieve between distant sites. Like other problems in network
tomography or traffic matrix estimation, this inverse problem is illconditioned: the endofend
measurement outcomes do not allow to uniquely identify the variables representing the status of the
IP links. To overcome this critical problem, current methods often use the unrealistic assumption
that all IP links have the same prior probability of being congested. We find that this assumption
is not needed: spatial correlations are sufficient to either learn these probabilities, or to identify
the variances of the link loss rates. We can then use the learned probabilities or variances as priors
to find rapidly the congested links at any time, with an order of magnitude gain in accuracy over
existing algorithms. These solutions scale well and are therefore applicable in today’s Internet, as
shown by the results obtained both by simulation and real implementation using the PlanetLab
network over the Internet.
This joint work with Hung X. Nguyen (Univ. of Adelaide), Denisa Ghita, Katerina Argyrak, and
Maciej Kurant (EPFL).

INI 1  
18:45 to 19:30  Dinner at Wolfson Court 