counting_subgraphs.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <link href="https://adriann.github.io/feed.rss" rel="alternate" type="application/rss+xml" title="What's new on adriann.github.io" />
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <meta name="generator" content="pandoc" />
  <meta name="author" content="Adrian Neumann (adrian_neumann@gmx.de)" />
  <title>Counting Subgraphs in Streams</title>
  <style type="text/css">
  .displayequation{margin-left:auto;margin-right:auto;}
  </style>
  <style>
  .caption{font-size:66%;text-align:right;}
  .figure{float:right;padding-bottom:1em;padding-left:1em;}
  .figure>img{display:block;margin:0 auto;}
  .footnotes{font-size:80%;}
  .block{border-left:1ex solid gray;padding-left:2em;}
  li{padding:0.25em;}
  a:hover{text-shadow: 0 0 5px;}
  body{font-family:sans-serif;max-width:100ex;padding-left:3em;padding-right:2em;}
  code{font-family:Consolas, Inconsolata, Monaco, monospace;}
  p{text-align:justify;}
  </style>
</head>
<body>
<div id="header">
<h1 class="title">Counting Subgraphs in Streams</h1>
</div>
<div class="figure">
<img src="pictures/stream.jpg" title="CC-BY-NC &#39;Streaming&#39; by Flickr user Dru!" />

</div>
<p>Suppose you have a very large graph, for example a protein-protein interaction graph for some complicated synthesis pathway. To draw conclusions from such a wealth of data, it is typically helpful to look for certain patterns, or subgraphs.</p>
<p>Finding, or indeed even counting, small sized subgraphs in a large graph seems to be a very hard problem: The number of candidate positions against which you need to match your subgraph explodes com­bi­na­to­ri­cal­ly. The situation is even more hopeless if the data doesn’t fit into main memory anymore.</p>
<p>To address the problem of working with graphs that don’t fit into RAM the streaming model was introduced. Here the algorithm has a very limited amount of storage, say O(<math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>log</mo><mi>n</mi></mrow><annotation encoding="application/x-tex">\log n</annotation></semantics></math>) bits, and the in­put is pre­sent­ed as a stream of edges. The algorithm is only allowed to make a small number of passes over the data.</p>
<p>Only few problems can be solved exactly in this model, but often it is possible to find good approximation algorithms. In this post we will have a look at how to count triangles in the streaming model. Very similar techniques can be used to count arbitrary constant size subgraphs.<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a></p>
<p>Triangle counting already is a relevant problem in itself. It has applications for example in community detection in graphs. There the clustering coefficient, i.e. how many of a node’s neighbours are adjacent to each other (and thus form triangles), plays a central role. Indeed there are several papers <a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a> on counting triangles using MapReduce as a way to tackle the large networks we face today. The streaming algorithm I’ll present here is a simple alternative to this.</p>
<!-- more -->
<h2 id="the-algorithm">The Algorithm</h2>
<p>This algorithm is from a paper by Jowhari and Ghodsi <a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a>. It is extremely simple. We need a source of sufficiently independent random bits, say an explicit polynomial generator of degree 12. That is a polynomial <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>p</mi><annotation encoding="application/x-tex">p</annotation></semantics></math> over some sufficiently large field with random coefficients. The random numbers are then <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>p</mi><mo stretchy="false" form="prefix">(</mo><mn>1</mn><mo stretchy="false" form="postfix">)</mo><mo>,</mo><mi>p</mi><mo stretchy="false" form="prefix">(</mo><mn>2</mn><mo stretchy="false" form="postfix">)</mo><mo>,</mo><mi>…</mi></mrow><annotation encoding="application/x-tex">p(1),p(2),\ldots</annotation></semantics></math> and we use their binary representation for a stream of random numbers <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>b</mi><mo stretchy="false" form="prefix">[</mo><mn>1</mn><mo stretchy="false" form="postfix">]</mo><mo>,</mo><mi>…</mi><mo>,</mo><mi>b</mi><mo stretchy="false" form="prefix">[</mo><mi>n</mi><mo stretchy="false" form="postfix">]</mo></mrow><annotation encoding="application/x-tex">b[1],\ldots, b[n]</annotation></semantics></math>, with <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>b</mi><mo stretchy="false" form="prefix">[</mo><mi>i</mi><mo stretchy="false" form="postfix">]</mo><mo>∈</mo><mo stretchy="false" form="prefix">{</mo><mo>−</mo><mn>1</mn><mo>,</mo><mn>1</mn><mo stretchy="false" form="postfix">}</mo></mrow><annotation encoding="application/x-tex">b[i] \in \{-1,1\}</annotation></semantics></math>. Note that we can encode this generator using only a logarithmic number of bits for the coefficients.</p>
<p>As we see the edges <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false" form="prefix">(</mo><mi>u</mi><mo>,</mo><mi>v</mi><mo stretchy="false" form="postfix">)</mo><mo>,</mo><mo stretchy="false" form="prefix">(</mo><mi>x</mi><mo>,</mo><mi>y</mi><mo stretchy="false" form="postfix">)</mo><mo>,</mo><mi>…</mi></mrow><annotation encoding="application/x-tex">(u,v),(x,y),\ldots</annotation></semantics></math> we sum up <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>b</mi><mo stretchy="false" form="prefix">[</mo><mi>u</mi><mo stretchy="false" form="postfix">]</mo><mi>b</mi><mo stretchy="false" form="prefix">[</mo><mi>v</mi><mo stretchy="false" form="postfix">]</mo></mrow><annotation encoding="application/x-tex">b[u]b[v]</annotation></semantics></math>, that is the algorithm computes</p>
<p><math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>Z</mi><mo>=</mo><munder><mo>∑</mo><mrow><mo stretchy="false" form="prefix">(</mo><mi>u</mi><mo>,</mo><mi>v</mi><mo stretchy="false" form="postfix">)</mo><mo>∈</mo><mi>E</mi></mrow></munder><mi>b</mi><mo stretchy="false" form="prefix">[</mo><mi>u</mi><mo stretchy="false" form="postfix">]</mo><mi>b</mi><mo stretchy="false" form="prefix">[</mo><mi>v</mi><mo stretchy="false" form="postfix">]</mo></mrow><annotation encoding="application/x-tex">Z = \sum_{(u,v)\in E} b[u]b[v]</annotation></semantics></math></p>
<p>The output is then</p>
<p><math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mn>1</mn><mn>6</mn></mfrac><msup><mi>Z</mi><mn>3</mn></msup><mi>.</mi></mrow><annotation encoding="application/x-tex">\frac{1}{6} Z^3.</annotation></semantics></math></p>
<p>That’s it. Algorithm over. This might seem too simple to be correct, but it actually works. Observe that we output</p>
<p><math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mfrac><mn>1</mn><mn>6</mn></mfrac><munder><mo>∑</mo><mrow><mo stretchy="false" form="prefix">(</mo><mi>u</mi><mo>,</mo><mi>v</mi><mo stretchy="false" form="postfix">)</mo><mo>,</mo><mo stretchy="false" form="prefix">(</mo><mi>w</mi><mo>,</mo><mi>x</mi><mo stretchy="false" form="postfix">)</mo><mo>,</mo><mo stretchy="false" form="prefix">(</mo><mi>y</mi><mo>,</mo><mi>z</mi><mo stretchy="false" form="postfix">)</mo><mo>∈</mo><mi>E</mi></mrow></munder><mi>b</mi><mo stretchy="false" form="prefix">[</mo><mi>u</mi><mo stretchy="false" form="postfix">]</mo><mi>b</mi><mo stretchy="false" form="prefix">[</mo><mi>v</mi><mo stretchy="false" form="postfix">]</mo><mi>b</mi><mo stretchy="false" form="prefix">[</mo><mi>w</mi><mo stretchy="false" form="postfix">]</mo><mi>b</mi><mo stretchy="false" form="prefix">[</mo><mi>x</mi><mo stretchy="false" form="postfix">]</mo><mi>b</mi><mo stretchy="false" form="prefix">[</mo><mi>y</mi><mo stretchy="false" form="postfix">]</mo><mi>b</mi><mo stretchy="false" form="prefix">[</mo><mi>z</mi><mo stretchy="false" form="postfix">]</mo></mrow><annotation encoding="application/x-tex">\frac 16 \sum_{(u,v), (w,x), (y,z) \in E} b[u]b[v]b[w]b[x]b[y]b[z]</annotation></semantics></math></p>
<p>Since <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>b</mi><mo stretchy="false" form="prefix">[</mo><mi>u</mi><msup><mo stretchy="false" form="postfix">]</mo><mi>k</mi></msup></mrow><annotation encoding="application/x-tex">b[u]^k</annotation></semantics></math> is zero in expectation for odd <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>k</mi><annotation encoding="application/x-tex">k</annotation></semantics></math>, only terms with even powers count (in expectation) in this sum.<a href="#fn4" class="footnoteRef" id="fnref4"><sup>4</sup></a> Hence only if <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>u</mi><mo>=</mo><mi>z</mi></mrow><annotation encoding="application/x-tex">u=z</annotation></semantics></math>, <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>v</mi><mo>=</mo><mi>w</mi></mrow><annotation encoding="application/x-tex">v=w</annotation></semantics></math>, and <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi><mo>=</mo><mi>y</mi></mrow><annotation encoding="application/x-tex">x=y</annotation></semantics></math> we count this term. This is exactly true if the three edges form a triangle! Of course we over count, since there are <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>3</mn><mo>*</mo><mn>2</mn><mo>*</mo><mn>1</mn><mo>=</mo><mn>6</mn></mrow><annotation encoding="application/x-tex">3*2*1=6</annotation></semantics></math> permutations of the three edges and they all occur in the sum, but we simply divide by six and get the right thing.</p>
<p>Unfortunately this is only true in expectation, the actual value can differ quite a bit. But it is not too difficult to bound the variance of <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>Z</mi><annotation encoding="application/x-tex">Z</annotation></semantics></math> and hence get a good estimate of how often we need to run this with fresh random variables to get a good estimate. Have a look at the paper to get the actual numbers. In the streaming model we can simply run these instances in parallel.</p>
<p>In did a few experiments and it seems like the high independence that we require in the analysis is actually not so important for random graphs. My results indicate that a normal linear congruential generator already provides sufficient randomness to get good estimates, so the algorithm is really trivial to implement.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Kane, Mehlhorn, Sauerwald, Sun: <a href="http://www.mpi-inf.mpg.de/~hsun/SGC12.pdf">Counting Arbitrary Subgraphs in Data Streams</a><a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>Suri, Vassilvitskii: Counting Triangles and the Curse of the Last Reducer,<br />
Pagh, Tsourarakis: Colorful triangle counting and a mapreduce implementation<a href="#fnref2">↩</a></p></li>
<li id="fn3"><p>Jowhari, Ghodsi: <a href="http://sina.sharif.ac.ir/~ghodsi/papers/jowhari-cocoon2005.pdf">New Streaming Algorithms for Counting Triangles in Graphs</a><a href="#fnref3">↩</a></p></li>
<li id="fn4"><p>If you ask yourself why we can take the expectation of the individual <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>b</mi><mo stretchy="false" form="prefix">[</mo><mi>u</mi><mo stretchy="false" form="postfix">]</mo></mrow><annotation encoding="application/x-tex">b[u]</annotation></semantics></math>, even though they appear in a product, this is because they are sufficiently independent. For the expectation calculation six-way independent random numbers would be sufficient (there are six terms in the product), but for the variance calculation that I don’t show here we need <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>E</mi><mo stretchy="false" form="prefix">(</mo><mi>Z</mi><msup><mo stretchy="false" form="postfix">)</mo><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">E(Z)^2</annotation></semantics></math>, hence 12-way independence.<a href="#fnref4">↩</a></p></li>
</ol>
</div>
<hr/>
<div style="display:inline-flex;flex-wrap:wrap;justify-content:space-between;font-size:80%">
<p style="margin-right:2ex">CC-BY-SA <a href="mailto:adrian_neumann@gmx.de">Adrian Neumann</a> (PGP Key <a href="https://adriann.github.io/ressources/pub.asc">A0A8BC98</a>)</p>
<p style="margin-left:1ex;margin-right:1ex"><a href="http://adriann.github.io">adriann.github.io</a></p>
<p style="margin-left:2ex"><a href="https://adriann.github.io/feed.rss">RSS</a></p>
</div>
</body>
</html>