Skip to content

Commit

Permalink
common crawl
Browse files Browse the repository at this point in the history
  • Loading branch information
Ce Zhang authored and Ce Zhang committed Feb 1, 2015
1 parent 7595303 commit d2fb636
Show file tree
Hide file tree
Showing 2 changed files with 65 additions and 1 deletion.
65 changes: 65 additions & 0 deletions doc/doc/opendata/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -466,6 +466,71 @@ indexed by Google Patents in Jan 2015.
Information obtained at [Jan 27, 2015](http://en.wikipedia.org/wiki/Google_Patents).
</i>


## CCRAWL (CommonCrawl)

<div class="panel panel-default" style="position:relative;">
<div style="position:absolute; left: -20px; top: -30px;">
<img src="/images/coming_soon.png" style="width:200px;">
</div>
<div class="panel-heading">Quick Statistics & Downloads</div>
<table class="table">
<tr>
<th> Pipeline </th>
<td colspan=3>

<span class="label label-info">HTML</span>
<span class="label label-warning">&gt;</span>
<span class="label label-info">STRIP (html2text)</span>
<span class="label label-warning">&gt;</span>
<span class="label label-info">NLP (Stanford CoreNLP)</span>

</td>
<tr>
<th> Size </th> <td> - </td>
<th> Document Type </th> <td> Government Document </td>
</tr>
<tr>
<th> # Documents </th> <td> - </td>
<th> # Machine Hours </th> <td> - </td>
</tr>
<tr>
<th> # Sentences </th> <td> - </td>
<th> # Words </th> <td> - </td>
</tr>
<tr>
<th> Downloads </th> <td colspan="3">
<div class="btn-group" text-aligh="right">
<button type="button" class="btn btn-primary dropdown-toggle disabled" data-toggle="dropdown" aria-expanded="false"> Download Full Corpus <span class="caret"></span>
</button>
<ul class="dropdown-menu" role="menu">
<li><a href="#">DeepDive-ready DB Dump</a></li>
<li><a href="#">CoNLL-format Markups</a></li>
</ul>
</div>
<div class="btn-group" text-aligh="right">
<button type="button" class="btn btn-primary dropdown-toggle disabled" data-toggle="dropdown" aria-expanded="false"> Download Small Teaser <span class="caret"></span>
</button>
<ul class="dropdown-menu" role="menu">
<li><a href="#">DeepDive-ready DB Dump</a></li>
<li><a href="#">CoNLL-format Markups</a></li>
</ul>
</div>
</td>
</tr>
</table>
</div>

We plan to have
DeepDive's CCRAWL Corpus to process a full
snapshot of the [Common Crawl Corpus](http://commoncrawl.org/), which is a corpus of web crawl data composed of over 5 billion web pages.


<i>
<img src="https://commoncrawl.atlassian.net/wiki/download/attachments/655375/CRWL?version=1&modificationDate=1341953825985&api=v2" style="width:88px;"/> This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use.
</i>


## More Datasets Are Coming -- Stay Tuned!

We are currently working hard to bring more (10+!) datasets
Expand Down
1 change: 0 additions & 1 deletion doc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,6 @@ projects.
DeepDive is project led by [Christopher
](http://cs.stanford.edu/people/chrismre/) at Stanford University. Current
group members include: [Michael Cafarella](http://web.eecs.umich.edu/~michjc/),
[Matteo Riondato](http://cs.brown.edu/~matteo/),
Amir Abbas Sadeghian, [Zifei Shan](http://www.zifeishan.org/),
Jaeo Shin, Feiran Wang, [Sen Wu](http://stanford.edu/~senwu/), and [Ce
Zhang](http://pages.cs.wisc.edu/~czhang/).
Expand Down

0 comments on commit d2fb636

Please sign in to comment.