about.html

---
layout: page
title: About
---
<h1>About</h1>

<div class="row first-row">
	<div class="col-sm-3 section-title">
		<h3>Motivation</h3>
	</div>
	<div class="col-sm-7">
		<p class="about-p">
		Synthesizing data from the published literature is critical to addressing a wide range of questions, ranging from the <a href="http://science.sciencemag.org/content/321/5885/97" target="_blank">history</a> and <a href="http://www.nature.com/nature/journal/v471/n7336/full/nature09678.html" target="_blank">future</a> of global biodiversity to the <a href="http://www.nature.com/nature/journal/v523/n7560/full/nature14584.html" target="_blank">evolution of continental crust</a>. Doing so manually, however, can be prohibitively time consuming and produces a monolithic database that is disconnected from primary sources that are difficult to fully cite.</p>
		<p class="about-p">We are building a scalable, dependable cyberinfrastructure to facilitate <a href="http://dx.doi.org/10.1371/journal.pone.0113523" target="_blank">new approaches</a> to the discovery, acquisition, utilization, and citation of data and knowledge in the published literature.
		</p>
		<p class="about-p">Formerly known as GeoDeepDive, xDD was renamed in order to clarify the core mission: We aim to enable e<u><b>x</b></u>traction of <u><b>d</b></u>ark <u><b>d</b></u>ata from scientific works which would otherwise remain unseen in the enormous volume of literature. The tools are agnostic to discipline and application framework, despite the implication of the prior "Geo" and "DeepDive" nomenclature.
		</p>
	</div>
</div>

<hr>

<div class="row new-row">
	<div class="col-sm-3 section-title">
		<h3>Overview</h3>
	</div>
	<div class="col-sm-7">
		<p class="about-p">
			This project was originally supported by the U.S. National Science Foundation <a href="http://earthcube.org" target="_blank">EarthCube building block</a> project (NSF ICER 1343760). Additional ogoing critical support is being provided by <a href="https://www.darpa.mil/program/automating-scientific-knowledge-extraction">DARPA - ASKEM HR00112220037</a>. Additional support is provided by the DOE and USGS. A through-going objective of all of these projects is to build a cyberinfrastructure that is capable of supporting end-to-end text and data mining (TDM) and knowledge base creation/augmentation activities in any domain of science or scholarship. xDD infrastructure includes the following key components:
			<ul class="about-p">
				<li>Automated, rate-controlled and authenticated original document fetching</li>
				<li>Secure original document storage and bibliographic/source metadata management</li>
				<li><a href="https://xdd.wisc.edu/api" target="_blank">API</a> for basic full-text search and discovery capabilities</li>
				<li>Ability to pre-index content using external dictionaries (e.g., <a href="https://macrostrat.org/api/v2/defs/lithologies?all" target="_blank">Macrostrat lithologies)</a></li>
				<li>Ability to generate fully documented, bibliographically complete testing and development datasets based on user-supplied terms</li>
				<li>Capacity to support the deployment of user-developed <a href="http://www.nature.com/news/computers-read-the-fossil-record-1.17868" target="_blank">TDM applications</a> across full corpus, with on-demand updates as new relevant documents are acquired</li>
			</ul>
		</p>
	</div>
</div>

<hr>

<div class="row new-row">
	<div class="col-sm-3 section-title">
		<h3>Current Partners</h3>
	</div>
	<div class="col-sm-7">
		<p>
			<a href="https://www.elsevier.com" target="_blank">
				<img src="assets/images/logos/elsevier.jpeg" alt="elsevier" class="partner-small"/>
			</a>
			<a href="http://www.wiley.com" target="_blank">
				<img src="assets/images/logos/wiley_wordmark.png" alt="wiley" class="partner"/>
			</a>
			<a href="https://sites.agu.org" target="_blank">
				<img src="assets/images/logos/agu.png" alt="agu" class="partner"/>
			</a>
			<a href="http://www.sepm.org" target="_blank">
				<img src="assets/images/logos/sepm.jpg" alt="SEPM" class="partner-small"/>
			</a>
			<a href="http://www.usgs.gov" target="_blank">
				<img src="assets/images/logos/usgs.jpg" alt="USGS" class="partner"/>
			</a>
		 <a href="http://www.geosociety.org/index.htm" target="_blank">
				<img src="assets/images/logos/GSA.jpg" alt="GSA" class="partner-small"/>
			</a>
			<a href="http://www.cdnsciencepub.com" target="_blank">
				<img src="assets/images/logos/CSP.png" alt="CSP" class="partner"/>
			</a>
			<a href="http://taylorandfrancis.com" target="_blank">
				<img src="assets/images/logos/TF.jpg" alt="TF" class="partner tf"/>
			</a>
			<a href="https://www.paleosoc.org" target="_blank">
				<img src="assets/images/logos/PS.jpg" alt="PS" class="partner-small"/>
			</a>
			<a href="https://www.springernature.com/gp" target="_blank">
				<img src="assets/images/logos/springer.jpg" alt="SP" class="partner tf"/>
			</a>
		</p>
		<p>Check back often. We are actively seeking new partnerships and content providers.</p>
	</div>
</div>

<hr>

<div class="row new-row">
	<div class="col-sm-3 section-title">
		<h3>Tools</h3>
	</div>
	<div class="col-sm-7">
		<div class="row product">
			<div class="col-sm-5 center">
				<div class="aligner">
					<h1>NLP</h1>
				</div>
			</div>

			<div class="col-sm-7 center">
				<div class="aligner">
					<p>Natural language analysis is critical to a variety of data and information location and extraction tasks. We deploy NLP software packages, including <a href="http://stanfordnlp.github.io/CoreNLP/index.html" target="_blank">Stanford CoreNLP</a> over our entire corpus. We are always seeking to deploy leading new tools for named entity recognition and other NLP tasks. Our campus <a href="https://chtc.cs.wisc.edu" target="_blank">CHTC infrastructure</a> and xDD architecture enables us to rapidly deploy new tools and analyze the output for millions of documents.</p>
				</div>
			</div>
		</div>

		<div class="row product">
			<div class="col-sm-5 center">
				<div class="aligner">
					<h1>Embedding Models</h1>
				</div>
			</div>
			<div class="col-sm-7 center">
				<div class="aligner">
					<p>Text embeddings (the mathematical representation of words, phrases, or entire documents as numerical vectors) provide powerful summarizations of the linguistic context and relationships of text within a corpus. The power of embeddings is that they are a learned representation of text that can provide basic question answering capabilities (i.e., similarity and analogy) and that is suitable as input for advanced machine learning approaches. We deploy a variety of embedding models over target document sets within xDD, providing a high-derived summary of the entirety of the corpus. Basic embedding results are available for document sets via the xDD API.</p>
				</div>
			</div>
		</div>

		<div class="row product">
			<div class="col-sm-5 center">
				<div class="aligner">
					<h1>COSMOS</h1>
				</div>
			</div>
			<div class="col-sm-7 center">
				<div class="aligner">
					<p>Scientific and other publications were made by humans for humans and they are rich in figures, tables, equations, and other visual elements that are not capture by text-only analysis. Our team have developed <a href="https://cosmos.wisc.edu" target="_blank">COSMOS</a>, an AI-powered technical assistant that extracts and assimilates data from heterogeneous sources to accelerate analytics and knowledge discovery. COSMOS is a <a href="https://www.darpa.mil/program/automating-scientific-knowledge-extraction">DARPA ASKE</a> program innovation.</p>
				</div>
			</div>
		</div>

	</div>
</div>

<hr>

<div class="row new-row">
	<div class="col-sm-3 section-title">
		<h3>Infrastructure Schematic Overview</h3>
	</div>
	<div class="col-sm-7">
		 <p class="about-p">The following image provides a general overview of the xDD pipeline, with an empahsis on COSMOS endpoint. The processes starts on the left-hand side, with document fetching. Here, our secure servers fetch original documents from partner content providers. These documents, and associated bibliographic and link-back metadata, are stored on secure servers that have restricted access. Each content provider has its own mechanism of providing documents and our fetching system can be fine-tuned to accommodate the preferred rates specified by each provider, if applicable. </p>
		 <img src="assets/images/diagrams/xdd_overview.jpg" alt="xdd" class="about-img">
		 <p class="about-p">After obtaining original documents and associated bibliographic metadata, document data products are generated for key components of the xDD system, including the xDD API. Subsets of documents targeted towards specific research projects then pass into additional infrastructure endpoints, including the COSMOS AI technical assistant.</p>
	</div>
</div>

<hr>

<div class="row new-row">
	<div class="col-sm-3 section-title">
		<h3>End Use</h3>
	</div>
	<div class="col-sm-7">
		<p class="about-p">
			Every word and datum that can be derived from our infrastructure is fully traceable back to the original content provided by our partner publishers and organizations. Our terms of use require that useres provide full citation and, when relevant, URL links back to all of the original works that contributed data to an application or result. The xDD infrastructure can also be cited and we welcome new collaborations, both scientific and informatic.
		</p>
	</div>
</div>