Skip to content

Latest commit

 

History

History
366 lines (294 loc) · 19.5 KB

Developer-Overview.md

File metadata and controls

366 lines (294 loc) · 19.5 KB
layout title navigation_weight
default
Developer Overview
5

This is an introduction to some of the key concepts and abstractions used within the source code of the Ambra stack, intended for software developers who want to contribute or debug it. (It is not necessary to read this document if you only want to use or administer the system.)

The Ambra stack consists of several service-oriented components. You should follow the Getting Started Guide, both to set up your development system and to give yourself some rough familiarity with what the components are and what each one does.

Wombat

"Wombat" is the stack's display component. It is a Spring Web MVC application. It scoops up a small set of configuration data on start-up from the wombat.yaml file, which points to the system's theme directory. The theme's provide the bulk of the system's configuration values, as well as any custom front-end code and static content that the user plugs in.

Sites and Themes

A key abstraction in Wombat is the site. One instance of the Ambra stack can serve multiple logical sites, potentially representing different journals or publications. Each site can have a distinct set of front-end files, and can even be served on different domain names if the network infrastructure in front of them is set up correctly. But they have the advantage of being able to link to each other, overlapping in such areas as "related article" links and search results. Another application of the "site" abstraction is to have two front-end designs for the same publication, such as one desktop-styled site and one mobile-friendly site.

Each site is mapped to a single theme, which is a directory of files that provide the configuration and front-end code that the site will use. Themes reuse each other's code and content through an inheritance mechanism. An overview of theme inheritance is provided in the Getting Started Guide and on the "Working with PLOS's Themes" page.

In our Java code, the Site and Theme classes represent these two abstractions.

Note the relationship between a site and a journal. A journal may be represented by more than one site, as in the "desktop and mobile" example above. Suppose you have a pointer to a particular journal and want to link to a page on the journal's site. But which site, if the journal has more than one? In a case like this, the Theme.resolveForeignJournalKey Java method finds the site that matches the originating site. In code, this would look like:

Site targetSite = currentSite.getTheme().resolveForeignJournalKey(siteSet, targetJournalKey);

For example, if our currentSite is the MobilePlosMedicine site and the targetJournalKey identifies the journal PLOS Biology, this code would resolve to MobilePlosBiology as the targetSite. Otherwise, there would be no way to tell that we don't want DesktopPlosBiology instead.

Request mapping

Wombat customizes its sites' URLs by tacking some fairly heavy-duty extensions on top of Spring's default request handlers. Most of the gory framework details are contained in the SiteHandlerMapping, SiteRequestCondition and RequestMappingContext classes. In short, RequestMappingContext picks up data from the @RequestMapping annotations that appear on all the controller methods and SiteHandlerMapping exposes our custom logic to Spring.

The @RequestMapping annotations are a common sight in the controllers of any normal Spring Web MVC application, but, because of the extensions described above, we use them in a slightly special way. Each one must supply a name attribute (which ordinarily is optional), or else it will be impossible to link to it as described in the "Linking" section below.

Furthermore, the SiteRequestCondition class tampers with the path pattern in the @RequestMapping(value="...") attribute before it is passed to Spring. The mappings.yaml config file allows the user to override these values on a per-site basis at runtime. See the documentation on the root mappings.yaml file for details. Even if the user doesn't override the path, many sites will need to add a token to the beginning of the pattern as described in the "Request handling" section below, which SiteRequestCondition does automatically.

One final detail is "siteless" request handlers, which are represented in the RequestMappingContext class. We need a few global handlers to be mapped to URLs that belong to no site. Such handlers are marked with the @Siteless annotation. These handlers get a simple mapping that ignores all the details from the "Request handling" section.

Request handling

The actual logic of mapping an HTTP request onto a site is in the SiteResolver class, which is mercifully less gory than the classes in the previous section. Each Site object comes equipped with a SiteRequestScheme object, and the SiteResolver doesn't do much beyond applying each scheme to the request, looking for a hit. The schemes are composed of the various predicate objects in the org.ambraproject.wombat.config.site.url package.

The SiteRequestScheme objects get populated from the sites.yaml file picked up from Wombat's theme directory. For example, one entry in the configuration for PLOS's production site mappings reads as follows:

  - key:   DesktopPlosOne
    theme: DesktopPlosOne
    resolve:
        host: journals.plos.org
        path: plosone
        headers:
          - name:  X-Wombat-Serve
            value: desktop

Each of the three entries under resolve corresponds to a SiteRequestPredicate instance that gets built. Let's unpack them individually:

  • host is a custom hostname. If the Apache server in front of our Tomcat container did the necessary magic to route the request from the journals.plos.org hostname to our webapp, then the request is eligible to hit this site.
  • path is a token that appears at the beginning of the servlet path, in this case distinguishing URLs like http://journals.plos.org/plosone/ from http://journals.plos.org/plosbiology/. Note that it begins only after the URL of the servlet context, which in this case is the domain root because the webapp happens to be deployed to Tomcat as ROOT.war. If it were wombat.war, then the URL might look like http://journals.plos.org/wombat/plosone/ instead.
  • headers is a list of additional HTTP headers to look for. In PLOS's case, the task of identifying mobile clients is offloaded to the Apache server, which we rely on to set a custom X-Wombat-Serve header. If no such header is set, Wombat will resolve the request to neither the desktop nor the mobile site, resulting in a siteless 404 error. (This is a common failure mode for dev instances that have accidentally been set up with the production site resolution config, but with no server setting the X-Wombat-Serve header in front of it.)

Linking

Because of the complexity around resolving URLs, there is necessarily some complexity around building links to those URLs as well. Wombat uses the Link class to do this. Although the private code in the Link class is quite hairy, it is relatively simple to call from the outside.

To build a link from within Java code, call one of the static methods that returns a Link.Factory object: toLocalSite, toForeignSite, or toAbsoluteAddress. Then call a toPattern method and supply the necessary values and query parameters to complete the URL. Details appear in the Link class's Javadoc.

It should be possible to chain these method calls so that you don't ordinarily have to store any variables of the Factory or PatternBuilder types. For example:

Link link = Link.toForeignSite(localSite, targetJournalKey, siteSet)
    .toPattern(requestMappingContextDictionary, "article")
    .addQueryParameter("id", doi)
    .build();
String url = link.get(request);

Always prefer the toPattern method over toPath. The toPattern method points at a handler using the name attribute from its @RequestMapping annotation, and will dynamically generate a URL for its configured URL pattern. If you use toPath, the link may break if the site mappings are changed in sites.yaml or if the handler mapping is changed in mappings.yaml.

However, you mostly will not be building links from Java code, but from FreeMarker. The SiteLinkDirective lets you invoke the same link-building code from a FreeMarker template. As with the Link class, the directive may be used with a handler name or with a raw path; always prefer the handler name. For example:

<@siteLink handlerName="article" queryParameters={"id": article.doi} />

Sometimes you will want a URL to appear within FreeMarker code in a place where it would be unreadable to shove a bulky directive invocation like the one above. In such cases, it is good style to bind the result to a "loop var" as follows:

<@siteLink handlerName="article" queryParameters={"id":article.doi} ; href>
  <a href="${href}">${article.title}</a>
</@siteLink>

Calling services

Wombat has no access to persistent data on its own. It gets everything by making HTTP calls to remote services. The tools to make these calls live in the org.ambraproject.wombat.service.remote package.

For each remote component, there is a interface that extends the RestfulJsonApi interface, which Wombat uses to make requests to that component. The class names avoid using component nicknames like "Rhino" in favor of descriptive terms, as follows:

Target component Wombat's service interface
Rhino ArticleApi
Content Repo ContentApi
NED UserApi

The main way to make requests to the RestfulJsonApi interface is with the ApiAddress class, which encapsulates a query to the API and produces a URL. It uses a builder pattern to construct an ApiAddress instance tersely, and its embedDoi method will apply Rhino's DOI-escaping scheme. For example:

ApiAddress.builder("articles").embedDoi(articleId.getDoi()).addToken("revisions").build()

Each RestfulJsonApi object wraps around a RemoteService object, which does the actual work of making the HTTP request. The RemoteService classes encapsulate performance concerns such as HTTP connection pooling and caching. These details are set up in RootConfiguration.

Rhino

"Rhino" is a service component that provides a loosely RESTful interface to the stored corpus of articles belonging to the system's journals, and all data associated with those articles such as user comments. Although it does not serve any user-facing HTML content, it is also a Spring Web MVC application. It uses this framework mostly to serve metadata describing the state of the corpus in JSON format, but it also serves some forms of raw data.

Rhino's back end is a MySQL database that it accesses via Hibernate. It gets its MySQL connector from a typical context.xml file, and additionally picks up some bespoke configuration from rhino.yaml.

Service responses

Rhino uses the ServiceResponse and CacheableResponse classes to encapsulate raw response data that is intended to be returned to the client as JSON data. Many service classes have ServiceResponse as the return type to their methods. (This is perhaps not the best separation of layers. It may help to think of the response class as being a bridge between the service layer and controller layer.)

A service method calls a static factory method and returns a ServiceResponse object. In cases where the service has access to a "last modified" timestamp of the entity in question, instead construct and return a CacheableResponse. The controller that receives the CacheableResponse supplies to it an "if modified since" timestamp and unpacks it into a ServiceResponse object, which then produces the response as normal (and will provide an empty response with a "not modified" status if appropriate).

Note the semantics of the service methods that sometimes return a ServiceResponse and sometimes don't. For methods that look up a persistent entity, Rhino has a (mostly) consistent naming convention that depends on how to handle missing entities. Please follow this convention when adding new methods.

Return type Verb Meaning
Optional<T> get Fetch something that may or may not exist. Check after returning.
T read Fetch something that must exist. Throw an exception if it doesn't.
ServiceResponse<T> serve Fetch something the client expects to exist. Produce a 404 response if it doesn't.

Note that none of these three method types ever return null.

How metadata is retrieved

Much of Rhino's functionality is concerned with article metadata, which is stored and read in different ways depending on its purpose. It may be helpful to think of article data in these broad categories:

  1. metadata that Rhino extracts from the article at the time of ingestion and stores in the database, in Solr, or both;
  2. metadata that Rhino parses dynamically from the manuscript whenever it needs to serve it up as JSON;
  3. data (including metadata and the main article text) that Wombat parses or transforms directly from the manuscript; and
  4. dynamic data that is captured after ingestion (e.g., comments), so it must be stored in the database or by some external service (such as an ALM server).

Distinguishing between the first two categories is an engineering decision. Prior to Rhino version 2, it was attempted to keep a comprehensive set of article metadata in the database, so that everything could be retrieved efficiently into one-size-fits-all Hibernate entities. Over time, small deficiencies appeared in the designs of these entities: some new things were needed but not at the cost of migration; some things turned out to be insignificant; some complex relationships were difficult to model in Hibernate and led to brittle or unmaintainable code.

When we were developing the data model of versioned articles for Rhino version 2, we explored pulling article metadata aggressively out of the database and into the second category, which is parsed dynamically as needed. This added a new step to the basic metadata service where we had to read and parse the XML manuscript, which incurred a new performance cost, but we determined that the cost was acceptable, especially behind a cache. This approach has the general advantage that bugs in any metadata-parsing code could be fixed without backfilling any data: the more data you persist at ingestion time, the greater the risk that you are persisting it in the wrong form.

As of Rhino version 2, the guiding principle is that we keep as much metadata as possible out of the database by default, with the assumption that we don't mind parsing the XML manuscript when we render the main article page or one of its tabs. The metadata that we do put into the database is whatever we will need when rendering other pages, or the main page for a different article. Mostly, this means data that we need for rendering a link or displaying it in a list, such as the title, publication dates, publication stage (i.e., whether it is an uncorrected proof), and its relationships with other articles.

The golden rule is that Rhino never should parse the manuscripts of a linear number of articles in order for Wombat to render a single page. In a context where metadata from many articles is being aggregated, that data must be read from the database.

Wombat's main role in parsing data (the third category) is transforming the article body text into HTML. Wombat also is responsible for parsing the article's list of references to other works, which it needs both to display to the reader and to provide in <meta> tags. This operation was deemed too complex and specialized to put into Rhino's API, where it would have required baroque JSON structures and equally hairy code for Wombat to consume them.

In general, we try to minimize the types of data that go into the third category, because Rhino is more useful if it can represent as much article metadata as possible in general-purpose JSON services. These services are available not only to Wombat, but to other in-house tools, some of which we don't yet know we need. The services also are useful because developers can query them ad-hoc for debugging and such.