Skip to content

Commit

Permalink
website: docs rewrite
Browse files Browse the repository at this point in the history
  • Loading branch information
Alex Buchanan authored and buchanae committed Nov 15, 2017
1 parent cd6fbd9 commit a24327e
Show file tree
Hide file tree
Showing 45 changed files with 1,868 additions and 573 deletions.
10 changes: 5 additions & 5 deletions website/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@ baseURL: https://ohsu-comp-bio.github.io/funnel/
canonifyURLs: true
languageCode: en-us
title: Funnel

publishDir: docs
menu:
main:

- name: Reference
url: https://godoc.org/github.com/ohsu-comp-bio/funnel
weight: 30
- name: Reference
parent: Development
url: https://godoc.org/github.com/ohsu-comp-bio/funnel
weight: 30
51 changes: 0 additions & 51 deletions website/content/_index.md
Original file line number Diff line number Diff line change
@@ -1,62 +1,11 @@
---
Demo:
- Title: Start Funnel
Cmd: $ funnel server run

- Title: Run a task
Desc: Returns a task ID.
Cmd: |
$ funnel run 'md5sum $src' -c ubuntu --in src=~/src.txt
b41pkv2rl6qjf441avd0
- Title: Get the task
Desc: Returns state, logs, and more.
Cmd: $ funnel task get b41pkv2rl6qjf441avd0

- Title: List all the tasks
Cmd: $ funnel task list

- Title: View the terminal dashboard
Cmd: $ funnel dashboard

# - Title: Move to the cloud.
# Desc: |
# Google, Amazon, Microsoft, HPC, and more.
# Cmd: |
# $ gcloud auth login
# $ funnel deploy gce
# $ funnel run 'md5sum' \
# --stdin gs://pub/input.txt \
# --stdout gs://my-bkt/output.txt

- Title: Use a remote server
Cmd: $ funnel run --server http://funnel.example.com ...

- Title: Example tasks
Cmd: |
$ funnel example list
$ funnel example hello-world
- Title: Get help
# Desc: The Funnel CLI is extensive.
Cmd: $ funnel help

# - Title: File a bug.
# Desc: It happens.
# Cmd: $ funnel bug

- Title: Get the code
Cmd: $ go get github.com/ohsu-comp-bio/funnel

# - Title: Hack together a workflow.
# Desc: Bash-fu. Hadouken!
# Cmd: |
# $ funnel run <<TPL
# TPL

# - Title: Use a workflow language.
# Desc: Level up with CWL and WDL.

---

Homepage content is written in layouts/index.html
102 changes: 60 additions & 42 deletions website/content/docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,62 +3,80 @@ title: Overview
menu:
main:
identifier: docs
weight: -100
weight: -1000
---

# Overview

Funnel aims to make batch processing tasks easier to manage by providing a simple
toolkit that supports a wide variety of cluster types. Our goal is to enable you
to spend less time worrying about task management and more time processing data.
Funnel makes distributed, batch processing easier by providing a simple task API and a set of
components which can easily adapted to a vareity of platforms.

## Background
### Task

### How Does Funnel Work?
A task defines a unit of work: metadata, input files to download, a sequence of Docker containers + commands to run,
output files to upload, state, and logs. The API allows you to create, get, list, and cancel tasks.

Funnel is a combination of a server and worker processes. First, you define a task.
A task describes input and output files, (Docker) containers and commands, resource
requirements, and some other metadata. You send that task to the Funnel server,
which puts it in the queue until a worker is available. When an appropriate Funnel
worker is available, it downloads the inputs, executes the commands in (Docker)
containers, and uploads the outputs.
Tasks are accessed via the `funnel task` command. There's an HTTP client in the [client package][clientpkg],
and a set of utilities and a gRPC client in the [proto/tes package][tespkg].

Funnel also comes with some tools related to managing workers and tasks. There's
a dashboard, a scheduler, an autoscaler, some rudimentary workflow tools, and more.
There's a lot more you can do with the task API. See the [tasks docs](/docs/tasks/) for more.

### Why Does Funnel Exist?
### Server

Here at OHSU Computational Biology, a typical project involves coordinating dozens
of tasks across hundreds of CPUs in a cluster of machines in order to process hundreds
of files. That's standard fare for most computational groups these days, and for some
groups it's "thousands" or "millions" instead of "hundreds".
The server serves the task API, web dashboard, and optionally runs a task scheduler.
It serves both HTTP/JSON and gRPC/Protobuf.

Because we're part of a worldwide scientific community, it's important that we're able
to easily share our work. If we create a variant calling pipeline with 50 steps,
we need people outside OHSU to run that pipeline easily and efficiently.
The server is accessible via the `funnel server` command and the [server package][serverpkg].

There's a long list of projects making great strides in the tools we use to tackle
this type of work, but they have a common problem. Every group of users has grown
a different set of tools for managing and interacting with their cluster. Some use
HTCondor and NFS. Some use Open Grid Engine and Lustre. Some prefer cloud providers,
but which one? Google? Amazon? Each cluster comes with a different interface to learn
(and a new set of problems to debug too).
### Storage

Tool authors usually end up writing (and hopefully maintaining) a set of
compute and storage plugins for each type of cluster. Many authors don't have
time for that, and their tools end up being limited to their environment.
Some tools were never meant to be shared, instead they were originally just
a prototype or a set of helper scripts for working with AWS instances.
Storage provides access to file systems such as S3, Google Storage, and local filesystems.
Tasks define locations where files should be downloaded from and uploaded to. Workers handle
the downloading/uploading.

The [GA4GH Task Execution Schemas][tes] (TES) group aims to ease problems by
designing a simple API for data processing tasks that can be easily layered on top of,
or easily plugged into, most existing cluster. Funnel started as the first
implementation of the TES API.
See the [storage docs](/docs/storage/) for more information on configuring storage backends.
The storage clients are available in the [storage package][storagepkg].

Funnel aims to ease these problems. Our goal is to enable easy management of tasks
and tools that need to work across many types of clusters.
### Worker

A worker is reponsible for executing a task. There is one worker per task. A worker:

- downloads the inputs
- runs the sequence of executors (usually via Docker)
- uploads the outputs

Along the way, the worker writes logs to event streams and databases:

- start/end time
- state changes (initializing, running, error, etc)
- executor start/end times
- executor exit codes
- executor stdout/err logs
- a list of output files uploaded, with sizes
- system logs, such as host name, docker command, system error messages, etc.

The worker is accessible via the `funnel worker` command and the [worker package][workerpkg].

### Node Scheduler

A node is a service that stays online and manages a pool of task workers. A Funnel cluster
runs a node on each VM. Nodes communicate with a Funnel scheduler, which assigns tasks
to nodes based on available resources. Nodes handle starting workers when for each assigned
task.

Nodes aren't always required. In some cases it often makes sense to rely on an existing,
external system for scheduling tasks and managing cluster resources, such as AWS Batch
or HPC systems like HTCondor, Slurm, Grid Engine, etc. Funnel provides integration with
these services that doesn't include nodes or scheduling by Funnel.

See [Deploying a cluster](/docs/compute/deployment/) for more information about running a cluster of nodes.

The node is accessible via the `funnel node` command and the [scheduler package][schedpkg].

[galaxy]: https://galaxyproject.org/
[cwl]: http://commonwl.org/
[wdl]: https://software.broadinstitute.org/wdl/
[tes]: https://github.com/ga4gh/task-execution-schemas
[serverpkg]: https://github.com/ohsu-comp-bio/funnel/tree/master/server
[workerpkg]: https://github.com/ohsu-comp-bio/funnel/tree/master/worker
[schedpkg]: https://github.com/ohsu-comp-bio/funnel/tree/master/compute/scheduler
[clientpkg]: https://github.com/ohsu-comp-bio/funnel/tree/master/client
[tespkg]: https://github.com/ohsu-comp-bio/funnel/tree/master/proto/tes
[storagepkg]: https://github.com/ohsu-comp-bio/funnel/tree/master/storage
8 changes: 8 additions & 0 deletions website/content/docs/compute.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
title: Compute
menu:
main:
weight: -5
---

# Compute
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
---
title: AWS Deployment

title: AWS Batch
menu:
main:
parent: guides
parent: Compute
weight: 20
---

# Amazon Web Services

# Amazon Batch

This guide covers deploying a Funnel server that leverages [DynamoDB][0] for storage
and [Batch][1] for task execution. You'll need to set up several resources
using either the Funnel CLI or through the provided Amazon web console.

## Create Required AWS Batch Resources
### Create Required AWS Batch Resources

For Funnel to execute tasks on Batch, you must define a Compute Environment,
Job Queue and Job Definition. Additionally, you must define an IAM role for your
Expand Down Expand Up @@ -132,6 +132,10 @@ Worker:
Secret: ""
```
### Known issues
Disk size and host volume management extra setup. The `Task.Resources.DiskGb` field does not have any effect. See [issue 317](https://github.com/ohsu-comp-bio/funnel/issues/317).

[0]: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html
[1]: http://docs.aws.amazon.com/batch/latest/userguide/what-is-batch.html
[2]: http://docs.aws.amazon.com/batch/latest/userguide/Batch_GetStarted.html#first-run-step-2
Expand Down
94 changes: 94 additions & 0 deletions website/content/docs/compute/deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
title: Deploying a cluster
menu:
main:
parent: Compute
weight: -50
---

# Deploying a cluster

This guide describes the basics of starting a cluster of Funnel nodes.
This guide is a work in progress.

A node is a service
which runs on each machine in a cluster. The node connects to the Funnel server and reports
available resources. The Funnel scheduler process assigns tasks to nodes. When a task is
assigned, a node will start a worker process. There is one worker process per task.

Nodes aren't always required. In some cases it makes sense to rely on an existing,
external system for scheduling tasks and managing cluster resources, such as AWS Batch,
HTCondor, Slurm, Grid Engine, etc. Funnel provides integration with
these services without using nodes or the scheduler.

### Usage

Nodes are available via the `funnel node` command. To start a node, run
```
funnel node run --config node.config.yml
```

To activate the Funnel scheduler, use the `manual` backend in the config.

The available scheduler and node config:
```
# Activate the Funnel scheduler.
Backend: manual
Scheduler:
# How often to run a scheduler iteration.
# In nanoseconds.
ScheduleRate: 1000000000 # 1 second
# How many tasks to schedule in one iteration.
ScheduleChunk: 10
# How long to wait between updates before marking a node dead.
# In nanoseconds.
NodePingTimeout: 60000000000 # 1 minute
# How long to wait for a node to start, before marking the node dead.
# In nanoseconds.
NodeInitTimeout: 300000000000 # 5 minutes
# Node config.
Node:
# If empty, a node ID will be automatically generated using the hostname.
ID: ""
# Files created during processing will be written in this directory.
WorkDir: ./funnel-work-dir
# If the node has been idle for longer than the timeout, it will shut down.
# -1 means there is no timeout. 0 means timeout immediately after the first task.
Timeout: -1
# A Node will automatically try to detect what resources are available to it.
# Defining Resources in the Node configuration overrides this behavior.
Resources:
# CPUs available.
# Cpus: 0
# RAM available, in GB.
# RamGb: 0.0
# Disk space available, in GB.
# DiskGb: 0.0
# For low-level tuning.
# How often to sync with the Funnel server.
# In nanoseconds.
UpdateRate: 5000000000 # 5 seconds
# RPC timeout for update/sync call.
# In nanoseconds.
UpdateTimeout: 1000000000 # 1 second
Logger:
# Logging levels: debug, info, error
Level: info
# Write logs to this path. If empty, logs are written to stderr.
OutputFile: ""
```

### Known issues

The config uses nanoseconds for duration values. See [issue #342](https://github.com/ohsu-comp-bio/funnel/issues/342).
Original file line number Diff line number Diff line change
@@ -1,15 +1,13 @@
---
title: Open Grid Engine

title: Grid Engine
menu:
main:
parent: guides
parent: Compute
weight: 20
---
# Grid Engine

# Open Grid Engine

Funnel can be configured to submit workers to [Open Grid Engine][ge] by making calls
Funnel can be configured to submit workers to [Grid Engine][ge] by making calls
to `qsub`.

The Funnel server process needs to run on the same machine as the Grid Engine master.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,10 @@
---
title: HTCondor

menu:
main:
parent: guides
parent: Compute
weight: 20
---

# HTCondor

Funnel can be configured to submit workers to [HTCondor][htcondor] by making
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,10 @@
---
title: PBS/Torque

menu:
main:
parent: guides
parent: Compute
weight: 20
---

# PBS/Torque

Funnel can be configured to submit workers to [PBS/Torque][pbs] by making calls
Expand Down
Loading

0 comments on commit a24327e

Please sign in to comment.