Cook has a lot of options in its configuration file.
This document will attempt to cover all the supported configuration options for the Cook scheduler.
Cook is designed to support multiple config file formats (json, yaml, edn), but today, it only supports edn.
The edn format (pronounced eden, like the garden) is described at https://github.com/edn-format/edn.
The most important thing to know about edn is that there’s nothing separating keys and values (that is, no :
like in Python or →
like in Scala) in maps, and commas are whitespace.
In this guide, configuration will be written as fragments.
Before looking over this guide, you should check out the sample config files, dev-config.edn
and example-prod-config.edn
.
The dev
config starts up an embedded Zookeeper and Datomic so that you don’t have to.
It also enables all of the introspection features, like JMX metrics and nREPL, without sending any emails or metrics to alerting systems.
To use the dev
config yourself, just make sure that the Mesos cluster’s :master
is set correctly, and the username exists on your slaves.
The prod
config instead uses external Zookeeper and Datomic to ensure high availability and persistence.
It registers with Mesos so that it can fail over successfully without losing any tasks.
It also is chattier, sending emails and metrics to other systems to improve visibility.
At this point, you’re probably wondering what all these options actually do, and how to configure them. The next several sections will document what every option does.
:port
-
This is the port that the REST API will bind to.
:hostname
-
This is the hostname of the server Cook is running on. Defaults to the hostname (via getCanonicalHostname)
:database
-
This configures which database Cook will connect to. Currently, Cook only supports Datomic. Thus,
:database
must be set to a map with a single key:{:datomic-uri "$DB_URI"}
.Datomic’s pretty awesome because it has an in-process embedded in-memory version, which is specified by using an in-memory backend. To use the in-memory DB, use the URI
datomic:mem://cook-jobs
. An example URI for connecting to a Datomic free transactor on the host$HOST
would bedatomic:free://$HOST:4334/cook-jobs
. See http://docs.datomic.com/getting-started.html for more information on setting up Datomic.
Tip
|
Datomic Configuration in Production
Cook uses special transaction functions to maintain distributed consistency. When running a standalone transactor (as in most QA and Production environments), you’ll need to include a copy of the Cook jar on the Datomic transactor’s classpath. This will ensure that all of the transaction functions are available on the transactor. |
:zookeeper
-
This configures which Zookeeper Cook will connect to. You can either have Cook use an embedded Zookeeper (great for development and trying out Cook), or use an external Zookeeper quorum (required for production). To use a production Zookeeper quorum located at
$QUORUM
(e.g.zk1.example.com,zk2.example.com,zk3.example.com/cook
), you should use a map:{:connection "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181/cook"}
.To use the embedded Zookeeper for development, use the map
{:local? true}
. By default, the embedded Zookeeper will bind to port 3291. If need it to bind to another port, you can specify that with the:local-port
key: e.g.{:local? true, :local-port 9001}
. :mesos
-
This key configures how Cook will interact with the Mesos cluster. See Mesos Configuration for details.
:authorization
-
This key configures how Cook will validate users for multitenant scheduling. Cook currently supports a single-user development mode, HTTP Basic authentication, and Kerberos authentication. See Authorization Configuration for details.
:authorization-fn
-
This key specifies what function to use to perform user authorization. Two example authorization functions are provided.
cook.rest.authorization/open-auth
allows any user to do anything, for testing and development.cook.rest.authorization/configfile-admins-auth
consults the:admins
key in the config file for a list of admins. Admins may do anything to any object; other users may only manipulate their own object. It’s easy to write your own custom authorization function. See the cook.rest.authorization docstrings for more information. :admins
-
The value of this key is a set of usernames who should be considered administrators when using the
configfile-admins-auth
authorization-fn. :plugins
-
Cook has two extension points that let plugins reject jobs at job submission time as well as accept or defer the launching of jobs via
:job-launch-filter
and:job-submission-validator
.-
:job-launch-filter
configures the entrypoint for filtering job launches. It has several keys:-
:factory-fn
contains a string with a namespace-qualified path of a function for creating either aJobLaunchFilter
. The factory function can fetch the current configuration out of the config defstate incook.config/config
-
:age-out-first-seen-deadline-minutes
controls how we get rid of jobs whose launch is perpetually deferred by a plugin. When a job 'ages out', we force it to launch now, to keep the queue from being cluttered with always deferred jobs. The clock for aging out starts when a job is near enough to the front of the queue to be eligible to run, not when it is added to the queue. A job is eligible for aging out when it was first seen in the scheudler queue at least this long ago. If you want jobs to sit longer in the queue than the default 10 hours before being aged out, increase this number. -
:age-out-last-seen-deadline-minutes
A job is eligible for aging out when it has been seen in the launch queue at least this recently. -
:age-out-seen-count
We must have attempted to schedule the job at least this many times before we age it out.
-
-
:job-submission-validator
configures the entrypoint for filtering job submissions. It has several keys:-
:factory-fn
contains a string with a namespace-qualified path of a function for creating aJobSubmissionValidator
. The factory function can fetch the current configuration out of the configdefstate
incook.config/config
-
:batch-timeout-seconds
: This includes a critical timeout value. We check job launches synchronously, so the plugin has to respond fast. In particular, we must complete all of the launch checks before the HTTP timeout. To do this we implement another timeout, defaulting to 40 seconds. Once we cross this soft timeout, a default accept/reject policy is implemented for submitted jobs, one that cannot look at jobs individually.
-
-
:job-adjuster
configures a plugin for adjusting jobs on submission.-
:factory-fn
contains a string with a namespace-qualified path of a function for creating aJobAdjuster
.
-
-
:rate-limit
-
Configure rate limits for the scheduler.
rate-limit
is a map with three possible keys::user-limit-per-m
,:job-submission
,:expire-minutes
. These keys control rate limits.-
:user-limit-per-m
is the max REST requests a single user can send in a minute. The default is 600 requests. -
:job-submission
is a dictionary with a token bucket filter configuration that allows per-user restriction of the job submission rate. -
:expire-minutes
For our token-bucket-filter rate limits, we will create one token bucket filter rate limit object for each user. If a user has become very idle, we expire old unused rate-limit entries after a time period. This should be set to several hours, or at least as high as (:bucket-size
/:tokens-replenished-per-minute
)
-
A token bucket filter configuration has 3 keys, :enforce
, :tokens-replenished-per-minute
and :bucket-size
. Tokens are added to a bucket of maximum size :bucket-size
at a rate of :tokens-replenished-per-minute
, :enforce
is used to decide whether we reject requests that violate this rate limit. Even if enforcement is off violations of the rate limit are logged. See https://en.wikipedia.org/wiki/Token_bucket.
:agent-query-cache
-
Configure the cache used to store sandbox locations of tasks on different mesos agents.
agent-query-cache
is a map with two possible keys,threshold
andttl
.max-size
is the maximum number of elements in the cache before the LRU eviction semantics apply. The default is 1000.ttl-ms
is the default time in milliseconds that entries are allowed to reside in the cache. The default is 60000, i.e. 1 minute. :exit-code-syncer
-
The Cook scheduler throttles the rate at which it publishes task exit-codes. This allows us to handle high rate of incoming exit-code messages in a graceful manner.
exit-code-syncer
is a map with the following possible keys:publish-batch-size
andpublish-interval-ms
. Thepublish-batch-size
is an integer representing the number of facts that are updated in individual datomic instance exit-code directory update transactions. The default value is 100. Thepublish-interval-ms
is an integer representing the number of millisecond intervals at which exit-code directories updates will be published to datomic. The default value is 2500. :sandbox-syncer
-
The Cook scheduler throttles the rate at which it publishes task sandbox directories. This allows us to handle high rate of incoming progress messages in a graceful manner.
sandbox-syncer
is a map with the following possible keys:max-consecutive-sync-failure
,publish-batch-size
,publish-interval-ms
andsync-interval-ms
. Themax-consecutive-sync-failure
represents the maximum number of failures before sandbox sync is not retried on that agent. The default value is 15. Thepublish-batch-size
is an integer representing the number of facts that are updated in individual datomic instance sandbox directory update transactions. The default value is 100. Thepublish-interval-ms
is an integer representing the number of millisecond intervals at which sandbox directories updates will be published to datomic. The default value is 2500. Thesync-interval-ms
represents the intervals at which the sandbox syncer triggers state lookup on pending mesos agents. This value should ideally be lower than the agent-query-cache ttl-ms. The default value is 15000, i.e. 15 seconds.
Mesos configuration is specified as a map, because there are several properties that can be configured about the way Cook connects to Mesos. We’ll look at the configurable options in turn:
:master
-
This option sets the Mesos master connection string. For example, if you are running Mesos with a Zookeeper node on the local machine (a common development setup), you’d use the connection string
zk://localhost:2181/mesos
. :failover-timeout-ms
-
This options sets the number of milliseconds that Mesos will wait for the Cook framework to reconnect. In development, you should set this to
nil
, which means that Mesos will treat any disconnection of Cook as the framework ending; this will kill all of Cook’s tasks when it disconnects. In production, it’s recommended to set this to 1209600000, which is 2 weeks in milliseconds. This means that when the Cook scheduler goes down, you have 2 weeks to reconnect a new instance, during which no tasks will be forcibly killed. Typically, however, you’ll only wait 10-30 seconds for reconnection, since Cook is usually run with hot standbys. :leader-path
-
This configures the path that Cook will use for its high-availibility configuration. The Zookeeper quorum is the one configured in the top-level
:zookeeper
option. As long as the Zookeeper quorum and:leader-path
are the same, then multiple instances of Cook will be able to synchronize, perform leader election, and perform framework recovery and failover automatically. For a production deployment, you can just run two or three copies of Cook on different hosts, and even if a host fails, Cook won’t be affected. :principal
-
This sets the principal that Cook will connect to Mesos with. You can omit this property unless you’ve enabled security features with Mesos. The value here should match with authorized
principals
inregister_frameworks
Action in Mesos Authorization file. See http://mesos.apache.org/documentation/latest/authorization/ for details. :role
-
This sets the role that Cook will connect to Mesos with. Default:
*
You can omit this property unless you’ve enabled security features in Mesos. The value should be in authorized list for the current:principal
inregister_frameworks
Action in Mesos Authorization file. See http://mesos.apache.org/documentation/latest/authorization/ for details. :run-as-user
-
When configured, this sets the user that Cook will override and set as the user to run tasks on Mesos with. You can omit this property, in which case the user configured in the job will be used as the user to run the job (the default behavior). Cook’s scheduling algorithm continues to use the user specified in the job to compute job schedules.
:framework-name
-
This sets part of the name of the framework that Cook will register to Mesos. Default: Cook When connecting to Mesos, Cook will use a framework name like "YourFrameworkName-e254483". It will append the current git hash to the value you specify here.
:enable-gpu-support
-
This enables GPU support for Cook. It is a boolean value, with default value
false
. This property will only work with Mesos 1.0 and above, since that’s when GPU support was added. If you enable this on an earlier version of Mesos, Cook will fail to start and print the error in the log. If you enable this and your cluster doesn’t have any GPU machines, Cook will accept GPU jobs, but they’ll never be scheduled. See https://github.com/apache/mesos/blob/master/docs/gpu-support.md for details on configuring the agents, installing external NVidia dependencies, and configuring Docker/GPU integration. :leader-reports-unhealthy
-
This configures whether or not the leader reports his status as healthy by returning 200 from the /debug endpoint. This can be used to isolate the leader from query load. If set to true, the leader will return 503 on the /debug endpoint. If set to false, the leader will return 200 on the /debug endpoint. The default value is false.
One of Cook’s most valuable features is its fair-sharing of a cluster. But how does Cook know who submitted which jobs? Every request to Cook’s REST API is authenticated, so that we know which user is making the request. Keep in mind that the username used for authentication is also the username that Cook will run the job as, so make sure that user exists on your Mesos slaves. We’ll look at the three authentication mechanisms supported:
:one-user
-
When doing development with Cook, it’s nice to be able to use it without any authentication. You can have Cook treat every request as coming from a specific user
$USER
by configuring the:authorization
like so:{ ; ... snip ... :authorization {:one-user "$USER"} ; ... snip ... }
:http-basic
-
Most organizations will want to use HTTP Basic authentication. Cook allows you to configure how the user name and password are configured. Currently, Cook supports specifying the logins in the config file or using no validation This also makes it super easy to get started: to use HTTP Basic, simply use
{:http-basic true}
as your:authorization
. This will use no validation. To use config-file validation, set:authorization
to:{:http-basic {:validation :config-file :valid-logins #{["user" "password"] ["user2" "password2"]}}}
:kerberos
-
If you have Kerberos at your organization, then you can use it to authenticate users with Cook. To use Kerberos, simply use
{:kerberos true}
as your:authorization
.
The Cook scheduler comes with a few knobs to tune its behavior under the :scheduler
key.
:offer-incubate-ms
-
This option configures how long Cook will hold onto offers, in order to try to coalesce offers and find better placements for tasks. We recommend setting this to 15000. If you set this to zero, Cook might not be able to find sufficiently large offers for tasks if you’re running other frameworks on your Mesos cluster at the same time.
:mea-culpa-failure-limit
-
When an instance fails, it can be for a variety of reasons. Some of these are considered "mea culpa reasons", meaning that Cook itself may be to blame for the failure, and in these cases, a certain number of these failures won’t count against the job’s retry limit. For example, if Cook pre-empts a task, the task will fail, but this won’t count against the retry limit. However, if a task fails for the same reason more than a certain number of times (which you can specify using this configuration setting), the excess failures WILL start to count against the job’s retry limit.
mea-culpa-failure-limit should be a map. The keys of the map should correspond to names of individual mea-culpa failure reasons (e.g. :preempted-by-rebalancer). Each value refers to a number of task failures for the specified reason that can occur occur before subsequent failures begin to count against the job’s retry-limit.
The value associated with key :default will apply to any mea-culpa failure reasons that aren’t mentioned by name.
To enable infinite failures for a given failure reason, set its value to -1.
Example:
:mea-culpa-failure-limit {:default 5
:mesos-master-disconnected 8
:preempted-by-rebalancer -1}
:fenzo-max-jobs-considered
-
This controls the number of jobs (ranked in Cook priority order) Fenzo will be able to see when placing jobs on Mesos Agents. Raising this number gives Fenzo more freedom to apply constraints for the purpose of optimization, but may also make it more likely to schedule jobs Cook wouldn’t consider of the highest priority. Default is 1000.
:fenzo-scaleback
-
If Fenzo fails to place Cook’s most desirable job, Cook will start to limit the number of jobs Fenzo can see until that most desirable job is matched by Fenzo. This number is the factor by which the number of Jobs Fenzo can see is reduced on each iteration which fails to match the most desirable job. Eventually, if the job is NEVER matched, Cook will reduce the number of Jobs Fenzo can see to 1, meaning that Fenzo will ONLY be able to see the most desirable job. Default is 0.95.
:fenzo-floor-iterations-before-warn
-
If Cook has been allowing Fenzo to see only 1 job for this number of iterations, warning messages will start to appear in the logs. Default is 10.
:fenzo-floor-iterations-before-reset
-
If Cook has been allowing Fenzo to see only 1 job for this number of iterations, it measn that the cluster is essentially down. In this case, Cook will log an error message and then reset the number of jobs Fenzo can see to the value of "fenzo-max-jobs-considered" (see above).
- :fenzo-fitness-calculator
-
By default, Cook will have Fenzo attempt to bin-pack using a combination of memory and CPU when choosing which hosts will field which tasks. By choosing a different option in fenzo-fitness-calculator, you can specify that Fenzo should use a different implemention of VMTaskFitnessCalculator. This value can either refer to a static member of a Java class on the classpath (e.g. "com.netflix.fenzo.plugins.BinPackingFitnessCalculators/cpuMemBinPacker", the default), or a namespaced clojure symbol (e.g. "cook.mesos.scheduler/dummy-fitness-calculator")
:task-constraints
-
This option is a map that allows you to configure limits for tasks, to ensure that impossible-to-schedule tasks and tasks that run forever won’t bog down your cluster. It currently supports 4 parameters to defend the Cook scheduler, which are described in Task Constraints.
:estimated-completion-constraint
-
This allows you to configure an optional constraint which will not launch jobs on VMs where the job is expected to run longer than the host’s expected lifetime (for instance, public cloud spot VMs.) The configuration parameters are described in Estimated Completion Constraint.
Optionally, you can include a "rebalancer" stanza. If you do, on startup, Cook will update its Rebalancer configuration to match the values you specify here.
- :interval-seconds
-
How often to rebalance the cluster for fairness between users. Default is 300 (5 minutes).
- :safe-dru-threshold
-
See the Rebalancer documentation
- :min-dru-diff
-
See the Rebalancer documentation
- :max-preemption
-
See the Rebalancer documentation
:dru-scale
-
This is only used to control the metrics reporting of DRU values. On some clusters, the DRU’s may be so small that when the values are fed to clj-metrics, they are treated as 0, which makes it impossible to glean insights into the DRU’s in play, in order to set rebalancer parameters. If you find that this is true on your cluster, it is likely that the user shares are set to a very high value, perhaps the default of Integer.MAX_VALUE. To obtain useful DRU metrics in this situation, you can either adjust your share settings (recommended), or increase the dru-scale setting to e.g. 10^300.
The optimizer is a component that can make more global decisions about the cluster, job placement and autoscaling. By default, Cook will use a no-op optimizer. To plug in an implementation, add the optional "optimizer" stanza. The two pluggable pieces, host-feed and optimizer, each have the same structure to configure. A :create-fn:: key which is a namespaced symbol on the class path and a :config:: key which is an arbitrary map. See the example below.
- :host-feed
-
The implementation for the host feed to use. The host feed returns a list of maps where each map describes a host type that can be purchased. Each map should include the following keys:
- * :count
-
The number of hosts available of that type. Should be a non-negative integer
- * :instance-type
-
The name of the host type. Should be a string
- * :cpus
-
The number of cpus available for the particular host type. Should be a postive number
- * :mem
-
The amount of memory available for the host type, in MB. Should be a positive number
- * :gpu
-
(Optional) The number of GPUs available for the host type. Should be a positive number
- :optimizer
-
The implementation for the optimizer to use. The optimizer accepts the output of the host feed, the queue, the running tasks and the available offers and outputs a schedule of suggestions.
A schedule is a map from T milliseconds in the future to a map of optimizer recommendations. Each recommendation map can contain multiple keys, currently, only one, :suggested-matches. The value of :suggested-matches is a map from a host type map described above to a list of job uuids that the optimizer recommends matching onto that host type at this point in the future.
- :optimizer-interval-seconds
-
The interval to run the optimizer, in seconds. The default is 30.
Example optimizer config:
+
{
; ... snip ...
:optimizer {:host-feed {:create-fn cook.mesos.optimizer/create-dummy-host-feed
:config {}}
:optimizer {:create-fn cook.mesos.optimizer/create-dummy-optimizer
:config {}}
:optimizer-interval-seconds 30}
; ... snip ...
}
:timeout-hours
-
This specifies the max time that a task is allowed to run for. Any tasks running for longer than this will be automatically killed.
:timeout-interval-minutes
-
This specifies how often to check for timed-out tasks. Since checking for timed-out tasks is linear in the number of running tasks, this can take a while. On the other hand, if your timeout is one hour, but you only check every 30 minutes, some tasks could end up running for almost one and a half hours!
:memory-gb
-
This specifies the max amount of memory that a task can request. You should make sure this is small enough that users can’t accidentally submit tasks that are too big for your slaves.
:cpus
-
This is just like
:memory-gb
, but for CPUs. You should make sure this is small enough that users can’t accidentally submit tasks that are too big for your slaves. :retry-limit
-
This limits the number of retries a job is allowed to request. Something in the low tens is often more than sufficient.
:expected-runtime-multiplier
-
This is a configurable constant factor which will multiply the expected runtime on each job to compute when the job will complete. For instance, if
estimated-runtime-multiplier
is 2.5 and a job has an expected runtime of 60000 ms, it will not be scheduled on a host which will die in 2.5 minutes. :host-lifetime-mins
-
This is the expected lifetime of hosts in the cluster. To apply the constraint, each host should have a
host-start-time
attribute in it’s mesos offer which will be used with thehost-lifetime-mins
parameter to determine when the host is expected to die.
The Cook executor is a custom executor written in Python.
It is enabled when the :command
option (see below) is configured to a non-empty string.
When enabled, it replaces the default command executor in order to enable a number of features for both operators and end users.
Please see the Cook Executor README for more detailed information about the Cook executor.
An example configuration looks like:
{...
:executor {:command "./cook-executor"
:default-progress-regex-string "progress:\\s+([0-9]*\\.?[0-9]+)($|\\s+.*)"
:environment {"EXECUTOR_DEFAULT_PROGRESS_OUTPUT_NAME" "stdout"}
:log-level "INFO"
:max-message-length 512
:progress-sample-interval-ms 1000
:uri {:cache true
:executable true
:extract false
:value "file:///path/to/cook-executor"}}
...}
The configuration values are defined as follows:
:command
-
A string containing the command executed on the mesos agent to launch the Cook executor. No default value is provided, when missing the use of the Cook executor is disabled.
:default-progress-regex-string
-
The string representation of the regex used to identify progress update messages. The regex should have one or two capture groups, the first being a number representing the progress percent. The second, when present, being a message about the progress. The executor will use the
:max-message-length
value to trim the progress message string before sending it to the scheduler. Defaults to "progress:\\s+(\\.?)($|\\s.)". :environment
-
A map that represents the additional environment variables passed on to the executor. The default is an empty map.
:log-level
-
The log level for the executor process. Defaults to "INFO".
:max-message-length
-
The maximum length for the unencoded string messages sent from a task via the Mesos executor HTTP API. The default is 512.
:progress-sample-interval-ms
-
The interval in ms after which to send progress updates. Care should be taken to avoid setting this value too low as it can end up causing a high rate of message transfer between the executor and the scheduler. The default is (1000 * 60 * 5), i.e. 5 minutes.
:uri
-
A description of the
uri
used to download the executor executable. The default is an empty map, i.e. no executable to download. Theuri
structure is defined below:
key |
type |
description |
|
boolean |
Mesos 0.23 and later only: should the URI be cached in the fetcher cache? |
|
boolean |
Should the URI have the executable bit set after download? |
|
boolean |
Should the URI be extracted (must be a tar.gz, zipfile, or similar). |
|
string |
The URI to fetch. Supports everything the Mesos fetcher supports, i.e. http://, https://, ftp://, file://, hdfs:// |
The Cook scheduler throttles the rate at which it publishes progress updates from the Cook executor. This allows us to handle high rate of incoming progress messages in a graceful manner. This also protects the scheduler against potentially bad executors that are sending progress messages at a high rate.
An example configuration looks like:
{...
:progress {:batch-size 100
:pending-threshold 4000
:publish-interval-ms 2500
:sequence-cache-threshold 1000}
...}
The configuration values are defined as follows:
:batch-size
-
An integer representing the number of facts that are updated in individual datomic instance progress update transactions. The default value is 100.
:pending-threshold
-
An integer representing the maximum number of instances whose pending progress states will be stored in memory. Additional messages (either in the queue or while building the in-memory state) will be dropped. The default value is 4000.
:publish-interval-ms
-
An integer representing the number of millisecond intervals at which progress updates will be published to datomic. The default value is 2500.
:sequence-cache-threshold
-
An integer representing the max number of items in the task sequence cache. This cache is used to track the latest sequence number of progress message processed for a given task. In order to avoid the potential for out-of-order progress updates, this cache should be sized to handle the maximum number of active tasks that are reporting progress. The default value is 1000.
Cook is designed to be easy to debug and monitor. We’ll look at the various monitoring and debugging subconfigs:
:metrics
-
This map configures where and how to report Cook’s internal scheduling and performance metrics. See Metrics for details.
:nrepl
-
Cook can start an embedded nREPL server. nREPL allows you to log into the Cook server and inspect and modify the code while it’s running. This should not be enabled on untrusted networks, as anyone who connects via nREPL can bypass all of Cook’s security mechanisms. This is really useful for development, though! See nREPL for details.
:log
-
This section configures Cook’s logging. See Logging for details.
:unhandled-exceptions
-
This map configures what Cook’s behavior should be when it encounters an exception that doesn’t already have code implemented to handle it. See Unhandled Exceptions for how to configure.
Cook can transmit its internal metrics over a variety of transports, such as JMX, Graphite, and Riemann. Internally, Cook uses Dropwizard Metrics 3, so we can easily add support for any Metrics 3 compatible reporter.
- JMX Metrics
-
To enable JMX metrics, set the
:metrics
key to{:jmx true}
. - Graphite Metrics
-
To enable Graphite metrics, you’ll need to populate the
:graphite
map. We support setting a prefix on all metrics, choosing which graphite server to connect to, and whether to use the plain-text or pickled transport format. Here’s an example of enabling Graphite metrics::metrics {:graphite {:host "my-graphite-server.example.com" :port 2003 :prefix "cook" :pickled? false ; defaults to true }}
Also, keep in mind that you can enable multiple metrics reporters simultaneously, if that’s useful in your environment. For example, you could use JMX and graphite together:
:metrics {:graphite {:host "my-graphite-server.example.com"
:port 2003
:prefix "cook"}
:jmx true}
The :nrepl
key takes a map that supports two options:
:enabled?
-
Set this to
true
if you’d like to start the embedded nREPL server. :port
-
Set this the to the port number you’d like the nREPL server to bind to. You must choose a port to enable nREPL.
Cook’s logging is configured under :log
.
Cook automatically rotates its logs daily, and includes information about package, namespace, thread, and the time for every log message.
:file
-
You must choose a file location for Cook to write its log. It’s strongly recommended to specify a log file under a folder, e.g.
log/cook.log
, since Cook will rotate the log files by appending.YYYY-MM-dd
to the specified path. The path can be relative (from the directory you launch Cook) or absolute. :levels
-
You can also specify log levels to increase or decrease verbosity of various components of Cook and libraries it uses. We’ll look at an example, which sets the default logging level to
:info
, but sets a few Datomic namespaces to use the:warn
level. This also happens to be the recommended logging configuration::levels {"datomic.db" :warn "datomic.peer" :warn "datomic.kv-cluster" :warn :default :info}}
As you can see, specific packages and namespaces are specified by strings as the map’s keys; their values specify their log level override.
Everyone makes mistakes.
We’d like to know when errors happen that we didn’t anticipate.
That’s what the :unhandled-exceptions
key is for.
Let’s look at what options it takes:
:log-level
-
This lets you choose the level to log unhanded error at. Usually
:error
is the right choice, although you may want to log these at the:fatal
level. :email
-
You can also choose to receive emails when an unhandled exception occurs. This key takes a map that it uses as a template for the email. Cook uses postal to send email. For advanced configuration, check out the postal’s documentation. Cook will append details to whatever subject line you provide, and it will fill in the body with the stacktrace, thread, and other useful info. Here’s a simple example of setting up email:
:email {:to ["[email protected]"] :from "[email protected]" :subject "Unhandled exception in cook"}
It can be intimidating to choose JVM options to enable Cook to run with high performance—what GC to use, how much heap, which Datomic options? Here’s a table with some options that should work for a cluster with thousands of machines:
Options | Reasoning |
---|---|
|
This enables the low-pause collector, which gives better API latency characteristics |
|
This means that the JVM will target to never stop the world for move than 50ms |
|
Increase datomic read rate to improve table scans |
|
Balance the writes with the read rate for faster job updates |
|
This allows Datomic to index much less often |
|
This allows Datomic to accept writes during slow indexing jobs for longer |
|
Sometimes, we generate big and bad transactions—this helps us to not die |
|
This helps to deal with slow peers |
|
This accelerates queries by caching a lot of data in memory |
|
Set the heap to use 12GB |
|
Don’t bother scaling the heap up—just force it to start at full size |