-
Notifications
You must be signed in to change notification settings - Fork 751
Gobblin Deployment
- Author: Yinan
- Reviewer: Sahil
One important feature of Gobblin is it can be deployed and run in different settings on different platforms. Currently Gobblin supports the standalone mode and Hadoop MapReduce mode. This page summarizes the different deployment modes of Gobblin. It is also helpful to understand the architecture of Gobblin in a specific deployment mode, so this page also describes the architecture of each deployment mode.
The following diagram illustrates the Gobblin standalone architecture. In the standalone mode, a Gobblin instance runs in a single JVM and tasks run in a thread pool, the size of which is configurable. The standalone mode is good for light-weight data sources such as small databases. The standalone mode is also the default mode for trying and testing Gobblin.
In the standalone deployment, the JobScheduler
runs as a daemon process that schedules and runs jobs using the so-called LocalJobLauncher
s. Each LocalJobLauncher
starts and manages a few components for executing tasks of a Gobblin job. Specifically, a TaskExecutor
is responsible for executing tasks in a thread pool, whose size is configurable on a per-job basis. A LocalTaskStateTracker
is responsible for keep tracking of the state of running tasks, and particularly updating the task metrics. The LocalJobLauncher
follows the steps below to launch and run a Gobblin job:
- Starting the
TaskExecutor
andLocalTaskStateTracker
. - Creating an instance of the
Source
class specified in the job configuration and getting the list ofWorkUnit
s to do. - Creating a task for each
WorkUnit
in the list, registering the task with theLocalTaskStateTracker
, and submitting the task to theTaskExecutor
to run. - Waiting for all the submitted tasks to finish.
- Upon completion of all the submitted tasks, collecting tasks states and persisting them to the state store, and publishing the extracted data.
Before Gobblin can be deployed, it needs to be built and packaged. Refer to the Getting Started Guide on how to build Gobblin.
In the standalone mode, the JobScheduler
, upon startup, will pick up job configuration files from a user-defined directory and schedule the jobs to run. An environment variable named GOBBLIN_JOB_CONFIG_DIR
must be set to point to the directory where job configuration files are stored. Note that this job configuration directory is different from gobblin-dist/conf
, which stores Gobblin system configuration files.
Gobblin needs a working directory at runtime, which is defined using an environment variable GOBBLIN_WORK_DIR
. Once started, Gobblin will create some subdirectories under the root working directory, as follows:
GOBBLIN_WORK_DIR\
task-staging\ # Staging area where data pulled by individual tasks lands
task-output\ # Output area where data pulled by individual tasks lands
job-output\ # Final output area of data pulled by jobs
state-store\ # Persisted job/task state store
metrics\ # Metrics store (in the form of metric log files), one subdirectory per job.
Before starting the Gobblin standalone daemon, make sure the environment variable JAVA_HOME is properly set. Gobblin ships with a script for starting and stopping the standalone Gobblin daemon on a single node, found at bin/gobblin-standalone.sh
.
To start the Gobblin standalone daemon, run the following command:
bin/gobblin-standalone.sh start
After the Gobblin standalone daemon is started, the logs can be found under gobblin-dist/logs
. By default, the Gobblin standalone daemon uses the following JVM settings. Change the settings if necessary for your deployment.
-Xmx2g -Xms1g
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution
-XX:+UseCompressedOops
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<gobblin log dir>
To restart the Gobblin standalone daemon, run the following command:
bin/gobblin-standalone.sh restart
To stop the running Gobblin standalone darmon, run the following command:
bin/gobblin-standalone.sh stop
The script also supports checking the status of the running daemon process using the status
command.
The digram below shows the architecture of Gobblin on Hadoop MapReduce. As the diagram shows, a Gobblin job runs as a mapper-only MapReduce job that runs tasks of the Gobblin job in the mappers. The basic idea here is to use the mappers purely as containers to run Gobblin tasks. This design also makes it easier to integrate with Yarn. Unlike in the standalone mode, task retries are not handled by Gobblin itself in the Hadoop MapReduce mode. Instead, Gobblin relies on the task retry mechanism of Hadoop MapReduce.
In this mode, a MRJobLauncher
is used to launch and run a Gobblin job on Hadoop MapReduce, following the steps below:
- Creating an instance of the
Source
class specified in the job configuration and getting the list ofWorkUnit
s to do. - Serializing each
WorkUnit
into a file on HDFS that will be read later by a mapper. - Creating a file that lists the paths of the files storing serialized
WorkUnit
s. - Creating and configuring a mapper-only Hadoop MapReduce job that takes the file created in step 3 as input.
- Starting the MapReduce job to run on the cluster of choice and waiting for it to finish.
- Upon completion of the MapReduce job, collecting tasks states and persisting them to the state store, and publishing the extracted data.
A mapper in a Gobblin MapReduce job runs one or more tasks, depending on the number of WorkUnit
s to do and the (optional) maximum number of mappers specified in the job configuration. If there is no maximum number of mappers specified in the job configuration, each WorkUnit
corresponds to one task that is executed by one mapper and each mapper only runs one task. Otherwise, if a maximum number of mappers is specified and there are more WorkUnit
s than the maximum number of mappers allowed, each mapper may handle more than one WorkUnit
. There is also a special type of WorkUnit
s named MultiWorkUnit
that group multiple WorkUnit
s to be executed together in one batch in a single mapper.
A mapper in a Gobblin MapReduce job follows the step below to run tasks assigned to it:
- Starting the
TaskExecutor
that is responsible for executing tasks in a configurable-size thread pool andMRTaskStateTracker
that is responsible for keep tracking of the state of running tasks in the mapper. - Reading the next input record that is the path to the file storing a serialized
WorkUnit
. - Deserializing the
WorkUnit
, creating a task from theWorkUnit
, registering the task with theMRTaskStateTracker
, and submitting the task to theTaskExecutor
to run. If the input is aMultiWorkUnit
, a task is created for eachWorkUnit
wrapped in theMultiWorkUnit
. - Waiting for all the submitted tasks to finish.
- Upon completion of all the submitted tasks, writing out the state of each task into a file that will be read by the
MRJobLauncher
when collecting task states. - Going back to step 2 and reading the next input record if available.
- Home
- [Getting Started](Getting Started)
- Architecture
- User Guide
- Working with Job Configuration Files
- [Deployment](Gobblin Deployment)
- Gobblin on Yarn
- Compaction
- [State Management and Watermarks] (State-Management-and-Watermarks)
- Working with the ForkOperator
- [Configuration Glossary](Configuration Properties Glossary)
- [Partitioned Writers](Partitioned Writers)
- Monitoring
- Schedulers
- [Job Execution History Store](Job Execution History Store)
- Gobblin Build Options
- Troubleshooting
- [FAQs] (FAQs)
- Case Studies
- Gobblin Metrics
- [Quick Start](Gobblin Metrics)
- [Existing Reporters](Existing Reporters)
- [Metrics for Gobblin ETL](Metrics for Gobblin ETL)
- [Gobblin Metrics Architecture](Gobblin Metrics Architecture)
- [Implementing New Reporters](Implementing New Reporters)
- [Gobblin Metrics Performance](Gobblin Metrics Performance)
- Developer Guide
- [Customization: New Source](Customization for New Source)
- [Customization: Converter/Operator](Customization for Converter and Operator)
- Code Style Guide
- IDE setup
- Monitoring Design
- Project
- [Feature List](Feature List)
- Contributors/Team
- [Talks/Tech Blogs](Talks and Tech Blogs)
- News/Roadmap
- Posts
- Miscellaneous