[SPARK-23572][DOCS] Bring "security.md" up to date.

This change basically rewrites the security documentation so that it's up to date with new features, more correct, and more complete. Because security is such an important feature, I chose to move all the relevant configuration documentation to the security page, instead of having them peppered all over the place in the configuration page. This allows an almost one-stop shop for security configuration in Spark. The only exceptions are some YARN-specific minor features which I left in the YARN page. I also re-organized the page's topics, since they didn't make a lot of sense. You had kerberos features described inside paragraphs talking about UI access control, and other oddities. It should be easier now to find information about specific Spark security features. I also enabled TOCs for both the Security and YARN pages, since that makes it easier to see what is covered. I removed most of the comments from the SecurityManager javadoc since they just replicated information in the security doc, with different levels of out-of-dateness. Author: Marcelo Vanzin <[email protected]> Closes apache#20742 from vanzin/SPARK-23572.
chamathns · Mar 26, 2018 · b30a7d2 · b30a7d2
1 parent eb48edf
commit b30a7d2
Show file tree

Hide file tree

Showing 6 changed files with 673 additions and 703 deletions.
diff --git a/.gitignore b/.gitignore
@@ -76,6 +76,7 @@ streaming-tests.log
 target/
 unit-tests.log
 work/
+docs/.jekyll-metadata
 
 # For Hive
 TempStatsStore/

diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala b/core/src/main/scala/org/apache/spark/SecurityManager.scala
@@ -42,148 +42,10 @@ import org.apache.spark.util.Utils
  * should access it from that. There are some cases where the SparkEnv hasn't been
  * initialized yet and this class must be instantiated directly.
  *
- * Spark currently supports authentication via a shared secret.
- * Authentication can be configured to be on via the 'spark.authenticate' configuration
- * parameter. This parameter controls whether the Spark communication protocols do
- * authentication using the shared secret. This authentication is a basic handshake to
- * make sure both sides have the same shared secret and are allowed to communicate.
- * If the shared secret is not identical they will not be allowed to communicate.
- *
- * The Spark UI can also be secured by using javax servlet filters. A user may want to
- * secure the UI if it has data that other users should not be allowed to see. The javax
- * servlet filter specified by the user can authenticate the user and then once the user
- * is logged in, Spark can compare that user versus the view acls to make sure they are
- * authorized to view the UI. The configs 'spark.acls.enable', 'spark.ui.view.acls' and
- * 'spark.ui.view.acls.groups' control the behavior of the acls. Note that the person who
- * started the application always has view access to the UI.
- *
- * Spark has a set of individual and group modify acls (`spark.modify.acls`) and
- * (`spark.modify.acls.groups`) that controls which users and groups have permission to
- * modify a single application. This would include things like killing the application.
- * By default the person who started the application has modify access. For modify access
- * through the UI, you must have a filter that does authentication in place for the modify
- * acls to work properly.
- *
- * Spark also has a set of individual and group admin acls (`spark.admin.acls`) and
- * (`spark.admin.acls.groups`) which is a set of users/administrators and admin groups
- * who always have permission to view or modify the Spark application.
- *
- * Starting from version 1.3, Spark has partial support for encrypted connections with SSL.
- *
- * At this point spark has multiple communication protocols that need to be secured and
- * different underlying mechanisms are used depending on the protocol:
- *
- *  - HTTP for broadcast and file server (via HttpServer) ->  Spark currently uses Jetty
- *            for the HttpServer. Jetty supports multiple authentication mechanisms -
- *            Basic, Digest, Form, Spnego, etc. It also supports multiple different login
- *            services - Hash, JAAS, Spnego, JDBC, etc.  Spark currently uses the HashLoginService
- *            to authenticate using DIGEST-MD5 via a single user and the shared secret.
- *            Since we are using DIGEST-MD5, the shared secret is not passed on the wire
- *            in plaintext.
- *
- *            We currently support SSL (https) for this communication protocol (see the details
- *            below).
- *
- *            The Spark HttpServer installs the HashLoginServer and configures it to DIGEST-MD5.
- *            Any clients must specify the user and password. There is a default
- *            Authenticator installed in the SecurityManager to how it does the authentication
- *            and in this case gets the user name and password from the request.
- *
- *  - BlockTransferService -> The Spark BlockTransferServices uses java nio to asynchronously
- *            exchange messages.  For this we use the Java SASL
- *            (Simple Authentication and Security Layer) API and again use DIGEST-MD5
- *            as the authentication mechanism. This means the shared secret is not passed
- *            over the wire in plaintext.
- *            Note that SASL is pluggable as to what mechanism it uses.  We currently use
- *            DIGEST-MD5 but this could be changed to use Kerberos or other in the future.
- *            Spark currently supports "auth" for the quality of protection, which means
- *            the connection does not support integrity or privacy protection (encryption)
- *            after authentication. SASL also supports "auth-int" and "auth-conf" which
- *            SPARK could support in the future to allow the user to specify the quality
- *            of protection they want. If we support those, the messages will also have to
- *            be wrapped and unwrapped via the SaslServer/SaslClient.wrap/unwrap API's.
- *
- *            Since the NioBlockTransferService does asynchronous messages passing, the SASL
- *            authentication is a bit more complex. A ConnectionManager can be both a client
- *            and a Server, so for a particular connection it has to determine what to do.
- *            A ConnectionId was added to be able to track connections and is used to
- *            match up incoming messages with connections waiting for authentication.
- *            The ConnectionManager tracks all the sendingConnections using the ConnectionId,
- *            waits for the response from the server, and does the handshake before sending
- *            the real message.
- *
- *            The NettyBlockTransferService ensures that SASL authentication is performed
- *            synchronously prior to any other communication on a connection. This is done in
- *            SaslClientBootstrap on the client side and SaslRpcHandler on the server side.
- *
- *  - HTTP for the Spark UI -> the UI was changed to use servlets so that javax servlet filters
- *            can be used. Yarn requires a specific AmIpFilter be installed for security to work
- *            properly. For non-Yarn deployments, users can write a filter to go through their
- *            organization's normal login service. If an authentication filter is in place then the
- *            SparkUI can be configured to check the logged in user against the list of users who
- *            have view acls to see if that user is authorized.
- *            The filters can also be used for many different purposes. For instance filters
- *            could be used for logging, encryption, or compression.
- *
- *  The exact mechanisms used to generate/distribute the shared secret are deployment-specific.
- *
- *  For YARN deployments, the secret is automatically generated. The secret is placed in the Hadoop
- *  UGI which gets passed around via the Hadoop RPC mechanism. Hadoop RPC can be configured to
- *  support different levels of protection. See the Hadoop documentation for more details. Each
- *  Spark application on YARN gets a different shared secret.
- *
- *  On YARN, the Spark UI gets configured to use the Hadoop YARN AmIpFilter which requires the user
- *  to go through the ResourceManager Proxy. That proxy is there to reduce the possibility of web
- *  based attacks through YARN. Hadoop can be configured to use filters to do authentication. That
- *  authentication then happens via the ResourceManager Proxy and Spark will use that to do
- *  authorization against the view acls.
- *
- *  For other Spark deployments, the shared secret must be specified via the
- *  spark.authenticate.secret config.
- *  All the nodes (Master and Workers) and the applications need to have the same shared secret.
- *  This again is not ideal as one user could potentially affect another users application.
- *  This should be enhanced in the future to provide better protection.
- *  If the UI needs to be secure, the user needs to install a javax servlet filter to do the
- *  authentication. Spark will then use that user to compare against the view acls to do
- *  authorization. If not filter is in place the user is generally null and no authorization
- *  can take place.
- *
- *  When authentication is being used, encryption can also be enabled by setting the option
- *  spark.authenticate.enableSaslEncryption to true. This is only supported by communication
- *  channels that use the network-common library, and can be used as an alternative to SSL in those
- *  cases.
- *
- *  SSL can be used for encryption for certain communication channels. The user can configure the
- *  default SSL settings which will be used for all the supported communication protocols unless
- *  they are overwritten by protocol specific settings. This way the user can easily provide the
- *  common settings for all the protocols without disabling the ability to configure each one
- *  individually.
- *
- *  All the SSL settings like `spark.ssl.xxx` where `xxx` is a particular configuration property,
- *  denote the global configuration for all the supported protocols. In order to override the global
- *  configuration for the particular protocol, the properties must be overwritten in the
- *  protocol-specific namespace. Use `spark.ssl.yyy.xxx` settings to overwrite the global
- *  configuration for particular protocol denoted by `yyy`. Currently `yyy` can be only`fs` for
- *  broadcast and file server.
- *
- *  Refer to [[org.apache.spark.SSLOptions]] documentation for the list of
- *  options that can be specified.
- *
- *  SecurityManager initializes SSLOptions objects for different protocols separately. SSLOptions
- *  object parses Spark configuration at a given namespace and builds the common representation
- *  of SSL settings. SSLOptions is then used to provide protocol-specific SSLContextFactory for
- *  Jetty.
- *
- *  SSL must be configured on each node and configured for each component involved in
- *  communication using the particular protocol. In YARN clusters, the key-store can be prepared on
- *  the client side then distributed and used by the executors as the part of the application
- *  (YARN allows the user to deploy files before the application is started).
- *  In standalone deployment, the user needs to provide key-stores and configuration
- *  options for master and workers. In this mode, the user may allow the executors to use the SSL
- *  settings inherited from the worker which spawned that executor. It can be accomplished by
- *  setting `spark.ssl.useNodeLocalConf` to `true`.
+ * This class implements all of the configuration related to security features described
+ * in the "Security" document. Please refer to that document for specific features implemented
+ * here.
  */
-
 private[spark] class SecurityManager(
     sparkConf: SparkConf,
     val ioEncryptionKey: Option[Array[Byte]] = None)