Cookbook : apache_spark
apache_spark
This cookbook installs and configures Apache Spark.
- GitHub: https://github.com/clearstorydata-cookbooks/apache_spark
- Chef Supermarket: https://supermarket.chef.io/cookbooks/apache_spark
- Travis CI: https://travis-ci.org/clearstorydata-cookbooks/apache_spark
- Documentation: http://clearstorydata-cookbooks.github.io/apache_spark/chef/apache_spark.html
Overview
This cookbook installs and configures Apache Spark. Currently, only the standalone deployment mode is supported. Future work:
- YARN and Mesos deployment modes
- Support installing from Cloudera and HDP Spark packages.
Compatibility
The following platforms are currently tested:
- Ubuntu 12.04
- CentOS 6.5
The following platforms are not tested but will probably work (tests coming soon):
- Fedora 21
- Ubuntu 14.04
Configuration
node['apache_spark']['install_mode']
:tarball
to install from a downloaded tarball, orpackage
to install from an OS-specific package.node['apache_spark']['download_url']
: the URL to download Apache Spark binary distribution tarball in thetarball
installation mode.node['apache_spark']['checksum']
: SHA256 checksum for the Apache Spark binary distribution tarball.node['apache_spark']['pkg_name']
: package name to install in thepackage
installation mode.node['apache_spark']['pkg_version']
: package version to install in thepackage
installation mode.node['apache_spark']['install_dir']
: target directory to install Spark to in thetarball
installation mode. In thepackage
mode, this must be set to the directory that the package installs Spark into.node['apache_spark']['install_base_dir']
: in thetarball
installation mode, this is where the tarball is actually extracted, and a symlink pointing to the subdirectory containing a specific Spark version is created atnode['apache_spark']['install_dir']
.node['apache_spark']['user']
: UNIX user to create for running Spark.node['apache_spark']['group']
: UNIX group to create for running Spark.node['apache_spark']['standalone']['master_host']
: Spark standalone-mode workers will connect to this host.node['apache_spark']['standalone']['master_bind_ip']
: the IP the master should bind to. This should be set in such a way that workers will be able to connect to the master.node['apache_spark']['standalone']['master_port']
: the port for the Spark standalone master to listen on.node['apache_spark']['standalone']['master_webui_port']
: Spark standalone master web UI port.node['apache_spark']['standalone']['worker_bind_ip']
: the IP address workers bind to. They bind to all network interfaces by default.node['apache_spark']['standalone']['worker_webui_port']
: the port for the Spark worker web UI to listen on.node['apache_spark']['standalone']['job_dir_days_retained']
:app-...
subdirectories ofnode['apache_spark']['standalone']['worker_work_dir']
older than this number of days will be deleted periodically on worker nodes to prevent unbounded accumulation. These directories contain Spark executor stdout/stderr logs. The directories will still be retained to honornode['apache_spark']['standalone']['job_dir_num_retained']
.node['apache_spark']['standalone']['job_dir_num_retained']
: the minimum number of Spark executor log directories (app-...
) to retain, regardless of creation time.node['apache_spark']['standalone']['worker_dir_cleanup_log']
: log file path for the Spark executor log directories cleanup script.node['apache_spark']['standalone']['worker_cores']
: the number of "cores" (threads) to allocate on each worker node.node['apache_spark']['standalone']['worker_work_dir']
: the directory to store Spark executor logs and Spark job jars.node['apache_spark']['standalone']['worker_memory_mb']
: the amount of memory in MB to allocate to each worker (i.e. the maximum total memory used by different applications' executors running on a worker node).node['apache_spark']['standalone']['default_executor_mem_mb']
: the default amount of memory to be allocated to a Spark application's executor on each node.node['apache_spark']['standalone']['log_dir']
: the log directory for Spark masters and workers.node['apache_spark']['standalone']['daemon_root_logger']
: thespark.root.logger
property is set to this.node['apache_spark']['standalone']['max_num_open_files']
: the maximum number of open files to set usingulimit
before launching a worker.node['apache_spark']['standalone']['java_debug_enabled']
: whether Java debugging options are to be enabled for Spark processes. Note: currently, this option is not working as intended.node['apache_spark']['standalone']['default_debug_port']
: default Java debug port to use. A free port is chosen if this port is unavailable.node['apache_spark']['standalone']['master_debug_port']
: default Java debug port to use for Spark masters. A free port is chosen if this port is unavailable.node['apache_spark']['standalone']['worker_debug_port']
: default Java debug port to use for Spark workers. A free port is chosen if this port is unavailable.node['apache_spark']['standalone']['executor_debug_port']
: default Java debug port to use for Spark standalone executors. A free port is chosen if this port is unavailable.node['apache_spark']['standalone']['common_extra_classpath_items']
: common classpath items to add to Spark application driver and executors (but not Spark master and worker processes).node['apache_spark']['standalone']['worker_dir']
: Set to a non-nil value to tell the spark worker to use an alternate directory for spark scratch spacenode['apache_spark']['standalone']['worker_opts']
: Set to a non-nil value to pass along any additional settings to the spark worker. E.G.:-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=86400
. Ideal for worker options only that you do not want in the default configuration file.node['apache_spark']['conf']['...']
: Spark configuration options that go into the default Spark configuration file. See https://spark.apache.org/docs/latest/configuration.html for details.node['apache_spark']['standalone']['local_dirs']
: a list of local directories to use on workers. This is where map output files are stored, so these directories should have enough space available.
Testing
ChefSpec
bundle install
bundle exec rspec
Test Kitchen
bundle install
bundle exec kitchen test
Contributing
If you would like to contribute this cookbook's development, please follow the steps below:
- Fork this repository on GitHub
- Make your changes
- Run tests
- Submit a pull request
License
Apache License 2.0
https://www.apache.org/licenses/LICENSE-2.0
Cookbook Documentation
Recipes Summary
- apache_spark::spark-user
- apache_spark::spark-install
- apache_spark::find-free-port
- apache_spark::spark-standalone-worker
- apache_spark::spark-standalone-master
- apache_spark::force-package-index-update
Recipe Details
apache_spark::spark-user
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# File 'recipes/spark-user.rb', line 1 # Copyright 2015 ClearStory Data, Inc. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. spark_user = node['apache_spark']['user'] spark_group = node['apache_spark']['group'] group spark_group user spark_user do comment 'Apache Spark Framework' uid node['apache_spark']['uid'] if node['apache_spark']['uid'] gid spark_group end |
apache_spark::spark-install
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
# File 'recipes/spark-install.rb', line 1 # Copyright 2015 ClearStory Data, Inc. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. include_recipe 'apache_spark::find-free-port' include_recipe 'apache_spark::spark-user' spark_user = node['apache_spark']['user'] spark_group = node['apache_spark']['group'] install_mode = node['apache_spark']['install_mode'] spark_install_dir = node['apache_spark']['install_dir'] spark_install_base_dir = node['apache_spark']['install_base_dir'] spark_conf_dir = ::File.join(spark_install_dir, 'conf') case install_mode when 'package' package node['apache_spark']['pkg_name'] do version node['apache_spark']['pkg_version'] end when 'tarball' install_base_dir = node['apache_spark']['install_base_dir'] directory install_base_dir do user spark_user group spark_group end tarball_basename = ::File.basename(URI.parse(node['apache_spark']['download_url']).path) downloaded_tarball_path = ::File.join(Chef::Config[:file_cache_path], tarball_basename) tarball_url = node['apache_spark']['download_url'] Chef::Log.warn("#{tarball_url} will be downloaded to #{downloaded_tarball_path}") remote_file downloaded_tarball_path do source tarball_url checksum node['apache_spark']['checksum'] end extracted_dir_name = tarball_basename.sub(/[.](tar[.]gz|tgz)$/, '') Chef::Log.warn("#{downloaded_tarball_path} will be extracted in #{install_base_dir}") actual_install_dir = ::File.join(install_base_dir, extracted_dir_name) tar_extract downloaded_tarball_path do action :extract_local target_dir install_base_dir creates actual_install_dir end link spark_install_dir do to actual_install_dir user spark_user group spark_group end else fail "Invalid Apache Spark installation mode: #{install_mode}. 'package' or 'tarball' required." end local_dirs = node['apache_spark']['standalone']['local_dirs'] ( [ spark_install_dir, spark_conf_dir, node['apache_spark']['standalone']['log_dir'], node['apache_spark']['standalone']['worker_work_dir'] ] + local_dirs.to_a ).each do |dir| directory dir do mode 0755 owner spark_user group spark_group action :create recursive true end end template "#{spark_conf_dir}/spark-env.sh" do source 'spark-env.sh.erb' mode 0644 owner spark_user group spark_group variables node['apache_spark']['standalone'] end bash 'Change ownership of Spark installation directory' do user 'root' code "chown -R #{spark_user}:#{spark_group} #{spark_install_base_dir}" end template "#{spark_conf_dir}/log4j.properties" do source 'spark_log4j.properties.erb' mode 0644 owner spark_user group spark_group variables node['apache_spark']['standalone'] end common_extra_classpath_items_str = node['apache_spark']['standalone']['common_extra_classpath_items'].join(':') default_executor_mem_mb = node['apache_spark']['standalone']['default_executor_mem_mb'] template "#{spark_conf_dir}/spark-defaults.conf" do source 'spark-defaults.conf.erb' mode 0644 owner spark_user group spark_group variables options: node['apache_spark']['conf'].to_hash.merge( 'spark.driver.extraClassPath' => common_extra_classpath_items_str, 'spark.executor.extraClassPath' => common_extra_classpath_items_str, 'spark.executor.memory' => "#{default_executor_mem_mb}m", 'spark.local.dir' => local_dirs.join(',') ) end |
apache_spark::find-free-port
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# File 'recipes/find-free-port.rb', line 1 # Copyright 2015 ClearStory Data, Inc. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. template '/usr/local/bin/find-free-port.rb' do source 'find-free-port.rb.erb' mode 0755 owner 'root' group 'root' variables ruby_interpreter: RbConfig.ruby end |
apache_spark::spark-standalone-worker
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
# File 'recipes/spark-standalone-worker.rb', line 1 # Copyright 2015 ClearStory Data, Inc. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. include_recipe 'apache_spark::spark-install' include_recipe 'monit_wrapper' worker_runner_script = ::File.join(node['apache_spark']['install_dir'], 'worker_runner.sh') worker_service_name = 'spark-standalone-worker' spark_user = node['apache_spark']['user'] spark_group = node['apache_spark']['group'] template worker_runner_script do source 'spark_worker_runner.sh.erb' mode 0744 owner spark_user group spark_group variables node['apache_spark']['standalone'].merge( install_dir: node['apache_spark']['install_dir'], user: spark_user ) end directory node['apache_spark']['standalone']['worker_work_dir'] do mode 0755 owner spark_user group spark_group action :create recursive true end template '/usr/local/bin/clean_spark_worker_dir.rb' do source 'clean_spark_worker_dir.rb.erb' mode 0755 owner 'root' group 'root' variables ruby_interpreter: RbConfig.ruby end worker_dir_cleanup_log = node['apache_spark']['standalone']['worker_dir_cleanup_log'] cron 'clean_spark_worker_dir' do minute 15 hour 0 command '/usr/local/bin/clean_spark_worker_dir.rb ' \ "--worker_dir #{node['apache_spark']['standalone']['worker_work_dir']} " \ "--days_retained #{node['apache_spark']['standalone']['job_dir_days_retained']} " \ "--num_retained #{node['apache_spark']['standalone']['job_dir_num_retained']} " \ "&>> #{worker_dir_cleanup_log}" end # logrotate for the log cleanup script logrotate_app 'worker-dir-cleanup-log' do cookbook 'logrotate' path worker_dir_cleanup_log frequency 'daily' rotate 3 # keep this many logs create '0644 root root' end # Run Spark standalone worker with Monit master_host_port = format( '%s:%d', node['apache_spark']['standalone']['master_host'], node['apache_spark']['standalone']['master_port'].to_i ) monit_wrapper_monitor worker_service_name do template_source 'pattern-based_service.conf.erb' template_cookbook 'monit_wrapper' wait_for_host_port master_host_port variables \ cmd_line_pattern: node['apache_spark']['standalone']['worker_cmdline_pattern'], cmd_line: worker_runner_script, user: 'root', # The worker needs to run as root initially to use ulimit. group: 'root' end monit_wrapper_service worker_service_name do action :start wait_for_host_port master_host_port # Determine the "notification action" based on whether the service is running at recipe compile # time. This is important because if the service is not running when the Chef run starts, it will # start as part of the :start action and pick up the new software version and configuration # anyway, so we don't have to restart it as part of delayed notification. # TODO: put this logic in a library method in monit_wrapper. notification_action = monit_service_exists_and_running?(worker_service_name) ? :restart : :start subscribes notification_action, "monit-wrapper_monitor[#{worker_service_name}]", :delayed subscribes notification_action, "package[#{node['apache_spark']['pkg_name']}]", :delayed subscribes notification_action, "template[#{worker_runner_script}]", :delayed end |
apache_spark::spark-standalone-master
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
# File 'recipes/spark-standalone-master.rb', line 1 # Copyright 2015 ClearStory Data, Inc. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. include_recipe 'apache_spark::spark-install' include_recipe 'monit_wrapper' master_runner_script = ::File.join(node['apache_spark']['install_dir'], 'bin', 'master_runner.sh') master_service_name = 'spark-standalone-master' spark_user = node['apache_spark']['user'] spark_group = node['apache_spark']['group'] template master_runner_script do source 'spark_master_runner.sh.erb' mode 0744 owner spark_user group spark_group variables node['apache_spark']['standalone'].merge( install_dir: node['apache_spark']['install_dir'] ) end # Run Spark standalone master with Monit monit_wrapper_monitor master_service_name do template_source 'pattern-based_service.conf.erb' template_cookbook 'monit_wrapper' variables \ cmd_line_pattern: node['apache_spark']['standalone']['master_cmdline_pattern'], cmd_line: master_runner_script, user: spark_user, group: spark_group end monit_wrapper_service master_service_name do action :start # Determine the "notification action" based on whether the service is running at recipe compile # time. This is important because if the service is not running when the Chef run starts, it will # start as part of the :start action and pick up the new software version and configuration # anyway, so we don't have to restart it as part of delayed notification. # TODO: put this logic in a library method in monit_wrapper. notification_action = monit_service_exists_and_running?(master_service_name) ? :restart : :start subscribes notification_action, "monit-wrapper_monitor[#{master_service_name}]", :delayed subscribes notification_action, "package[#{node['apache_spark']['pkg_name']}]", :delayed subscribes notification_action, "template[#{master_runner_script}]", :delayed end |
apache_spark::force-package-index-update
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# File 'recipes/force-package-index-update.rb', line 1 # Copyright 2015 ClearStory Data, Inc. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. case node['platform'] when 'debian', 'ubuntu' execute 'apt-get update' do command 'apt-get update' action :nothing end.run_action(:run) when 'redhat', 'centos', 'fedora' execute 'apt-get update' do command <<-EOT yum check-update exit_code=$? if [ "${exit_code}" -eq 100 ]; then # yum returns 100 when there are updates available. exit_code=0 fi exit "${exit_code}" EOT action :nothing end.run_action(:run) else Chef::Log.info("Cannot update package index for platform #{node['platform']} -- doing nothing") end |