Thumbtack: How we run Spark and Sqoop in production

综合编程 2016-06-07

Part ll: How we run Spark/Sqoop in production


In the last post, we described our legacy infrastructure and event processing code, along with the key design decisions we made as we architected our new data infrastructure. In this post, we’ll discuss some of the operational details involved with deploying these systems in production and some lessons that we’ve learned along the way. Here, we’ll cover two topics:

  • Running Spark jobs
  • Running Sqoop imports

Spark in Production

As we described in our last post, we elected to run our own Spark + CDH installation on top of EC2 nodes. At this point, we run dozens of batch jobs for event aggregation, experiment analysis, machine learning, etc., all written as Spark jobs. Given that Spark has become such a critical piece of our production infrastructure, we’ve spent a great deal of time automating deployment of Spark, job tuning, and performance optimization. In this section, we provide some of the specific details of how we typically run Spark in production.

Since Spark just runs on top of YARN with Spark-on-YARN
, we tend to upgrade Spark much more frequently than CDH, and we typically run a very recent version of Spark in production. At the time of writing, we are running Spark 1.6.1.

Building Spark

At Thumbtack, we deploy our own Spark builds. This enables us to update some of Spark’s dependencies (the AWS SDK version is a recurring pain point) and also enables us to build Spark with, as the default distributions do not include Hive support
. Hive is required to use the HiveContext object in Spark, which in turn is required for using any Hive tables that are contained in another database outside of the default database in Hive. After building Spark, we distribute our Spark builds to dev/prod clusters using Puppet and an internal Debian repo hosted on Amazon S3.

Our full Spark build and packaging process is as follows. First, we build Spark using Oracle Java 1.8 and Maven 3.3.9. (Check
the Apache project’s list of mirrors

and pick a new one if the downloads for Spark or Maven fail.)

# Accept Oracle license and download JDK 8.
curl -O -L -H "Cookie: oraclelicense=accept-securebackup-cookie"
tar -xf jdk-8u77-linux-x64.tar.gz

# Download Maven 3.3.
# Note that even though Spark can download its own copy of Maven,
# it'll try to use whatever version of mvn is on PATH,
# and then fail because it's not 3.3.
curl -O -L
tar -xf apache-maven-3.3.9-bin.tar.gz

# Add JDK and Maven to environment.
export PATH="$HOME/jdk1.8.0_77/bin:$HOME/apache-maven-3.3.9/bin:$PATH"
export JAVA_HOME="$HOME/jdk1.8.0_77"

# Download Spark.
curl -O -L
tar -xf spark-1.6.1.tgz

# Build Spark, produces distribution in dist/.
cd spark-1.6.1
# Run tests to make sure the build is good. Can take an hour or two.
# The -pl option skips tests that depend on Docker. If your build box has Docker, you can leave that off.
    -pl '!:spark-docker-integration-tests_2.10' 

This builds the Spark distribution with Hadoop 2.6, YARN, and Hive. It will download Scala 2.10.5 and Zinc 0.3, and it includes the netlib-java dependency for native support in MLLib. We then take the built Spark artifact and package it as a Debian package using


gem install --user-install fpm
export PATH="$(ruby -e 'print Gem.user_dir')/bin:$PATH"
    -s dir 
    -t deb 
    -C dist 
    --exclude examples 
    --exclude 'lib/spark-examples*.jar' 
    --prefix /opt/spark-thumbtack 
    --name spark-thumbtack 
    --version 1.6.1 
    --iteration $iteration_number 
    --architecture amd64 
    --depends java8-runtime-headless 
    --vendor Thumbtack 
    --url "$internal_wiki_url" 
    --description 'Spark built with Hadoop 2.6, Hive, and YARN support'

We use

to deploy built Debian package artifacts to S3:

gem install --user-install deb-s3
deb-s3 upload 
    --visibility private 
    --sign $gpg_key_id 
    --bucket $tt_deb_s3_bucket 
    --arch amd64 
    --codename wheezy 
    --component main 

Finally, we pull down this package on our clusters to /opt/spark. Our clusters include

for making signed requests to S3 over HTTPS, which is necessary to work with deb-s3 –visibility private. Here we tell them to fetch the latest version and install it:

pssh -h cluster-hosts.txt -O StrictHostKeyChecking=no -i sudo aptitude update
pssh -h cluster-hosts.txt -O StrictHostKeyChecking=no -i sudo aptitude -y install spark-thumbtack

We validate our Spark builds on a development cluster, and then update the symlink on our production cluster to deploy the latest build.

Running Spark Jobs

Our production Spark jobs are executed with an invocation of Spark that looks like the following:

    --master yarn-client 
    --executor-cores 2 
    --executor-memory 8g 
    --driver-memory 512m 
    --conf spark.dynamicAllocation.enabled=true 
    --conf spark.dynamicAllocation.initialExecutors=$initialExecutors 
    --conf spark.dynamicAllocation.maxExecutors=$maxExecutors 
    --conf spark.yarn.executor.memoryOverhead=2048 
    --conf spark.shuffle.service.enabled=true 
    --conf spark.ui.showConsoleProgress=false 
    --conf spark.eventLog.dir=hdfs://$namenode/user/spark/applicationHistory 
    --conf spark.yarn.queue=root.production.normal 
    --files /opt/spark/conf/,/opt/spark/conf/ 
    --class "com.thumbtack.spear.Spear" 
    # ... command-line args for our jar follow

Some notes on the above:

  • Executors: In our profiling, we’ve found that 2 cores / 8GB RAM per executor works well for most of our workloads. We then scale the total number of executors per job according to the resource requirements of the job and the available cluster resources. For some memory-intensive jobs, we significantly increase the executor memory allocation.
  • Dynamic Allocation: We use Spark’s
    dynamic allocation

    to scale the executor count dynamically with respect to the workload activity. Moving all of our jobs from static executor allocation to dynamic allocation has significantly reduced the the average resource consumption in YARN, since jobs are not unnecessarily retaining a large number of executors.
  • Event Log: We write Spark events to HDFS so that the Spark history server can pick them up. This is useful for debugging issues we encounter with Spark jobs.
  • YARN Resource Queue: By assigning our Spark jobs to a specific resource queue with spark.yarn.queue, we have a great deal of control over how YARN allocates resources to production Spark jobs vs. other ad-hoc jobs running on the cluster. In this case, we assign our production Spark jobs to the root.production.normal queue.

Relational ETL with Sqoop

We implemented our relational ETL using Apache Sqoop. For ETLing this data into HDFS, we at first hoped to do differential updates with Sqoop. However, Sqoop’s
incremental update

mode requires that tables provide a lastmodified timestamp column for every table to be imported. This is an impractical schema change for an existing production database with dozens or hundreds of existing tables, so we opted to do a full import of the entire PostgreSQL schema for each table in our database.

One inconvenient caveat with Snappy on Hadoop is that the Hadoop version of the Snappy codec is incompatible with the default version, so you cannot work offline with Snappy-compressed files as you would other compressed files. Given that Spark’s shell permits loading and exploring Parquet files interactively, this is not too problematic.

Our production Sqoop import job currently takes a full snapshot of our PostgreSQL database every 4 hours, with each table imported using something like the following:

sqoop import 
    -D' -Duser.timezone=GMT' 
    -D mapreduce.job.queuename=root.production.normal 
    --connect jdbc:postgresql://$pg_host/$database 
    --username $user 
    --compression-codec snappy 
    --num-mappers $mapper_count 
    --table $table 
    --target-dir '$import_path' 
    --map-column-java $col=String,... # only when required
    --split-by $split_by_col          # only when required

A few notes regarding the above import command:

  • Timestamps: We found that Sqoop was importing records with incorrect timestamps. A workaround is to specify -D’ -Duser.timezone=GMT’ as noted in

  • YARN Resource Queues: As with Spark jobs, we run all of our production jobs in a
    YARN resource queue

    that is separate from the queue used for ad-hoc jobs. This enables isolation of job groups, and ensures that ad-hoc jobs do not starve our production jobs of cluster resources.
  • Hive Imports: We omit the argument –hive-import, since this is incompatible with –as-parquetfile. Instead, we import the table to $import_path, and in Hive, execute a CREATE EXTERNAL TABLE… command to create a Hive table on top of the imported Parquet files.
  • Compression: We compress all of our data with Snappy.
  • Mapper Count: We found that specifying the mapper count with –num-mappers was important for everything downstream, since this determines the number and size of resulting Parquet files. We’ve set this on a per-table basis to ensure each output Parquet file is approximately the same as our HDFS block size (128MB).
  • PostgreSQL Data Types: Many PostgreSQL column types cannot be imported directly (for example, the

    type). These need to be cast to an importable type like String using –map-column-java $col=String
  • Split Columns: For tables that do not have a primary key, Sqoop requires the specification of a split column, using the argument –split-by $split_by_col. This will enable importing the table, but we’ve found that it can often result in very imbalanced file sizes on HDFS. We’ve worked around this for now by just setting a much higher number of mappers for these tables.

Following this pattern, our automation kicks off a separate Sqoop import for each table in our production PostgreSQL database and other relational databases used across the engineering team. Our end-to-end import process is as follows:

  • Sqoop import table to HDFS at path /data/sql/$table/yyyy/mm/dd/hh/mm
  • Create external Hive table named ${table}_staging on top of HDFS location above
  • Run validation checks to ensure table was imported correctly
  • Drop $table and rename ${table}_staging to $table. This ensures the table is only briefly unavailable.

In the future, we plan to transition to running streaming replication from our PostgreSQL database into HDFS; however, we’ve tabled this effort until our scale requires making those changes. We expect it will require a significant investment of engineering time to build out a relational ETL pipeline around differential updates.


At Thumbtack, we’ve come a long way over the past year. From the MongoDB/PostgreSQL infrastructure we began with, we now have transitioned to data infrastructure that looks like:

We’ve got a lot of work ahead of us, and our team is really excited to continue building out our data infrastructure. Some things on our roadmap:

  • Near Real-time: Develop infrastructure to support near real-time data applications, with a Kinesis or Kafka distributed queue feeding events data into either Spark Streaming
    or Storm
  • Experiments: Thumbtack has some incredibly challenging A/B testing and experimental design problems, and our team is constantly working to improve our infrastructure for setting up, running, and analyzing the outcomes of experiments. Today, we are working towards deploying a unified experiment framework that handles everything in the experiment process: feature flags, randomization / hash seed selection, experiment assignment (including more complex assignment strategies like graphical market-based experiment segmentation), event aggregation, binomial/continuous metric evaluation, and more sophisticated experiment analysis.
  • API we’re building out an API layer, to provide other engineering teams with a standardized way to run jobs which generate datasets, and then store those datasets in persistent stores (e.g., DynamoDB, Elasticache, Elasticsearch, etc.) for use by other backend services.
  • Machine Learning we do quite a bit of predictive modeling on Spark (using Spark’s MLLib
    ), and we’re continually investing in improving our modeling infrastructure (feature extraction, model training, model deployment, model performance evaluation).

We’rehiring! Please reach out if you’re looking to join a fast-growing team that is focused on building out world-class data infrastructure.

At Thumbtack, our mission is to help people accomplish personal projects by matching their needs to the best service professionals in their area. From wedding photographers and DJs to home contractors or French tutors, Thumbtack can help.

责编内容by:Thumbtack Engineering (源链)。感谢您的支持!


Mac 安装 scala & spark # 确保已安装 java 7+# spark 需要 scala 2.11.x (目前最新 2.12.x) wget tar zxvf scala-2.11.11.tgz...
Spark的误解-不仅spark是内存计算, 市面上有一些初学者的误解,他们拿spark和hadoop比较时就会说,Spark是内存计算,内存计算是spark的特性。请问在计算机领域,mysql,redis,ssh框架等等他们不是内存计算吗?依据冯诺依曼体系结构,有什么技术的程序不是在内存中运行,需要数据从硬盘中拉取,然后供cpu进行执行?所...
Monitoring Real-Time Uber Data Using Apache APIs, ... According to Gartner , by 2020, a quarter of a billion connected cars will form a major element of the Internet of Things. Connected vehicles are ...
Spark官方文档翻译:Spark Programming Guide(一)... Overview At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel oper...
快速上手写spark代码系列:01-RDD transformation函数入门... 快速上手写spark代码系列:01-RDD transformation函数入门 标签: RDD transformation 函数 学习Spark的第一步就是要了解RDD的相关概念以及操作,RDD操作的operator分为两类,一类被称为transformat...