1.2 Billion Taxi Rides on AWS RDS running PostgreSQL

存储架构 2016-06-22

On November 17th, 2015, Todd Schneider published a blog post titled Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance
in which he analysed the metadata of 1.1 billion Taxi journeys made in New York City between 2009 and 2015. Included with this work was a link to a GitHub repository
where he published the SQL, Shell and R files he used in his work and instructions on how to get everything up and running. There are a few additional charts created by the R files which were used in follow up posts as well.

In this blog post I'll launch 4 different types of AWS RDS instances running PostgreSQL 9.5.2 and benchmark creating the same graphs that Todd Schneider did in his analysis.

PostgreSQL on AWS RDS Up & Running

I'll launch 4 RDS instances. They will all be running PostgreSQL 9.5.2 in
eu-west-1a

. Each will have 400 GB of General Purpose SSD capacity and will not be replicated across additional availability zones. Their maintenance windows have been set so they will not run during this benchmark. They will each have their own VPC.

The instance types are as follows:

  • DB1 is a db.t2.large
    which costs $5.50 / day inc. storage costs for RDS alone.
  • DB2 is a db.r3.large
    which costs $8.52 / day inc. storage costs for RDS alone.
  • DB3 is a db.m4.large
    which costs $6.60 / day inc. storage costs for RDS alone.
  • DB4 is a db.m4.xlarge
    which costs $11.53 / day inc. storage costs for RDS alone.

The instances each took 5 minutes to launch.

Each RDS instance is paired with a t2.large
EC2 instance running Ubuntu 14.04.3 LTS that also sit in
eu-west-1a

. These instances will run the data import and R-based analysis scripts. Each of these instances has 500 GB of General Purpose SSD (GP2) capacity and 1,500 IOPS each at their disposal. The instances are graded for low to moderate network performance. Each of these instances cost $4.57 / day including storage.

Downloading 1.2 Billion Taxi Trips

In this benchmark I'll be downloading all the data fresh and importing it into each RDS instance individually from each paired EC2 instance. The original dataset of 1.1 billion taxi journeys has grown with the release of the last half of 2015's data and is now has a little over 1.2 billion journeys.

The following are the commands I ran to bootstrap each of the EC2 instances and download the 300+ GB of data.

$ sudo apt-get update
$ sudo apt-get install 
      build-essential 
      git 
      postgis 
      postgresql-client 
      ruby 
      ruby-dev 
      unzip 
      zlib1g-dev

$ sudo gem install 
      activesupport 
      roo 
      rest-client
$ git clone https://github.com/toddwschneider/nyc-taxi-data.git
$ cd nyc-taxi-data
$ cat raw_uber_data_urls.txt 
      raw_data_urls.txt | 
      xargs -n 1 -P 6 
          wget -P data/

Importing 1.2 Billion Taxi Trips

In PostgreSQL
nyc-taxi-data

is perfectly fine as a database name but on RDS the database name can only contain letters, numbers and underscores. For this reason I named the databases trips
. In order for the shell and ruby scripts to be able to work with this new database name I patched them with the following:

$ sed -i 's/psql nyc-taxi-data/psql trips/' 
      {*.sh,tlc_statistics/import_statistics_data.rb}
$ sed -i 's/psql -d nyc-taxi-data/psql -d trips/' *.sh
$ sed -i 's/createdb/#createdb/' *.sh

I then created environment variables on each EC2 instance for their respective RDS pairings.

$ read PGHOST
$ read PGUSER
$ read PGPASSWORD

$ export PGHOST
$ export PGUSER
$ export PGPASSWORD

I then ran the database initialisation, population and statistics generation scripts.

$ time (
      ./initialize_database.sh;
      ./import_trip_data.sh;
      ./import_uber_trip_data.sh;
      cat analysis/prepare_analysis.sql 
            tlc_statistics/create_statistics_tables.sql | 
            psql trips;
      cd tlc_statistics;
      ruby import_statistics_data.rb
  )

The following were the durations I observed:

  • The db.t2.large
    instance took 60 hours, 38 minutes and 42 seconds costing $13.98 excluding EC2 costs.
  • The db.r3.large
    instance took 71 hours, 29 minutes and 27 seconds costing $25.56 excluding EC2 costs.
  • The db.m4.large
    instance took 62 hours, 13 minutes and 47 seconds costing $17.05 excluding EC2 costs.
  • The db.m4.xlarge
    instance took 51 hours, 43 minutes and 56 seconds costing $24.98 excluding EC2 costs.

R Up & Running

The following was run to install R and various other dependencies for the reports Todd Schneider wrote.

$ echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" | 
    sudo tee -a /etc/apt/sources.list
$ gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
$ gpg -a --export E084DAB9 | sudo apt-key add -
$ sudo apt-get update

$ sudo apt-get install 
    git 
    libgdal-dev 
    libpq-dev 
    libproj-dev 
    r-base 
    r-base-dev
$ mkdir -p $HOME/.R_libs
$ export R_LIBS="$HOME/.R_libs"

$ echo 'requirements = c("ggplot2",
                         "ggmap",
                         "dplyr",
                         "reshape2",
                         "zoo",
                         "scales",
                         "extrafont",
                         "grid",
                         "RPostgreSQL",
                         "rgdal",
                         "maptools",
                         "gpclib")

        sapply(requirements,
               function(x) {
                    if (!x %in% installed.packages()[,"Package"])
                        install.packages(x, repos="http://cran.r-project.org")})' | 
  R --no-save

Before running the analysis.R
file I needed to patch the database connector to use the environment variables I set earlier rather than the hard-coded localhost setup.

$ cd ~/nyc-taxi-data/analysis/
$ vi helpers.R

The following line in helpers.R
:

con = dbConnect(dbDriver("PostgreSQL"), dbname = "nyc-taxi-data", host = "localhost")

Was replaced with the following:

con = dbConnect(dbDriver("PostgreSQL"),
                dbname = "trips",
                host = Sys.getenv('PGHOST'),
                user = Sys.getenv('PGUSER'),
                password = Sys.getenv('PGPASSWORD'))

Benchmarking RDS

I ran the analysis.R
script three times from the respective EC2 pairs. Each of the results has been rounded to the nearest second.

$ time (cat analysis.R | R --no-save)

The db.t2.large
instance reported the following times:

  • Run 1: 5 minutes 21 seconds
  • Run 2: 5 minutes 25 seconds
  • Run 3: 5 minutes 39 seconds

The db.r3.large
instance reported the following times:

  • Run 1: 6 minutes
  • Run 2: 5 minutes 30 seconds
  • Run 3: 5 minutes 34 seconds

The db.m4.large
instance reported the following times:

  • Run 1: 5 minutes 43 seconds
  • Run 2: 5 minutes 34 seconds
  • Run 3: 5 minutes 33 seconds

The db.m4.xlarge
instance reported the following times:

  • Run 1: 5 minutes 34 seconds
  • Run 2: 5 minutes 26 seconds
  • Run 3: 5 minutes 41 seconds

For this workload the reporting speeds don't line up well with the price differences between the RDS instances. I suspect this workload is biased towards R's CPU consumption when generating PNGs rather than RDS' performance when returning aggregate results. The RDS instances share the same number of IOPS each which might erase any other performance advantage they could have over one another.

As for the money spent importing the data into RDS I suspect scaling up is more helpful when you have a number of concurrent users rather than a single, large job to execute.

责编内容by:Mark Litwintschik (源链)。感谢您的支持!

您可能感兴趣的

HTAP数据库 PostgreSQL 场景与性能测试之 43 – (OLTP+OLAP)... 标签 PostgreSQL , HTAP , OLTP , OLAP , 场景与性能测试 背景 PostgreSQL是一个历史悠久的数据库,历史可以追溯到1973年,最早由2014计算机图灵奖得主,关系数据库的鼻祖 Michael_Stonebraker 操刀设计,PostgreSQ...
将GitLab数据库从阿里云PostgreSQL RDS迁移至自建的PostgreSQL服务器... 阿里云RDS目前支持的是PostgreSQL 9.4,而gitlab支持的最低版本是PostgreSQL 9.6.1,不升级PostgreSQL,gitlab就无法升级,阿里云RDS短期内不进行升级,被迫无奈下只能改用自己用阿里云ECS搭建的PostgreSQL服务器。这篇随笔记录一下数据库迁...
Mark Needham: Loading and analysing Strava runs us... In my last post I showed how tomap Strava runs using data that I’d extracted from their /activities API, but the API returns a lot of other data th...
HTAP数据库 PostgreSQL 场景与性能测试之 31 – (OLTP) 高吞吐数... 标签 PostgreSQL , HTAP , OLTP , OLAP , 场景与性能测试 背景 PostgreSQL是一个历史悠久的数据库,历史可以追溯到1973年,最早由2014计算机图灵奖得主,关系数据库的鼻祖 Michael_Stonebraker 操刀设计,PostgreSQ...
PgSQL · 应用案例 · 毫秒级文本相似搜索实践一... 背景 在现实生活中,很多地方会用到相似搜索,例如 1、打车,要去某个地方,我们输入的目的地可能和数据库里面存的并不完全一致。所以只能通过相似搜索来实现。 2、搜索问题,同样的道理,我们搜的问题可能和存的问题不完全一致。只能通过相似搜索来匹配。 3、搜索兴趣点,等。 实际上Post...