Sharding a multi-tenant app with Postgres

综合编程 2017-07-19

Whether you’re building marketing analytics, a portal for e-commerce sites, or an application to cater to schools, if you’re building an app and your customer is another business then a multi-tenant approach is the norm. The same code runs for all customers, but each customer sees their own private data set, except in the cases of holistic internal reporting .

Early in your application’s life, customer data has a simple structure which evolves organically. Typically all information relates to a central customer/user/tenant table. With a smaller amount of data (10’s of GB) it’s easy to scale the application by throwing more hardware at it, but what happens when you’ve had so much success that your data no longer fits in memory on a single box, or you need more concurrency? You scale out by re-architecting your application—and it’s often painful (and expensive.)

Options for scaling out your database

This scale-out model for databases has worked well for the likes of Google and Instagram , but doesn't have to be as complicated as you might think.

If you're able to model your multi-tenant data in the right way sharding can be much simpler—you do not need to re-architect your application to scale out, and youcan keep the power you need from a database including joins, indexing, and more. I work at Citus Data , where we’ve created a database that scales out Postgres (an extension to Postgres, actually): we’ve done the hard work of sharding so you don’t have to. While Citus lets you scale out your processing power and memory and storage, how you model your data will determine the ease and flexibility you get from the system. If you're building a multi-tenant SaaS application, hopefully the following example highlights how you can plan early for scaling without having to contort too much of your application.

Scaling out Postgres with ease by adopting a multi-tenant data model

At the core of most SaaS applications, tenancy is already built in, whether you realize it or not. By “tenancy”, we mean the notion that your SaaS application has multiple customers (“tenants”) who are all sharing the same application but whose data needs to be kep separate from each other. (The same way that multiple tenants can live in the same building, but each have their own separate apartment.)

Anyway, as we mentioned above, you may have a users table. Let's look at a very basic SaaS schema that highlights this:

  id UUID,
  owner_email VARCHAR(255),
  owner_password VARCHAR(255),
  name VARCHAR(255),
  url VARCHAR(255),
  last_login_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ

CREATE TABLE products (
  id UUID,
  name VARCHAR(255),
  description TEXT,
  price INTEGER,
  quantity INTEGER,
  store_id UUID,
  created_at TIMESTAMPTZ,
  updated_at TIMESTAMPTZ

CREATE TABLE purchases (
  id UUID,
  product_id UUID,
  customer_id UUID,
  store_id UUID,
  price INTEGER,
  purchased_at TIMESTAMPTZ,

The above schema highlights an overly simplified multi-tenant e-commerce site. Say for example, someone like an Etsy or Shopify. And of course there are a number of queries you would run against this:

List the products for a particular store:

FROM products
WHERE store_id = foo;

Or let’s say you want to compute how many purchases exist weekly for a given store:

SELECT date_trunc('week', purchased_at),
       sum(price * quantity)
FROM purchases,
WHERE = products.stores_id
  AND store_id = ‘foo’

From here you could envision how to give each store its own presence and analytics. Now if we fast-forward a bit and start to look at scaling this out then we have a choice to make on how we'll shard the data.

The easiest level to do this at is the tenant level or in this case on store_id . With the above data model the largest tables over time are likely to be products and purchases, we could shard on both of these. Though if we choose products or purchases, the difficulty lies in the fact that we may want to do queries that focus on some high level item such as store. If we choose store_id then all data for a particular store would exist on the same node, this would allow you push down all computations directly to the a single node.

Multi-tenancy and co-location, a perfect pair

Co-locating data within the same physical instance avoids sending data over the network during joins. This can result in much faster operations. With Citus, there are a number of ways to move your data around so you can join and query it in a flexible manner, but for this class of multi-tenant SaaS apps it’s simple if you can ensure data ends up on the shard. To do this though we need to push down our store_id to all of our tables.

The key that makes this all possible is including your store_id on all tables. By doing this you can easily shard out all your data so it’s located on the same shard. In the above data model we coincidentally had store_id on all of our tables, but if it weren’t there you could add it. This would put you in a good position to distribute all your data so it’s stored on the same nodes. So now let’s try sharding our tenants, in this case stores:

SELECT master_create_distributed_table('stores', 'id', 'hash');
SELECT master_create_distributed_table('products', 'store_id', 'hash');
SELECT master_create_distributed_table('purchases', 'store_id', 'hash');

SELECT master_create_worker_shards('stores', 16);
SELECT master_create_worker_shards('products', 16);
SELECT master_create_worker_shards('purchases', 16);

Now you’re all set. Again, you’ll notice that we shard everything by store_id—this allows all queries to be routed to a single Postgres instance. The same queries as before should work just fine for you as long as you have store_id on your query. An example layout of your data now may look something like:

The alternative to colocation is to choose some lower-level shard key such as orders or products. This has a trade-off of making joins and querying more difficult because you have to send more data over the network and make sure things work in a distributed way. This lower-level key can be useful for consumer focused datasets, if your analytics are always against the entire data set as is often the case in metrics-focused use cases.

To scale out your Postgres database, you have more good choices than you think.

It’s important to note as we did in our sharding JSON in Postgres post that different distribution models can have different benefits and trade-offs. In some cases modeling on a lower-level entity id such as products or purchases can be the right choice— you gain more parallelism for analytics and trade off simplicity in querying a single store. Either choice of picking a multi-tenant data model or adopting a more distributed document model can be made to scale, but each comes with its own trade-offs. If you have the need today to scale out your multi-tenant app then give Citus Cloud a try or if you have any questions on which type of scale-out data model might work best for your application, please don’t hesitate to reach out to my team at Citus . We can help. (And did we mention that Citus is available as open source, as a database service in AWS, and on-prem?)

Originally posted on the Citus Data blog


模型的元数据Meta — Django从入门到精通系列教程... 该系列教程系个人原创,并完整发布在个人官网 刘江的博客和教程 所有转载本文者,需在顶部显著位置注明原作者及www.liujiangblog.com官网地址。 Python及Django学习QQ群:453131687 模型的元数据,指的是“除了字段外的所有内容”,例如排序方式...
PHP set_time_limit&comma... I have an API which is used by a mobile app to create user profiles. It seems that occasionally the app is timing out and displays an error message bu...
Parallax Posters For Everyone One of the coolest UI elements of tvOS and the new Apple TV are the parallax images. They're used for a lot of things on the interface: app icons, fea...
Just a Little 1. So just a quick one for the Moos-pen today. I my lastpost I did create a Test DAD , but then I discovered the problem with doing this is you may...
CUBRID Migration Toolkit v2013.05 has been release... We are glad to announce the release of a new version 2013.05 of CUBRID Migration Toolkit which allows to migrate or extract existing data...