Deduplicating Large Data With Rails

存储架构 2018-03-13

Reading Time:

This article was originally published by Cheyne Wallace
on his blog
, and with his permission, we are sharing it here for Codeship readers.

Storing and retrieving large chunks of data from your database can be tricky if it’s not done correctly. What happens when you want to store a relatively large document or body of text in your database but that large chunk of text is likely to be identical for thousands of new records? You’re faced with the possibility of storing gigabytes of duplicated data that really doesn’t need to be there.

Here’s the scenario: say you have a piece of software that generates a log file on installation with some key information you would like to track and archive. You push this log file back to your server via an API endpoint and store it in the database. The log file is roughly 100KB in file size; for every 100,000 installs, that’s coming close to 10GB in storage.

The problem is 80 percent of these log files are identical — they contain the same information but we treat and store them as though they were unique, wasting disk space and blowing out our table size and tuple count with an unnecessary number of rows.

We all know disk space is cheap, but processing power isn’t. Large tables cause large indexes, which in turn cause slower queries, require more CPU cycles, and mean you need to be more careful about writing your future SQL queries. If you query on a non-indexed field, you can trigger a full table scan, and your app takes a big performance hit in the process.

People love to say, “Storage is cheap! store everything!” But moving, loading and managing 100GB backup files is a lot more difficult and time-consuming than 20GB backup files, so it’s worth optimizing where possible.

So how do we optimize this? We’re going to create three tables, The user table, one for the log files, and one that acts as a join table between the users and logs. Instead of having a direct relation between the Log and the User, there will be a relation from User to the UserLog and from the UserLog to the Log.

Instead of just saving the log file when we receive it, we will instead hash the contents and perform a look-up on this hash in the database to see if we have already seen this exact block of data before. If we get a match on the hash, we will instead use the id of that record and throw away the data. This means that if 1000 people generate the exact same log output, we will still only store it once.

Let’s have a look at how to set this up:


class CreateLog < ActiveRecord::Migration
  def change
    create_table :log do |t|
      t.string :checksum
      t.text :data
    create_table :user_log do |t|
      t.integer :log_id
      t.integer :user_id
    create_table :users do |t|
      t.string :name
    add_index :logs, :checksum
    add_index :user_logs, :user_id
    add_index :user_logs, :log_id

Okay, so we have the base tables we need to deduplicate our data. Now let’s see how we would store the log files. (For the sake of a cleaner read, I’m going to omit all the usual boilerplate code you would find in a normal application, like validation and authentication and get to the point.)

!Sign up for a free Codeship Account

Assuming that we are receiving the logs from an API endpoint somewhere and that we just want to store them for later inspection and return a status code, here is a simple way we could implement it:


class LogController < ApplicationController
  def create
    log =[:log_data])
    user_log = UserLog.create(log_id:, user_id:
    render status: 201


class User < ActiveRecord::Base
  has_many :logs, through: :user_logs
  has_many :user_logs


class UserLog < ActiveRecord::Base
  belongs_to :user
  belongs_to :log


class Log  e
          raise ActiveRecord::Rollback

It’s as simple as that. The
method simply uses Digest::MD5.hexdigest
to calculate an MD5 checksum of the log data to which on the next line we then attempt to look up if the record exists. If it does not, we create a new log file and save it, being sure to wrap the save operation in a transaction just in case another log file with the same content is being stored at the same time.

The User model has a through: :user_logs
relation to the Log model, which means when you call User.logs
, it performs log look-up via the smaller join table, which just consists of ids in order to find relevant logs. This essentially resolves to a query that looks something like:

INNER JOIN user_logs ON = user_logs.log_id 
WHERE user_logs.user_id = 

As you can see from the following console output, trying to store the same data repeatedly will result in the same record being returned and not repeatedly stored.

2.3.1 :004 > log = "some log file text here"
2.3.1 :005 >
 => 4 
2.3.1 :007 >
 => 4 
2.3.1 :008 >
 => 4 
2.3.1 :009 >
 => 4 
2.3.1 :010 > new_log = "something else"
2.3.1 :011 >
 => 5


微信 WCDB for Android的接入 摘要: 微信 WCDB for Android的接入。腾讯开源微信数据库框架WCDB,他是一个高效、完整、易用的移动数据库框架,基于SQLCipher,支持iOS, macOS和Android。 接入与迁移
What Do We Mean by Database Scalability? What Do We Mean by Database Scalability? DZone's Guide to What Do We Mean by Database Scalability? This lesson in database performa...
ivre使用:安装配置 简介 IVRE(又名DRUNK)是一款开源的网络侦查框架工具,IVRE使用Nmap、Zmap进行主动网络探测、使用Bro、P0f等进行网络流量被动分析,探测结果存入数据库中,方便数据的查询、分类汇总统计。 IVRE官方网站: GitHub: ht...
Scaling ARA to a million Ansible playbooks a month The OpenStack community runs over 300 000 CI jobs with Ansible every month with the help of the awesome Zuul . As a core contributor of ...
数据库ROOM-Google新推 今年,谷歌新推了不少东西,在看其新架构 Architecture Components (AC架构是一个类似MVVM的架构,7.0之前,Google公布过不少MVP架构的官方Demo,看来现在其也是与时俱进了)的相关文章时,注意到一个新的东西_ ROOM 数据库,这是一个SQLITE类型数...