Alluxio Provides Distributed Storage at In-Memory Speeds

存储架构 2017-05-10 阅读原文

Alluxio has become one of the fastest-growing open source big data projects, with more than 500 contributors after four years, CEO Haoyuan Li told an audience at theVault big data conference in March.

Traditionally, the big data ecosystem has been MapReduce for compute and HDFS for storage, but now there are many different choices for storage, all with different properties. As a result, enterprises are experiencing silos from all the different storage systems, and they become hard to manage.

Many of these types of storage were not built for these types of workloads, so performance becomes a big issue as well, he said. Alluxio aims to be a way to unify data from all the different storage systems and present a unified view on the global namespace to the up layer of applications and at the same time enable operators to quickly access the data.

“We put Alluxio between the compute and storage systems. We unify the data from different storage and present this global namespace to the upper-level applications to enable them to interact with the data at memory speed,” he said.

In essence, it’s a programmable interface — distributed node-based memory — between compute frameworks like Spark and MapReduce and the underlying storage systems. It then uses a tiered storage architecture that caches the most often-used data in memory, with less-often-used data on SSDs and traditional hard drives.

As Jowanza Joseph , senior software engineer at One Click Retail, put it :

“Ideally, we’d have some way of specifying which data we’d want to keep on the cache when to release it and to be able to plan around that. Alluxio is exactly this, with a sophisticated API and support for many data stores out of the box.”

Alluxio supports a range of storage systems, including Amazon S3, Google Cloud Storage, Gluster, Ceph, HDFS, NFS, and OpenStack Swift.

The cache functionality helped Barclays to reduce its workflow iteration time from hours to seconds.

“Even though Spark provides a cache functionality, every time we restart the context, update the dependency jars or re-submit the job, the loaded data is dropped from the memory and the only way to restore it is to reload it from the central warehouse,” it states in a white paper.

It can use it as storage for any text format including Parquet , Avro , Kryo together with compression algorithms (such as Snappy or LZO ) to reduce the memory occupation.

At Vault, Li highlighted how Chinese travel site Qunar uses Alluxio to manage data across disparate storage systems, and how at Chinese search firm Baidu, batch queries that previously took 15 minutes now take less than 30 seconds.

Baidu manages an Alluxio cluster that scales to 1,000 nodes and more than 2TB, including 50TB of memory storage and the balance on disk.

From Tachyon to Alluxio

Alluxio began as a project at the University of California Berkeley AmpLab around 2012, originally called Tachyon. It was open sourced in 2013 and renamed in 2016. Version 1.5 is due out this quarter, Li said.

It has announced partnerships with storage vendors EMC and Huawei , and Mesosphere included one-click integration with Alluxio in its updated DC/OS platform release in March.

The commercial enterprise edition was unveiled in January; a free community edition can be downloaded from the Alluxio website.

Alternative technologies include Apache Ignite ,Apache Geode and Spark used with Redis .

Among Alluxio’s differentiators, according to an Evaluator Group report, it uses re-computation of log data to provide fault tolerance rather than creating three distributed copies at ingest, as distributed file systems typically do. That improves performance and means it can rebuild data sets from a point in time from before a failure.

Gartner compared it to other Hadoop operations providers Attunity , BlueData Software , DriveScale , GridGain Systems, Pepperdata and others. It’s a market struggling with a skills gaps and technology immaturity, the analyst firm noted.

Feature Image: “ Corral at sunset by Loren Kerns , licensed under CC BY-SA 2.0 .

The New Stack

责编内容by:The New Stack阅读原文】。感谢您的支持!


利用Alluxio构建计算-存储解耦架构 这篇博客探讨了在数据平台上使用Alluxio的几点优势,主要从如下方面介绍: 1 计算-存储解耦架构兴起的趋势 2 Alluxio如何加速计算-存储解耦...
Pulling Storage Together at Extreme Scale At the height of the Hadoop era there were countless storage and analytics start...
Alluxio+HDFS实战 介绍 Alluxio(之前名为Tachyon)是世界上第一个以内存为中心的虚拟的分布式存储系统。它统一了数据访问的方式,为上层计算框架和底层存储系统构建了桥...
高管和你只有这两个区别 郑昀 2016/11 领导和群众可能只有这两个区别。 第一, 领导肾好。 为什么肾好? 因为领导站着说话不腰疼。 ...
Alluxio Names Bob Wiederhold Executive Chairman to... Veteran enterprise software executive, former Couchbase CEO, brings Global 20...