Beyond HDFS to store massive Hadoop data

存储架构 2017-12-31

The file system that is so popular in Hadoop eco system is HDFS. This is also called Hadoop File System. You may ask a question that in real-time the data storing in HDFS is expensive or not. Before I get you there, I want to share how HDFS file system works in thecontext of Hadoop.

The popular file formats that Hadoop supports: Some common storage formats for Hadoop include:

  • Plain text storage (eg, CSV, TSV files)
  • Sequence Files
  • Avro
  • Parquet

HDFS can store any type of data, including text data in binary format, including even image or audio files. HDFS was initially and currently developed to be used by MapReduce. So, the file format that fits to the MapReduce or Hive workload is usually used.

One challenge with implementing HDFS is achieving availability and scalability at the same time. You may have a large amount of data that can’t fit on a single physical machine disk, so it’s necessary to distribute the data among multiple machines. HDFS can do this automatically and transparently while providing a user-friendly interface to developers. HDFS achieves these two main points:

  • High scalability

  • High availability

HDFS snapshot

It copies a data in the filesystem at some point in time. A snapshot can be taken for a subtree or the entire filesystem. Snapshot can usually be used for data backup for protection against some failures or disaster recovery, and snapshot is read-only data, because it is meaningless if you can modify the snapshot data after it is created.

HDFS snapshot was designed to copy data efficiently, and the main effectiveness of making HDFS snapshot includes:

  • Creating a snapshot takes constant time order O(1), excluding the inode lookup time, because it does not copy actual data but only makes a reference.

  • Additional memory is used only when the original data is modified. The size of additional memory is proportional to the number of modifications.

  • The modifications are recorded as the collection in reverse chronological order. The current data is not modified any more, and the snapshot data is computed by subtracting the modifications from the current data.

In real-time creating HDFS cluster in HDOOP eco system is expensive. So can go for Cloud storage:

Amazon EMR: Amazon Elastic MapReduce is a cloud service for Hadoop. It provides an easy way to create Hadoop clusters on EC2 instances and to access HDFS or S3. You can use major distributions on Amazon EMR such as Hortonworks Data Platform, and MapR distributions.

The launching process is automated and simplified by Amazon EMR, and HDFS can be used to store intermediate data generated while running a job on an Amazon EMR cluster. Only input and final output are put on S3, which is the best practice for using EMR storage

Treasure Data Service:Treasure Data is a fully managed cloud data platform. You can easily import any type of data on a storage system managed by Treasure Data, which uses HDFS and S3 internally, but encapsulates their detail. You do not have to pay attention to these storage systems.

Treasure Data mainly uses Hive and Presto as its analytics platform. You can write SQL to analyze what is imported on a Treasure Data storage service. Treasure Data is using HDFS and S3 as its backend and makes use of their advantages respectively. If you do not want to do any operation on HDFS, Treasure Data can be a best choice.

Azure Blob Storage:Azure Blob Storage is a cloud storage service provided by Microsoft. The combination of Azure Blob Storage and HDInsight provides a full-featured HDFS compatible storage system. A user who is used to HDFS can seamlessly use Azure Blob Storage. A lot of Hadoop ecosystems can operate directly on the data that Azure Blob Storage manages. Azure Blob Storage is optimized to be used by a computation layer such as HDInsight, and it provides various types of interfaces, such as PowerShell and of course Hadoop HDFS commands. The developers who are already comfortable using Hadoop can get started easily with Azure Blob Storage

Mainframe-Srini Blogs

责编内容by:Mainframe-Srini Blogs (源链)。感谢您的支持!


《Hadoop与大数据挖掘》一2.2Hadoop配置及IDE配置... 本节书摘来华章计算机《Hadoop与大数据挖掘》一书中的第2章 ,第2.2节,张良均 樊 哲 位文超 刘名军 许国杰 周 龙 焦正升 著 更多章节内容可以访问云栖社区“华章计算机”公众号查看。 2.2 Hadoop配置及IDE配置 2.2.1 准备工作 相关软件及版本如表2-1...
比自建Hadoop便宜 云栖大会揭秘阿里云数加MaxCompute... DT时代,越来越多的企业应用数据步入云端。 Hadoop是当下流行的大数据并行计算体系,横向扩展、生态圈成熟等一直是它的主要特点。 阿里云数加MaxCompute (原名ODPS)是一种快速、完全托管的TB/PB级数据仓库解决方案。与传统 Hadoop 相比,MaxCompute 向用户提供了完...
De MariaDB a HDFS: Usando Continuent Tungsten. Par... Semana santa y yo con nuevas batallas que contar. Me hayaba yo en el trabajo, pensando en que iba a invertir la calma que acompa;a a los dias de ...
Kerberos-给CDH集群添加Kerberos认证 近期网络曝出 通过Hadoop Yarn资源管理系统未授权访问漏洞从外网进行攻击内部服务器并植入挖矿木马的行为和自动化脚本的产生 。此次事件主要因Hadoop YARN 资源管理系统配置不当,导致可以未经授权进行访问,从而被攻击者恶意利用。攻击者无需认证即可通过REST API部署...
Dynamometer: Scale Testing HDFS on Minimal Hardwar... Co-authors:Erik Krogen andMin Shen In March 2015, LinkedIn’s Big Data Platform team experienced a crisis. As the team was prepa...