Hadoop Weekly Issue #214

存储架构 2017-04-30

30 April 2017

While cloud vendors have been adding impressive features to their SaaS SQL-on-big data engines, other vendors are trying to demonstrate the benefit of their non-hosted products. A couple of weeks ago, Hortonworks was blogging about Hive LLAP and this week Cloudera is touting Impala. Regardless of which camp you're backing, the competition is good for end users.


Amazon Redshift has added new mechanisms to optimize a cluster by by performing certain tasks when a rule (written in their new engine) is violated. For example, they show how to terminate a long running query that is generating too many rows or using too much CPU and not completing after a certain amount of time. The system also has the ability to log details or "hop" a query to a different queue.


Apache Ambari, which is used for managing Hadoop clusters, has a new feature for interacting with Apache Hive in version 2.5. Dubbed "Hive View 2.0," Ambari can now perform most operations against the Hive metastore, including viewing and computing table/column statistics (which are used by the optimizer). Additionally, the new release includes a visualizer for the Hive optimizer (to visualize EXPLAIN output).


With the caveat that it's important to take a critical eye to vendor benchmarks (and try with your own data), Cloudera has posted impressive benchmark numbers for Impala. They've compared to Greenplum and several Hadoop-ecosystem engines (including Spark SQL, Hive with LLAP, and Presto) using data from the TPC-DS kit in both multi user and single user scenarios.


The Hortonworks blog has an overview of Livy, which is a server that exposes HTTP APIs for interacting with a Spark cluster. The post looks at both the programmatic API, and the RESTful endpoints that can be used to run both a batch job and an interactive session. The post also describes Livy's security and high availability features.


The Databricks blog has a tutorial that describes how to use Spark's structured streaming with Apache Kafka as the source (and sink) of data. The post includes a real-world example of processing JSON event data from a Nest device.


Cloudera has two posts on their new Cloudera Data Science Workbench. In the first, they describe how to write a PySpark job that uses a python library (python packaging is different from the JVM's JAR files, which is often a pain point in these types of applications). In the second, they show how to use BigDL (a deep learning library for Apache Spark) with the workbench.


The LinkedIn blog has a post about how they've transformed testing for their Content Analytics system. By building out a number of mock functions and libraries, they're able to write unit tests for the Kafka/Samza jobs that execute in a few minutes. Previously, testing was UI driven and took days or weeks to complete.


The StreamSets Data Collector is using the same expression language (EL) as is supported in JSP, and it's easy to write a custom function as is described in this post.



Apache Metron, which is a security-focused analytics system built on the Apache Hadoop ecosystem, has graduated from the Apache incubator.


Videos of presentations for Flink Forward San Francisco, which took place earlier this month, have been posted.


It's been just over a year since Apache Apex became a top-level project. The post celebrates community growth (number of contributors, pull requests, and others), adoption (including a few examples), the major features released over the past year, and the roadmap for the project looking forward.


Cloudera's IPO was Friday. Shares went for $15 and jumped 20% the first day bringing Cloudera's market cap to $2.3 billion.



Version 0.11.0-incubating of Apache PredictionIO was released. PredictionIO is a framework for building predictive services built on Apache Spark, Apache HBase, Spray, and Elasticsearch.


Version 0.3.0 of Scio, the Scala API for Apache Beam, was released. This is the first non-beta release that is built on the Apache Beam SDK rather than Google Cloud Dataflow SDK.


Apache Kafka was released to fix a number of bugs in the release.


Apache Gearpump (incubating) has released version 0.8.3-incubating. Gearpump is a streaming engine built on the Actor model.



Curated by Datadog ( http://www.datadog.com



Interacting with Spark: Beyond the Basics (San Diego) - Wednesday, May 3


Hands-On Workshop in Real-Time Analytics, Stream Processing with Twitter Heron (Santa Clara) - Saturday, May 6



Near Real-Time Ingest with StreamSets Data Collector (St. Louis) - Wednesday, May 3



Meetup #9 (Vilnius) - Wednesday, May 3



Scaling Data Pipelines + Master BigQuery and Redshift (Athens) - Wednesday, May 3



An Informal Session with Cloudera and Doug Cutting, the Father of Hadoop (Auckland) - Tuesday, May 2


If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit https://hadoopweekly.com

Hadoop Weekly

责编内容by:Hadoop Weekly (源链)。感谢您的支持!


Big Data Security Analytics Lessons Shared at Stra... We’re still reflecting upon the success of Strata + Hadoop London . Chief Sc...
Spark:超越Hadoop MapReduce 和hadoop 一样,Spark 提供了一个 Map/Reduce API(分布式计算)和分布式存储。二者主要的不同点是,Spark 在集群的内存中保存数据,...
多位大佬大胆预测:Hadoop将死,图数据库成为新趋势!... 科技行业向来是以技术发展速度快著称,时值岁末,我们和多位数据库领域的业内大佬进行了深度交流,分享了他们眼中2017年的小惊喜和2018年的大展望。 ...
Hue 之 SparkSql interpreters的配置及使用 1、环境说明: HDP 2.4 V3 sandbox hue 4.0.0 2、hue 4.0.0 编译及安装 地址:https://gi...
BigData-「基于代价优化」究竟是怎么一回事?... 作者:范欣欣 还记得笔者在上篇文章无意中挖的一个坑么?如若不知,强烈建议看官先行阅读前面两文-《SparkSQL – 有必要坐下来聊聊Joi...