25 September 2016
Lots of releases this week—CouchDB, Accumulo, Kylin, Osso (a new OSS project from Rocana—but most notably Apache Kudu hit version 1.0. There’s a bit less technical content and general news than usual, but that’s to be expected. With Strata + Hadoop World taking place this week in NYC, get ready for tons of news in the next issue.
The Cloudera blog has a post on the recently released Apache Hadoop 3.0.0-alpha1. It describes several of the features of the release, including HDFS erasure coding, v.2 of the YARN Timeline Service, and the shell script rewrite.
MapR has posted a whiteboard walkthrough on how Apache Flink handles event time for stream processing. In addition to the video, there’s a transcript of the presentation.
This post is a great walkthrough of Apache Drill. It covers a bunch of topics, including: quoting reserved keywords, interpreting/fixing json parse errors, use of subqueries, conveniences for querying csv, a basic overview of Drill’s web interface, plugin configuration, querying a rdbms, and analyzing a query plan.
Cloudera has published a post comparing Apache Impala and Amazon Redshift. There’s an overview of key differences, but the main focus is a performance and cost comparison. As always, these results shouldn’t be viewed as necessarily representative (each dataset is different). With that said, using a TPC-DS derived workload, they show that Impala can often beat Redshift in cost and performance.
The StreamSets blog has a post arguing that Apache Kudu’s support for efficient real-time access and atomic updates provides an alternative to the lambda architecture.
This post describes some of the challenges of moving a data science research project into a production data pipeline. The author argues that it’s important for developers and data scientists to work together to integrate quickly.
IBM Power systems are getting support for Apache Hadoop through an IBM partnership with Hortonworks.
dataArtisans have announced the dA Platform, which is a distribution of Apache Flink with enterprise support.
Oracle and Qubole announced a partnership to bring the Qubole big data as a service offering to the Oracle Cloud Platform.
Omid is a transaction manager for Apache HBase that was recently accepted into the Apache Incubator after a proposal from Yahoo. It both provides snapshot isolation guarantees and can be used in high performance environments (supporting over 100k transactions/second).
Rocana has open sourced Osso, which is a new semi-structured event format. Built on Avro, the standard is meant to be easy, intuitive, efficient, and complementary to existing solutions.
The Google Cloud Platform blog has highlighted three integrations related to Kafka. The Google Cloud Pub/Sub connectors offer a mechanism for moving data between pub/sub and Kafka, the KafkaIO connector for Apache Beam allows Beam systems to consume from Kafka, and the Kafka to BigQuery connector can be used to mirror data to BigQuery.
Version 2.0 of Apache CouchDB was released this week. Highlights of the release include new clustering, a new querying language, and a rewritten admin interface.
Apache Kudu announced version 1.0 this week. The release includes support for HA Kudu Master, a rewritten Apache Spark integration, an official client library for Python, and more. To mark the occasion, the Cloudera blog has an overview of the history of the project and a look at its future.
Apache Accumulo 1.6.6 includes a data loss fix, a fix for DataNode decommission, dependency upgrades, and more.
Amazon EMR now supports security configurations to enable encryption for data at rest and in transit. The post has an example of configuring the encryption providers.
Version 1.5.4 of Apache Kylin, the OLAP engine for Hadoop, was released.
Amazon Web Services has open-sourced the Amazon EMR-DynamoDB connector.
Curated by Datadog ( http://www.datadog.com )
Apache Spark Meetup (San Francisco) – Tuesday, September 27
Azure 101: Hadoop on Cloud (Mountain View) – Wednesday, September 28
Scaling Recommenders + Content Embeddings at Facebook (Seattle) – Wednesday, September 28
Apache Nifi (Lafayette) – Monday, September 26
Hadoop Security and Governance with Apache Ranger and Apache Atlas (Manhattan) – Wednesday, September 28
Big Data & Data Science Workshop Using Apache Spark (Houston) – Monday, September 26
Diving Into Big Data Technologies: Hadoop, Hive, and Apache NiFi (Atlanta) – Thursday, September 29
District of Columbia
“Data Analytics with Hadoop” Book Release Celebration (Washington) – Monday, September 26
HBaseCon East 2016 (New York) – Monday, September 26
Intro to Apache Kudu: Fast Analytics on Fast Data (New York) – Tuesday, September 27
The Stream Processor as a Database (New York) – Wednesday, September 28
Let’s Get Started with Hadoop #9 (Oslo) – Thursday, September 29
Criteo Labs Tech Talks Session 3 (Paris) – Wednesday, September 28
Introduction to Apache Flink (Amsterdam) – Thursday, September 29
Data Engineering on AWS by Thorsten Greiner (Dusseldorf) – Thursday, September 29
Hands-On Introduction to Apache Spark & Apache Zeppelin (Gdansk) – Wednesday, September 28
Practical Distributed Stream Processing with Akka Streams (Tel Aviv-Yafo) – Tuesday, September 27
Discuss Key Emerging Big Data Technologies (Bangalore) – Thursday, September 29
Introduction to Hadoop, Yarn, HDFSStudents Only – Friday, September 30