Spark on Fire! Integrating Pentaho and Spark

June 30, 2014

One of Pentaho’s great passions is to empower organizations to take advantage of amazing innovations in Big Data to solve new challenges using the existing skill sets they have in their organizations today.  Our Pentaho Labs’ innovations around natively integrating data engineering and analytics with Big Data platforms like Hadoop and Storm have already led dozens of customers to deploy next-generation Big Data solutions. Examples of these solutions include optimizing data warehousing architectures, leveraging Hadoop as a cost effective data refinery, and performing advanced analytics on diverse data sources to achieve a broader 360-degree view of customers.

Not since the early days of Hadoop have we seen so much excitement around a new Big Data technology as we see right now with Apache Spark.  Spark is a Hadoop-compatible computing system that makes big data analysis drastically faster, through in-memory computation, and simpler to write, through easy APIs in Java, Scala and Python.  With the second annual Spark Summit taking place this week in San Francisco, I wanted to share some of the early work Pentaho Labs and our partners over at Databricks are collaborating on to deeply integrate Pentaho and Spark for delivering high performance, Big Data Analytics solutions.

Big Data Integration on Spark

Big Data Integration on SparkAt the core of Pentaho Data Integration (PDI) is a portable ‘data machine’ for ETL which today can be deployed as a stand-alone Pentaho cluster or inside your Hadoop cluster though MapReduce and YARN.  The Pentaho Labs team is now taking this same concept and working on the ability to deploy inside Spark for even faster Big Data ETL processing.  The benefit for ETL designers is the ability to design, test and tune ETL jobs in PDI’s easy-to-use graphical design environment, and then run them at scale on Spark.  This dramatically lowers the skill sets required, increases productivity, and reduces maintenance costs when to taking advantage of Spark for Big Data Integration.

Advanced Analytics on Spark

Last year Pentaho Labs introduced a distributed version of Weka, Pentaho’s machine learning and data mining platform. The goal was to develop a platform-independent approach to using Weka with very large data sets by taking advantage of distributed environments like Hadoop and Spark. Our first implementation proved out this architecture by enabling parallel, in-cluster model training with Hadoop.

Advanced Analytics on Spark

We are now working on a similar level of integration with Spark that includes data profiling and evaluating classification and regression algorithms in Spark.  The early feedback from Pentaho Labs confirms that developing solutions on Spark is faster and easier than with MapReduce. In just a couple weeks of development, we have demonstrated the ability to perform in-cluster Canopy clustering and are very close to having k-means++ working in Spark as well!

Next up: Exploring Data Science Pack Integration with MLlib

MLlib is already one of the most popular technologies for performing advanced analytics on Big Data.  By integrating Pentaho Data Integration with Spark and MLlib, Data Scientists will benefit by having an easy-to-use environment (PDI) to prepare data for use in MLlib-based solutions.  Furthermore, this integration will make it easier for IT to operationalize the work of the Data Science team by orchestrating the entire end-to-end flow from data acquisition, to data preparation, to execution of MLlib-based jobs to sharing the results, all in one simple PDI Job flow.  To get a sense for how this integration might work, I encourage you to look at a similar integration with R we recently launched as part of the Data Science Pack for Pentaho Business Analytics 5.1.

Experiment Today with Pentaho and Spark!

You can experiment with Pentaho and Spark today for both ETL and Reporting.  In conjunction with our partners at Databricks, we recently certified for the following use cases combining Pentaho and Spark:

  • Reading data from Spark as part of an ETL workflow by using Pentaho Data Integration’s Table Input step with Apache Shark (Hive SQL layer runs on Spark)
  • Reporting on Spark data using Pentaho Reporting against Apache Shark

We are excited about this first step in what we both hope to be a collaborative journey towards deeper integration.

Jake Cornelius
Sr. Vice President, Product Management
Pentaho

 


Announcing Pentaho with Storm and YARN

February 11, 2014

One of Pentaho’s core beliefs is that you can’t prepare for tomorrow with yesterday’s tools. In June of 2013, amidst waves of emerging big data technologies, Pentaho established Pentaho Labs to drive innovation through the incubation of these new technologies. Today, one of our Labs projects hatches.  At the Strata Conference in Santa Clara, we announced native integration of Pentaho Data Integration (PDI) with Storm and YARN. This integration enables developers to process big data and drive analytics in real-time, so businesses can make critical decisions on time-sensitive information.

Read the announcement here.

Here is what people are saying about Pentaho with Storm and YARN:

Pentaho Customer
Bryan Stone, Cloud Platform Lead, Synapse Wireless: “As an M2M leader in the Internet of Everything, our wireless solutions require innovative technology to bring big data insights to business users. The powerful combination of Pentaho Data Integration, Storm and YARN will allow my team to immediately leverage real-time processing, without the delay of batch processing or the overhead of designing additional transformations. No doubt this advancement will have a big impact on the next generation of big data analytics.

Leading Big Data Industry Analyst
Matt Aslett, Research Director, Data Management and Analytics, 451 Research: “YARN is enabling Hadoop to be used as a flexible multi-purpose data processing and analytics platform. We are seeing growing interest in Hadoop not just as a platform for batch-based MapReduce but also rapid data ingestion and analysis, especially using Apache Storm. Native support of YARN and Storm from companies like Pentaho will encourage users to innovate and drive greater value from Hadoop.”

Pentaho founder and Pentaho Labs Leader
Richard Daley, Founder and Chief Strategy Officer, Pentaho: “Our customers are facing fast technology iterations from the relentless evolution of the big data ecosystem. With Pentaho’s Adaptive Big Data Layer and Big Data Analytical Platform our customers are “future proofed” from the rapid pace of evolution in the big data environment. In 2014, we’re leading the way in big data analytics with Storm, YARN, Spark and predictive, and making it easy for customers to leverage these innovations.”

Learn more about the innovation of Pentaho Data Integration for Storm on YARN in Pentaho Labs at pentaho.com/storm

If you are at O’Reilly Strata Conference in Santa Clara this week make sure to stop by booth 710 to see a live demo of Pentaho Data integration with Storm and YARN at the O’Reilly Strata Conference in Santa Clara, February 11-13 at Booth 710. The Pentaho team of technologist, data scientist and executives will be on hand to share the latest big data innovations from Pentaho Labs.

Donna Prlich
Senior Director, Product Marketing
Pentaho


Bring your Big Data to Life With Pentaho at Strata Santa Clara

January 30, 2014

horizon_bigdatalife

If you are like most Enterprise IT decision makers, there’s a 50/50 chance you are already knee deep into Big Data or on a path to figuring out how to get started. One of the “must attend” conferences for anyone involved in Big Data is the O’Reilly Strata Conference (Santa Clara, February 11-15, 2014).

Join Us!

Pentaho is excited to return as a sponsor this year and we have a number of ways you can learn more about getting the most out of your Big Data initiatives.

The Pentaho team of executives, technologist and data scientist will be on hand to share the latest big data innovations from Pentaho Labs such as integration with Apache Hadoop YARN and Storm. Come get answers to your all of your big data integration and analytics question. Let us help you bring your Big Data to life!

Below is a list of all activities for Pentaho in and around the conference. Register with code Pentaho20 and receive 20% off registration.

Exhibit booth

You will find the Pentaho team in the Sponsor Pavilion at booth 710 (located near the O’reilly Media booth). Learn all about how Pentaho can help bring your Big Data to life! Don’t forget to get your Pentaho t-shirt and enter for the chance to win a Go Pro camera.

Meetups

Big Data Science Meet-up at Strata Conference

  • Monday, 2/10 at 5:30-9:30 in Ballroom E
  • Nick Gonzalez, Data Scientist at Pentaho will speak about Real World Big Data Prescriptive Analytics
  • Today’s large and convoluted data landscape coupled with the abundance of available computing resources presents unique opportunities for data scientists around the world. To remain competitive in this landscape, we must go beyond generating predictions to generating solutions from big data that are driven by actions derived from data driven predictions. And we have to do this as fast as possible.  This is the real world of big data prescriptive analytics. This talk will address each one of these challenges and present technical solutions and algorithms to address them.  By the end of this presentation each individual solution will come together in a symphony of code and hardware to form a unified automated process that is the backbone of a successful big data prescriptive analytics solution.

Breakout Sessions

Getting There from Here: Moving Data Science into the Boardroom

  • Rosanne Saccone (Pentaho), Scott Chastain (SAS), Chris Selland (HP Vertica)
  • Tuesday, 2/11 at 11:15 on the Data Driven Business Track, Ballroom CD
  • Pundits and analysts agree—the data-driven enterprise is here to stay. But how will companies balance analysis with action? Will optimization of the current model leave firms more vulnerable than ever to disruption by what’s new and unpredictable? And how do we balance legacy investments in data warehousing and business intelligence with emerging technologies for massive, real-time data processing? Join Scott Chastain, Roseanne Saccone, Chris Selland, and Strata Chair Alistair Croll for a look at the practical concerns facing tomorrow’s data-driven business.

Lessons from the Trenches: edo Interactive Leverages Hadoop to Build Customer Loyalty

  • Thursday, 2/13 at 11:30am, Ballroom G
  • Tim Garnto (edo) & Rob Rosen (Pentaho)
  • Hadoop presents as an enabling technology to better understand customer preferences and behaviors, but organizations often struggle with time-consuming data preparation and analytics processes. edo Interactive – a leader in providing card-linked offers to financial services and retailers – shares how they drive agile, improved decision-making by complementing native Hadoop technologies with analytical databases and ETL optimization and data visualization solutions from vendors such as Pentaho.

We hope to see you soon at Strata in Santa Clara. If you would prefer a private meeting with Pentaho at the conference send us a message via our contacts page or direct message us on twitter @Pentaho.


Follow

Get every new post delivered to your Inbox.

Join 105 other followers