Spark on Fire! Integrating Pentaho and Spark

June 30, 2014

One of Pentaho’s great passions is to empower organizations to take advantage of amazing innovations in Big Data to solve new challenges using the existing skill sets they have in their organizations today.  Our Pentaho Labs’ innovations around natively integrating data engineering and analytics with Big Data platforms like Hadoop and Storm have already led dozens of customers to deploy next-generation Big Data solutions. Examples of these solutions include optimizing data warehousing architectures, leveraging Hadoop as a cost effective data refinery, and performing advanced analytics on diverse data sources to achieve a broader 360-degree view of customers.

Not since the early days of Hadoop have we seen so much excitement around a new Big Data technology as we see right now with Apache Spark.  Spark is a Hadoop-compatible computing system that makes big data analysis drastically faster, through in-memory computation, and simpler to write, through easy APIs in Java, Scala and Python.  With the second annual Spark Summit taking place this week in San Francisco, I wanted to share some of the early work Pentaho Labs and our partners over at Databricks are collaborating on to deeply integrate Pentaho and Spark for delivering high performance, Big Data Analytics solutions.

Big Data Integration on Spark

Big Data Integration on SparkAt the core of Pentaho Data Integration (PDI) is a portable ‘data machine’ for ETL which today can be deployed as a stand-alone Pentaho cluster or inside your Hadoop cluster though MapReduce and YARN.  The Pentaho Labs team is now taking this same concept and working on the ability to deploy inside Spark for even faster Big Data ETL processing.  The benefit for ETL designers is the ability to design, test and tune ETL jobs in PDI’s easy-to-use graphical design environment, and then run them at scale on Spark.  This dramatically lowers the skill sets required, increases productivity, and reduces maintenance costs when to taking advantage of Spark for Big Data Integration.

Advanced Analytics on Spark

Last year Pentaho Labs introduced a distributed version of Weka, Pentaho’s machine learning and data mining platform. The goal was to develop a platform-independent approach to using Weka with very large data sets by taking advantage of distributed environments like Hadoop and Spark. Our first implementation proved out this architecture by enabling parallel, in-cluster model training with Hadoop.

Advanced Analytics on Spark

We are now working on a similar level of integration with Spark that includes data profiling and evaluating classification and regression algorithms in Spark.  The early feedback from Pentaho Labs confirms that developing solutions on Spark is faster and easier than with MapReduce. In just a couple weeks of development, we have demonstrated the ability to perform in-cluster Canopy clustering and are very close to having k-means++ working in Spark as well!

Next up: Exploring Data Science Pack Integration with MLlib

MLlib is already one of the most popular technologies for performing advanced analytics on Big Data.  By integrating Pentaho Data Integration with Spark and MLlib, Data Scientists will benefit by having an easy-to-use environment (PDI) to prepare data for use in MLlib-based solutions.  Furthermore, this integration will make it easier for IT to operationalize the work of the Data Science team by orchestrating the entire end-to-end flow from data acquisition, to data preparation, to execution of MLlib-based jobs to sharing the results, all in one simple PDI Job flow.  To get a sense for how this integration might work, I encourage you to look at a similar integration with R we recently launched as part of the Data Science Pack for Pentaho Business Analytics 5.1.

Experiment Today with Pentaho and Spark!

You can experiment with Pentaho and Spark today for both ETL and Reporting.  In conjunction with our partners at Databricks, we recently certified for the following use cases combining Pentaho and Spark:

  • Reading data from Spark as part of an ETL workflow by using Pentaho Data Integration’s Table Input step with Apache Shark (Hive SQL layer runs on Spark)
  • Reporting on Spark data using Pentaho Reporting against Apache Shark

We are excited about this first step in what we both hope to be a collaborative journey towards deeper integration.

Jake Cornelius
Sr. Vice President, Product Management


Blueprints to Big Data Success

March 21, 2014

data-refinerySeems like these days, everyone falls into one of the three categories.  You are either:

  1. Executing a big data strategy
  2. Implementing a big data strategy
  3. Taking a wait and see attitude

If you’re in one of the first two categories, good for you!  If you’re in the third, you might want re-think your strategy. Companies that get left behind will have a huge hill to climb just to catch up to the competition.

If one of your concerns with moving forward with big data has been a lack of solid guidance to help pave the path to success, then you are in luck!  Pentaho has recently released four Big Data Blueprints to help guide you through the process of executing a strategy.  These are four use cases that Pentaho has seen customers execute successfully.  So, this isn’t just marketing fluff.  These are real architectures that supports real big data business value.  The four blueprints now available on Pentaho’s website includes:

  • Optimize the data warehouse
  • Streamlined data refinery
  • Customer 360-degree view
  • Monetize my data

These blueprints will help you understand the basic architectures that will support your efforts and achieve your desired results.  If you are like many companies just getting started with big data, these are great tools to guide you through the murky waters that lie ahead.  Here is my quick guide to the four Blueprints, where you may want to get started and why.

The Big Data Blueprints

1.    Optimize the Data Warehouse
The data warehouse optimization (or sometimes referred to as data warehouse offloading or DWO) is a great starter use case for gaining experience and expertise with big data, while reducing costs and improving the analytic opportunities for end users.  The idea is to increase the amount of data being stored, but not by shoving it into the warehouse, but by adding Hadoop to house the additional data.  Once you have Hadoop in the mix, Pentaho makes it easy to move data into Hadoop from external sources, move data bi-directionally between the warehouse and Hadoop, as well as makes it easy to process data in Hadoop.  Again, this is a great place to start.  It’s not as transformative to your business as the other use cases can be, but it will build expertise and save you money.

2.    Streamlined Data Refinery
The idea behind the refinery is to provide a way to stream transaction, customer, machine, and other data from their sources through a scalable big data processing hub, where Hadoop is then used to process transformations, store data, and process analytics that can then be sent to an analytic model for reporting and analysis.  Working with several customers, we have seen this as a great next step after the DWO.

3.    Customer 360-Degree View
This blueprint is perhaps the most transformative of all the potential big data use cases. The idea here is to gain greater insight into what your customer is doing, seeing, feeling and purchasing.  All with the idea that you can then serve and retain that customer better, and attract more customers into your fold.  This blueprint lays out the architecture needed to start understanding your customer better.  It will require significant effort in accessing all the appropriate customer touch points, but the payoff can be huge.  Don’t worry too much about getting the full 360-degree view at first; starting with even one small slice can drive huge revenue and retention rates.

4.    Montetize My Data
What do you have locked up in your corporate servers, or in machines you own?  This blueprint can be as transformative as the Customer 360, in that it can create new revenue streams that you may not have ever thought about before.  In some cases, it could create a whole new business opportunity.  What ever your strategy, take time to investigate where and how you can drive new business by leveraging your data.

There are other blueprints that have been defined and developed by Pentaho, but these are four that typically make the most sense for organizations to leverage first.  Feel free to reach out to us for more information about any of these blueprints or to learn more about how Pentaho helps organizations be successful with big data.

Find out more about the big data blueprints at

Please let me know what you think @cyarbrough.

Chuck Yarbrough
Product Marketing, Big Data


Get every new post delivered to your Inbox.

Join 97 other followers