Spark on Fire! Integrating Pentaho and Spark

June 30, 2014

One of Pentaho’s great passions is to empower organizations to take advantage of amazing innovations in Big Data to solve new challenges using the existing skill sets they have in their organizations today.  Our Pentaho Labs’ innovations around natively integrating data engineering and analytics with Big Data platforms like Hadoop and Storm have already led dozens of customers to deploy next-generation Big Data solutions. Examples of these solutions include optimizing data warehousing architectures, leveraging Hadoop as a cost effective data refinery, and performing advanced analytics on diverse data sources to achieve a broader 360-degree view of customers.

Not since the early days of Hadoop have we seen so much excitement around a new Big Data technology as we see right now with Apache Spark.  Spark is a Hadoop-compatible computing system that makes big data analysis drastically faster, through in-memory computation, and simpler to write, through easy APIs in Java, Scala and Python.  With the second annual Spark Summit taking place this week in San Francisco, I wanted to share some of the early work Pentaho Labs and our partners over at Databricks are collaborating on to deeply integrate Pentaho and Spark for delivering high performance, Big Data Analytics solutions.

Big Data Integration on Spark

Big Data Integration on SparkAt the core of Pentaho Data Integration (PDI) is a portable ‘data machine’ for ETL which today can be deployed as a stand-alone Pentaho cluster or inside your Hadoop cluster though MapReduce and YARN.  The Pentaho Labs team is now taking this same concept and working on the ability to deploy inside Spark for even faster Big Data ETL processing.  The benefit for ETL designers is the ability to design, test and tune ETL jobs in PDI’s easy-to-use graphical design environment, and then run them at scale on Spark.  This dramatically lowers the skill sets required, increases productivity, and reduces maintenance costs when to taking advantage of Spark for Big Data Integration.

Advanced Analytics on Spark

Last year Pentaho Labs introduced a distributed version of Weka, Pentaho’s machine learning and data mining platform. The goal was to develop a platform-independent approach to using Weka with very large data sets by taking advantage of distributed environments like Hadoop and Spark. Our first implementation proved out this architecture by enabling parallel, in-cluster model training with Hadoop.

Advanced Analytics on Spark

We are now working on a similar level of integration with Spark that includes data profiling and evaluating classification and regression algorithms in Spark.  The early feedback from Pentaho Labs confirms that developing solutions on Spark is faster and easier than with MapReduce. In just a couple weeks of development, we have demonstrated the ability to perform in-cluster Canopy clustering and are very close to having k-means++ working in Spark as well!

Next up: Exploring Data Science Pack Integration with MLlib

MLlib is already one of the most popular technologies for performing advanced analytics on Big Data.  By integrating Pentaho Data Integration with Spark and MLlib, Data Scientists will benefit by having an easy-to-use environment (PDI) to prepare data for use in MLlib-based solutions.  Furthermore, this integration will make it easier for IT to operationalize the work of the Data Science team by orchestrating the entire end-to-end flow from data acquisition, to data preparation, to execution of MLlib-based jobs to sharing the results, all in one simple PDI Job flow.  To get a sense for how this integration might work, I encourage you to look at a similar integration with R we recently launched as part of the Data Science Pack for Pentaho Business Analytics 5.1.

Experiment Today with Pentaho and Spark!

You can experiment with Pentaho and Spark today for both ETL and Reporting.  In conjunction with our partners at Databricks, we recently certified for the following use cases combining Pentaho and Spark:

  • Reading data from Spark as part of an ETL workflow by using Pentaho Data Integration’s Table Input step with Apache Shark (Hive SQL layer runs on Spark)
  • Reporting on Spark data using Pentaho Reporting against Apache Shark

We are excited about this first step in what we both hope to be a collaborative journey towards deeper integration.

Jake Cornelius
Sr. Vice President, Product Management


Pentaho 5 has arrived with something for everyone!

September 18, 2013

I am tremendously excited to announce that Pentaho Business Analytics 5 is available for download!  This release is represents the culmination of over 30 man years of engineering effort and contains over 250 new features and improvements.  There truly is something for everyone in Pentaho 5.  If you are an end user, administrator, executive or developer I wanted to share with you what I think are the 3 top areas of improvement for you:

  1. Improving productivity for end users and administrators
  2. Empowering organizations to easily and accurately answer questions using blended big data sets
  3. Simplifying the experience for developers integrating with or embedding Pentaho Business Analytics

Improving Productivity for End Users and Administrators



18 months ago, we challenged ourselves to think deeply about the different profiles of users working with the Pentaho suite and identify the top areas where we could significantly improve our ease-of-use.  Based on the feedback from countless customer interviews and usability studies, the first thing you will notice about Pentaho 5 is a dramatically overhauled User Console.  Beyond the fresh, new, modern look and feel, we’ve introduced a new concept called “perspectives” making it easier than ever for end users to:

  • navigate between open documents
  • browse the repository
  • manage scheduled activities

Throughout the User Console, end users will enjoy numerous improvements and better feedback for common workflows such as designing dashboards or scheduling the execution of a parameterized report. Administrators will appreciate that we have consolidated all Administration capabilities directly into the User Console, enhanced security with the ability to create more specific role types with control the types of actions they can perform, and bundled a comprehensive audit mart providing out-of-the-box answers to common questions about usage patterns, performance and errors.

Analytics-ready Big Data Blending


partner logos

In the dawn of the Big Data era, a wide range of new storage and processing technologies have flooded the market, each bringing specialized characteristics to help solve the next wave of data challenges.   Pentaho has long been a leader and innovator in delivering an end-to-end platform for designing scalable and easily maintainable Big Data solutions.  Powered by the Pentaho Adaptive Big Data Layer, we’ve dramatically expanded our support for Hadoop with all new certifications for the latest distributions from Cloudera, Hortonworks, MapR and Intel.  Furthermore, we’ve integrated our complete analytics platform for use with Cloudera Impala.  Other Big Data highlights in Pentaho 5 include new integration with Splunk and dramatic ease-of-use improvements when working with NoSQL platforms such as MongoDB and Cassandra.

blendingAs organizations large and small map out their next generation data architectures, we see best practice design patterns emerging that help organizations target the appropriate data technology for each use case.  Evident in all of these design patterns is the fact that Big Data technologies are rarely information silos.  Solving common use cases such as optimizing your data warehousing architecture or performing 360 degree analysis on a customer require that all data be accessible and blended in an accurate way.  Pentaho Data Integration provides the connectivity and design ease-of-use to implement all of these emerging patterns, and with Pentaho 5 I’m excited to announce the world’s first SQL (JDBC) driver for runtime transformation.  This integration empowers data integration designers to accurately design blended data sets from across the enterprise, and put them directly in the hands of end users using tools they are already familiar with – reporting, dashboards and visual discovery – as well as predictive analytics.

Simplified Platform for OEMs and Embedders

marketplace perspective

integration samples

Finally, I’d like to highlight how this release further solidifies the Pentaho suite as the best platform for enterprises and OEMs who want to enrich their applications with better data processing or business analytics.  Pentaho 5 delivers a more customizable User Console providing developers with complete control over the menu bar and toolbar, improvements to the underlying theming engine and an all new plugin layer for adding custom perspectives.  Furthermore, we’ve dramatically simplified our service architecture by introducing a brand new REST-based API along with a rich library of integration samples and documentation to get you started.

These enhancements are just a few of the many great improvements in Pentaho 5. If you want a more in-depth overview and demonstration, register for the Pentaho 5.0 webinar on September 24th – 2 times to choose:  North America/LATAM & EMEA. You can also access great resources from videos to solutions briefs at

Jake Cornelius

SVP Products

What’s new in Pentaho BI Suite Enterprise Edition 3.8

March 16, 2011

It’s release time once again and I’m pleased to announce that Pentaho BI Suite 3.8 is available for download! This release is packed with new features empowering you to add more interactivity to your Pentaho-based solutions and improve performance and efficiency when working with larger and larger data volumes. In today’s blog, I’ll highlight just a few of these exciting new enhancements:

Guided Analysis
Building upon the hyperlink feature introduced in Pentaho Reporting with our 3.7 BI suite release, Analyzer Reports and Action Sequences now also provide the ability to create contextual hyperlinks to other pieces of Pentaho content or external URLs. You now have complete flexibility to provide information consumers with guided paths to additional detail or related content found in another report.  For example, with a couple of clicks you could create a summary level Analyzer Report for users to explore and analyze product sales, then enable hyperlinks on product names which link out to a detailed inventory report to ensure there are enough units on hand for your top selling products.

Dashboard Content Linking
Pentaho Dashboard Designer also receives a dose of interactivity with a feature we call content linking.  Content linking allows you enable one dashboard element drive the filtering of another element of the dashboard.  This feature is integrated with nearly all dashboard components including filter controls, dashboard charts, data tables and any items embedded in a dashboard widget such as a Pentaho report or Analyzer view.  This can be used for a variety of use cases including the creation of chained parameters, where selections in one filter control are used to drive the available selections in another filter control, or allowing dashboard consumers to click on the slice of a pie or a bar in a bar chart and have that drive the filtering of other widgets on the dashboard.  Be sure to check out the new dashboard samples, Product Sales Performance and Product Performance Dashboard, which illustrate the content linking feature along with the new, expanded set of filter controls including radio groups, check boxes, calendar pickers, button controls and more.

Data-less Design Mode for Analyzer Reports
Since its introduction just over a year ago, Pentaho Analyzer’s elegant combination of power and simplicity has driven exponential growth in use of Pentaho Analysis.  This includes deployments to larger user communities and the development of bigger, more sophisticated Mondrian cubes. Based on feedback from Pentaho customers, we’ve introduced the notion of a data-less report design mode, referred to as ‘auto-refresh’ in the user interface.  This allows users to design or modify the layout of an Analyzer report without querying the underlying RDBMs until the designer is ready to query.  This can help reduce database traffic for deployments to large user communities or reduce the design time for reports that depend on large queries.  Try it out by clicking on the Disable Auto-refresh button on the Analyzer toolbar, designing your query using the Field Layout panel, then clicking the Refresh Report button to issue the query.

Simplified Hadoop MapReduce Job Design
Also included in the 3.8 suite release is Pentaho Data Integration 4.1.2.  While this is primarily a patch release containing important product fixes and performance improvements, we’ve also added a few new features that simplify the design of transformations used to compose MapReduce jobs for Hadoop.  This includes dedicated steps for MapReduce Inputs and Outputs allowing you to simply choose which fields to use as your OutKey and OutValues, rather than having to explicitly filter and rename fields in the stream down to the key-value pair fields you pass back to Hadoop.  Finally, the Transformation Job Executor step now provides the ability to specify a transformation for use as a Combiner, thereby enabling you to optimize the performance of your PDI-based MapReduce jobs.

We hope you enjoy this exciting new release, get started today by downloading your copy at

Jake Cornelius
Vice President of Product Management
Pentaho Corporation

Top 3 reasons to download PDI 4.0 and BI Suite 3.6

June 10, 2010

Today we announced General Availability of Pentaho Data Integration 4.0 and Pentaho BI Suite 3.6.  This marks a major step forward on our mission to empower users with the most comprehensive, easy-to-use, and integrated Business Intelligence suite on the market today.  All too often, new BI projects end up on IT’s cutting room floor due a variety of factors:

  • Licensing costs – it’s too expensive to add the additional users/CPUs for the project to reach its intended audience.  A single BI project can have licensing impacts on several software products including data integration (ETL), reporting, dashboards, analytics, database, and on and on…
  • Lengthy time to ROI – a classic pitfall of new BI projects is spending weeks or months ‘getting the data right’ before business users have an opportunity interact and provide feedback.  Inevitably, this leads to missed requirements, delayed rollouts, and blown budgets
  • Technical Resources – do you have the right technical resources available and are they knowledgeable in all of the tools and technology involved?  Am I beholden to availability of IT, or can I accomplish this myself?

Our Agile BI initiative is breaking down these barriers by delivering a BI Suite with:

  • All of the functionality you actually need at 20% the cost of a comparable solution from the big guys… bye bye bloat-ware
  • Integrated design environment combining all aspects of a BI solution from Data Integration (ETL) through data visualization; thereby encouraging collaborative, cross team interactions between solution architects and end users and faster iterations… compare that with a with a hybrid Informatica/Data Stage – Oracle/Business Objects/Cognos solution
  • A modern, standards based architecture that deploys in minutes and is easy to customize or extend to meet the changing needs of your business… guaranteed 97% less super glue and duct tape under the hood than comparable proprietary BI Suites

So you’re thinking… enough with the marketing bullets.  Why should I download or upgrade to these new releases? The top 3 reasons:

1. Design Perspectives – New perspectives in PDI’s designer (Spoon) providing one click visualization of data and simple, drag-and-drop metadata modeling for OLAP and reporting metadata

2. PDI Enterprise Edition Data Integration ServerAll the execution and clustering capabilities of our core offering plus integrated scheduling, advanced security, and enhanced content management including complete revision history on all of your jobs and transformations

3. User Console improvements – There are numerous improvements to our visualization plugins including:

  • Support for scheduling, emailing, member properties and dashboard filter integration in Analyzer
  • Configurable auto-refresh intervals on dashboards and dashboard widgets and integration of Dashboard filters with Pentaho metadata data sources

Get started today by downloading your copy from

Jake Cornelius
Director of Product Management
Pentaho Corporation


Get every new post delivered to your Inbox.

Join 97 other followers