Spark on Fire! Integrating Pentaho and Spark

June 30, 2014

One of Pentaho’s great passions is to empower organizations to take advantage of amazing innovations in Big Data to solve new challenges using the existing skill sets they have in their organizations today.  Our Pentaho Labs’ innovations around natively integrating data engineering and analytics with Big Data platforms like Hadoop and Storm have already led dozens of customers to deploy next-generation Big Data solutions. Examples of these solutions include optimizing data warehousing architectures, leveraging Hadoop as a cost effective data refinery, and performing advanced analytics on diverse data sources to achieve a broader 360-degree view of customers.

Not since the early days of Hadoop have we seen so much excitement around a new Big Data technology as we see right now with Apache Spark.  Spark is a Hadoop-compatible computing system that makes big data analysis drastically faster, through in-memory computation, and simpler to write, through easy APIs in Java, Scala and Python.  With the second annual Spark Summit taking place this week in San Francisco, I wanted to share some of the early work Pentaho Labs and our partners over at Databricks are collaborating on to deeply integrate Pentaho and Spark for delivering high performance, Big Data Analytics solutions.

Big Data Integration on Spark

Big Data Integration on SparkAt the core of Pentaho Data Integration (PDI) is a portable ‘data machine’ for ETL which today can be deployed as a stand-alone Pentaho cluster or inside your Hadoop cluster though MapReduce and YARN.  The Pentaho Labs team is now taking this same concept and working on the ability to deploy inside Spark for even faster Big Data ETL processing.  The benefit for ETL designers is the ability to design, test and tune ETL jobs in PDI’s easy-to-use graphical design environment, and then run them at scale on Spark.  This dramatically lowers the skill sets required, increases productivity, and reduces maintenance costs when to taking advantage of Spark for Big Data Integration.

Advanced Analytics on Spark

Last year Pentaho Labs introduced a distributed version of Weka, Pentaho’s machine learning and data mining platform. The goal was to develop a platform-independent approach to using Weka with very large data sets by taking advantage of distributed environments like Hadoop and Spark. Our first implementation proved out this architecture by enabling parallel, in-cluster model training with Hadoop.

Advanced Analytics on Spark

We are now working on a similar level of integration with Spark that includes data profiling and evaluating classification and regression algorithms in Spark.  The early feedback from Pentaho Labs confirms that developing solutions on Spark is faster and easier than with MapReduce. In just a couple weeks of development, we have demonstrated the ability to perform in-cluster Canopy clustering and are very close to having k-means++ working in Spark as well!

Next up: Exploring Data Science Pack Integration with MLlib

MLlib is already one of the most popular technologies for performing advanced analytics on Big Data.  By integrating Pentaho Data Integration with Spark and MLlib, Data Scientists will benefit by having an easy-to-use environment (PDI) to prepare data for use in MLlib-based solutions.  Furthermore, this integration will make it easier for IT to operationalize the work of the Data Science team by orchestrating the entire end-to-end flow from data acquisition, to data preparation, to execution of MLlib-based jobs to sharing the results, all in one simple PDI Job flow.  To get a sense for how this integration might work, I encourage you to look at a similar integration with R we recently launched as part of the Data Science Pack for Pentaho Business Analytics 5.1.

Experiment Today with Pentaho and Spark!

You can experiment with Pentaho and Spark today for both ETL and Reporting.  In conjunction with our partners at Databricks, we recently certified for the following use cases combining Pentaho and Spark:

  • Reading data from Spark as part of an ETL workflow by using Pentaho Data Integration’s Table Input step with Apache Shark (Hive SQL layer runs on Spark)
  • Reporting on Spark data using Pentaho Reporting against Apache Shark

We are excited about this first step in what we both hope to be a collaborative journey towards deeper integration.

Jake Cornelius
Sr. Vice President, Product Management
Pentaho

 


World Cup, Twitter sentiment and equity prices…any correlation?

June 25, 2014

I heard a news story on the radio today about stock markets going quiet during World Cup events, especially when the home country is on the field. This made me think about how live activities affect the major markets. My colleague Bo Borland at Pentaho posed an interesting question on this topic just yesterday at MongoDB World in New York, “Do real time Tweets have an affect on the stock markets?” Working for a Big Data integration and analytics company, Bo of course used Pentaho tools to see if there was indeed a correlation. A cool idea, but what resulted was even cooler than I’d imagined….

Using Pentaho Data Integration, Bo easily pulled minute-by-minute stock tick data which is highly structured, and blended it with unstructured Twitter data. Next, he pushed the blended data into a MongoDB collection to take advantage of its flexibility. (Note: Bo is also the author of Pentaho Analytics for MongoDB). Taking the integration and analysis a step further, he scored the tweet sentiment by including a Weka predictive algorithm as part of the data ingestion process from Twitter. Once the data was in place, he used one of the cool new features in Pentaho 5.1 to “slice and dice” the data stored in MongoDB.

It’s worth pointing out that the ability to analyze data directly from MongoDB with no coding is a first to market feature. Pentaho’s designed and delivered native integration with MongoDB’s Aggregation Framework allowing business users and analysts to immediately access, analyze and visualize MongoDB data for superior insight and governance.

Here’s Bo’s process simplified:

Pentaho Data Integration

  • Ingest data from external data source (TickData) into MongoDB
  • Ingest data from Twitter using public API into MongoDB
  • Execute a Weka Scoring step in during the ingestion process to properly score the incoming tweets and calculate the sentiment

Connect Pentaho Analytics to the Mongo Collection(s)

  • Start analyzing data
  • Slice and dice large amounts of data quickly

Here’s what the process looks like:

diagram mongodb

If you want to see this slicing and dicing directly on data in MongoDB check out this video.

Bo presented this demo yesterday live to a standing room only crowd using Tesla data at MongoDB World. You can access his slides here:

So the question still remains, “Does Twitter sentiment correlate to equity prices?” I’ll let you take a look and decide, but I’ve got some stocks to research….

Chuck Yarbrough
Director, Big Data Product Marketing
Pentaho

 


Introducing Pentaho 5.1 – Powering Big Data Analytics at Scale

June 23, 2014

14-054-Pentaho-5.1-Panel-v5You can’t predict tomorrow with yesterday’s tools. At Pentaho, this has been a core tenant in staying nimble and innovating in this disruptive market. Today, at MongoDB World in New York, we announced Pentaho Business Analytics 5.1, a culmination of speed of innovation and community and customer engagement. Pentaho 5.1 supports our ongoing strategy to make big data analytics faster—at scale—and easier and more accessible for more users.

The most powerful insights are revealed when Big Data can be accessed and blended data at the source. 5.1 enables users to do this in a seamless way eliminating the need for specialized set of skills and bridging the data-to-analytics divide. Our recent Data Science Pack blog post, references analyst research estimating that the top two time-consuming big data tasks are solving data quality and consistency issues (46%) and preparing data for integration (52%). We know a huge amount of resources are spent just getting data ‘ready’ to discover the greatest land mine or gold mine of data.

In 5.1 we are streamlining the big data process and making big data a reality for all with three innovations including:

  • Direct analytics on MongoDB – Unlocks the value of data in NoSQL through interactive visual analysis. Native integration leverages the MongoDB Aggregation Framework, Replication and Tag Sets for direct analysis on MongoDB collections with no impact on throughput.
  • Data Science Pack  - Operationalizes predictive models, drastically reducing data preparation time and effort. The pack includes integration with both R and Weka, two of the most popular machine learning and predictive analytic toolsets in use today by data scientists.
  • Full YARN Support – Reduces complexity for big data developers while leveraging the full power of Hadoop.

Just listen to our customer, Chris Palm, Lead Software Architecture Engineer at MultiPlan share just how daunting the data-to-analytics process can be “Traditional RDBMS analytics can get very complicated and, quite frankly, ugly when working with semi or unstructured data. The Pentaho 5.1 platform is meeting market needs, allowing users to directly analyze data in MongoDB. We have seen more accurate results with new analyses and are no longer constrained by having to pull only part of our data. We can now look across a more full set of data and govern our system of record to gain greater insights.”

I encourage you to explore the impressive new capabilities in Pentaho 5.1. You can access resources such as videos, webinar and download at: http://bit.ly/PTHO5-1.

Chris Dziekan
EVP & Chief Product Officer
Pentaho


Recognizing and Rewarding Your Work: The Pentaho Excellence Awards

June 23, 2014

Lego_PEAOne of my favorite aspects of being CEO of Pentaho is the opportunity to talk to our customers around the world. Innovative and motivated individuals and teams are turning data into value and making a major impact for their organization, and in some cases for the better of society. We are proud to announce first annual Pentaho Excellence Awards to recognize and honor our customers and users, rewarding those that have deployed Pentaho technologies in impressive and innovative ways.

The Pentaho Excellence Awards offer an opportunity for you and your team to receive industry recognition for your expertise in analytics and big data deployments and thought leadership. While we know your teams are busy helping to make faster and smarter business decisions, here is the link to more information about the Pentaho Customer Excellence Awards and our short nomination process: http://bit.ly/PWorldPEA. Nominations are open until July 11th.

A panel of expert judges will pick a winner in six different categories. Category winners receive a free pass to PentahoWorld in Orlando October 8-10, 2014, along with several additional unique opportunities at the event such as a VIP dinner, speaking opportunities and recognition at a keynote awards ceremony. As a highlight of the Awards ceremony we will announce the overall User of the Year Award.

We look forward to celebrating the amazing accomplishments achieved through our work together. I hope to see you on stage during the awards ceremony at PentahoWorld.

Quentin

Photo/LEGO credit: @kathrineiben


Award time at Pentaho

June 18, 2014

The past few weeks we’ve been giddy with excitement about several awards we’ve received celebrating our big data technology and how customers are applying it to reap big benefits. The latest awards Pentaho along with our customers have added to our growing trophy case include:

PrintThe CRN Big Data 100 list identifies vendors that have demonstrated an ability to innovate in bringing to market products and services that help businesses work with big data. Pentaho is proud to be named to the Big Data 100 list for the second year in the business analytics category. The award noted Pentaho’s record 83 percent bookings growth in 2013 for big data and embedded analytics products. In addition, the addition of Christopher Dziekan, previously head of analytics product strategy at IBM as Pentaho’s new Chief Product Officer.

 

2014SDT100_logo_120x123Each year the SD Times 100 recognizes companies, non-commercial organizations, open source projects and other initiatives for their innovation and leadership. Judged by the editors of SD Times, the SD Times 100 recognizes the top innovators and leaders in multiple software development industry areas. Pentaho was selected as a top 10 leader for Big Data, alongside Apache Hadoop, Splunk, Cloudera, DataStax, Hortonworks and MongoDB!

 

Computer-Weekly-EuroUserAwards-EnterpriseSoftwarePentaho customer, Bywaters was shortlisted for the ComputerWeekly European User Awards for Enterprise Software in the category of Best Technology Innovation! Bywaters is a waste management and recycling company based in the UK. They aim to make it easy and affordable for customers to improve their environmental performance and meet regulatory compliance through a system they created that embeds Pentaho called BRAD – Bywaters reporting and analytics dashboards. You can read more about their use case on Pentaho.com or feature article on ComputerWeekly.

SIG-Awards-headline-sponsor-logo-300x237Pentaho/Ctools pro-bono customer, Leukaemia & Lymphoma Research organization was awarded GOLD for ‘Best Use of Performance Reporting and Data Visualization’ by the Institute of Fundraising Insight Awards. The CTools team and Dan Keeley (@Codek1) worked with the UK Beating Blood Cancer group dedicated to improving the lives of patients with all types of blood cancer, including Leukaemia, Lymphoma and Myeloma. They created an (now) award winning dashboards to track the charity events they organize – check out the sample version here.

If you love awards as much as we do, then, you must check out the Pentaho Excellence Awards. This is our first annual awards program to recognize and honor our customers, partners and users who have deployed Pentaho in interesting and innovative ways.  Nominations are open until July 11th.

Rebecca Shomair
Director of Communications
Pentaho

 


Dinosaurs Have Had Their Day

June 16, 2014

dinosaur

Once upon a time, (not so) long ago in 2004, two young technologies were born from the same open source origins – Hadoop and Pentaho. Both evolved quickly from the market’s demand for better, larger-scale analytics, that could be adopted faster to benefit more players

Most who adopt Hadoop want to be disruptive leaders in their market without breaking the bank. Earlier this month at Hadoop Summit 2014, I talked to many people who told me, “I’d like to get off of <insert old proprietary software here> for my new big data applications and that’s why we’re looking at Pentaho.” It’s simple – no company is going to adopt Hadoop and then turn around and pay the likes of Informatica, Oracle or SAS outrageous amounts for data engineering or analytics.

Big data is the asteroid that has hit the tech market and changed its landscape forever, giving life to new business models and architectures based on open source technologies. First the ancient dinosaurs ignored open source, then they fought it and now they are trying to embrace it. But the mighty force of evolution had other plans. Dinosaurs are giving way to a more nimble generation that doesn’t depend on a mammoth diet of maintenance revenue, exorbitant license fees and long-term deals just to survive.

In this new world companies must continually evolve to survive and dinosaurs have had their day. It’s incredibly rewarding to be  part of a new analytics ecosystem that thrives on open standards, high performance and better value for customers. So many positive evolutionary changes have taken place in the last ten years, I can’t wait to see what the next ten will bring.

Richard Daley
Founder and Chief Strategy Officer
Pentaho

Image: #147732373 / gettyimages.com


Why are WE excited about PentahoWorld?

June 10, 2014

bonnet_creek

We’re thrilled to announce that registration is open for PentahoWorld, our first global conference that brings together Pentaho users, advocates, and partners to help each other solve challenges around data integration, big data, and embedded analytics.

You can register here, but if you need a bit more convincing, here are a few of the top reasons we’re excited about PentahoWorld – and why you should be too.

Getting on top of industry trends – PentahoWorld will be a unique gathering of hundreds of people who, every day, are solving challenging problems on the cutting edge of a rapidly changing data landscape.  With all those brilliant minds in one room, you’ll learn about solutions they’re crafting today, what they see coming in the future, and what you should be thinking about for your own company. And we have no doubt you’ll contribute some unique insights of your own.

Meet the experts – If you’ve got questions, this is the place for answers.  Between Pentaho product experts, power users, advocates, community leaders, and people who’ve applied Pentaho in every imaginable way, we can’t imagine a denser concentration of Pentaho expertise.  Whatever your challenge might be, start here if you’re looking to extend your use of Pentaho, troubleshoot, brainstorm, and innovate on your Pentaho implementation.

Product training – To extend your Pentaho use even further, don’t miss our training classes that cover how to use Pentaho with Hadoop, NoSQL, Weka, and much more.  If you’re looking to justify your trip to your boss, this opportunity to come back with a solid set of skills in these cutting edge technologies is a no-brainer.

Partner SolutionExpo – You probably know that Pentaho plays wells with others – and that’s why we’re so excited about our partner SolutionExpo.  Learn about how we integrate with different technologies, tease out the advantages of different technology partners, and see how they can help you with your business challenges. (Plus, here’s where you get the cool giveaways to bring back to your office and your family.)

It’s our birthday!  Pentaho’s 10th Birthday Gala is happening at PentahoWorld, and have you seen where we’re having the party? Besides the awesome lazy river pool (ranked #1 by TripAdvisor’s Top 10 Fantastic Pools), we can’t think of a better way to celebrate than with all of you – our friends, partners, and community – who have been with us for the journey. We promise it will be a special night.

Celebrate our customers’ success – As the conclusion to our Pentaho Excellence Awards, we’ll be announcing the Pentaho User of the Year at PentahoWorld.  These awards celebrate outstanding and creative use of Pentaho across a number of different categories.  Sound like you or someone you know? Submit your entry here through July 11.

You can learn much more and register at www.pentahoworld.com – and make sure you register before Early Bird pricing ends to save up to $600!

Amy Palmer
Sr. Marketing Manager, Integrated Programs
Pentaho


Follow

Get every new post delivered to your Inbox.

Join 102 other followers