Spark on Fire! Integrating Pentaho and Spark

June 30, 2014

One of Pentaho’s great passions is to empower organizations to take advantage of amazing innovations in Big Data to solve new challenges using the existing skill sets they have in their organizations today.  Our Pentaho Labs’ innovations around natively integrating data engineering and analytics with Big Data platforms like Hadoop and Storm have already led dozens of customers to deploy next-generation Big Data solutions. Examples of these solutions include optimizing data warehousing architectures, leveraging Hadoop as a cost effective data refinery, and performing advanced analytics on diverse data sources to achieve a broader 360-degree view of customers.

Not since the early days of Hadoop have we seen so much excitement around a new Big Data technology as we see right now with Apache Spark.  Spark is a Hadoop-compatible computing system that makes big data analysis drastically faster, through in-memory computation, and simpler to write, through easy APIs in Java, Scala and Python.  With the second annual Spark Summit taking place this week in San Francisco, I wanted to share some of the early work Pentaho Labs and our partners over at Databricks are collaborating on to deeply integrate Pentaho and Spark for delivering high performance, Big Data Analytics solutions.

Big Data Integration on Spark

Big Data Integration on SparkAt the core of Pentaho Data Integration (PDI) is a portable ‘data machine’ for ETL which today can be deployed as a stand-alone Pentaho cluster or inside your Hadoop cluster though MapReduce and YARN.  The Pentaho Labs team is now taking this same concept and working on the ability to deploy inside Spark for even faster Big Data ETL processing.  The benefit for ETL designers is the ability to design, test and tune ETL jobs in PDI’s easy-to-use graphical design environment, and then run them at scale on Spark.  This dramatically lowers the skill sets required, increases productivity, and reduces maintenance costs when to taking advantage of Spark for Big Data Integration.

Advanced Analytics on Spark

Last year Pentaho Labs introduced a distributed version of Weka, Pentaho’s machine learning and data mining platform. The goal was to develop a platform-independent approach to using Weka with very large data sets by taking advantage of distributed environments like Hadoop and Spark. Our first implementation proved out this architecture by enabling parallel, in-cluster model training with Hadoop.

Advanced Analytics on Spark

We are now working on a similar level of integration with Spark that includes data profiling and evaluating classification and regression algorithms in Spark.  The early feedback from Pentaho Labs confirms that developing solutions on Spark is faster and easier than with MapReduce. In just a couple weeks of development, we have demonstrated the ability to perform in-cluster Canopy clustering and are very close to having k-means++ working in Spark as well!

Next up: Exploring Data Science Pack Integration with MLlib

MLlib is already one of the most popular technologies for performing advanced analytics on Big Data.  By integrating Pentaho Data Integration with Spark and MLlib, Data Scientists will benefit by having an easy-to-use environment (PDI) to prepare data for use in MLlib-based solutions.  Furthermore, this integration will make it easier for IT to operationalize the work of the Data Science team by orchestrating the entire end-to-end flow from data acquisition, to data preparation, to execution of MLlib-based jobs to sharing the results, all in one simple PDI Job flow.  To get a sense for how this integration might work, I encourage you to look at a similar integration with R we recently launched as part of the Data Science Pack for Pentaho Business Analytics 5.1.

Experiment Today with Pentaho and Spark!

You can experiment with Pentaho and Spark today for both ETL and Reporting.  In conjunction with our partners at Databricks, we recently certified for the following use cases combining Pentaho and Spark:

  • Reading data from Spark as part of an ETL workflow by using Pentaho Data Integration’s Table Input step with Apache Shark (Hive SQL layer runs on Spark)
  • Reporting on Spark data using Pentaho Reporting against Apache Shark

We are excited about this first step in what we both hope to be a collaborative journey towards deeper integration.

Jake Cornelius
Sr. Vice President, Product Management
Pentaho

 


Dinosaurs Have Had Their Day

June 16, 2014

dinosaur

Once upon a time, (not so) long ago in 2004, two young technologies were born from the same open source origins – Hadoop and Pentaho. Both evolved quickly from the market’s demand for better, larger-scale analytics, that could be adopted faster to benefit more players

Most who adopt Hadoop want to be disruptive leaders in their market without breaking the bank. Earlier this month at Hadoop Summit 2014, I talked to many people who told me, “I’d like to get off of <insert old proprietary software here> for my new big data applications and that’s why we’re looking at Pentaho.” It’s simple – no company is going to adopt Hadoop and then turn around and pay the likes of Informatica, Oracle or SAS outrageous amounts for data engineering or analytics.

Big data is the asteroid that has hit the tech market and changed its landscape forever, giving life to new business models and architectures based on open source technologies. First the ancient dinosaurs ignored open source, then they fought it and now they are trying to embrace it. But the mighty force of evolution had other plans. Dinosaurs are giving way to a more nimble generation that doesn’t depend on a mammoth diet of maintenance revenue, exorbitant license fees and long-term deals just to survive.

In this new world companies must continually evolve to survive and dinosaurs have had their day. It’s incredibly rewarding to be  part of a new analytics ecosystem that thrives on open standards, high performance and better value for customers. So many positive evolutionary changes have taken place in the last ten years, I can’t wait to see what the next ten will bring.

Richard Daley
Founder and Chief Strategy Officer
Pentaho

Image: #147732373 / gettyimages.com


Hadoop Summit 2014 – Big Data Keeps Getting Bigger

June 6, 2014

hadoop_summit_logo

While most of this year’s Hadoop Summit sessions still conveyed ‘developer conference,’ rife with command-line driven demos and Java, Scala, and Python code snippets, I noticed the ‘commercial’ uniform of khakis, blazers and Docksiders starting to creep in. Indeed, the themes I noticed most at the Summit were “enterprise ready” and “next-generation data platform.”

So if the Summit’s days as an all-out geekfest are history, what does this say about Hadoop? I happen to think it’s great news: it says Hadoop is going mainstream and being embraced as core to the enterprise data platform. Nothing drives this home more convincingly than the fact that the Hadoop “enterprise ready” ecosystem has exploded from less than ten vendors five years ago to more than 80 vendor sponsors at this year’s show.

In this our fifth year sponsoring the Summit, we were just as pumped as we were attending and sponsoring our first Hadoop Summit back in June 2010 right after the launched our first Hadoop product set.  This year saw a record crowd (3,200+ attendees from 1,100 different companies), informative breakout sessions, fun parties, and lots of energy and passion throughout.

More large enterprise companies laid out specific needs and funded use cases than ever. I noticed companies increasingly talking about bringing Hadoop in house to build proofs of concept so they wouldn’t get left behind, losing ground to competitive Hadoop shops. Hadoop has emerged as the new strategic weapon in companies’ IT arsenals as they wake up to the value of their data assets.

I talked to Hadoop users who were beaming with pride over their projects and hungry to take on more. These techies are the new corporate rock stars, delivering huge returns to their companies. However as with any young technology, Hadoop projects aren’t completely free of road bumps – mostly around blending different types of data and integration as both corporate data volumes and variety continues to multiply like rabbits.

That’s why at Pentaho, we’re determined to stay on the road less travelled and keep smoothing out these data blending and integration road bumps so that every data professional working with Hadoop – regardless of their dress code – will enjoy a better ride.

See you at Hadoop Summit 2015!

Richard Daley
Founder and Chief Strategy Officer
Pentaho


Highlights From Splunk .conf2013 – Machine Data Meets Big Business

October 4, 2013
Eddie.jpg

Eddie White, EVP Business Development, Pentaho

This week, Pentaho was on site for Splunk .conf2013 in Las Vegas and the show was buzzing with excitement. Organizations big and small shared a range of new innovations leveraging machine data.

Eddie White, executive VP of business development at Pentaho, shares his first-hand impressions and insights on the biggest news and trends coming out of .conf2013.

Q: Eddie, what are your impressions of this year’s Splunk conference?

There’s a different feel at the show this year — bigger companies and more business users attended this year. What traditionally has been more of an “IT Show,” has evolved to showcase real business use cases, success stories and post-deployment analysis. It’s apparent that machine data has turned a corner. The industry is moving well beyond simply logging of machine data. Users integrate, analyze and leverage their vast resource of device data for business intelligence and competitive advantage.

For example, on the first day ADP shared how they leverage big data for real-time insights. Yahoo! shared details on a deployment of Splunk Enterprise at multi-terabyte scale that is helping to better monitor and manage website properties. Intuit spoke on leveraging Splunk for diagnostics, testing, performance tuning and more. And on the second day, StubHub, Harvard University, Credit Suisse, Sears and Wipro were all featuring compelling uses for Splunk.

What was most exciting to me was the 50+ end users I spoke with who wanted learn how Pentaho blends data with and in Splunk. Our booth traffic was steady and heavy. Pentaho’s enhanced visualization and reporting demos were a hit not only with the IT attendees, but with the business users who are searching for ways to harness the power of their Splunk data for deeper insights. 

Q: Does attendance indicate a bigger/growing appetite for analysis of machine data?

Splunk is helping to uncover new information and insights – tapping into the myriad of data types Splunk can support as a data platform. It’s clearly making an impact in the enterprise. Yet as all these organizations increasingly turn to Splunk to collect, index and harness their machine-generated big data…there is tremendous opportunity for organizations to turn to Pentaho , a Splunk Powered Technology Partner, to tap and combine Splunk data with any other data source for deeper insights.

Q: How is the market developing for machine data analytics?

We are seeing the market here change from being driven by the technologists, to being driven by the business user.  The technology has advanced and now has the scale, the flexibility and the models to make real business impacts for the enterprise.  The use cases are clearly defined now and the technology fits the customer needs.  The level of collaboration between the major players like Pentaho, Splunk and Hadoop vendors now presents CIOs with real value.

Q: You were invited this year to speak on a CXO Panel addressing Big Data challenges and opportunities. What were some of the highlights?

The CXO panel was fantastic. It was quite an honor to present and be on a panel with four founders and “rock stars” in Big Data: Matt Pfeil (DataStax), M.C. Srivas (MapR), Ari Zilka (Hortonworks) and Amr Awadallah (Cloudera).

Over a panel session that ran for 90 minutes, we tackled subjects on big data challenges. We heard that Splunk users are dealing with quite a few of the same questions and challenges.

Business users and IT professionals just getting started are struggling with what project to pick first and first steps. My advice is to pick a real business use case and push us vendors to do a proof-of-concept with you, your team and to show quantifiable results in 30 days.

We also heard a lot of questions about which vendor has the right answer to their individual use scenarios and challenges. It was great to see all of the panelists on the same page in their response. No one vendor has all the answers. As I mentioned on the panel, if any Big Data player tells you they can solve all your Big Data problems, you should disqualify them! Users need Splunk, they need Pentaho and they need Hadoop.

Q: Taking a high level view of the conference, what trends can you identify?

There were two major trends taking center stage. Business people were asking business questions, and almost everyone was looking to map adoption to real business use cases.  And again, there’s a clear awareness that no one vendor can answer all of their questions. They are all looking at how to best assemble Hadoop, along with Pentaho and extend their use of Splunk with those technologies.

Q: Pentaho and Splunk are demonstrating the new Pentaho Business Analytics and Splunk Enterprise offering, providing a first look to conference attendees. What kind of reaction are you getting from the demos?

The reaction from the audiences was tremendous. We had two sets of reactions. The end user customers took the time to go in-depth with technology demos and asked questions like… where Splunk ends and where Pentaho begins?  The demo we showed drew the business user in too. It was a very powerful visualization of how we can enable a Splunk enterprise to solve business problems.

The Splunk sales teams who visited the booth and saw the demo were able to clearly discuss how to position a total solution for their customer.

Learn more about Splunk and Pentaho.

 


Impala – A New Era for BI on Hadoop

November 30, 2012

With the recent announcement of Impala, also known as Cloudera Enterprise RTQ (Real Time Query), I expect the interest in and adoption of Hadoop to go from merely intense to crazy.  We applaud Cloudera’s investment in creating Impala as it moves Hadoop a huge step forward in making Hadoop accessible using existing BI tools.

What is Impala?  Simply put, it enables all of the SQL-based BI and business analytics tools that have been built over the past couple of decades to now work directly on top of Hadoop, providing interactive response times not previously attainable with Hadoop, and many times faster than Hive, the existing SQL-like alternative. And Impala provides pretty complete SQL support, including join and aggregate functions – must-have functions for analytics.

For enterprises this analytic query speed and expressiveness is huge – it means they are now much less likely to need to extract data out of Hadoop and load it into a data mart or warehouse for interactive visualization.  Instead they can use their favorite business analytics tool directly against Hadoop. But of course only Pentaho provides the integrated end-to-end data integration and business analytics capability for both ingesting and processing data inside of Hadoop, as well as interactively visualizing and analyzing Hadoop data.

Over the past few months Cloudera and Pentaho have been partnering closely at all levels including marketing, sales and engineering.  We are proud of the role we played in assisting Cloudera with validating and testing Impala against realistic BI workloads and use cases.  Based on the extremely strong interest we’ve seen, as evidenced by the lines at our booth at the recent Strata big data conference in New York City, the combination of Pentaho’s visual development and interactive visualization for Hadoop with the break-through performance of Cloudera Impala is very compelling for a huge number of enterprises.

- Ian Fyfe, Chief Technology Evangelist, Pentaho

Impala


Are You?

June 15, 2012

You might be a badass… but are you a big data badass??

Happy Friday!


Top 10 Reasons Behind Pentaho’s Success

September 2, 2011

To continue our revival of old blog posts, today we have our #2 most popular blog from last July. Pentaho is now 7 years old, with sales continually move up and to the right. In a crazy economy, many are asking, “What is the reason behind your growth and success?” Richard Daley reflected on this question after reporting on quartlery results in 2010 .

*****Originally posted on July 20, 2010*****

Today we announced our Q2 results. In summary Pentaho:

  • More than doubled new Enterprise Edition Subscriptions from Q2 2009 to Q2 2010.
  • Exceeded goals resulting in Q2 being the strongest quarter in company history and most successful for the 3rd quarter in a row.
  • Became the only vendor that lets customers choose the best way to access BI: on-site, in the cloud, or on the go using an iPad.
  • Led the industry with a series of market firsts including delivering on Agile BI.
  • Expanded globally, received many industry recognitions and added several stars to our executive bench.

How did this happen? Mostly because of our laser focus over the past 5 years to build the leading end-to-end open source BI offering. But if we really look closely over the last 12-18 months there are some clear signs pointing to our success (my top ten list):

Top 10 reasons behind Pentaho’s success:

1.     Customer Value – This is the top of my list. Recent analyst reports explain how we surpassed $2 billion mark during Q2 in terms of cumulative customer savings on business intelligence and data integration license and maintenance costs. In addition, ranked #1 in terms of value for price paid and quality of consulting services amongst all Emerging Vendors.

2.     Late 2008-Early 2009 Global Recession – this was completely out of our control but it helped us significantly by forcing companies to look for lower cost BI alternatives that could deliver the same or better results than the high priced mega-vendor BI offerings. Making #1 more attractive to companies worldwide.

3.     Agile BI – we announced our Agile BI initiative in Nov 2009 and received an enormous amount of press and positive reception from the community, partners, and customers. We’ve been showing previews and releasing RCs in Q1-Q2 2010 and put PDI 4.0 in GA at the end of Q2 2009.

4.     Active Community – A major contributing factor to our massive industry adoption is our growing number of developer stars (the Pentaho army) that continue to introduce Pentaho into new BI and data integration projects. Our community triples the amount of work of our QA team, contributes leading plug-ins like CDA and PAT, writes best-selling books about our technologies and self-organizes to spread the word.

5.    BI Suite 3.5 & 3.6 – 3.5 was a huge release for the company and helped boost adoption and sales in Q3-Q4 2009. This brought our reporting up to and beyond that of competitors. In Q2 2010 the Pentaho BI Suite 3.6 GA brought this to another level including enhancements and new functionality for enterprise security, content management and team development as well as the new Enterprise Edition Data integration Server.  The 3.6 GA also includes the new Agile BI integrated ETL, modeling and data visualization environment.

6.     Analyzer – the addition of Pentaho Analyzer to our product lineup in Sept-Oct 2009 was HUGE for our users – the best web-based query and reporting product on the market.

7.     Enterprise Edition 30-Day Free Evaluation – we started this “low-touch/hassle free” approach in March 2009 and it has eliminated the pains that companies used to have to go thru in order to evaluate software.

8.     Sales Leadership – Lars Nordwall officially took over Worldwide Sales in June 2009 and by a combination of building upon the existing talent and hiring great new team members, he has put together a world-class team and best practices in place.

9.     Big Data Analytics – we launched this in May 2010 and have received very strong support and interest in this area. We currently have a Pentaho-Hadoop beta program with over 40 participants. There is a large and unfulfilled requirement for Data Integration and Analytic solutions in this space.

10.   Whole Product & Team – #1-#9 wouldn’t work unless we had all of the key components necessary to succeed – doc, training, services, partners, finance, qa, dev, vibrant community, IT, happy customers and of course a sarcastic CTO ;-)

Thanks to the Pentaho team, community, partners and customers for this great momentum. Everyone should be extremely proud with the fact that we are making history in the BI market. We have a great foundation in which to continue this rapid growth, and with the right team and passion, we’ll push thru our next phase of growth over the next 6-12 months.

Quick story to end the note:  I was talking and white boarding with one of my sons a few weeks ago (yes, I whiteboard with my kids) and he was asking certain questions about our business (how do we make money, why are we different than our competitors, etc.) and I explained at a high level how we are basically “on par and in many cases better” than the Big Guys (IBM, ORCL, SAP) with regards to product, we provide superior support/services, yet we cost about 10% as much as they do. To which my son replied, “Then why doesn’t everyone buy our product?”  Exactly.

Richard
CEO, Pentaho


Pentaho’s support of EMC Greenplum HD – what it means and why you should care

May 12, 2011

On Monday we announced our support for the EMC Greenplum distribution of Hadoop called EMC Greenplum HD. You can read about all the details in our press release, Pentaho Makes Hadoop Faster, More Affordable and Easier to Use with EMC.

This week we have been at EMC World in Las Vegas as a sponsor in booth 211 (if you are at the conference come visit us). We’ve had a great crowd and interest in Pentaho BI Suite for Hadoop, Pentaho Data Integration for Hadoop and our new native support for the Greenplum Database GPLoad high performance bulk loader. Two questions that attendees keep asking are: “How is Pentaho supporting EMC Greenplum HD,” and “Why should I care?” You can read my answers below and more details about our announcement in the press release and Pentaho & EMC web page.

How Pentaho supports EMC Greenplum for Hadoop
Pentaho is the only EMC Greenplum partner to provide a complete BI solution from data integration through to reporting, analysis, dashboarding and data mining, from a single BI platform with shared metadata. Pentaho’s support and certification complements the Greenplum distribution of Hadoop by providing an end-to-end data integration and BI suite with the cost advantages of open source that enables:

  • An easy-to-use, graphical ETL environment for input, transformation, and output of Hadoop data;
  • Massively scalable deployment of ETL processing across the Hadoop cluster;
  • Coordination and execution of Hadoop tasks by enabling them to be managed from within the Pentaho management console;
  • Easy spinning off of high performance data marts for interactive analysis;
  • Integration of data from Hadoop with data from other sources for interactive analysis.

Why this is a good thing and how it changes the industry
EMC Greenplum, in combination with key technology partners, for the first time is giving the industry an integrated, supported and certified data management and BI stack that includes storage, a MapReduce framework for processing unstructured data, an analytic database, predictive analytics and business intelligence.

By combining Pentaho’s powerful BI suite with the strength of EMC Greenplum’s storage and data management domain expertise, the industry benefits from maximum data throughput and significantly shorter implementation cycles for new Hadoop deployments.

Already an industry leader in data and storage, EMC is now well-positioned to play a pivotal role in commercializing Hadoop and giving businesses a more cost-effective and simple way to perform advanced analytics in a massively scalable way. For Hadoop to truly get to the next level, it needs to be as easy-to-install and use as off-the-shelf software.

If you are interested to evaluate Pentaho BI Suite and Pentaho Data Integration for the EMC Greenplum distribution of Hadoop, contact us at Pentaho_EMC@pentaho.com

Ian Fyfe
Chief Technology Evangelist
Pentaho

Photos from the Pentaho booth at EMC World this week

This slideshow requires JavaScript.


Thoughts on last week’s Strata big data conference

February 8, 2011

Last week I attended the O’Reilly’s Strata Conference, in Santa Clara, California where Pentaho was an exhibitor. I gave a 5-minute lightning talk during the preceding Big Data Camp “un-conference” on the topic, The importance of the hybrid data model for Hadoop driven analytics, focusing on the importance of combining big data analytic results with the data elements already in firm’s existing systems to give business units the answers to questions that were previously not possible or economic to answer (something that of course Pentaho now makes possible). I also sat down for an interview with Mac Slocum, Online Managing Editor at O’Reilly, you can see the video below where we discuss  what kinds of businesses can benefit from big data technologies such as Hadoop, and what is the tipping point for adopting big data technologies.


The high quality of attendees and activity at this sell-out conference I think further confirms that although development work on solutions for big data has been happening for a few years, this area is undergoing a quantum leap in adoption at businesses both large and small. Simply put this technology allows them to glean “information” from the enormous quantities of often unstructured or semi-structured data that in the past was simply not possible, or was eye-wateringly expensive to achieve using conventional relational database technologies.

I found that the level of “Big Data” understanding maturity among attendees was quite varied. Questions spanned the entire spectrum with a few people asking things like “What is Hadoop?” to many along the lines of “Exactly how does Pentaho integrate with Hadoop’s Map-Reduce Framework, HDFS, and Hive?” Some attendees were clearly still in the discovery and learning phase, but many were confidently moving forward with the idea of leveraging big data, and were looking for solutions that make it easier to work with big data technologies such as Hadoop to deliver new information and insights to their businesses. In fact, it is clear that the emergence of a new type of database professional: the data scientist is rapidly becoming mainstream. This person combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data.

Ian Fyfe
Chief Technology Evangelist
Pentaho Corporation

Here are some in-action photos of our booth at the Strata Conference


Pentaho’s week of Hadoop

October 15, 2010

Congrats to our new partner Cloudera for putting on a great event this week. Hadoop World 2010 was a huge success – with over 900 attendees. It was great to talk to companies using Hadoop and those looking to solve their big data problems. We were also excited to have such a great showing at our presentation at the very end of the day with standing room only!

On Wednesday, following Hadoop World, the Pentaho Agile BI Tour arrived in NYC. Pentaho and its partner Project Leadership Associates presented a special half-day seminar focused on Agile BI and Big Data. We also hosted the first of three special OEM Power Lunches for companies interested in embedding Pentaho.

For an insider’s look at our Week of Hadoop check out our slide show below.

If you missed our four announcements Tuesday about the availability of Pentaho Data Integration and Pentaho BI Suite for Hadoop and our new partnerships you can read what the press and analyst have to say:

Hadoop pitched for business intelligence
ITWorld.com,  Joab Jackson

Pentaho Adds Hadoop Support
CTO Edge, Mike Vizard

Pentaho Brings Businss intellgience to Hadoop
ECRMGuide, Paul Shread

Pentaho brings BI, integration to Hadoop
Computer Business Review, Jason Stamper

You may be thinking, what is Hadoop? If so, I recommend to check out the videos by our Chief Geek, James Dixon. 5 videos are short, to the point and very informative.

Have a great weekend!
Rebecca Shomair
Director, Corporate Communications
Pentaho Corporation

This slideshow requires JavaScript.


Follow

Get every new post delivered to your Inbox.

Join 97 other followers