Pentaho, Hadoop, and Data Lakes

October 15, 2010

Earlier this week, at Hadoop World in New York,  Pentaho announced availability of our first Hadoop release.

As part of the initial research into the Hadoop arena I talked to many companies that use Hadoop. Several common attributes and themes emerged from these meetings:

  • 80-90% of companies are dealing with structured or semi-structured data (not unstructured).
  • The source of the data is typically a single application or system.
  • The data is typically sub-transactional or non-transactional.
  • There are some known questions to ask of the data.
  • There are many unknown questions that will arise in the future.
  • There are multiple user communities that have questions of the data.
  • The data is of a scale or daily volume such that it won’t fit technically and/or economically into an RDBMS.

In the past the standard way to handle reporting and analysis of this data was to identify the most interesting attributes, and to aggregate these into a data mart. There are several problems with this approach:

  • Only a subset of the attributes are examined, so only pre-determined questions can be answered.
  • The data is aggregated so visibility into the lowest levels is lost

Based on the requirements above and the problems of the traditional solutions we have created a concept called the Data Lake to describe an optimal solution.

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

For more information on this concept you can watch a presentation on it here: Pentaho’s Big Data Architecture

Cheers,
James Dixon
Chief Geek
Pentaho Corporation

Originally posted on James Dixon’s blog http://jamesdixon.wordpress.com/


Data, Data, Data

October 12, 2010

It’s everywhere and expanding exponentially every day. But it might as well be a pile of %#$& unless you can turn all of that data into information. And do so in a timely, efficient and cost-effective manner.  The old-school vendors don’t operate in a timely (everything is slow), efficient (everything is over-engineered, over-analyzed, over-staffed, etc) or cost-effective mode (the bloated supertanker needs feeding and the customer gets to pay for those inefficiencies), so that means new technologies and business models will drive innovation which ultimately serves the customers and communities.

Back to Data, Data, Data – Enter open source technologies like Hadoop and Pentaho BI/DI to drive next gen big data analytics to the market. Hadoop and Pentaho have both been around about 5 years, are both driven by very active communities, and have both been experiencing explosive growth over the last 18 months. Our community members are the ones who came up with the original integration points for the two techs, not because it was a fun, science project thing to do but because they had real business pains they were trying to solve. This all started in 2009 – we started development in 09, we launched our beta program in June 2010 (had to cap enrollment in the beta program at 60), launched a Pentaho for Hadoop roadshow (which was oversubscribed) and are now announcing the official release of Pentaho Data Integration and BI Suite for Hadoop.

I’m in NYC today at Hadoop World and we’re making four announcements:

  1. Pentaho for Hadoop – our Pentaho BI Suite and Pentaho Data Integration are now both integrated with Hadoop
  2. Partnership with Amazon Web Services – Pentaho for Hadoop now supports Amazon Elastic Map Reduce (EMR) and S3
  3. Partnership with Cloudera – Pentaho for Hadoop will support certified versions of Cloudera’s Distribution for Hadoop (CDH)
  4. Partnership with Impetus – a major Solutions Provider (over 1,000 employees) with a dedicated Large Data Analytics practice.

Consider this as phase I of building out the ecosystem.

We’re all about making Hadoop easy and accessible. Now you can take on those mountains of data and turn them into value. Download Pentaho for Hadoop.

Richard


Pentaho in October: It’s a Hadoop world, we’re just living in it

October 5, 2010

Big things are happening at Pentaho this month, with an emphasis on Big Data.  We are headlining Hadoop events in New York, London and San Diego. If you are attending Hadoop World in New York City on October 12th, make sure to stop by our booth and attend Richard Daley’s session, ‘Putting Analytics in Big Data Analysis‘.  Then stay around until the 13th for a special FREE half-day seminar, ‘Agile BI Meets Big Data‘ with our partners, Project leadership Associates.

All the while, we’re still on the road with the Agile BI Tour hitting full force in October.  We’ll be visiting 10 more cities around the world with these info-packed seminar and training sessions.  Directly following the Agile BI events in New York, San Mateo, and Houston, Pentaho will be hosting a special OEM Power Lunch, where we will explore Pentaho’s architecture, hear some specific OEM use cases, and introduce you to the Pentaho OEM team.

Hadoop Events

Join us to see how we’re simplifying the complexities of Big Data Analytics with Pentaho’s Biggest initiative to date, Pentaho for Hadoop.

Agile BI Tour : Data to Dashboards in Minutes

Business and technical users alike will benefit from these information-packed half-day seminars.  We will demonstrate and provide a step-by-step training on the Pentaho BI Suite Enterprise Edition.

OEM Power Lunch Series : Enhance you Product with Modern Business Intelligence

Join us for lunch as we explore Pentaho’s easily embeddable architecture, specific OEM partner case studies, and introduce you to our OEM team.  These lunch sessions will take place from 12:00 – 2:00pm, directly following the Pentaho Agile BI Tour stops in the following cities:

It’s an action-packed month and we look forward to seeing you somewhere along the way.  Wishing you a Happy Halloween from Pentaho!


Pentaho’s BIG, Fast, and Agile August

August 4, 2010

Pentaho is hitting the road this month to show you the world’s first BI integration for Hadoop with our three-city roadshow, ‘Harnessing Hadoop for Big Data’.  Next, prepare to see blazing fast business intelligence when we pair Ingres Vectorwise with Pentaho’s Agile BI initative.

BIG – We’re rolling into town to show you how Pentaho, as the face of Hadoop, can leverage the power of business intelligence and data integration for your Big Data analysis needs.  These live seminars are free but space is limited, so be sure to register now.

  • Harnessing Hadoop for Big Data – Live Seminar Series

FAST & AGILE – See what is possible when you combine the power of Agile BI with Ingres Vectorwise, the next generation of analytic database technology during this live webcast.

  • Blazing Fast, Agile BI with Ingres VectorWise and Pentaho
    • Webcast:  Thursday, August 12th 2010

Want to learn more about Pentaho and meet the team?  This month we will be holding Classroom Training classes in Buenos Aires, Argentina and here on the home front in Orlando, Florida.

Where else can you find Pentaho?  This month and every month, we invite you to join the conversation with us on Twitter, Facebook, and LinkedIn.

Visit our Events page for more details and updated events.  Here’s to a BIG, Fast, and Agile August!


Top 10 reasons behind Pentaho’s success

July 20, 2010

Today we announced our Q2 results. In summary Pentaho:

  • More than doubled new Enterprise Edition Subscriptions from Q2 2009 to Q2 2010.
  • Exceeded goals resulting in Q2 being the strongest quarter in company history and most successful for the 3rd quarter in a row.
  • Became the only vendor that lets customers choose the best way to access BI: on-site, in the cloud, or on the go using an iPad.
  • Led the industry with a series of market firsts including delivering on Agile BI.
  • Expanded globally, received many industry recognitions and added several stars to our executive bench.

How did this happen? Mostly because of our laser focus over the past 5 years to build the leading end-to-end open source BI offering. But if we really look closely over the last 12-18 months there are some clear signs pointing to our success (my top ten list):

Top 10 reasons behind Pentaho’s success:

1.     Customer Value – This is the top of my list. Recent analyst reports explain how we surpassed $2 billion mark during Q2 in terms of cumulative customer savings on business intelligence and data integration license and maintenance costs. In addition, ranked #1 in terms of value for price paid and quality of consulting services amongst all Emerging Vendors.

2.     Late 2008-Early 2009 Global Recession – this was completely out of our control but it helped us significantly by forcing companies to look for lower cost BI alternatives that could deliver the same or better results than the high priced mega-vendor BI offerings. Making #1 more attractive to companies worldwide.

3.     Agile BI – we announced our Agile BI initiative in Nov 2009 and received an enormous amount of press and positive reception from the community, partners, and customers. We’ve been showing previews and releasing RCs in Q1-Q2 2010 and put PDI 4.0 in GA at the end of Q2 2009.

4.     Active Community – A major contributing factor to our massive industry adoption is our growing number of developer stars (the Pentaho army) that continue to introduce Pentaho into new BI and data integration projects. Our community triples the amount of work of our QA team, contributes leading plug-ins like CDA and PAT, writes best-selling books about our technologies and self-organizes to spread the word.

5.    BI Suite 3.5 & 3.6 – 3.5 was a huge release for the company and helped boost adoption and sales in Q3-Q4 2009. This brought our reporting up to and beyond that of competitors. In Q2 2010 the Pentaho BI Suite 3.6 GA brought this to another level including enhancements and new functionality for enterprise security, content management and team development as well as the new Enterprise Edition Data integration Server.  The 3.6 GA also includes the new Agile BI integrated ETL, modeling and data visualization environment.

6.     Analyzer – the addition of Pentaho Analyzer to our product lineup in Sept-Oct 2009 was HUGE for our users – the best web-based query and reporting product on the market.

7.     Enterprise Edition 30-Day Free Evaluation – we started this “low-touch/hassle free” approach in March 2009 and it has eliminated the pains that companies used to have to go thru in order to evaluate software.

8.     Sales Leadership – Lars Nordwall officially took over Worldwide Sales in June 2009 and by a combination of building upon the existing talent and hiring great new team members, he has put together a world-class team and best practices in place.

9.     Big Data Analytics – we launched this in May 2010 and have received very strong support and interest in this area. We currently have a Pentaho-Hadoop beta program with over 40 participants. There is a large and unfulfilled requirement for Data Integration and Analytic solutions in this space.

10.   Whole Product & Team – #1-#9 wouldn’t work unless we had all of the key components necessary to succeed – doc, training, services, partners, finance, qa, dev, vibrant community, IT, happy customers and of course a sarcastic CTO ;-)

Thanks to the Pentaho team, community, partners and customers for this great momentum. Everyone should be extremely proud with the fact that we are making history in the BI market. We have a great foundation in which to continue this rapid growth, and with the right team and passion, we’ll push thru our next phase of growth over the next 6-12 months.

Quick story to end the note:  I was talking and white boarding with one of my sons a few weeks ago (yes, I whiteboard with my kids) and he was asking certain questions about our business (how do we make money, why are we different than our competitors, etc.) and I explained at a high level how we are basically “on par and in many cases better” than the Big Guys (IBM, ORCL, SAP) with regards to product, we provide superior support/services, yet we cost about 10% as much as they do. To which my son replied, “Then why doesn’t everyone buy our product?”  Exactly.

Richard
CEO, Pentaho


Where to find Pentaho this June

June 15, 2010

June may be half way over but there are still 20 opportunities to learn about Pentaho this month at live and virtual events….and in 6 languages!

This month Pentaho is bringing a ray of Open Source BI sunshine to some of the industry’s most preeminent cloud events. Following the successful announcements of Pentaho’s On-Demand BI Solution and support of Apache Hadoop, we will demonstrate these offerings in action, bringing insight, clarity and flexibility to data in the cloud.

Pentaho Featured Cloud Events

GigaOm Structure 2010, June 23-24, 2010, San Francisco, CA – Join Pentaho ‘s CEO, Richard Daley and CTO, James Dixon at Structure 2010, to learn more about using Pentaho’s data integration and analytic tools to more quickly and easily load, access and analyze data in Hadoop, whether its on-premise or in the cloud.

In the exhibit hall, see a live preview demo of Pentaho’s integration with Hadoop and learn about the integrating of Pentaho BI Suite with Hive database. Take advantage of our 25% sponsor discount code by clicking here.

Hadoop Summit, June 29, 2010 in Santa Clara, CA –Pentaho is attending the third annual Hadoop Summit 2010.  Organized by Yahoo!, Hadoop Summit sessions span numerous industries and cater to all levels of expertise.  Richard Daley and Jake Conelius will be on hand at the conference to demo and discuss Pentaho’s integration with Hadoop and Hive and benefits of “Pentaho becoming the face of Hadoop.” They will also pass out limited edition Hadoop Elephants with Pentaho sweaters.

Pentaho Agile BI Events

Pentaho’s Agile BI initiative is full speed ahead as we recently delivered the Pentaho Data Integration 4.0 GA. To learn more about how to get started and why, make sure to attend one of these Agile BI focused events in the US and Europe:

North America

Worldwide webinar’s

Italy

Germany

Spain

UK

France

Norway

Visit the Events and Webcast section of our website to stay up-to-date on virtual and live events.

We want to connect with you. Join the conversation on Twitter and Facebook and get the inside scoop from the BI from the Swamp Blog by signing up to receive post via email or rss.

Rebecca Goldstein
Director, Corporate Communications
Pentaho Corporation


Six reasons why Pentaho’s support of Apache Hadoop is great news for ‘big data’

May 19, 2010

Earlier today Pentaho announced support for Apache Hadoop – read about it here.

There are many reasons we are doing this:

  1. Hadoop lacks graphical design tools – Pentaho provides plug-able design tools.
  2. Hadoop is Java –  Pentaho’s technologies are Java.
  3. Hadoop needs embedded ETL – Pentaho Data Integration is easy to embed.
  4. Pentaho’s open source model enables us to provide technology with great price/performance.
  5. Hadoop lacks visualization tools – Pentaho has those
  6. Pentaho provides a full suite of ETL, Reporting, Dashboards, Slice ‘n’ Dice Analysis, and Predictive Analytics/Machine Learning

The thing is, taking all of these in combination, Pentaho is the only technology that satisfies all of these points.

You can see a few of the upcoming integration points in the demo video (above). The ones shown in the video are only a few of the many integration points we are going to deliver.

Most recently I’ve been working on integrating the Pentaho suite with the Hive database. This enables desktop and web-based reporting, integration with the Pentaho BI platform components, and integration with Pentaho Data Integration. Between these use cases, hundreds of different components and transformation steps can be combined in thousands of different ways with Hive data. I had to make some modifications to the Hive JDBC driver and we’ll be working with the Hive community to get these changes contributed. These changes are the minimal changes required to get some of the Pentaho technologies working with Hive. Currently the changes are in a local branch of the Hive codebase. More specifically they are a ‘Short-term Rapid-Iteration Minimal Patch’ fork – a SHRIMP Fork.

Technically, I think the most interesting Hive-related feature so far is the ability to call an ETL process within a SQL statement (as a Hive UDF). This enables all kinds of complex processing and data manipulation within a Hive SQL statement.

There are many more Hadoop-related ETL and BI features and tools to come from Pentaho.  It’s gonna be a big summer.

James Dixon
Chief Geek
Pentaho Corporation

Learn more - watch the demo



Big data should not mean big cost

May 19, 2010

Data is exploding at rates our industry has never seen before and the huge opportunity to leverage this data is stymied by the archaic licensing practices still in use by the old school software companies.  Currently, the big guys like Oracle, IBM, SAP, Teradata and other proprietary database and data warehouse vendors have a very simple solution to “big data” environments – just keep charging more money, a lot more money. The only “winners” in this scenario are the software sales reps. Our industry (Tech) is artificially slowed in order to support these old school business models – they can’t afford to innovate in licensing and they surely don’t want to kill the golden goose – The Perpetual License fee.

A major gaming company, for example, had been using Oracle for its database and BI tech. With traffic reaching 100 million to 1 billion impressions per day, the database giant’s only answer was to sell more expensive licenses. Even then, the best it could do was analyze four days worth of information at a time.

Organizations like Mozilla, Facebook, Amazon, Yahoo, RealNetworks and many others are now collecting immense amount of structured and unstructured data. The size of weblogs alone can be enormous.  Management wants to be able to triangulate what people are doing at their sites in order to do a better job of

a)     Turning prospects into customers
b)     Offering customers what they want in a more timely manner
c)     Spotting trends and reacting to them in real time.

Any company, small or large, that is trying to sift through terabytes of structured and complex data on an hourly, daily or weekly basis for any kind of analytics had better take a long hard look at what it is really paying for. Just like the worldwide recession of 08-09 brought tremendous attention to lower cost, better value prop alternatives like Pentaho, the “big data” movement is doing the same thing in the DB/DW space. And where do you find some of the best innovations in the tech space? The answer is open source.

Specifically, an open source tech called Apache Hadoop is addressing the “better value proposition for Big Data.” It also is the only tech capable of handling some of these big data applications. Sounds great, right? Well not exactly. The issue with Hadoop is it is a very technical product with a command line interface. Once that data gets into Hadoop, how do you get it out? How do you analyze that data? If only there was an ETL and BI product tightly integrated with Hadoop, and available with the right licensing terms…

Today I’m proud to announce that Pentaho has done just that. Early May 19th we announced our plans to deliver the industry’s first complete end-to-end data integration and business intelligence platform to support Apache Hadoop.  Over the next few months we’ll be rolling out versions of our Pentaho Data Integration product and our BI Suite products that will provide Hadoop installations with a rich, visual analytical solution. Early feedback from joint Hadoop-Pentaho sites have been extremely positive and the excitement level is high.

Hadoop came out of the Apache open source camp. It is the best technology around for storing monster data sets. Until recently, only a small number of organizations used it, primarily those with deep technical resources. However, as the tech matures the audience is widening and now with a rich ETL and analytical solution it is about to get even bigger.

Stay tuned to our website and to this blog as I’ll be sharing many success stories over the next 90 days. And most importantly, watch out for the ‘Golden Goose’ licensing schemes from the old school vendors.

Richard

Visit www.pentaho.com/hadoop to watch a demo of Pentaho Enterprise integration with Hadoop and reserve your place in the beta program.


Follow

Get every new post delivered to your Inbox.

Join 101 other followers