High availability and scalability with Pentaho Data Integration

March 31, 2011

Experts often possess more data than judgment.” – Colin Powell….hmmm, those experts surely are not using a scalable Business Intelligence solution to optimize that data which can help them make better decisions. :-)

Data is everywhere! The amount of data being collected by organizations today is experiencing explosive growth. In general, ETL (Extract Transform Load) tools have been designed to move, cleanse, integrate, normalize and enrich raw data to make it meaningful and available for knowledge workers and decision support systems. Once data has been “optimized,” only then can it be turned into “actionable” information using the appropriate business applications or Business Intelligence software. This information could then be used to discover how to increase profits, reduce costs or even write a program that suggests what your next movie on Netflix should be. The capability to pre-process this raw-data before making it available to the masses, becomes increasingly vital to organizations who must collect, merge and create a centralized repository containing “one version of the truth.” Having an ETL solution that is always available, extensible and highly scalable is an integral part of processing this data.

Pentaho Data Integration

Pentaho Data Integration (PDI) can provide such a solution for many varying ETL needs. Built upon a open Java framework, PDI uses a metadata driven design approach that eliminates the need to write, compile or maintain code. It provides an intuitive design interface with a rich library of prepacked plug-able design components. ETL developers with skill sets that range from the novice to the Data Warehouse expert can take advantage of the robust capabilities available within PDI immediately with little to no training.

The PDI Component Stack

Creating a highly available and scalable solution with Pentaho Data Integration begins with understanding the PDI component stack.

● Spoon – IDE – for creating Jobs, Transformations including the semantic layer for BI platform
● Pan – command line tool for executing Transformations modeled in Spoon
● Kitchen – command line tool for executing Jobs modeled in Spoon
● Carte – lightweight ETL server for remote execution
● Enterprise Data Integration Server – remote execution, version control repository, enterprise security
● Java API – write your own plug-ins or integrate into your own applications

Spoon is used to create the ETL design flow in the form of a Job or Transformation on a developer’s workstation. A Job coordinates and orchestrates the ETL process with components that control file movement, scripting, conditional flow logic, notification as well as the execution of other Jobs and Transformations. The Transformation is responsible for the extraction, transformation and loading or movement of the data. The flow is then published or scheduled to the Carte or Data Integration Server for remote execution. Kitchen and Pan can be used to call PDI Jobs and Transformations from your external command line shell scripts or 3rd party programs. There is also a complete Java SDK available to integrate and embed these process into your Java applications.

Figure 1: Sample Transformation that performs some data quality and exception checks before loading the cleansed data

PDI Remote Execution and Clusters

The core of a scalable/available PDI ETL solution involves the use of multiple Carte or Data Integration servers defined as “Slaves” in the ETL process. The remote Carte servers are started on different systems in the network infrastructure and listen for further instructions. Within the PDI process, a Cluster Scheme can be defined with one Master and multiple Slave nodes. This Cluster Scheme can be used to distribute the ETL workload in parallel appropriately across these multiple systems. It is also possible to define Dynamic Clusters where the Slave servers are only known at run-time. This is very useful in cloud computing scenarios where hosts are added or removed at will. More information on this topic including load statistics can be found here in an independent consulting white paper created by Nick Goodman from Bayon Technologies, “Scaling Out Large Data Volume Processing in the Cloud or on Premise.”

Figure 2: Cx2 means these steps are executed clustered on two Slave servers
All other steps are executed on the Master server

The Concept of High Availability, Recover-ability and Scalability

Building a highly available, scalable, recoverable solution with Pentaho Data Integration can involve a number of different parts, concepts and people. It is not a check box that you simply toggle when you want to enable or disable it. It involves careful design and planning to prepare and anticipate the events that may occur during an ETL process. Did the RDBMS go down? Did the Slave node die? Did I lose network connectivity during the load? Was there a data truncation error at the database? How much data will be processed on peak times? The list can go on and on. Fortunately PDI arms you with a variety of components including complete ETL metric logging, web services and dynamic variables that can be used to build recover-ability, availability, scalability scenarios into your PDI ETL solution.

For example, Managing Consultant in EMEA, Jens Bleuel developed a PDI implementation of the popular Watchdog concept. A solution that includes checks to monitor if everything is on track is using the concept of a Watchdog when executing its tasks and events. Visit the link above for more information on this implementation.

 

 

Putting it all together – (Sample)

Diethard Steiner, active Pentaho Community member and contributor, has written an excellent tutorial that explains how to set up PDI ETL remote execution using the Carte server. He also provides a complete tutorial (including sample files provided by Matt Casters, Chief Architect and founder of Kettle) on setting up a simple “available” solution to process files, using Pentaho Data Integration. You can get it here. Please note that advanced topics such as this are also covered in greater detail (designed by our Managing Consultant Jens Bleuel – EMEA) in our training course available here.

Summary

When attempting to process the vast amounts of data collected on a daily basis, it is critical to have a Data Integration solution that is not only easy to use but easily extendable. Pentaho Data Integration achieves this extensibility with its open architecture, component stack and object library which can be used to build a scalable and highly available ETL solution without exhaustive training and no code to write, compile or maintain.

Happy ETLing.

Regards,

Michael Tarallo
Senior Director of Sales Engineering
Pentaho

This blog was originally published on the Pentaho Evaluation Sandbox. A comprehensive resource for evaluating and testing Pentaho BI.


Top 5 Reasons to join the Pentaho Team

March 30, 2011

Several of our executives at Pentaho have shared reasons why they joined Pentaho – you can read their blogs here, here and here. We surveyed employees at Pentaho and narrowed down the top 5 reasons to join the Pentaho team:

  1. The fridge is stocked with beer. Corona, Amstel, Sierra Nevada, we have it all! Changing the face of business intelligence is no easy feat. To help us out, the Pentaho offices have dedicated full-sized beer refrigerators, so we can enjoy our favorite cold ones at the end of the day.
  2. Be a part of revolutionizing the business intelligence industry. Pentaho was founded in 2004 with the mission to “To achieve positive, disruptive change in the BI space by building an innovative, class-leading BI platform and making it available to everyone by releasing it into Open Source”. We are working towards this mission everyday and succeeding!
  3. Smart, motivated, and fun co-workers. We may work hard, but we know how to have a good time! From sales meeting to Halloween, we’re always in the company of innovative, intelligent, and friendly people.
  4. International opportunities. Being part of an international team means we get to work with customer, partners, and coworkers from around the world.
  5. We’re growing! With +120% growth year-over-year in 2010, the opportunities for growth at Pentaho are tremendous.

If you think you would be a good fit with our smart, talented and fun group, and are interested in working at a fast-paced, international, game-changing company, check out our current job openings. We’re growing fast and currently openings worldwide in everything from sales and marketing to services and information systems.

Bonus reason: Last year Pentaho CTO/Chief Geek, James Dixon was picked by CNN as having one of the Best Jobs in America.


IT needs vs. Business needs

March 22, 2011

Can Business and IT finally live in harmony when it comes to BI?

This is not a new concept or question. In fact, for the last several years pretty much all BI vendors claimed that they have solved the “Business and IT Collaboration” needs. Or, at least their marketing departments did!

To truly solve a problem, we must first fully understand it. In this case it is important to ask questions such as: Why is there a lack of collaboration between these two groups? What is so drastically different about these two groups that have forced such a gap between them?

The truth is that IT needs a central ownership to information to streamline processes and ensure sustainability, while business users want their own self-service and ownership to gain results faster. After all business users have become a lot more analysis and data savvy these days as compared to the past; so, an old-school approach of letting IT do the work and just being the consumers of canned reports doesn’t cut it anymore.

Perhaps this picture illustrates the differences more clearly.

As you can see, these two groups are clearly in conflict when it comes to how they like to manage their information. So, we ask: What will help these two groups to start working in harmony?

The truth is that it won’t happen… unless there is a ‘balance’ between their needs.

As much as business users want quick time to value out of their BI projects, one-off applications are not sustainable overtime. They become monsters that are too hard to keep up-to-date, considering all the changes that happen to business requirements over time. Sooner or later, business users will need to reach to their IT friends for help.

The ‘balance’ lies in letting the business users get fast time to value, but still building applications that are sustainable to change. We define this ‘balance’ with an Agile BI approach:

  • Quick prototyping and visualization of the results
  • Frequent iterations and reviews between business and IT users to ‘get it right’
  • Once the data is ‘fit-for-purpose’, providing self-service tools for business users to be self-sufficient in building their own reports, analysis, and dashboards
  • Having a strong ‘shared’ metadata foundation across the board to adjust to changes quickly and to scale up with cumulative iterations

So back to our point about collaboration between business and IT: It is possible? Yes. Does it happen because a set of ‘tools’ facilitate this collaboration? Not necessarily, but they can help. What is the secret ingredient then to ensure such collaboration occurs? Simple: This collaboration happens as long as these two groups need each other, and are working towards a set of common / balanced goals for their BI projects. Something that is only possible with an Agile BI approach.

For more information about this topic and to explore how Pentaho has made Agile BI possible, attend our upcoming webinar on How to Fast Track Your BI Projects with Agile BI and see for yourself how Pentaho customers have come to reap the value of their BI projects with Pentaho’s Agile BI initiative.

Farnaz Erfan
Product Marketing Manager
Pentaho Corporation


What’s new in Pentaho BI Suite Enterprise Edition 3.8

March 16, 2011

It’s release time once again and I’m pleased to announce that Pentaho BI Suite 3.8 is available for download! This release is packed with new features empowering you to add more interactivity to your Pentaho-based solutions and improve performance and efficiency when working with larger and larger data volumes. In today’s blog, I’ll highlight just a few of these exciting new enhancements:

Guided Analysis
Building upon the hyperlink feature introduced in Pentaho Reporting with our 3.7 BI suite release, Analyzer Reports and Action Sequences now also provide the ability to create contextual hyperlinks to other pieces of Pentaho content or external URLs. You now have complete flexibility to provide information consumers with guided paths to additional detail or related content found in another report.  For example, with a couple of clicks you could create a summary level Analyzer Report for users to explore and analyze product sales, then enable hyperlinks on product names which link out to a detailed inventory report to ensure there are enough units on hand for your top selling products.

Dashboard Content Linking
Pentaho Dashboard Designer also receives a dose of interactivity with a feature we call content linking.  Content linking allows you enable one dashboard element drive the filtering of another element of the dashboard.  This feature is integrated with nearly all dashboard components including filter controls, dashboard charts, data tables and any items embedded in a dashboard widget such as a Pentaho report or Analyzer view.  This can be used for a variety of use cases including the creation of chained parameters, where selections in one filter control are used to drive the available selections in another filter control, or allowing dashboard consumers to click on the slice of a pie or a bar in a bar chart and have that drive the filtering of other widgets on the dashboard.  Be sure to check out the new dashboard samples, Product Sales Performance and Product Performance Dashboard, which illustrate the content linking feature along with the new, expanded set of filter controls including radio groups, check boxes, calendar pickers, button controls and more.

Data-less Design Mode for Analyzer Reports
Since its introduction just over a year ago, Pentaho Analyzer’s elegant combination of power and simplicity has driven exponential growth in use of Pentaho Analysis.  This includes deployments to larger user communities and the development of bigger, more sophisticated Mondrian cubes. Based on feedback from Pentaho customers, we’ve introduced the notion of a data-less report design mode, referred to as ‘auto-refresh’ in the user interface.  This allows users to design or modify the layout of an Analyzer report without querying the underlying RDBMs until the designer is ready to query.  This can help reduce database traffic for deployments to large user communities or reduce the design time for reports that depend on large queries.  Try it out by clicking on the Disable Auto-refresh button on the Analyzer toolbar, designing your query using the Field Layout panel, then clicking the Refresh Report button to issue the query.

Simplified Hadoop MapReduce Job Design
Also included in the 3.8 suite release is Pentaho Data Integration 4.1.2.  While this is primarily a patch release containing important product fixes and performance improvements, we’ve also added a few new features that simplify the design of transformations used to compose MapReduce jobs for Hadoop.  This includes dedicated steps for MapReduce Inputs and Outputs allowing you to simply choose which fields to use as your OutKey and OutValues, rather than having to explicitly filter and rename fields in the stream down to the key-value pair fields you pass back to Hadoop.  Finally, the Transformation Job Executor step now provides the ability to specify a transformation for use as a Combiner, thereby enabling you to optimize the performance of your PDI-based MapReduce jobs.

We hope you enjoy this exciting new release, get started today by downloading your copy at http://www.pentaho.com/download/.

Jake Cornelius
Vice President of Product Management
Pentaho Corporation


Top 5 Reasons to Attend a Fast Track BI Seminar

March 16, 2011

We are trotting the globe to bring Agile Business Intelligence to a city near you. The latest Pentaho seminar series, Fast Track BI, spans over 40 cities around the world, bringing half-day complimentary training sessions to companies looking to implement BI. With Pentaho’s unique Agile BI approach, speed of thought reporting and analysis, and out-of-the-box data integration capabilities, you’ll learn how to put your business intelligence projects on the fast track to success.

5. It’s free! Each half-day seminar is packed with information, training, and business intelligence best practices. Who doesn’t enjoy free education these days?

4. See real-life use cases and live demonstrations of the Pentaho Agile BI approach in action.

3. Network with other CTOs, CIOs, Architects, and IT executives in your area to share best practices in implementing business intelligence.

2. Meet the dedicated Pentaho partner in your area to start the conversation on how your company can start leveraging open source business intelligence.

1. Of the Top 5 Open Source Companies, 3 are in the business intelligence space. Now is the time to get familiar with the next generation of business intelligence.

See the full schedule of cities and register today for a seminar near you. We look forward to seeing you on the road!


Q&A with Pentaho Trainer Lynn Yarbrough

March 14, 2011

Q&A is a series on the Business Intelligence from the Swamp Blog that interviews key members of the Pentaho team to learn more about their focus at Pentaho and outlook on the Business Intelligence industry.

If you have attended a Pentaho training class, most likely you have had a chance to meet the knowledgeable and entertaining Lynn Yarbrough. To get to know Lynn and our training classes better we asked Lynn 5 questions.

1.  What brought you to Pentaho and what do you do?

I am a trainer at Pentaho. After almost 20 years in the BI and Data Integration industry working at companies like Information Builders and Hyperion, I wanted to work at a smaller Business Intelligence company and I found a job posting for a training position at Pentaho on Monster.

2.  How many classes have you taught and do you have a favorite that stands out?

I have been with Pentaho almost 4 years and probably taught 40 classes last year alone.  My favorite class is the Pentaho BI Suite BootCamp.  I especially love the morning of Day 3, because that when we complete the creation of an OLAP cube and can view the data of Pentaho Analyzer in the Pentaho User console.  It is very rewarding to see the result of all of the work and often this is when students have an ‘aha’ moment, and say ‘this is why we need Pentaho, to give our users this power.’

3.  What elements do you think are necessary to make a successful training class?

Knowledge of the product, knowledge of the industry (in my case Business Intelligence) and a sense of humor.  We in training have also made many of the classes very ‘hands on’ which help students learn and retain the product. One of the problems we face in the classroom is that we are teaching students who have a variety of skill levels to make the class beneficial to all we have added  “optional” labs for the more advanced students.

4. Can you tell us more about the Agile BI for Business Analyst class and how it contrasts to the ‘early days’ of training

We began talking about and shaping the Agile BI class over 2 years ago, but it didn’t really take-off.  The product didn’t have the “ease of use” it has today. Now, Pentaho Agile BI has matured to the level where the class is engaging and quite fun to teach since is so easy to create a data model and analyze the data.

5.  As we kick off 2011, what are you most excited about regarding training for the upcoming year?

2011 will add many new key features to the product which will not only impact customers, but will make training easier, as the product improves in “ease of use” it also get easier and more fun to teach.  We also plan to add new training classes suggested by our customers such as ‘Pentaho for the End User’ giving hands-on training for our front end tools and ‘Pentaho On-Demand’ for customers interested to gain more experience with our cloud solution. Things are never boring at Pentaho.

Do you have additional questions for Lynn? Is there someone or a certain role at Pentaho you would like us to interview? Leave your questions in the comments section below. We’d love to hear from you.


February in review

March 7, 2011

For the shortest month of the year, Pentaho made a lot of noise in the press. Not as much as Charlie Sheen, Egypt and Libya, but we had our fare share as you can see below in our February in review update.

New customer references

This month we were happy to add Brussels Airport to our list of customer successes. Following a competitive proof-of-concept project to replace both its Oracle reporting and an IBM Extraction Transformation and Load tool, Brussels Airport selected Pentaho for its reporting and ETL capabilities to achieve their goal of a single Information Delivery Platform. You can read their story in English, French and Dutch.

“Choosing Pentaho meant we could replace not just our reporting, but also our proprietary ETL tool to achieve our goal of a single Information Delivery Platform. Given we found Pentaho gives us at least 80 to 85% of the functionality of proprietary vendors, the cost savings we achieved are hugely significant at an estimated €350K+.”

Pentaho in the news

Interview: Karl de Bruijn, European IT director, Specsavers
Computing.co.uk, Nicolla Brittian

The Microsoft Cloud BI Story, Dateline February 2011
SQL Server Magazine, Mark Kromer

Open source’s emerging opportunity in BI
InfoWorld, Savio Rodrigues

Brussels Airport also made headlines worldwide in Germany, Belgium and the UK
Brussels Airport bespaart met open source BI
Computable, Pim van der Beek

Brussels Airport opte pour le BI open source
Datanews.be, Stefan Grommen

Open-source business intelligence solution lands at Brussels airport
Computing.co.uk, Nicola Brittain

To stay up-to-date with the latest Pentaho highlights in the news subscribe to the Highlights RSS.

Event Recap

Pentaho was a hot topic on webcast this month being featured on BI-EDGE and Information Management. If you missed these webcast, no fear, you can now access them on-demand.

BI-EDGE online conference about Business Intelligence and Analytics for the Responsive Enterprise with IBM Cognos, SAS Institute, Pentaho and Kapow Software
Watch Now

Information Management webcast about Better Decisions: The Top Dos and Don’ts of Agile BI with leading industry analyst, Claudia Imhoff
Watch Now

We are already full-speed-ahead in March make sure to read the blog, ‘Where is Pentaho this March‘ to see when a webcast, live event, or training session is coming to a city near you.


Follow

Get every new post delivered to your Inbox.

Join 105 other followers