Pentaho Data Integration 4 Cookbook – Win a Free Copy

July 19, 2011

Pentaho is very fortunate to have such a fantastic community. There are a few community rockstars that find time in their uber busy lives to write books about using Pentaho. The latest book published, the Pentaho Data Integration 4 Cookbook by co-authors Adrián Pulvirenti & Maria Carina Roldán is making its way to the top of the Amazon bestseller tech list. Even more impressive – this is Maria’s second book about PDI in just 15 months! (In April 2010 she published PDI 3.2: Beginner’s Guide). We were interested to learn more about the book and the authors. Check out our interview below to get the inside scoop about the PDI 4 Cookbook.

Read below to learn how to win a FREE copy of the PDI 4 Cookbook and for a special discount offer from Packt Publishing

1) What inspired you to write the PDI 4 cookbook so soon after “PDI 3.2 for beginners”?
Maria: At the time PDI 3.2 for Beginners was published there was a clear need for a book that revealed the secrets of Kettle, in particular for those who barely knew about this tool. The book had a great acceptance especially coming from the Pentaho Community. Today I can say that the main inspiration was definitely that rewarding feedback.

On the other side, at the time that book was published, Pentaho was about to release PDI 4. From a beginner perspective, there aren’t big differences between Kettle 3.2 and Kettle 4. Thus, there is nothing that refrain you from learning Kettle 4 with the help of the Beginner’s book. However Kettle 4 brought a lot of new features that deserved to be explained. This was also a motivation for writing this new book.

2) What is the main goal behind the book?  What do you aim to bring across?
Adrián: This book is intended to help the reader quickly solve the problems that might appear while he or she is developing jobs and transformations. It doesn’t cover PDI basics – the Beginner’s book does. On the contrary, it focuses on giving the PDI users quick solutions to particular issues.

  • Can I generate complex XML structures with Kettle?
  • How do I execute a transformation in a loop?
  • What do I need for attaching a file in an email?
  • These are common questions solved in the book through quick easy-to-follow recipes with different difficulty levels.

3) Where did you find the inspiration for this new book?
Maria: The main inspiration for this book was the PDI forum; many of the recipes explained in the book are the answers to questions that appear in the forum again and again, as for example: how to use variables, how to read an XML file, how to create multi-sheet Excel files, how to pass parameters to transformations, etc. Just to give an example, the recipe “Executing part of a job once for every row in a dataset” explains how to loop over a set of entities (people, product codes, filenames, or whatever), which is a very recurrent issue in the Kettle forum.

Besides that, Kettle itself was an inspiration. While outlining the contents of the book and with the aim of having a diversified set of recipes we browsed the list of steps and job entries many times thinking: Is there something that we aren’t covering? Are there steps that deserve a recipe by themselves? Many of the recipes that you can find in the Cookbook came out after that exercise. “Programming custom functionality,” a recipe that explains how to use the UDJC step and quickly explains other scripting related steps, is just an example of these set of recipes.

4) What do you like so much about Pentaho (Data Integration) to make you write books about it?
Maria:  I have used Kettle since the 2.4 version, when many of the tasks could only be done with JavaScript steps. Despite that, I already admired the flexibility and power of the tool. From that moment Kettle has really improved in performance, functionality and look & feel. Its capabilities are endless and this goes unnoticed for many users. That’s what makes me write about it: The need to uncover those hidden features, and explain how easily you can do things with Kettle.

Adrián: In my daily work I integrate all kinds of data: xml files, plain text files, databases, and so on. Anyone facing these tasks knows about the time and effort required for accomplishing them. Meeting Kettle was love at first sight. Thanks to Kettle I realized that these formerly tedious tasks can be done in a fast, fun and easy way. I liked the idea of writing this book to share my own experiences with other people.

5) When can we expect the next book(s)?
Adrián: Just as Kettle, the whole Pentaho Suite has grown a lot in the latest years. There is undoubtedly much to write about it.

However at this time we’d like to enjoy the recently published book and look forward for the feedback of the Pentaho community.

**

Win a free Pentaho Data Integration 4 Cookbook. Like Pentaho on Facebook and leave a comment here about which chapter(s) or recipe(s) you think will be most useful for you and why (you can see the full index in the book here). You also have the chance to win on Twitter by following Pentaho and tweeting your comment with the hashtag #PDI4. Maria and Adrián will pick their favorite comment to win. Deadline to leave a comment is July 26 at 12pm/EST.

Packt Publishing is offering an exclusive 20% discount off the Pentaho Data Integration 4 Cookbook when you purchase through PacktPub.com for Pentaho BI from the Swamp readers. At the shopping cart, simply enter the discount code PentahoDI20 (case sensitive).

***Update July 27***
The winner of the free book goes to Mike Dugan. As Adrián explains, “Because he expressed in a few words the essence of chapter 7, which is one of our favorites.”

Mike’s response to his favorite chapter and why, “Chapter 7 is the key here. Who wants to recreate the wheel??? Just like Newton I believe in the conservation of energy…. Especially MY energy. Do it once, use it a lot, look like a rock star with minimal effort.”

Well said! Congrats Mike, you will receive a free copy of the PDI Cookbook courtesy of Packt Publishing soon.

Read all the responses here


High availability and scalability with Pentaho Data Integration

March 31, 2011

Experts often possess more data than judgment.” – Colin Powell….hmmm, those experts surely are not using a scalable Business Intelligence solution to optimize that data which can help them make better decisions. :-)

Data is everywhere! The amount of data being collected by organizations today is experiencing explosive growth. In general, ETL (Extract Transform Load) tools have been designed to move, cleanse, integrate, normalize and enrich raw data to make it meaningful and available for knowledge workers and decision support systems. Once data has been “optimized,” only then can it be turned into “actionable” information using the appropriate business applications or Business Intelligence software. This information could then be used to discover how to increase profits, reduce costs or even write a program that suggests what your next movie on Netflix should be. The capability to pre-process this raw-data before making it available to the masses, becomes increasingly vital to organizations who must collect, merge and create a centralized repository containing “one version of the truth.” Having an ETL solution that is always available, extensible and highly scalable is an integral part of processing this data.

Pentaho Data Integration

Pentaho Data Integration (PDI) can provide such a solution for many varying ETL needs. Built upon a open Java framework, PDI uses a metadata driven design approach that eliminates the need to write, compile or maintain code. It provides an intuitive design interface with a rich library of prepacked plug-able design components. ETL developers with skill sets that range from the novice to the Data Warehouse expert can take advantage of the robust capabilities available within PDI immediately with little to no training.

The PDI Component Stack

Creating a highly available and scalable solution with Pentaho Data Integration begins with understanding the PDI component stack.

● Spoon – IDE – for creating Jobs, Transformations including the semantic layer for BI platform
● Pan – command line tool for executing Transformations modeled in Spoon
● Kitchen – command line tool for executing Jobs modeled in Spoon
● Carte – lightweight ETL server for remote execution
● Enterprise Data Integration Server – remote execution, version control repository, enterprise security
● Java API – write your own plug-ins or integrate into your own applications

Spoon is used to create the ETL design flow in the form of a Job or Transformation on a developer’s workstation. A Job coordinates and orchestrates the ETL process with components that control file movement, scripting, conditional flow logic, notification as well as the execution of other Jobs and Transformations. The Transformation is responsible for the extraction, transformation and loading or movement of the data. The flow is then published or scheduled to the Carte or Data Integration Server for remote execution. Kitchen and Pan can be used to call PDI Jobs and Transformations from your external command line shell scripts or 3rd party programs. There is also a complete Java SDK available to integrate and embed these process into your Java applications.

Figure 1: Sample Transformation that performs some data quality and exception checks before loading the cleansed data

PDI Remote Execution and Clusters

The core of a scalable/available PDI ETL solution involves the use of multiple Carte or Data Integration servers defined as “Slaves” in the ETL process. The remote Carte servers are started on different systems in the network infrastructure and listen for further instructions. Within the PDI process, a Cluster Scheme can be defined with one Master and multiple Slave nodes. This Cluster Scheme can be used to distribute the ETL workload in parallel appropriately across these multiple systems. It is also possible to define Dynamic Clusters where the Slave servers are only known at run-time. This is very useful in cloud computing scenarios where hosts are added or removed at will. More information on this topic including load statistics can be found here in an independent consulting white paper created by Nick Goodman from Bayon Technologies, “Scaling Out Large Data Volume Processing in the Cloud or on Premise.”

Figure 2: Cx2 means these steps are executed clustered on two Slave servers
All other steps are executed on the Master server

The Concept of High Availability, Recover-ability and Scalability

Building a highly available, scalable, recoverable solution with Pentaho Data Integration can involve a number of different parts, concepts and people. It is not a check box that you simply toggle when you want to enable or disable it. It involves careful design and planning to prepare and anticipate the events that may occur during an ETL process. Did the RDBMS go down? Did the Slave node die? Did I lose network connectivity during the load? Was there a data truncation error at the database? How much data will be processed on peak times? The list can go on and on. Fortunately PDI arms you with a variety of components including complete ETL metric logging, web services and dynamic variables that can be used to build recover-ability, availability, scalability scenarios into your PDI ETL solution.

For example, Managing Consultant in EMEA, Jens Bleuel developed a PDI implementation of the popular Watchdog concept. A solution that includes checks to monitor if everything is on track is using the concept of a Watchdog when executing its tasks and events. Visit the link above for more information on this implementation.

 

 

Putting it all together – (Sample)

Diethard Steiner, active Pentaho Community member and contributor, has written an excellent tutorial that explains how to set up PDI ETL remote execution using the Carte server. He also provides a complete tutorial (including sample files provided by Matt Casters, Chief Architect and founder of Kettle) on setting up a simple “available” solution to process files, using Pentaho Data Integration. You can get it here. Please note that advanced topics such as this are also covered in greater detail (designed by our Managing Consultant Jens Bleuel – EMEA) in our training course available here.

Summary

When attempting to process the vast amounts of data collected on a daily basis, it is critical to have a Data Integration solution that is not only easy to use but easily extendable. Pentaho Data Integration achieves this extensibility with its open architecture, component stack and object library which can be used to build a scalable and highly available ETL solution without exhaustive training and no code to write, compile or maintain.

Happy ETLing.

Regards,

Michael Tarallo
Senior Director of Sales Engineering
Pentaho

This blog was originally published on the Pentaho Evaluation Sandbox. A comprehensive resource for evaluating and testing Pentaho BI.


Follow

Get every new post delivered to your Inbox.

Join 52 other followers