Announcing Pentaho with Storm and YARN

February 11, 2014

One of Pentaho’s core beliefs is that you can’t prepare for tomorrow with yesterday’s tools. In June of 2013, amidst waves of emerging big data technologies, Pentaho established Pentaho Labs to drive innovation through the incubation of these new technologies. Today, one of our Labs projects hatches.  At the Strata Conference in Santa Clara, we announced native integration of Pentaho Data Integration (PDI) with Storm and YARN. This integration enables developers to process big data and drive analytics in real-time, so businesses can make critical decisions on time-sensitive information.

Read the announcement here.

Here is what people are saying about Pentaho with Storm and YARN:

Pentaho Customer
Bryan Stone, Cloud Platform Lead, Synapse Wireless: “As an M2M leader in the Internet of Everything, our wireless solutions require innovative technology to bring big data insights to business users. The powerful combination of Pentaho Data Integration, Storm and YARN will allow my team to immediately leverage real-time processing, without the delay of batch processing or the overhead of designing additional transformations. No doubt this advancement will have a big impact on the next generation of big data analytics.

Leading Big Data Industry Analyst
Matt Aslett, Research Director, Data Management and Analytics, 451 Research: “YARN is enabling Hadoop to be used as a flexible multi-purpose data processing and analytics platform. We are seeing growing interest in Hadoop not just as a platform for batch-based MapReduce but also rapid data ingestion and analysis, especially using Apache Storm. Native support of YARN and Storm from companies like Pentaho will encourage users to innovate and drive greater value from Hadoop.”

Pentaho founder and Pentaho Labs Leader
Richard Daley, Founder and Chief Strategy Officer, Pentaho: “Our customers are facing fast technology iterations from the relentless evolution of the big data ecosystem. With Pentaho’s Adaptive Big Data Layer and Big Data Analytical Platform our customers are “future proofed” from the rapid pace of evolution in the big data environment. In 2014, we’re leading the way in big data analytics with Storm, YARN, Spark and predictive, and making it easy for customers to leverage these innovations.”

Learn more about the innovation of Pentaho Data Integration for Storm on YARN in Pentaho Labs at pentaho.com/storm

If you are at O’Reilly Strata Conference in Santa Clara this week make sure to stop by booth 710 to see a live demo of Pentaho Data integration with Storm and YARN at the O’Reilly Strata Conference in Santa Clara, February 11-13 at Booth 710. The Pentaho team of technologist, data scientist and executives will be on hand to share the latest big data innovations from Pentaho Labs.

Donna Prlich
Senior Director, Product Marketing
Pentaho


edo optimizes data warehouse, increases loyalty and targets new customers

February 10, 2014

edo

What do you do when you need to track, store, blend, and analyze over 6 billion financial data transactions with the outlook of daily growth by the millions? edo Interactive, inc is a digital advertising company that leverages payment networks to connect brands with consumers. Their legacy data integration and analysis system took more than 27 hours to run, meaning that meeting daily Service Level Agreements was nearly impossible. However, after only a few weeks of implementing a data distribution on Hadoop, with Pentaho for data integration, edo Interactive was able to reduce its processing time to less than 8 hours, often as little as 2 hours.

Minimum timesaving’s of 70% quickly precipitated cost savings. With an optimized data warehouse, edo and its clients also spend less time navigating IT barriers. Pentaho’s graphical user interface, removes cumbersome coding of batch process jobs, enabling sophisticated and simplified conversion of data from PostgreSQL to Hadoop, Hive and HBase. edo and its clients quickly gain insights to customer preferences, refine marketing strategies and provide their customers with improved experience and satisfaction.

Edo Interactive successfully navigated many of the obstacles faced when implementing a big data environment and created a lasting and scalable solution. Their vision to provide end-users a better view of their customers has helped shape a new data architecture and embedded analytics capabilities.

To learn more about edo’s Big Data vision and success, read their customer success overview and case study on Pentaho.com. We are excited to announce that Tim Garnto, SVP of Product Engineering at edo, will share his story live when he presents at O’Reilly + Strata on Thursday, February 13th in Santa Clara (11:30AM, Ballroom G).

Strata Santa Clara is already sold out! If you are interested to learn more about edo’s Big Data deployment, leave your questions in the comments section below and we will ask Tim during his speaking session at Strata.

Ben Mayer
Customer Marketing
Pentaho


Rackspace brings ETL to the Cloud with Pentaho: Hadoop Summit Q&A

June 27, 2013

This week Pentaho has been meeting with the movers and shakers of the Apache Hadoop community in San Jose, at the 6th annual Hadoop Summit. Pentaho and Rackspace are drawing attention on this final day of the show with the announcement of a partnership that brings ETL to the cloud. We’re introducing Rackspace Big Data, a powerful enterprise grade Hadoop as a Service solution. As the industry leader in cost effective data integration for Hadoop, Pentaho is proud to team with Rackspace, the industry leader in enterprise IAAS, to deliver this new era of big data in the cloud.

photo.JPG

L) Eddie White, EVP business development, Pentaho | R) Sean Anderson product marketing manager for cloud big data solutions, Rackspace Hosting

To learn more about the news, we’re talking today with Pentaho’s Eddie White, executive vice president of business development.

Give us a quick overview of this Rackspace news, and how Pentaho is involved.

Rackspace Big Data is an exciting Hadoop as a Service offering with full enterprise features. This is the next evolution in the big data ecosystem, delivering the ongoing structure to allow enterprise customers to choose a variety of consumption models over time. Customers can choose managed dedicated servers, and public, private or hybrid cloud options. Pentaho was chosen as the only Hadoop ETL / Data integration partner for this Cloud Tools Hadoop offering.

So is this a solution for enterprise customers looking to grow their big data operations?

Yes, absolutely. Hadoop as a Service is an attractive alternative for customers that need enterprise-level infrastructure support. Pentaho gives Rackspace a partner with the skills and talent on-board to deliver big data for production environments, along with the support and stability that Rackspace customers demand from their service-level agreements. Enterprises are looking for a Cloud partner with an enterprise-grade infrastructure to support running their business; not just test and development efforts.

What makes up this Hadoop as a Service model?

Together, Rackspace, Hortonworks and Pentaho have jointly delivered an offering that facilitates ease of use and ease of adoption of Hadoop as a Service. Rackspace Big Data includes the HortonWorks Data Platform for Hadoop; Pentaho Business Analytics as the ETL / Big Data Integration partner; and Karmasphere providing Hadoop analytics.

Rackspace excels at the enterprise IaaS model, and now they’ve partnered with Hortonworks and Pentaho to introduce an easy-to-use, consume-as-you-scale Hadoop as a Service offering – so customers can get started today, confident their solution will scale along with their big data needs. Rackspace chose to partner with Pentaho because it is the industry-leading Hadoop ETL and Big Data Analytics platform. Rackspace Big Data offers a range of models to meet any organization’s changing needs, from dedicated to hybrid, and for private and public clouds. And the offering ensures the ability to bi-directionally move data in and out of enterprise clusters, with minimal technical effort and cost.

What does Pentaho Data Integration bring to Rackspace Big Data?

Rather than speak for our partner, I’ll let Sean Anderson, Rackspace Hosting’s product marketing manager for cloud big data solutions, answer that. He sums up what Pentaho brings to the partnership nicely:

“Pentaho Data Integration is all about easing adoption and enhancing utilization of Rackspace big data platforms, with native, easy-to-use data integration. Pentaho is leading the innovation of Hadoop Integration and Analytics, and the upcoming cloud offering with Rackspace reduces the barriers to instant success with Hadoop, so customers can adopt and deploy quickly, delivering faster ROI,” said Anderson.

“Pentaho’s powerful data integration engine serves as a platform, enabling delivery of that content right into an enterprise’s pre-existing business intelligence and analytics tools,” continued Anderson. “Rackspace Big Data customers who require multiple data stores can leverage the ease of operation inherent in their visual ETL tool Pentaho provides. Customers will be able to complement their platform offering by adding the validated Pentaho tool via the Cloud Tools Marketplace.”

A key takeaway is that Rackspace Big Data customers may choose to bridge to the Pentaho Business Analytics platform. As an example, Pentaho’s full suite can be used where a Rackspace customer wants to use both Hortonworks and ObjectRocket. We bring the data in both of these databases to life for the Rackspace customer.

Why is Pentaho excited about this announcement?

This is exciting news because it is Pentaho’s first strategic cloud partnership. As the big data market has matured, it’s now time for production workloads to be moved over to Big Data Service offerings. Rackspace is the recognized leader providing the enterprise with IaaS, with an enterprise-grade support model. We see Rackspace and a natural partner for us to make our move into this space. We are market leaders in our respective categories with proven experience that enterprises trust for service, reliability, scalability and support. As the market for Hadoop and Big Data is developing and maturing, we see Rackspace as the natural strategic partner for Pentaho to begin providing Big Data / Hadoop as a Service.

MarketplaceHow can organizations buy Rackspace Big Data?

For anyone looking to leverage Hadoop as a Service, Rackspace Big Data is available directly from Rackspace. For more information and pricing visit: www.rackspace.com/big-data. Pentaho will also be in the Rackspace Cloud Tools marketplace.


Informatica jumps on the Pentaho bandwagon

June 12, 2013

Big-Data_web.jpgYou know that a technology megatrend has truly arrived when the large vendors start to jump on the bandwagon. Informatica recently announced Informatica Vibe™ — its new virtual data machine (VDM), an embeddable data management engine that allows developers to “Map Once, Deploy Anywhere,” including into Hadoop, without generating or writing code. According to Informatica, developers can instantly become Hadoop developers without having to acquire new skills. Sound familiar?

I applaud Informatica’s efforts – but not for innovating or changing the landscape in data integration.  What I applaud them for is recognizing that the landscape for data integration has indeed changed, and it was time for them to join the party. “Vibe” itself may be new, but it is not a new concept, nor unique to the industry.  In fact, Pentaho recognized the need for a modern, agile, adaptive approach to data integration for OEMs and customers. We pioneered the Kettle “design once, run anywhere” embeddable virtual data engine back in 2005. And let’s set the record straight – Pentaho extended its lightweight data integration capabilities to Hadoop over three years ago as noted in this 2010 press release.

Over the past three years, Pentaho has delivered on Big Data Integration with many successful Hadoop customers, such as BeachMint, MobileThink, TravelTainment and Travian Games and continued our innovation — with not only Hadoop but also NoSQL, Analytical Engines, and other specialized Big Data stores. We have added test, deploy and real time monitoring functionality.  The Pentaho engine is embedded in multiple SaaS, Cloud, and customer applications today such as Marketo, Paytronix, Sharable Ink and Soliditet, with many more on the horizon. Our VDM is completely customer extensible and open. We insulate customers from changes in their data volumes, types, sources, computing platforms, and user types.  In fact, what Informatica states as intention and direction with Vibe, Pentaho Data Integration delivers today, and we continue to lead in this new landscape.

VisualDataManagement

The Data Integration market has changed– the old, heavyweight, proprietary infrastructure players must adapt to current market demands. Agile, extensible, open, embeddable engines with pluggable infrastructures are the base, but it doesn’t end there. Companies of all sizes and verticals are requiring shorter development cycles, broad and deep big data ecosystem support, attractive price points and rich functionality, and all without vendor lock-in.  Informatica is adapting to play in the big data integration world by rebranding its products and signaling new direction.  Tony Baer, principal analyst at Ovum, summarizes this adaptation in his blog, “Informatica aims to get its vibe back.”

The game is on and Pentaho is at the forefront. We have very exciting big data integration news in store for you at the Hadoop Summit in Santa Clara on June 26-27 that unfortunately I have to keep the lid on for now. Stay tuned!

Richard

Richard Daley

Co-founder and chief strategy officer


Pentaho Data Integration 4 Cookbook – Win a Free Copy

July 19, 2011

Pentaho is very fortunate to have such a fantastic community. There are a few community rockstars that find time in their uber busy lives to write books about using Pentaho. The latest book published, the Pentaho Data Integration 4 Cookbook by co-authors Adrián Pulvirenti & Maria Carina Roldán is making its way to the top of the Amazon bestseller tech list. Even more impressive – this is Maria’s second book about PDI in just 15 months! (In April 2010 she published PDI 3.2: Beginner’s Guide). We were interested to learn more about the book and the authors. Check out our interview below to get the inside scoop about the PDI 4 Cookbook.

Read below to learn how to win a FREE copy of the PDI 4 Cookbook and for a special discount offer from Packt Publishing

1) What inspired you to write the PDI 4 cookbook so soon after “PDI 3.2 for beginners”?
Maria: At the time PDI 3.2 for Beginners was published there was a clear need for a book that revealed the secrets of Kettle, in particular for those who barely knew about this tool. The book had a great acceptance especially coming from the Pentaho Community. Today I can say that the main inspiration was definitely that rewarding feedback.

On the other side, at the time that book was published, Pentaho was about to release PDI 4. From a beginner perspective, there aren’t big differences between Kettle 3.2 and Kettle 4. Thus, there is nothing that refrain you from learning Kettle 4 with the help of the Beginner’s book. However Kettle 4 brought a lot of new features that deserved to be explained. This was also a motivation for writing this new book.

2) What is the main goal behind the book?  What do you aim to bring across?
Adrián: This book is intended to help the reader quickly solve the problems that might appear while he or she is developing jobs and transformations. It doesn’t cover PDI basics – the Beginner’s book does. On the contrary, it focuses on giving the PDI users quick solutions to particular issues.

  • Can I generate complex XML structures with Kettle?
  • How do I execute a transformation in a loop?
  • What do I need for attaching a file in an email?
  • These are common questions solved in the book through quick easy-to-follow recipes with different difficulty levels.

3) Where did you find the inspiration for this new book?
Maria: The main inspiration for this book was the PDI forum; many of the recipes explained in the book are the answers to questions that appear in the forum again and again, as for example: how to use variables, how to read an XML file, how to create multi-sheet Excel files, how to pass parameters to transformations, etc. Just to give an example, the recipe “Executing part of a job once for every row in a dataset” explains how to loop over a set of entities (people, product codes, filenames, or whatever), which is a very recurrent issue in the Kettle forum.

Besides that, Kettle itself was an inspiration. While outlining the contents of the book and with the aim of having a diversified set of recipes we browsed the list of steps and job entries many times thinking: Is there something that we aren’t covering? Are there steps that deserve a recipe by themselves? Many of the recipes that you can find in the Cookbook came out after that exercise. “Programming custom functionality,” a recipe that explains how to use the UDJC step and quickly explains other scripting related steps, is just an example of these set of recipes.

4) What do you like so much about Pentaho (Data Integration) to make you write books about it?
Maria:  I have used Kettle since the 2.4 version, when many of the tasks could only be done with JavaScript steps. Despite that, I already admired the flexibility and power of the tool. From that moment Kettle has really improved in performance, functionality and look & feel. Its capabilities are endless and this goes unnoticed for many users. That’s what makes me write about it: The need to uncover those hidden features, and explain how easily you can do things with Kettle.

Adrián: In my daily work I integrate all kinds of data: xml files, plain text files, databases, and so on. Anyone facing these tasks knows about the time and effort required for accomplishing them. Meeting Kettle was love at first sight. Thanks to Kettle I realized that these formerly tedious tasks can be done in a fast, fun and easy way. I liked the idea of writing this book to share my own experiences with other people.

5) When can we expect the next book(s)?
Adrián: Just as Kettle, the whole Pentaho Suite has grown a lot in the latest years. There is undoubtedly much to write about it.

However at this time we’d like to enjoy the recently published book and look forward for the feedback of the Pentaho community.

**

Win a free Pentaho Data Integration 4 Cookbook. Like Pentaho on Facebook and leave a comment here about which chapter(s) or recipe(s) you think will be most useful for you and why (you can see the full index in the book here). You also have the chance to win on Twitter by following Pentaho and tweeting your comment with the hashtag #PDI4. Maria and Adrián will pick their favorite comment to win. Deadline to leave a comment is July 26 at 12pm/EST.

Packt Publishing is offering an exclusive 20% discount off the Pentaho Data Integration 4 Cookbook when you purchase through PacktPub.com for Pentaho BI from the Swamp readers. At the shopping cart, simply enter the discount code PentahoDI20 (case sensitive).

***Update July 27***
The winner of the free book goes to Mike Dugan. As Adrián explains, “Because he expressed in a few words the essence of chapter 7, which is one of our favorites.”

Mike’s response to his favorite chapter and why, “Chapter 7 is the key here. Who wants to recreate the wheel??? Just like Newton I believe in the conservation of energy…. Especially MY energy. Do it once, use it a lot, look like a rock star with minimal effort.”

Well said! Congrats Mike, you will receive a free copy of the PDI Cookbook courtesy of Packt Publishing soon.

Read all the responses here


High availability and scalability with Pentaho Data Integration

March 31, 2011

Experts often possess more data than judgment.” – Colin Powell….hmmm, those experts surely are not using a scalable Business Intelligence solution to optimize that data which can help them make better decisions. :-)

Data is everywhere! The amount of data being collected by organizations today is experiencing explosive growth. In general, ETL (Extract Transform Load) tools have been designed to move, cleanse, integrate, normalize and enrich raw data to make it meaningful and available for knowledge workers and decision support systems. Once data has been “optimized,” only then can it be turned into “actionable” information using the appropriate business applications or Business Intelligence software. This information could then be used to discover how to increase profits, reduce costs or even write a program that suggests what your next movie on Netflix should be. The capability to pre-process this raw-data before making it available to the masses, becomes increasingly vital to organizations who must collect, merge and create a centralized repository containing “one version of the truth.” Having an ETL solution that is always available, extensible and highly scalable is an integral part of processing this data.

Pentaho Data Integration

Pentaho Data Integration (PDI) can provide such a solution for many varying ETL needs. Built upon a open Java framework, PDI uses a metadata driven design approach that eliminates the need to write, compile or maintain code. It provides an intuitive design interface with a rich library of prepacked plug-able design components. ETL developers with skill sets that range from the novice to the Data Warehouse expert can take advantage of the robust capabilities available within PDI immediately with little to no training.

The PDI Component Stack

Creating a highly available and scalable solution with Pentaho Data Integration begins with understanding the PDI component stack.

● Spoon – IDE – for creating Jobs, Transformations including the semantic layer for BI platform
● Pan – command line tool for executing Transformations modeled in Spoon
● Kitchen – command line tool for executing Jobs modeled in Spoon
● Carte – lightweight ETL server for remote execution
● Enterprise Data Integration Server – remote execution, version control repository, enterprise security
● Java API – write your own plug-ins or integrate into your own applications

Spoon is used to create the ETL design flow in the form of a Job or Transformation on a developer’s workstation. A Job coordinates and orchestrates the ETL process with components that control file movement, scripting, conditional flow logic, notification as well as the execution of other Jobs and Transformations. The Transformation is responsible for the extraction, transformation and loading or movement of the data. The flow is then published or scheduled to the Carte or Data Integration Server for remote execution. Kitchen and Pan can be used to call PDI Jobs and Transformations from your external command line shell scripts or 3rd party programs. There is also a complete Java SDK available to integrate and embed these process into your Java applications.

Figure 1: Sample Transformation that performs some data quality and exception checks before loading the cleansed data

PDI Remote Execution and Clusters

The core of a scalable/available PDI ETL solution involves the use of multiple Carte or Data Integration servers defined as “Slaves” in the ETL process. The remote Carte servers are started on different systems in the network infrastructure and listen for further instructions. Within the PDI process, a Cluster Scheme can be defined with one Master and multiple Slave nodes. This Cluster Scheme can be used to distribute the ETL workload in parallel appropriately across these multiple systems. It is also possible to define Dynamic Clusters where the Slave servers are only known at run-time. This is very useful in cloud computing scenarios where hosts are added or removed at will. More information on this topic including load statistics can be found here in an independent consulting white paper created by Nick Goodman from Bayon Technologies, “Scaling Out Large Data Volume Processing in the Cloud or on Premise.”

Figure 2: Cx2 means these steps are executed clustered on two Slave servers
All other steps are executed on the Master server

The Concept of High Availability, Recover-ability and Scalability

Building a highly available, scalable, recoverable solution with Pentaho Data Integration can involve a number of different parts, concepts and people. It is not a check box that you simply toggle when you want to enable or disable it. It involves careful design and planning to prepare and anticipate the events that may occur during an ETL process. Did the RDBMS go down? Did the Slave node die? Did I lose network connectivity during the load? Was there a data truncation error at the database? How much data will be processed on peak times? The list can go on and on. Fortunately PDI arms you with a variety of components including complete ETL metric logging, web services and dynamic variables that can be used to build recover-ability, availability, scalability scenarios into your PDI ETL solution.

For example, Managing Consultant in EMEA, Jens Bleuel developed a PDI implementation of the popular Watchdog concept. A solution that includes checks to monitor if everything is on track is using the concept of a Watchdog when executing its tasks and events. Visit the link above for more information on this implementation.

 

 

Putting it all together – (Sample)

Diethard Steiner, active Pentaho Community member and contributor, has written an excellent tutorial that explains how to set up PDI ETL remote execution using the Carte server. He also provides a complete tutorial (including sample files provided by Matt Casters, Chief Architect and founder of Kettle) on setting up a simple “available” solution to process files, using Pentaho Data Integration. You can get it here. Please note that advanced topics such as this are also covered in greater detail (designed by our Managing Consultant Jens Bleuel – EMEA) in our training course available here.

Summary

When attempting to process the vast amounts of data collected on a daily basis, it is critical to have a Data Integration solution that is not only easy to use but easily extendable. Pentaho Data Integration achieves this extensibility with its open architecture, component stack and object library which can be used to build a scalable and highly available ETL solution without exhaustive training and no code to write, compile or maintain.

Happy ETLing.

Regards,

Michael Tarallo
Senior Director of Sales Engineering
Pentaho

This blog was originally published on the Pentaho Evaluation Sandbox. A comprehensive resource for evaluating and testing Pentaho BI.


Follow

Get every new post delivered to your Inbox.

Join 105 other followers