The Pentaho Test Drive

September 2, 2014

120724_TECH_TESLA_S_SEDAN.jpg.CROP.rectangle3-large

Before making an investment in a new car, you may first narrow down what features you want such as fast, easy to use and cost effective. Next, once you have a strategy you probably head to the dealer to test drive a few of the targeted options before making a big investment in a car that you know will improve your commute.

With the rapidly changing data landscape, there are newer, faster, and easier to use options to access, analyze and predict your data that are now in the marketplace. Similar to the flashy car advertisements, its hard to separate hype from reality of which tools will give your organization the performance and flexibility needed to achieve a return on your investment.

Similar to the process you would take to buy a new car, at PentahoWorld, we are giving you a unique opportunity to test drive three different innovative use cases that we see making a huge impact on our customers. Test drive options include: Pentaho Data Integration (PDI) with Hadoop, PDI with MongoDB, and PDI with Weka (Predictive) – see below for descriptions.

At the PentahoWorld Hands-On Product Training, we hand over the keys or in reality – we provide the experts, computers and lunch. This really is a great opportunity to take your knowledge to the next level with in-depth instructions with these innovative technologies. I recommend signing up for the test drive sessions ASAP as space is limited, and the early bird pricing ends September 5th.

John Durkin
Senior Training Manager
Pentaho

Test Drive Pentaho with Hadoop

This Test Drive covers the steps to use Hadoop in the process of optimizing your data warehouse. You’ll gain hands-on experience with creating folders in HDFS, loading data, working with PDI transformations, configuring Pentaho Map Reduce and reviewing your results in HDFS. You’ll orchestrate it all by using a Hadoop job in PDI.

Test Drive Pentaho with MongoDB

This Test Drive outlines a real world scenario based on creating a 360° view of your customers. You’ll gain hands-on experience with connecting MongoDB to PDI, loading data into a MongoDB document, creating arrays, creating an aggregate pipeline query, and visualizing the results. You’ll orchestrate all of these tasks using a MongoDB job in PDI.

Test Drive Pentaho with Weka

This Test Drive introduces basic data mining concepts and terminology, along with the parts of the Pentaho suite that facilitate the development and application of predictive modelling. If you are new to data mining or you are considering a predictive solution for your business challenge, this is the test drive for you. You’ll gain hands-on experience with the Pentaho tools using real world direct-marketing use case. In particular, you will be introduced to some common types of data mining models and guided through the process of creating, evaluating, exporting and deploying a predictive model.

Sign-up for the Pentaho test drive today!

Photo credit: Slate.com


Spark on Fire! Integrating Pentaho and Spark

June 30, 2014

One of Pentaho’s great passions is to empower organizations to take advantage of amazing innovations in Big Data to solve new challenges using the existing skill sets they have in their organizations today.  Our Pentaho Labs’ innovations around natively integrating data engineering and analytics with Big Data platforms like Hadoop and Storm have already led dozens of customers to deploy next-generation Big Data solutions. Examples of these solutions include optimizing data warehousing architectures, leveraging Hadoop as a cost effective data refinery, and performing advanced analytics on diverse data sources to achieve a broader 360-degree view of customers.

Not since the early days of Hadoop have we seen so much excitement around a new Big Data technology as we see right now with Apache Spark.  Spark is a Hadoop-compatible computing system that makes big data analysis drastically faster, through in-memory computation, and simpler to write, through easy APIs in Java, Scala and Python.  With the second annual Spark Summit taking place this week in San Francisco, I wanted to share some of the early work Pentaho Labs and our partners over at Databricks are collaborating on to deeply integrate Pentaho and Spark for delivering high performance, Big Data Analytics solutions.

Big Data Integration on Spark

Big Data Integration on SparkAt the core of Pentaho Data Integration (PDI) is a portable ‘data machine’ for ETL which today can be deployed as a stand-alone Pentaho cluster or inside your Hadoop cluster though MapReduce and YARN.  The Pentaho Labs team is now taking this same concept and working on the ability to deploy inside Spark for even faster Big Data ETL processing.  The benefit for ETL designers is the ability to design, test and tune ETL jobs in PDI’s easy-to-use graphical design environment, and then run them at scale on Spark.  This dramatically lowers the skill sets required, increases productivity, and reduces maintenance costs when to taking advantage of Spark for Big Data Integration.

Advanced Analytics on Spark

Last year Pentaho Labs introduced a distributed version of Weka, Pentaho’s machine learning and data mining platform. The goal was to develop a platform-independent approach to using Weka with very large data sets by taking advantage of distributed environments like Hadoop and Spark. Our first implementation proved out this architecture by enabling parallel, in-cluster model training with Hadoop.

Advanced Analytics on Spark

We are now working on a similar level of integration with Spark that includes data profiling and evaluating classification and regression algorithms in Spark.  The early feedback from Pentaho Labs confirms that developing solutions on Spark is faster and easier than with MapReduce. In just a couple weeks of development, we have demonstrated the ability to perform in-cluster Canopy clustering and are very close to having k-means++ working in Spark as well!

Next up: Exploring Data Science Pack Integration with MLlib

MLlib is already one of the most popular technologies for performing advanced analytics on Big Data.  By integrating Pentaho Data Integration with Spark and MLlib, Data Scientists will benefit by having an easy-to-use environment (PDI) to prepare data for use in MLlib-based solutions.  Furthermore, this integration will make it easier for IT to operationalize the work of the Data Science team by orchestrating the entire end-to-end flow from data acquisition, to data preparation, to execution of MLlib-based jobs to sharing the results, all in one simple PDI Job flow.  To get a sense for how this integration might work, I encourage you to look at a similar integration with R we recently launched as part of the Data Science Pack for Pentaho Business Analytics 5.1.

Experiment Today with Pentaho and Spark!

You can experiment with Pentaho and Spark today for both ETL and Reporting.  In conjunction with our partners at Databricks, we recently certified for the following use cases combining Pentaho and Spark:

  • Reading data from Spark as part of an ETL workflow by using Pentaho Data Integration’s Table Input step with Apache Shark (Hive SQL layer runs on Spark)
  • Reporting on Spark data using Pentaho Reporting against Apache Shark

We are excited about this first step in what we both hope to be a collaborative journey towards deeper integration.

Jake Cornelius
Sr. Vice President, Product Management
Pentaho

 


Dinosaurs Have Had Their Day

June 16, 2014

dinosaur

Once upon a time, (not so) long ago in 2004, two young technologies were born from the same open source origins – Hadoop and Pentaho. Both evolved quickly from the market’s demand for better, larger-scale analytics, that could be adopted faster to benefit more players

Most who adopt Hadoop want to be disruptive leaders in their market without breaking the bank. Earlier this month at Hadoop Summit 2014, I talked to many people who told me, “I’d like to get off of <insert old proprietary software here> for my new big data applications and that’s why we’re looking at Pentaho.” It’s simple – no company is going to adopt Hadoop and then turn around and pay the likes of Informatica, Oracle or SAS outrageous amounts for data engineering or analytics.

Big data is the asteroid that has hit the tech market and changed its landscape forever, giving life to new business models and architectures based on open source technologies. First the ancient dinosaurs ignored open source, then they fought it and now they are trying to embrace it. But the mighty force of evolution had other plans. Dinosaurs are giving way to a more nimble generation that doesn’t depend on a mammoth diet of maintenance revenue, exorbitant license fees and long-term deals just to survive.

In this new world companies must continually evolve to survive and dinosaurs have had their day. It’s incredibly rewarding to be  part of a new analytics ecosystem that thrives on open standards, high performance and better value for customers. So many positive evolutionary changes have taken place in the last ten years, I can’t wait to see what the next ten will bring.

Richard Daley
Founder and Chief Strategy Officer
Pentaho

Image: #147732373 / gettyimages.com


Hadoop Summit 2014 – Big Data Keeps Getting Bigger

June 6, 2014

hadoop_summit_logo

While most of this year’s Hadoop Summit sessions still conveyed ‘developer conference,’ rife with command-line driven demos and Java, Scala, and Python code snippets, I noticed the ‘commercial’ uniform of khakis, blazers and Docksiders starting to creep in. Indeed, the themes I noticed most at the Summit were “enterprise ready” and “next-generation data platform.”

So if the Summit’s days as an all-out geekfest are history, what does this say about Hadoop? I happen to think it’s great news: it says Hadoop is going mainstream and being embraced as core to the enterprise data platform. Nothing drives this home more convincingly than the fact that the Hadoop “enterprise ready” ecosystem has exploded from less than ten vendors five years ago to more than 80 vendor sponsors at this year’s show.

In this our fifth year sponsoring the Summit, we were just as pumped as we were attending and sponsoring our first Hadoop Summit back in June 2010 right after the launched our first Hadoop product set.  This year saw a record crowd (3,200+ attendees from 1,100 different companies), informative breakout sessions, fun parties, and lots of energy and passion throughout.

More large enterprise companies laid out specific needs and funded use cases than ever. I noticed companies increasingly talking about bringing Hadoop in house to build proofs of concept so they wouldn’t get left behind, losing ground to competitive Hadoop shops. Hadoop has emerged as the new strategic weapon in companies’ IT arsenals as they wake up to the value of their data assets.

I talked to Hadoop users who were beaming with pride over their projects and hungry to take on more. These techies are the new corporate rock stars, delivering huge returns to their companies. However as with any young technology, Hadoop projects aren’t completely free of road bumps – mostly around blending different types of data and integration as both corporate data volumes and variety continues to multiply like rabbits.

That’s why at Pentaho, we’re determined to stay on the road less travelled and keep smoothing out these data blending and integration road bumps so that every data professional working with Hadoop – regardless of their dress code – will enjoy a better ride.

See you at Hadoop Summit 2015!

Richard Daley
Founder and Chief Strategy Officer
Pentaho


Highlights From Splunk .conf2013 – Machine Data Meets Big Business

October 4, 2013
Eddie.jpg

Eddie White, EVP Business Development, Pentaho

This week, Pentaho was on site for Splunk .conf2013 in Las Vegas and the show was buzzing with excitement. Organizations big and small shared a range of new innovations leveraging machine data.

Eddie White, executive VP of business development at Pentaho, shares his first-hand impressions and insights on the biggest news and trends coming out of .conf2013.

Q: Eddie, what are your impressions of this year’s Splunk conference?

There’s a different feel at the show this year — bigger companies and more business users attended this year. What traditionally has been more of an “IT Show,” has evolved to showcase real business use cases, success stories and post-deployment analysis. It’s apparent that machine data has turned a corner. The industry is moving well beyond simply logging of machine data. Users integrate, analyze and leverage their vast resource of device data for business intelligence and competitive advantage.

For example, on the first day ADP shared how they leverage big data for real-time insights. Yahoo! shared details on a deployment of Splunk Enterprise at multi-terabyte scale that is helping to better monitor and manage website properties. Intuit spoke on leveraging Splunk for diagnostics, testing, performance tuning and more. And on the second day, StubHub, Harvard University, Credit Suisse, Sears and Wipro were all featuring compelling uses for Splunk.

What was most exciting to me was the 50+ end users I spoke with who wanted learn how Pentaho blends data with and in Splunk. Our booth traffic was steady and heavy. Pentaho’s enhanced visualization and reporting demos were a hit not only with the IT attendees, but with the business users who are searching for ways to harness the power of their Splunk data for deeper insights. 

Q: Does attendance indicate a bigger/growing appetite for analysis of machine data?

Splunk is helping to uncover new information and insights – tapping into the myriad of data types Splunk can support as a data platform. It’s clearly making an impact in the enterprise. Yet as all these organizations increasingly turn to Splunk to collect, index and harness their machine-generated big data…there is tremendous opportunity for organizations to turn to Pentaho , a Splunk Powered Technology Partner, to tap and combine Splunk data with any other data source for deeper insights.

Q: How is the market developing for machine data analytics?

We are seeing the market here change from being driven by the technologists, to being driven by the business user.  The technology has advanced and now has the scale, the flexibility and the models to make real business impacts for the enterprise.  The use cases are clearly defined now and the technology fits the customer needs.  The level of collaboration between the major players like Pentaho, Splunk and Hadoop vendors now presents CIOs with real value.

Q: You were invited this year to speak on a CXO Panel addressing Big Data challenges and opportunities. What were some of the highlights?

The CXO panel was fantastic. It was quite an honor to present and be on a panel with four founders and “rock stars” in Big Data: Matt Pfeil (DataStax), M.C. Srivas (MapR), Ari Zilka (Hortonworks) and Amr Awadallah (Cloudera).

Over a panel session that ran for 90 minutes, we tackled subjects on big data challenges. We heard that Splunk users are dealing with quite a few of the same questions and challenges.

Business users and IT professionals just getting started are struggling with what project to pick first and first steps. My advice is to pick a real business use case and push us vendors to do a proof-of-concept with you, your team and to show quantifiable results in 30 days.

We also heard a lot of questions about which vendor has the right answer to their individual use scenarios and challenges. It was great to see all of the panelists on the same page in their response. No one vendor has all the answers. As I mentioned on the panel, if any Big Data player tells you they can solve all your Big Data problems, you should disqualify them! Users need Splunk, they need Pentaho and they need Hadoop.

Q: Taking a high level view of the conference, what trends can you identify?

There were two major trends taking center stage. Business people were asking business questions, and almost everyone was looking to map adoption to real business use cases.  And again, there’s a clear awareness that no one vendor can answer all of their questions. They are all looking at how to best assemble Hadoop, along with Pentaho and extend their use of Splunk with those technologies.

Q: Pentaho and Splunk are demonstrating the new Pentaho Business Analytics and Splunk Enterprise offering, providing a first look to conference attendees. What kind of reaction are you getting from the demos?

The reaction from the audiences was tremendous. We had two sets of reactions. The end user customers took the time to go in-depth with technology demos and asked questions like… where Splunk ends and where Pentaho begins?  The demo we showed drew the business user in too. It was a very powerful visualization of how we can enable a Splunk enterprise to solve business problems.

The Splunk sales teams who visited the booth and saw the demo were able to clearly discuss how to position a total solution for their customer.

Learn more about Splunk and Pentaho.

 


Impala – A New Era for BI on Hadoop

November 30, 2012

With the recent announcement of Impala, also known as Cloudera Enterprise RTQ (Real Time Query), I expect the interest in and adoption of Hadoop to go from merely intense to crazy.  We applaud Cloudera’s investment in creating Impala as it moves Hadoop a huge step forward in making Hadoop accessible using existing BI tools.

What is Impala?  Simply put, it enables all of the SQL-based BI and business analytics tools that have been built over the past couple of decades to now work directly on top of Hadoop, providing interactive response times not previously attainable with Hadoop, and many times faster than Hive, the existing SQL-like alternative. And Impala provides pretty complete SQL support, including join and aggregate functions – must-have functions for analytics.

For enterprises this analytic query speed and expressiveness is huge – it means they are now much less likely to need to extract data out of Hadoop and load it into a data mart or warehouse for interactive visualization.  Instead they can use their favorite business analytics tool directly against Hadoop. But of course only Pentaho provides the integrated end-to-end data integration and business analytics capability for both ingesting and processing data inside of Hadoop, as well as interactively visualizing and analyzing Hadoop data.

Over the past few months Cloudera and Pentaho have been partnering closely at all levels including marketing, sales and engineering.  We are proud of the role we played in assisting Cloudera with validating and testing Impala against realistic BI workloads and use cases.  Based on the extremely strong interest we’ve seen, as evidenced by the lines at our booth at the recent Strata big data conference in New York City, the combination of Pentaho’s visual development and interactive visualization for Hadoop with the break-through performance of Cloudera Impala is very compelling for a huge number of enterprises.

- Ian Fyfe, Chief Technology Evangelist, Pentaho

Impala


Are You?

June 15, 2012

You might be a badass… but are you a big data badass??

Happy Friday!


Follow

Get every new post delivered to your Inbox.

Join 105 other followers