Pentaho and Cloudera Impala in 5 words

April 29, 2013

Today our big data partner Cloudera, joined us in continuing to deliver innovative, open technologies that bring real business value to customers. Pentaho and Cloudera share a common history and approach to simplifying complex, but powerful technologies to integrate and analyze big data. Our common open source heritage means that we can innovate at the speed of our customers businesses.

What is Cloudera’s latest Innovation? Cloudera Impala powers Cloudera Enterprise RTQ (Real-time Query), the first data management solution that takes Hadoop beyond batch to enable real-time data processing and analysis on any type of data (unstructured and structured) within a centralized, massively scalable system. Impala dramatically improves the economics and performance of large-scale enterprise data management.

Pentaho and Cloudera Impala in 5 words = Affordable scalability meets fast analytics. Cloudera Imapala enables any product that is JDBC-enabled to get fast results from Hadoop, making Hadoop an ideal component for a data warehouse strategy. Customers no longer have to pay for expensive proprietary DBMS or analytical DBs to house their entire data warehouse.

Cloudera’s innovation makes it even easier for customers to use common analytic tools that can access and analyze data in all of these formats. What does this really mean? It means you don’t have buy expensive, proprietary products that can’t work across all of your data platforms.

With Pentaho and Cloudera you can quickly analyze large volumes of disparate data significantly faster with Impala than with Hive. Take a look at how Cloudera Impala is driving a major evolutionary step in the growth of the company’s Platform for Big Data, Cloudera Enterprise, and the Apache Hadoop ecosystem as a whole.

Richard Daley


Ensure Your Big Data Integration and Analytics Tools are Optimized for Hadoop

March 27, 2013

Existing data integration and business analytics tools are generally built for relational and structured file data sources, and aren’t architected to take advantage of Hadoop’s massively scalable, but high-latency, distributed data management architecture. Here’s a list of requirements for tools that are truly built for Hadoop.

A data integration and data management tool built for Hadoop must:

  1. Run In-Hadoop: fully leverage the power of Hadoop’s distributed data storage and processing. It should do this via native integration with the Hadoop Distributed Cache, to automate distribution across the cluster. Generating inefficient Pig scripts doesn’t count.
  2. Maximize resource usage on each Hadoop node: each node is a computer, with memory and multiple CPU cores. Tools must fully leverage the power of each node, through multi-threaded parallelized execution of data management tasks and high-performance in-memory caching of intermediate results, customized to the hardware characteristics of nodes.
  3. Leverage Hadoop ecosystem tools: tools must natively leverage the rapidly growing ecosystem of Hadoop add-on projects. For example, using Sqoop for bulk loading of huge datasets or Oozie for sophisticated coordination of Hadoop job workflows.

The widely distributed nature of Hadoop means accessing data can take minutes, or even hours. Data visualization and analytics tools built for Hadoop must mitigate this high data access latency:

  1. Provide end-users direct access to data in Hadoop: and after initial access, provide instant speed-of-thought response times.  It must be done in a way that is simple and intuitive for end users, while providing IT with the controls they need to streamline and manage data access for end users.
  2. Create dynamic data marts: make it easy and quick to spin-off Hadoop data into marts and warehouses for longer-lived high-performance analysis of data from Hadoop.

Learn how big data analytics provider Pentaho is optimized for Hadoop at www.pentahobigdata.com.

- Ian Fyfe, Pentaho

Hadoop Elephantthis blog originally appeared on GigaOM at http://gigaom.com/2012/12/11/ensure-your-big-data-integration-and-analytics-tools-are-optimized-for-hadoop/


How to Get to Big Data Value Faster

March 18, 2013

Summary: Everyone talks about how big data is the key to business success, but the process of getting value from big data is time intensive and complex.  Examining the big data analytics workflow provides clues to getting to big data results faster.

Pentaho Value

Most organizations recognize that big data analytics is key to their future business success, but efforts to implement are often slowed due to operational procedures and workflow issues.

At the heart of the issue is the big data analytics workflow including loading, ingesting, manipulating, transforming, accessing, modeling and, finally, visualizing and analyzing data. Each step requires manual intervention by IT with a great amount of hand coding and tools that invite mistakes and delays. New technologies such as Hadoop and NoSQL databases also require specialized skills. Once the data is prepared, business users often have new requests to IT for additional data sources and the linear process begins again.

Given the potential problems that can crop up in managing and incorporating big data into decision-making processes, organizations need easy-to-use solutions that can address today’s challenges, with the flexibility to adapt to meet future challenges. These solutions require data integration with support for structured and unstructured data and tools for visualization and data exploration that support existing and new big data sources.

A single, unified business analytics platform with tightly coupled data integration and business analytics such as Pentaho Business Analytics  is ideal. Pentaho supports the entire big data analytics flow with visual tools to simplify development and remove complexity for developers and powerful analytics to allow a broad set of users to easily access, visualize and explore big data. By dramatically improving developer productivity and offering significant performance advantages, Pentaho significantly reduces time to big data value.

- Donna Prlich
Senior Director, Product and Solution Marketing, Pentaho

this blog originally appeared on GigaOM at http://gigaom.com/2012/12/06/how-to-reduce-complexity-and-get-to-big-data-value-faster/


Make Your Voice Heard! – 2013 Wisdom of Crowds Business Intelligence Market Study

March 12, 2013

Make your voice heard!

Participate in the 2013 Wisdom of Crowds ® Business Intelligence Market Study and get a complimentary copy of the study findings. 

Dresner Advisory Services is inviting all Business Intelligence (BI) users to participate in its annual examination of the state of the BI marketplace focusing on BI usage, deployment trends, and products.

The 2013 report will build on previous years’ research and will expand to include questions on the latest and emerging trends such as Collaborative BI, BI in the Cloud, and Embedded BI. It will also rank vendors and products, providing an important tool for organizations seeking to invest in BI solutions.

BI users in all roles and throughout all industries are invited to contribute their insight, which should take approximately 15 minutes.  The final report is scheduled to be out in late Spring, and qualified survey participants will receive a complimentary copy.

Click here to start the survey today!


Improving Customer Support using Hadoop and Device Data Analytics

March 6, 2013
20130228_145940.jpg

L – Dave Henry, Pentaho | R – Ben Llyod, NetApp

At Strata 2013 last week, Pentaho had the privilege to host a speaking session with Ben Lloyd, Sr. Program Manager, AutoSupport (ASUP) at NetApp. Ben leads a project called ASUP.Next, which has the goal of implementing a mission-critical data infrastructure for a worldwide customer support program for NetApp’s storage appliances. With design and development assistance from Think Big Analytics and Accenture, NetApp has reached the “go-live” milestone for ASUP.Next and will go into production this month.

A Big Data Problem

More than 250,000 NetApp devices are deployed worldwide; they “phone home” with device statistics and diagnostic information and represent a continuously growing collection of structured data that must be reliably captured, parsed, interpreted and aggregated to support a large collection of use cases. Ben’s presentation highlighted the business and IT challenges of the legacy AutoSupport environment:

  • The total cost of processing, storing and managing data represents a major ongoing expense ($15M / year). The storage required for ASUP-related data doubles every 16 months — by the end of 2013 NetApp will have more than 1PB of ASUP-related data available for analysis
  • The legacy ETL (PL/SQL) and data warehouse-based approach has resulted in increased latency and missed SLAs. Integrated data for reporting and analysis is typically only available 72-hours after the receipt of device messages
  • For NetApp Customer Support, the information required to resolve support cases is not easily available in the time required
  • For NetApp Professional Services, it’s difficult or impossible to aggregate the volume of performance data needed to provide valuable recommendations
  • For Product Engineering, failure analysis and defect signatures over long time periods are impossible to identify

Cloudera Hadoop: at the Core of NetApp’s Solution

The ASUP.Next project aims to address these issues by eliminating data volume constraints and building a Hadoop-centered infrastructure that will scale to support projected volumes. Ben discussed the new architecture in detail during his presentation. It enables a complete end-to-end workflow including:

  • Receipt of ASUP device messages via HTTP and e-mail
  • Message parsing and ingestion into HDFS and HBase
  • Distribution of messages to case-generation processes and downstream ASUP consumers
  • Long –term storage of messages
  • Reporting and analytic access to structured and unstructured data
  • RESTful services that provide access to AutoSupport data and processes

Pentaho’s Data Integration platform (PDI) is used in ASUP.Next for overall orchestration of this workflow as well as implementation of transformation logic using Pentaho’s visual development solution for MapReduce. Pentaho’s main value to NetApp comes from shortening the development cycle and providing ETL and job control capabilities that span the entire data infrastructure, from HDFS, HBase and MapReduce to Oracle and SAP. Pentaho also worked closely with Cloudera to ensure compatibility with the latest CDH client libraries.

NetApp’s use of Hadoop as a scalable infrastructure for ETL is increasingly common. Pentaho is seeing this use case across a variety of industries including capital markets, government, telecommunications, energy and digital publishing. In general, the reasons these customers use PDI with Hadoop include:

  • Leveraging existing team members for rapid development and ongoing maintenance of the solution. Most organizations have a core ETL team that can bring a decade or more of subject matter expertise to the table. By removing the requirement to use Java, a scripting language or raw XML, team members are able to actively help with the build-out of jobs and transformations. This also lessens the need to recruit, hire and orient outside developers
  • Increasing the “logic density” of transformations. As you can see in the demo example below, it’s possible to express a lot of transformation logic in a single mapper or reducer task. This makes it possible to reduce the number of unique jobs that must be run to achieve a complete workflow. In addition to improving performance, this can result in designs that are easier to document and explain

PDI

  • Focusing on the “what”, not the “how” of MapReduce development. I was surprised (actually shocked) to see how many of the speakers at Strata were still walking through code examples to illustrate a development technique. The typical organization has no desire and little ability to turn itself into a software development shop. The language-based approach may work for the Big Data “Titans”, but not for businesses that need to implement Big Data solutions quickly and with minimal risk

Key Takeaways

Since this was a Pentaho-sponsored session, Ben summarized his experience working with the Pentaho Services and Engineering teams. His main points are illustrated in the photo above. Most of his points revolve around how Pentaho provided support during early development and testing. A large number of Pentaho employees contributed their time, energy and brain-power to ensure the project’s success. Many enhancements in PDI 4.4 are a direct result of improvements needed to support ASUP.Next use cases.

What has Pentaho learned from this project? Pentaho gained a number of valuable insights:

  • Big Data architectures to support low-latency use cases can be complex. Not only are multiple functional components needed, but they must integrate with existing systems such as enterprise data warehouses. These architectures demand a high degree of flexibility
  • Big Data projects require customers, system integrators and technology providers to “plumb the last 5%” as the solution is being developed. Inevitably, new capabilities are used for the first time and need to be fine-tuned to support real-world use cases, data volumes and encoding formats. A good example is PDI’s support for AVRO. Although we anticipated needing to adapt the existing AVRO Input Step to work with NetApp’s schemas, we only understood the full set of requirements after seeing their actual data during an early system test
  • Pentaho’s plugin-based architecture isolates the core “engines” from the layer where point-functionality is implemented. Pentaho is able to implement all of the required enhancements without a single architectural change. The AVRO enhancements and other improvements (such as HTableInput format support for MapReduce jobs) were all coded and field-deployed via updates to plug-ins, completely eliminating the possibility of introducing defects into PDI’s data flow engine.
  • Open source is a significant “enabler” making it easy for everyone to understand how integration works. It’s hard to overestimate the importance of code transparency. It allows the customer, the system integrators and the technology partners to get right to the point and experiment quickly with different designs.

It’s been a pleasure working with NetApp and its partners on the ASUP.Next solution. We look forward to continuing our work with NetApp as their use of device data evolves to exploit new opportunities not previously possible with their legacy application.

-Dave

Dave Henry, SVP Enterprise Solutions
Pentaho


Looking for the perfect match

February 28, 2013

image

I’m at the O’Reilly Strata Big Data Conference in Santa Clara, CA this week where there’s lots of buzz about the value and reality of big data. It’s a fun time to be part of a hot new market in technology. But, of course, a hot new market brings a new set of challenges.

After talking to several attendees, I would not be surprised if someone took out an advertisement in the San Francisco Guardian that reads:

SEEKING BDT (Big Data Talent)

“Middle-aged attractive company seeks hot-to-trot data geek for mutually enjoyable discrete relationship, mostly involving analytics. Must enjoy long discussions about wild statistical models, short walks to the break room and large quantities of caffeine.”

The feedback from the presentations and attendees at Strata mimics the results from a Big Data survey that Pentaho released last week showing there is a lack of current skills to address new big data technologies such as Hadoop among existing staff and more generally on the market. This is good news for folks looking for jobs in Big Data and a good indication for others who want to learn new skills.

The market has created the perfect storm – the combination of hot new technology mixed with a myriad of very complex systems plus highly complicated statistical models and calculations. This storm is preventing the typical IT generalist or BI expert from applying.  More experienced data scientists who can spin models on their head with a twist of a mouse are in high demand. The need to garner value quickly from Big Data means there is little time to look for the “perfect match.”

It seems like new companies and technologies pop up almost every week, each with the promise of business benefits, but with the added cost of high complexity.  Shouldn’t things get easier with new technologies?

Pentaho’s Visual MapReduce is a prime example of things getting easier.  Getting data out of Hadoop quickly can be a challenge.  However, with Visual MapReduce any IT professional could pull the right information from a Hadoop cluster, improve the performance of a MapReduce job and make results available in the optimal format for business users.

New technologies might need new talent, but in the case of Pentaho Visual MapReduce, new technologies might only need new tools to help address them.

Looks like Pentaho is the perfect match.

Chuck Yarbrough
Technical Solutions Marketing


The Tesla vs. NY Times – How Analytics Helped Tesla Win

February 21, 2013

Tesla’s-Pricing-Strategy-for-the-Model-S-Luxury-Sedan-25

In the last couple of weeks the feud between The NY Times Editor, John Broder – and Tesla Motors’ CEO, Elon Musk has played out in the media.

It all started when Broder took a highway trip between Washington D.C. and Boston, cruising in Tesla’s Model S luxury sedan. The purpose of the trip was to range test the car between two new supercharging stations. This 200 miles trip was well under the Model S’s 265-mile estimated range. But nonetheless the trip was filled with anxiety for Broder. Fearful of not reaching his charging destination, he had to turn off the battery-draining amenities such as radio and heater (in a 30 degree weather) to finally reach his destination – feet and knuckles “frozen”.

In rebutting Broder’s claims, Tesla’s chief executive, Elon Musk, has charged that the story was faked, that Mr. Broder intentionally caused his car to fail. On his Tesla blog, he released graphs and charts, based on driving logs that contest many of the details of Mr. Broder’s article.

With the logs now published, one thing is clear — Tesla’s use of predictive analytics helped them warn Broder on what is ahead. By calculating the range based on the energy consumption, Tesla signaled Broder to charge the vehicle in time. Had Tesla not been able to call its log files as witness, this futuristic motor tech company could have experienced serious brand damage.

What’s interesting is that Tesla’s story is not unique. Today, virtually anything that we use, an appliance, a mobile phone, an application, generates some sort of data – machine-generated data. And the truth exists behind that data. Such data, when analyzed and mined properly, provides indicators that solve problems, ahead of time.

Having real-time access to machine-generated data to foresee problems and improve performance is exactly why NetApp is using Pentaho. Using Hadoop and Pentaho Business Analytics to process and drive insights from 2-5 TBs of incoming data per week, NetApp has built a solution that sends alerts and notifications ahead of the actual hardware failure. The solution has helped NetApp predict its appliance interruptions for the E-Series storage units, offering new ways to exceed customer SLAs and protect the brand’s image.

Tesla, NetApp or other, if you run a data-driven business, the more your company can act on that data to improve your application, service or product performance, the better off your customers and the better your brand will be.

Pentaho Business Analytics gives companies fast and easy ways for collecting, analyzing and predicting data patterns. Pentaho’s customers see the value of analytics in many different facets and use cases. NetApp’s use case will be featured in Strata’s upcoming conference on Thursday, February 28, 2012.

Join us to find out more.

- Farnaz Erfan, Product and Solution Marketing, Pentaho


Follow

Get every new post delivered to your Inbox.

Join 105 other followers