Ensure Your Big Data Integration and Analytics Tools are Optimized for Hadoop

March 27, 2013

Existing data integration and business analytics tools are generally built for relational and structured file data sources, and aren’t architected to take advantage of Hadoop’s massively scalable, but high-latency, distributed data management architecture. Here’s a list of requirements for tools that are truly built for Hadoop.

A data integration and data management tool built for Hadoop must:

  1. Run In-Hadoop: fully leverage the power of Hadoop’s distributed data storage and processing. It should do this via native integration with the Hadoop Distributed Cache, to automate distribution across the cluster. Generating inefficient Pig scripts doesn’t count.
  2. Maximize resource usage on each Hadoop node: each node is a computer, with memory and multiple CPU cores. Tools must fully leverage the power of each node, through multi-threaded parallelized execution of data management tasks and high-performance in-memory caching of intermediate results, customized to the hardware characteristics of nodes.
  3. Leverage Hadoop ecosystem tools: tools must natively leverage the rapidly growing ecosystem of Hadoop add-on projects. For example, using Sqoop for bulk loading of huge datasets or Oozie for sophisticated coordination of Hadoop job workflows.

The widely distributed nature of Hadoop means accessing data can take minutes, or even hours. Data visualization and analytics tools built for Hadoop must mitigate this high data access latency:

  1. Provide end-users direct access to data in Hadoop: and after initial access, provide instant speed-of-thought response times.  It must be done in a way that is simple and intuitive for end users, while providing IT with the controls they need to streamline and manage data access for end users.
  2. Create dynamic data marts: make it easy and quick to spin-off Hadoop data into marts and warehouses for longer-lived high-performance analysis of data from Hadoop.

Learn how big data analytics provider Pentaho is optimized for Hadoop at www.pentahobigdata.com.

- Ian Fyfe, Pentaho

Hadoop Elephantthis blog originally appeared on GigaOM at http://gigaom.com/2012/12/11/ensure-your-big-data-integration-and-analytics-tools-are-optimized-for-hadoop/


How to Get to Big Data Value Faster

March 18, 2013

Summary: Everyone talks about how big data is the key to business success, but the process of getting value from big data is time intensive and complex.  Examining the big data analytics workflow provides clues to getting to big data results faster.

Pentaho Value

Most organizations recognize that big data analytics is key to their future business success, but efforts to implement are often slowed due to operational procedures and workflow issues.

At the heart of the issue is the big data analytics workflow including loading, ingesting, manipulating, transforming, accessing, modeling and, finally, visualizing and analyzing data. Each step requires manual intervention by IT with a great amount of hand coding and tools that invite mistakes and delays. New technologies such as Hadoop and NoSQL databases also require specialized skills. Once the data is prepared, business users often have new requests to IT for additional data sources and the linear process begins again.

Given the potential problems that can crop up in managing and incorporating big data into decision-making processes, organizations need easy-to-use solutions that can address today’s challenges, with the flexibility to adapt to meet future challenges. These solutions require data integration with support for structured and unstructured data and tools for visualization and data exploration that support existing and new big data sources.

A single, unified business analytics platform with tightly coupled data integration and business analytics such as Pentaho Business Analytics  is ideal. Pentaho supports the entire big data analytics flow with visual tools to simplify development and remove complexity for developers and powerful analytics to allow a broad set of users to easily access, visualize and explore big data. By dramatically improving developer productivity and offering significant performance advantages, Pentaho significantly reduces time to big data value.

- Donna Prlich
Senior Director, Product and Solution Marketing, Pentaho

this blog originally appeared on GigaOM at http://gigaom.com/2012/12/06/how-to-reduce-complexity-and-get-to-big-data-value-faster/


Make Your Voice Heard! – 2013 Wisdom of Crowds Business Intelligence Market Study

March 12, 2013

Make your voice heard!

Participate in the 2013 Wisdom of Crowds ® Business Intelligence Market Study and get a complimentary copy of the study findings. 

Dresner Advisory Services is inviting all Business Intelligence (BI) users to participate in its annual examination of the state of the BI marketplace focusing on BI usage, deployment trends, and products.

The 2013 report will build on previous years’ research and will expand to include questions on the latest and emerging trends such as Collaborative BI, BI in the Cloud, and Embedded BI. It will also rank vendors and products, providing an important tool for organizations seeking to invest in BI solutions.

BI users in all roles and throughout all industries are invited to contribute their insight, which should take approximately 15 minutes.  The final report is scheduled to be out in late Spring, and qualified survey participants will receive a complimentary copy.

Click here to start the survey today!


Is your Hadoop cluster big enough to hold your development team’s ego?

March 7, 2013
“There isn’t a cluster big enough to hold your ego!”

“There isn’t a cluster big enough to hold your ego!”

While Gartner describes a “trough of disillusionment” to describe the hangover that follows a period of commercial hype, on the IT side, I see a corresponding “mountain of ego”. Don’t get me wrong. This is not about a sales guy trying to go after the development community – one I proudly belonged to for many years and where I started my journey in this industry. But ask any developer how long it takes to code something and prepare to be amazed by how fast and easy it all is. Three budget cycles later and a couple of delay notifications and we all know better. Agile development tries to cope with this but it’s no silver bullet.

As companies plough ahead with big data initiatives, the relationship between IT and the business has never been more important.  IT and data integration specialists lead most of today’s big data initiatives; it’s uncharted territory, pioneering work and a place to shine a bright and powerful spotlight on IT’s capabilities and potential to add great value to the business. Challenged by the promise of crafting an algorithm that reads like poetry they dive in head first with the scripting language of choice: Python, Ruby, Pig, Perl, JavaScript…whatever you prefer, too bad there isn’t an Hadoop Assembler library available or we could take some real poetic license!

But here’s the problem. It’s one thing to develop beautiful algorithms and dazzling prototypes but what happens when the inevitable errors, exceptions and irregularities, or – worse – the continuous stream of user changes surface? These unglamorous, seemingly trivial inconveniences that seemed hardly worth factoring into the initial delivery estimates invariably wind up causing major headaches and delays. And fixing them is boring and unworthy of the self proclaimed and highly paid Data Scientist!

The reason I mention all this is that no group suffers the grave consequences of putting ego before pragmatism than IT itself. Business Intelligence 1.0 fell into disrepute for taking ages to implement, costing too much money, being inflexible and plagued with backlogs of IT requests. Our industry must avoid the situation in big data where searching for the Holy Grail of scripting nirvana gets in the way of delivering solutions on time to the business so we all can avoid a repeat failure.

When it comes to data analytics, sidestepping IT is reckless. If you thought ‘Excel Hell’ and ‘rogue spreadsheets’ led to inefficiency and poor decisions, just wait for the mayhem that ensues when ‘rogue analytics’ comes to town! Whether it knows it or not, the business needs IT to handle the data cleansing, warehousing, integration and assimilation that is vital to underpinning fast, meaningful, insightful analytics – especially when big data sets come into the picture.

All this means that IT and the business need to work together and find a rhythm instead of trying to get one over on each other. Since the business rarely gets blamed for bad IT decisions (even when it makes them) this rhythm will only happen when IT gets pragmatic and finds ways to work at the (rapid) pace of the business, especially when using big data sources. In addition to changes in culture and mentality, taking advantage of the promise of big data analytics will certainly involve IT using powerful data integration tools instead of coding scripts and hacking algorithms together. However, through these changes, IT stands to earn respect and even hero status, when the business is able to measure revenue gains and efficiency savings.

On the other hand if IT insists on putting ego before pragmatism, the business will find shortcuts, and believe you me they won’t be pretty! So don’t F* around…visit http://www.pentahobigdata.com to learn more.

Davy Nys

*Cartoon drawing by Pentaho’s own Steve Macfarlane


Improving Customer Support using Hadoop and Device Data Analytics

March 6, 2013
20130228_145940.jpg

L – Dave Henry, Pentaho | R – Ben Llyod, NetApp

At Strata 2013 last week, Pentaho had the privilege to host a speaking session with Ben Lloyd, Sr. Program Manager, AutoSupport (ASUP) at NetApp. Ben leads a project called ASUP.Next, which has the goal of implementing a mission-critical data infrastructure for a worldwide customer support program for NetApp’s storage appliances. With design and development assistance from Think Big Analytics and Accenture, NetApp has reached the “go-live” milestone for ASUP.Next and will go into production this month.

A Big Data Problem

More than 250,000 NetApp devices are deployed worldwide; they “phone home” with device statistics and diagnostic information and represent a continuously growing collection of structured data that must be reliably captured, parsed, interpreted and aggregated to support a large collection of use cases. Ben’s presentation highlighted the business and IT challenges of the legacy AutoSupport environment:

  • The total cost of processing, storing and managing data represents a major ongoing expense ($15M / year). The storage required for ASUP-related data doubles every 16 months — by the end of 2013 NetApp will have more than 1PB of ASUP-related data available for analysis
  • The legacy ETL (PL/SQL) and data warehouse-based approach has resulted in increased latency and missed SLAs. Integrated data for reporting and analysis is typically only available 72-hours after the receipt of device messages
  • For NetApp Customer Support, the information required to resolve support cases is not easily available in the time required
  • For NetApp Professional Services, it’s difficult or impossible to aggregate the volume of performance data needed to provide valuable recommendations
  • For Product Engineering, failure analysis and defect signatures over long time periods are impossible to identify

Cloudera Hadoop: at the Core of NetApp’s Solution

The ASUP.Next project aims to address these issues by eliminating data volume constraints and building a Hadoop-centered infrastructure that will scale to support projected volumes. Ben discussed the new architecture in detail during his presentation. It enables a complete end-to-end workflow including:

  • Receipt of ASUP device messages via HTTP and e-mail
  • Message parsing and ingestion into HDFS and HBase
  • Distribution of messages to case-generation processes and downstream ASUP consumers
  • Long –term storage of messages
  • Reporting and analytic access to structured and unstructured data
  • RESTful services that provide access to AutoSupport data and processes

Pentaho’s Data Integration platform (PDI) is used in ASUP.Next for overall orchestration of this workflow as well as implementation of transformation logic using Pentaho’s visual development solution for MapReduce. Pentaho’s main value to NetApp comes from shortening the development cycle and providing ETL and job control capabilities that span the entire data infrastructure, from HDFS, HBase and MapReduce to Oracle and SAP. Pentaho also worked closely with Cloudera to ensure compatibility with the latest CDH client libraries.

NetApp’s use of Hadoop as a scalable infrastructure for ETL is increasingly common. Pentaho is seeing this use case across a variety of industries including capital markets, government, telecommunications, energy and digital publishing. In general, the reasons these customers use PDI with Hadoop include:

  • Leveraging existing team members for rapid development and ongoing maintenance of the solution. Most organizations have a core ETL team that can bring a decade or more of subject matter expertise to the table. By removing the requirement to use Java, a scripting language or raw XML, team members are able to actively help with the build-out of jobs and transformations. This also lessens the need to recruit, hire and orient outside developers
  • Increasing the “logic density” of transformations. As you can see in the demo example below, it’s possible to express a lot of transformation logic in a single mapper or reducer task. This makes it possible to reduce the number of unique jobs that must be run to achieve a complete workflow. In addition to improving performance, this can result in designs that are easier to document and explain

PDI

  • Focusing on the “what”, not the “how” of MapReduce development. I was surprised (actually shocked) to see how many of the speakers at Strata were still walking through code examples to illustrate a development technique. The typical organization has no desire and little ability to turn itself into a software development shop. The language-based approach may work for the Big Data “Titans”, but not for businesses that need to implement Big Data solutions quickly and with minimal risk

Key Takeaways

Since this was a Pentaho-sponsored session, Ben summarized his experience working with the Pentaho Services and Engineering teams. His main points are illustrated in the photo above. Most of his points revolve around how Pentaho provided support during early development and testing. A large number of Pentaho employees contributed their time, energy and brain-power to ensure the project’s success. Many enhancements in PDI 4.4 are a direct result of improvements needed to support ASUP.Next use cases.

What has Pentaho learned from this project? Pentaho gained a number of valuable insights:

  • Big Data architectures to support low-latency use cases can be complex. Not only are multiple functional components needed, but they must integrate with existing systems such as enterprise data warehouses. These architectures demand a high degree of flexibility
  • Big Data projects require customers, system integrators and technology providers to “plumb the last 5%” as the solution is being developed. Inevitably, new capabilities are used for the first time and need to be fine-tuned to support real-world use cases, data volumes and encoding formats. A good example is PDI’s support for AVRO. Although we anticipated needing to adapt the existing AVRO Input Step to work with NetApp’s schemas, we only understood the full set of requirements after seeing their actual data during an early system test
  • Pentaho’s plugin-based architecture isolates the core “engines” from the layer where point-functionality is implemented. Pentaho is able to implement all of the required enhancements without a single architectural change. The AVRO enhancements and other improvements (such as HTableInput format support for MapReduce jobs) were all coded and field-deployed via updates to plug-ins, completely eliminating the possibility of introducing defects into PDI’s data flow engine.
  • Open source is a significant “enabler” making it easy for everyone to understand how integration works. It’s hard to overestimate the importance of code transparency. It allows the customer, the system integrators and the technology partners to get right to the point and experiment quickly with different designs.

It’s been a pleasure working with NetApp and its partners on the ASUP.Next solution. We look forward to continuing our work with NetApp as their use of device data evolves to exploit new opportunities not previously possible with their legacy application.

-Dave

Dave Henry, SVP Enterprise Solutions
Pentaho


Channel Partners – Carpe Diem!

March 1, 2013
Erik Nolten

Erik Nolten

In my first blog post for Pentaho, I would like to join our Italian channel partner BNova in a celebration. No, this is not about Italy’s recent elections (I don’t want to get involved in politics!) but a matter closer to my heart. On March 13, I will meet BNova’s Massimiliano Vitali and Serena Arrighi in London to take part in the IT Europa European IT and Software Excellence awards dinner in recognition for the ground-breaking work it did for its customer Infocamere. You can read more about the Infocamere story here.

This is a proud milestone in our partnership with BNova, which was one of our very first European channel partners. We worked with Bnova from day one, helping the company design its business strategy, marketing plan and train its people to sell and support Pentaho with confidence. BNova has built a thriving, profitable business whose revenue has tripled since its reseller agreement with Pentaho began in early 2009. All this has taken place against the backdrop of Italy and the Eurozone’s tough economic situation, proving that its services offer public and private sector companies excellent value for money.

Seize the day, or the year!

Our CEO Quentin Gallivan recently blogged that 2013 will be the year in which many companies go into production with big data analytics. Thanks to the groundwork we have done together, BNova, which has thrown its support behind big data is now in pole position profit from this trend, which is emerging in Italy. You can learn more about BNova’s customers and services at the Big Data Conference in Rome on March 12.

When channel partners succeed, everybody wins. We know that many of you have limited marketing resources so we’re here to help. If you have impressive customer stories like Infocamere, our team is on hand to help you promote and celebrate them by writing up case studies, press releases, co-hosting a webinar and completing award applications.

I hope that BNova’s story will inspire other channel partners to get the most out of their partnerships with Pentaho. If you’d like to learn more, please contact me on enolten@pentaho.com.

Erik Nolten
Director Channel EMEA & APAC


Follow

Get every new post delivered to your Inbox.

Join 88 other followers