Ensure Your Big Data Integration and Analytics Tools are Optimized for Hadoop

March 27, 2013

Existing data integration and business analytics tools are generally built for relational and structured file data sources, and aren’t architected to take advantage of Hadoop’s massively scalable, but high-latency, distributed data management architecture. Here’s a list of requirements for tools that are truly built for Hadoop.

A data integration and data management tool built for Hadoop must:

  1. Run In-Hadoop: fully leverage the power of Hadoop’s distributed data storage and processing. It should do this via native integration with the Hadoop Distributed Cache, to automate distribution across the cluster. Generating inefficient Pig scripts doesn’t count.
  2. Maximize resource usage on each Hadoop node: each node is a computer, with memory and multiple CPU cores. Tools must fully leverage the power of each node, through multi-threaded parallelized execution of data management tasks and high-performance in-memory caching of intermediate results, customized to the hardware characteristics of nodes.
  3. Leverage Hadoop ecosystem tools: tools must natively leverage the rapidly growing ecosystem of Hadoop add-on projects. For example, using Sqoop for bulk loading of huge datasets or Oozie for sophisticated coordination of Hadoop job workflows.

The widely distributed nature of Hadoop means accessing data can take minutes, or even hours. Data visualization and analytics tools built for Hadoop must mitigate this high data access latency:

  1. Provide end-users direct access to data in Hadoop: and after initial access, provide instant speed-of-thought response times.  It must be done in a way that is simple and intuitive for end users, while providing IT with the controls they need to streamline and manage data access for end users.
  2. Create dynamic data marts: make it easy and quick to spin-off Hadoop data into marts and warehouses for longer-lived high-performance analysis of data from Hadoop.

Learn how big data analytics provider Pentaho is optimized for Hadoop at www.pentahobigdata.com.

- Ian Fyfe, Pentaho

Hadoop Elephantthis blog originally appeared on GigaOM at http://gigaom.com/2012/12/11/ensure-your-big-data-integration-and-analytics-tools-are-optimized-for-hadoop/


Is your big data protected?

October 17, 2012

This morning I participated in a panel discussing the topic of big data privacy at the San Francisco ISACA Fall Conference. The Information Systems Audit and Control Association (ISACA) is a professional association of individuals interested in information systems audit, control and security, with over 50,000 members in 141 countries.  Other representatives on the panel were from eBay, Pricewaterhouse Coopers, and CipherCloud.

Researching this topic and today’s discussion raised some interesting questions about the intersection of personal privacy and big data in this new age where it is becoming technically and economically viable to store and analyze enormous data volumes, such as every click on a website and every commerce transaction. All this big data, from both internal systems as well as externally sourced enrichment data can now be streamed into giant “data lakes” using open source based big data management platforms such as Hadoop and NoSQL databases. Using visual big data development tools and end-user visualization technology such as Pentaho makes it easier and easier for organizations to ingest, prepare and analyze this data resulting in previously unattainable insights that can be used to optimize revenue streams and reduce costs.

However, how can we ensure this “big data” is protected and never used in ways that intrude on individual privacy? Mature and integrated big data analytics platforms such as Pentaho can enforce data access controls as well as audit data usage, but today there is no industry standard for tagging how specific data elements may be used on a permanent basis. This has the potential to leading to risks down the road that data collected with individual personal consent may ultimately be used for purposes beyond the scope of the original consent policy. Is it time for government and industry standards bodies to tackle this issue with new technical standards that enforce data usage and aging policies on an ongoing basis? I have in mind something like an “XBRL” for data privacy, a standard taxonomy and semantics that enforces data usage policies regardless of source and platform.

Let me know what you think. Leave a comment below or @ian_fyfe

Ian Fyfe
Chief Technology Evangelist
Pentaho


Pentaho’s support of EMC Greenplum HD – what it means and why you should care

May 12, 2011

On Monday we announced our support for the EMC Greenplum distribution of Hadoop called EMC Greenplum HD. You can read about all the details in our press release, Pentaho Makes Hadoop Faster, More Affordable and Easier to Use with EMC.

This week we have been at EMC World in Las Vegas as a sponsor in booth 211 (if you are at the conference come visit us). We’ve had a great crowd and interest in Pentaho BI Suite for Hadoop, Pentaho Data Integration for Hadoop and our new native support for the Greenplum Database GPLoad high performance bulk loader. Two questions that attendees keep asking are: “How is Pentaho supporting EMC Greenplum HD,” and “Why should I care?” You can read my answers below and more details about our announcement in the press release and Pentaho & EMC web page.

How Pentaho supports EMC Greenplum for Hadoop
Pentaho is the only EMC Greenplum partner to provide a complete BI solution from data integration through to reporting, analysis, dashboarding and data mining, from a single BI platform with shared metadata. Pentaho’s support and certification complements the Greenplum distribution of Hadoop by providing an end-to-end data integration and BI suite with the cost advantages of open source that enables:

  • An easy-to-use, graphical ETL environment for input, transformation, and output of Hadoop data;
  • Massively scalable deployment of ETL processing across the Hadoop cluster;
  • Coordination and execution of Hadoop tasks by enabling them to be managed from within the Pentaho management console;
  • Easy spinning off of high performance data marts for interactive analysis;
  • Integration of data from Hadoop with data from other sources for interactive analysis.

Why this is a good thing and how it changes the industry
EMC Greenplum, in combination with key technology partners, for the first time is giving the industry an integrated, supported and certified data management and BI stack that includes storage, a MapReduce framework for processing unstructured data, an analytic database, predictive analytics and business intelligence.

By combining Pentaho’s powerful BI suite with the strength of EMC Greenplum’s storage and data management domain expertise, the industry benefits from maximum data throughput and significantly shorter implementation cycles for new Hadoop deployments.

Already an industry leader in data and storage, EMC is now well-positioned to play a pivotal role in commercializing Hadoop and giving businesses a more cost-effective and simple way to perform advanced analytics in a massively scalable way. For Hadoop to truly get to the next level, it needs to be as easy-to-install and use as off-the-shelf software.

If you are interested to evaluate Pentaho BI Suite and Pentaho Data Integration for the EMC Greenplum distribution of Hadoop, contact us at Pentaho_EMC@pentaho.com

Ian Fyfe
Chief Technology Evangelist
Pentaho

Photos from the Pentaho booth at EMC World this week

This slideshow requires JavaScript.


How fast is lightening fast?

February 15, 2011

A huge congratulations to our partners at Ingres, who today announced that their lightening fast database VectorWise has set a new record for the Transaction Processing Performance Council’s TPC-H benchmark at scale factor 100. Not only did Vectorwise set a new standard, but it blew the previous record holder out of the water, delivering 340% of the previous record.

Equally outstanding about this news is the fact that VectorWise has not only changed the game in terms of performance, but the database also comes in at a fraction of the price of its competitors. Forward-thinking innovation, high performance, and low cost… sound familiar? It should.

What does this mean to Pentaho users?

Pentaho and Ingres established a partnership last October, with the goal to combine enterprise-class business intelligence with the speed and performance of the fastest analytical database on the market. With over 250,000 QphH (Queries per hour) for 100 GB of data, VectorWise is the epitome of agility at the database level. This means lightening fast query response times, more iterative cycles, and at essence, even more agile business intelligence.

For more


Thoughts on last week’s Strata big data conference

February 8, 2011

Last week I attended the O’Reilly’s Strata Conference, in Santa Clara, California where Pentaho was an exhibitor. I gave a 5-minute lightning talk during the preceding Big Data Camp “un-conference” on the topic, The importance of the hybrid data model for Hadoop driven analytics, focusing on the importance of combining big data analytic results with the data elements already in firm’s existing systems to give business units the answers to questions that were previously not possible or economic to answer (something that of course Pentaho now makes possible). I also sat down for an interview with Mac Slocum, Online Managing Editor at O’Reilly, you can see the video below where we discuss  what kinds of businesses can benefit from big data technologies such as Hadoop, and what is the tipping point for adopting big data technologies.


The high quality of attendees and activity at this sell-out conference I think further confirms that although development work on solutions for big data has been happening for a few years, this area is undergoing a quantum leap in adoption at businesses both large and small. Simply put this technology allows them to glean “information” from the enormous quantities of often unstructured or semi-structured data that in the past was simply not possible, or was eye-wateringly expensive to achieve using conventional relational database technologies.

I found that the level of “Big Data” understanding maturity among attendees was quite varied. Questions spanned the entire spectrum with a few people asking things like “What is Hadoop?” to many along the lines of “Exactly how does Pentaho integrate with Hadoop’s Map-Reduce Framework, HDFS, and Hive?” Some attendees were clearly still in the discovery and learning phase, but many were confidently moving forward with the idea of leveraging big data, and were looking for solutions that make it easier to work with big data technologies such as Hadoop to deliver new information and insights to their businesses. In fact, it is clear that the emergence of a new type of database professional: the data scientist is rapidly becoming mainstream. This person combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data.

Ian Fyfe
Chief Technology Evangelist
Pentaho Corporation

Here are some in-action photos of our booth at the Strata Conference


Follow

Get every new post delivered to your Inbox.

Join 97 other followers