Existing data integration and business analytics tools are generally built for relational and structured file data sources, and aren’t architected to take advantage of Hadoop’s massively scalable, but high-latency, distributed data management architecture. Here’s a list of requirements for tools that are truly built for Hadoop.
A data integration and data management tool built for Hadoop must:
- Run In-Hadoop: fully leverage the power of Hadoop’s distributed data storage and processing. It should do this via native integration with the Hadoop Distributed Cache, to automate distribution across the cluster. Generating inefficient Pig scripts doesn’t count.
- Maximize resource usage on each Hadoop node: each node is a computer, with memory and multiple CPU cores. Tools must fully leverage the power of each node, through multi-threaded parallelized execution of data management tasks and high-performance in-memory caching of intermediate results, customized to the hardware characteristics of nodes.
- Leverage Hadoop ecosystem tools: tools must natively leverage the rapidly growing ecosystem of Hadoop add-on projects. For example, using Sqoop for bulk loading of huge datasets or Oozie for sophisticated coordination of Hadoop job workflows.
The widely distributed nature of Hadoop means accessing data can take minutes, or even hours. Data visualization and analytics tools built for Hadoop must mitigate this high data access latency:
- Provide end-users direct access to data in Hadoop: and after initial access, provide instant speed-of-thought response times. It must be done in a way that is simple and intuitive for end users, while providing IT with the controls they need to streamline and manage data access for end users.
- Create dynamic data marts: make it easy and quick to spin-off Hadoop data into marts and warehouses for longer-lived high-performance analysis of data from Hadoop.
Learn how big data analytics provider Pentaho is optimized for Hadoop at www.pentahobigdata.com.
- Ian Fyfe, Pentaho