“We haven’t really thought about that”

October 7, 2011

If you have been dealing with business analytics related sales activities or are searching for that “right” business intelligence tool, you will find that most organizations:

  • Use manual tasks, that include desktop query and reporting tools, to answer their business questions.
  • Have “something” in place that they are not happy with or is costing them too much money.
  • Have data in multiple silos that they need to access, consolidate and optimize.

Hence, they are usually looking for a low cost business analytics alternative that can provide them with the answers to their business questions, as well as ease of use and functionality they are looking for within their budget. Don’t believe me? Join the many business intelligence groups available in LinkedIn and Quora as well as other social networking type sites and you will see the barrage of questions from those looking for recommendations on BI and analytics tools.

I was on a call the other daya with a well known organization where the “prospect” stated: “We need basic reporting with the ability to access all of our data without moving it or massaging it.”

“Okay? That is absolutely possible,” I replied. “However do you understand the pros and cons that are associated with doing that?”  [...]  Silence…not only could you hear crickets on their phone, but you could hear them in the next conference room over. I took the proverbial saying “Silence is Golden” to another level. It became so uncomfortable that the Account Rep felt he should interject. I interrupted promptly to allow them to answer the question. After about a minute of what seemed like hours of silence, they responded:

“We haven’t really thought about that.” – BINGO! Case closed! Next!

Hmm… “We haven’t really thought about that.”

That’s the problem! No one is taking the time to be proactive and think about what it is they need and are rather just reacting saying, “Let’s see a demo,” “I just need reporting,” or “We need dashboards.” If that is the case, I would recommend you watch a video demonstration which may intrigue you and prompt you to start thinking about what you really need. Then…come talk to me when you have more criteria that will support your business intelligence and analytics initiative. :-)

I digress…in turn, I took this as an opportunity to educate them by asking pointed questions that would help them see what it is they actually need vs. what they thought they needed.

Eight questions I always ask customers to determine what they actually need

  1. Is the data you need to access all in one location? – No
  2. Does the data you have support a majority of questions that will be asked of it? – Don’t know
  3. Would you like answers to questions that occur on a regular basis? – Yes
  4. Would you like your users to answer their own questions on a random basis? – Yes
  5. Would you like your users to explore and discover answers to questions they did not think to ask? – Yes
  6. Do you have a predefined set of KPIs to manage and track business performance? – Yes
  7. Would you like your executives to see an at a glance view of those KPIs? – Yes
  8. Would you like to be aware of “something” when a defined threshold is met? – Yes

Alright, now we are getting somewhere. Each of those questions and responses clearly identifies that their needs are more than just simple reporting as originally desired. They require a solution that encompasses both data integration and content delivery (ETL, reporting, analysis and dashboards).

I further probed as to why they wanted to access all of their data “without moving it or massaging it.”

They replied: “Because building a data warehouse takes too much time and costs too much money.”

Wow! Clearly a response most likely seeded by a competitor, whom believes they can access all of the data directly where it sits, without building a data warehouse (which may be true for some of the competition out there). However, they usually leave out the fact that they MAY still need to ‘move or massage’ the data – they just don’t call it ETL or refer to their process as data integration or even use the words “Data Warehousing.”

I further explained that data integration (ETL) does not have to be about building an Enterprise Data Warehouse (EDW). It can be about building operational data stores, that are refreshed periodically, to support questions that the business users want to ask. It can involve federated queries where the data is accessed from the source without having to stage the data. It can also be about normalizing data in to a small easy to maintain data mart that supports speed of thought analytics for the power users.

Upon those points I provided a demonstration of Pentaho’s Agile BI capabilities which involves a rapid, collaborative and iterative approach to building business analytic applications. At completion of the presentation, the prospect was amazed and pleased stating, “This is exactly what we need.”  Ahh..music to my ears.

People…and I say this with great care…you cannot throw a business analytics tool in your organization and expect it to stick without asking some important questions. It is those answers that will help guide you to the right solution. And most importantly, you cannot put a business analytics tool on top of those “as-is” data sources without knowing what questions are going to be asked of it. I know, it is impossible to know every question that may be asked, but at least have those that are important to tracking your business performance and achieving your goals.

On the majority of calls that I participate in, it seems that organizations just don’t have the time to properly plan and discuss the criteria needed to implement a decision support system. Why? Because everyone is doing more with less these days and researching a BI and analytics tool is…usually…an ancillary responsibility for them. If that is the case, allow us to help you with your research and we will ask those questions you haven’t really thought about.

Regards,

Michael Tarallo
Director of Enterprise Solutions
Pentaho


Using Pentaho to Be Aware, Analyze, & Take Action

September 28, 2011

Be Aware

Denial of Service attacking (DoS), IP Spoofing, Comment Spamming and Malware programming… are malicious activities designed to disrupt services used by many people and organizations. If you are taking advantage of the internet to run your business, create awareness of a product or service or simply keep in touch with friends and family, your systems are at risk of becoming a target.

Successful internet “intrusions” can cost you money and even steal your identity. DoS attacks can prevent internet sites from running efficiently and in most cases can take them down. IP Spoofing, frequently used in DoS attacks, is a means to “forge” the IP address and make it appear that the internet request or “attack” is coming from some other machine or location. And Comment Spamming, oh brother…where programs or people flood your site with random nonsense comments and links with an attempt to raise their site’s search engine ranking or increase internet traffic to their sites:

“Nice informations for me. Your posts is been helpful. I wish to has valuable posts like yours in my blog. How do you find these posts? Check mind out [link here]“

Huh? – LOL

You may already have defensive measures in place to address some if not all of these things. There are programs, filters and services that you can use to look up, track and prevent this sort of activity. However, with the continuous stream of unique and newly produced malware, those programs and services are only as good as the latest “malicious” activity that is captured. No matter what, it will eventually cause headaches for many people and organizations around the globe. Being able to monitor when something is “just not right” is a great step in the right direction.

Analyze

In September of 2010, I introduced the Pentaho Evaluation Sandbox. It was designed as a tool to assist with Pentaho evaluations as well as showcase many examples of what Pentaho can do. There have been numerous unique visitors to this site, both legitimate and some as I soon discovered…not. Prior to the site’s launch, using Pentaho’s Reporting, Dashboard and Analysis capabilities, I created a simplistic Web Analytic Dashboard that would highlight metrics and dimensions of the Sandbox’s internet traffic. It was a great example to demonstrate Pentaho Web Analytics embedded in a hosted application. Upon my daily review of the Site Activity dashboard which includes a real-time visit strip chart monitor, I noticed an unusually large spike in page views that occurred within a 1 minute time-frame.

Now that spike can be normal, providing a number of different people are surfing the site at the same time. However it caught my attention as “unusual” due to what I knew was normal. The dashboard quickly alerted me of something I should possibly take action on. So I clicked on the point at the peak to drill-down into the page visit detail at that time. The detail report revealed that who or whatever was accessing the Sandbox was rapidly traversing the site’s page map and directories looking for holes in the system. I also notice that all the page views were accessed by the same IP address within under 1 minute. Hmmm, I thought. “That could be a shared IP, a person or even a bot ignoring my robots.txt rules.” But..as I scrolled down I further discovered there were attempts to access the .htaccess and passwd files that protect the site. I immediately clicked on the IP address data value in the detail report (in my admin version of the report) which linked me to an IP Address Blacklist look-up service. The Blacklist Look-up program informed me that the IP address has been previously reported and was listed as suspicious for malicious activity. BINGO! Goodbye whoever you are!

Take Action

I quickly took action on my findings by banning the IP address from the system to prevent any further attempts to access the site. I then began to think of some random questions I needed to ask of the data. I switched gears and turned to Pentaho Analysis. Upon further analysis of the site’s data using Pentaho Analyzer Report - I was able to see evidence of IP Spoofing and even Comment Spamming coming form certain IP address ranges. The action I took next was to block certain IP address ranges that have been accessing the site in this manner. In addition I created a contact page for those who may be accessing the site legitimately but may have gotten blocked if their IP falls in that range.

Wow, talk about taking action on your data huh?

It is not a question of if, but when an unwarranted attempt will occur on your systems. Make sure you take the appropriate steps to protect them by using the appropriate software and services that will make you aware of problems. My experience may be an oversimplification but it is a great example of how I used Pentaho to make me aware of a problem and take that raw data and turn it into actionable information.

Special thanks to Marc Batchelor, Chief Engineer and Co-Founder of Pentaho, for helping me explore the corrective actions to take to protect the Pentaho Evaluation Sandbox.

Regards,

Michael Tarallo
Director of Enterprise Solutions
Pentaho


The Right Tool For the Right Job – Part 1

September 20, 2011

 

All too Common

You have questions. How do you get your answers? The methods and the tools used to help get those answers to business questions will vary per organization. For those without established BI solutions; using desktop database query and spreadsheet tools are…all too common. And…If there is a BI tool in place, usage and its longevity are dependent on its capabilities, costs to maintain it and ease of use for both development staff and business users. Decreased BI tool adoption, due to rising costs, lack of functionality and complexity may increase dependencies on technical resources and other home grown solutions to get answers. IT departments have numerous responsibilities. Running queries and creating reports may be ancillary, which can result in information not getting out in a timely manner, questions going unanswered and decisions being delayed. Therefore, the organization may not be leveraging its BI investment for what it was originally designed to do…empower business user to create actionable information.

(Read the similar experiences of Pentaho customer Kiva.org here)

Six of One, Half a Dozen of the Other

The BI market is saturated with BI tools, from the well known proprietary vendors to the established commercial open source leaders and niche players. There are choices that include the “Cloud,” on premise, hosted (SaaS) and even embedded. Let’s face it and not complicate things…most, if not all, of the BI tools out there can do the same thing in some form or fashion. They are designed to access, optimize and visualize data that will aid in the answering of questions and tracking of business performance. Dashboards, Reporting and Analysis fall under a category I refer as “Content Delivery.” These methods of delivering information are the foundation of a typical BI solution. They provide the most common means for tracking performance and identifying problems that need attention. But..did you know, there is usually some sort of prep work to be done, before that chart or traffic light is displayed on your screen or printed in that report? That prep work can range from simple ETL scripting to provisioning more robust Data Warehouse and Metadata Repositories.

Data Integration

Content Delivery should begin first with some sort of Data Integration. In my 15 years in the BI space I have not seen one customer or prospect challenge me on this. They all have “data” in multiple silos. They all have a “need” to access it, consolidate it, extrapolate it and make it available for analysis and reporting applications. Whether they use it already as second-hand data, loaded into an Enterprise Data Warehouse for historical purposes, or produce Operational Data Stores, they are using Data Integration. Whether they are writing code to access and move the data, using a proprietary utility or even some ETL tool, they are using Data Integration. It is important to realize that not all data needs to be “optimized” out of the gate, as it is not only the data that is important. It is how it will be used in the day to day activities supporting the questions that will be asked. This requires careful planning and consideration of the overall objectives that the BI tools will be supporting.

Well, How do I know what tools to use? – Stay Tuned

With so many tools available, how will you know what is right for the organization? Thorough investigation of the tools through RFIs, RFPs, self evaluation and POCs are a good start. However, make sure you are selecting tools based on the ability to solve your specific current AND future needs and not solely because it looks cool and provides only the “sex and sizzle” the executives are after. The typical need is always reporting, analysis and dashboards. Little realize that there is a lot more to it than those three little words. In the next part of this article I will cover a few of the most common “BI Profiles” that are in almost every organization. In each profile I will cover the pains, symptoms and impacts that plague organizations today as well as the solution strategies and limitations you should be aware of when looking at Pentaho.

Stay tuned!

Regards,

Michael Tarallo
Director of Enterprise Solutions
Pentaho

This blog was originally posted on http://michaeltarallo.blogspot.com/ on September 19, 2011

Facebook and Pentaho Data Integration

July 15, 2011

Social Networking Data

Recently, I have been asked about Pentaho’s product interaction with social network providers such as Twitter and Facebook. The data stored within these “social graphs” can provide its owners with critical metrics around their content. By analyzing trends within user growth and demographics as well as consumption and creation of content…owners and developers are better equipped to improve their business with Facebook and Twitter. Social networking data can already be viewed and analyzed utilizing existing tools such as FB Insights or even purchasable 3rd party software packages created specifically for this purpose. Now…Pentaho Data Integration in its traditional sense is an ETL (Extract Transform Load) tool. It can be used to extract and extrapolate data from these services and merge or consolidate it with other relative company data. However, it can also be used to automatically push information about a company’s product or service to the social network platforms. You see this in action today if you have ever used Facebook and “Liked” something a company had to offer. At regular intervals, you will sometimes note unsolicited product offers and advertisements posted to your wall from those companies. A great and cost effective way to advertise to the masses.

Application Programming Interface

Interacting with these systems is made possible because they provide an API. (Application Programming Interface) To keep it simple, a developer can write a program in “some language” to run on one machine which communicates with the social networking system on another machine. The API can leverage a 3GL such as Java or JavaScript or even simpler, RESTful services. At times, software developers/vendors will write connectors in the native API that can be distributed and used in many software applications. These connectors can offer a quicker and easier approach than writing code alone. It may be possible within the next release of Pentaho Data Integration, that an out of the box Facebook and/or Twitter transformation step is developed – but until then the RESTful APIs provided work just fine with the simple HTTP POST step.  Using Pentaho Data Integration with this out of the box component, allows quick access to social network graph data. It can also provide the ability to push content to those applications such as Facebook and Twitter without writing any code or purchasing a separate connector.

The Facebook Graph API

Both Facebook and Twitter provide a number of APIs, one worth mentioning is the Facebook Graph API (don’t worry Twitter, I’ll get back to you in my next blog entry).

The Graph API is a RESTful service that returns a JSON response. Simply stated an HTTP request can initiate a connection with the FB systems and publish / return data that can then be parsed with a programming language or even better yet – without programing using Pentaho Data Integration and its JSON input step.

Since the FB Graph API provides both data access and publish capabilities across a number of objects (photos, events, statuses, people pages) supported in the FB Social graph, one can leverage both automated push and pull capabilities.

If you are interested in giving this a try or seeing this in action, take a look at this tutorial available on the Pentaho Evaluation Sandbox.

Kind Regards,

Michael Tarallo
Director of Enterprise Solutions
Pentaho


A Plug for the PLUG

May 24, 2011

On Thursday, May 19, I was a fly on the wall at the third Pentaho London User Group (PLUG) meeting hosted by Dan Keeley and Tom Barber.  Here’s Dan opening the meeting, spurred on by a little ‘flower power’:

The theme for this meeting was column-oriented databases.

Tom kicked off the proceedings with a short demo of WebDetails CDF (community dashboard framework), a Pentaho extension for making light work of building dashboards.  You can read more about this in Tom’s blog.

Next, we fastened our seatbelts for a lively presentation by special guest John Smedley of Ingres, who briefed us on VectorWise, an exciting open source community project originating from the University of Amsterdam, later acquired by Ingres, for processing data at 10X -75X faster over traditional databases. It achieves this by tapping into the power of modern commodity CPUs with a database engine that leverages vector-based processing and on-chip memory.

John’s priceless quote of the evening: “de-normalizing can make your life easier”.  (Hey, I’m no database techie, but I can get down with that sentiment)

This was about the only time John stood still long enough for me to take his picture:

The evening concluded with an Academy Awards-style recorded video demonstration hosted by Nick Goodman, CEO of Seattle-based Pentaho partner DynamoDB.  Nick walked us through LucidDB, billed as the only open source database purpose-built from day one for doing business intelligence.  Among the interesting things Nick showed us was how DynamoDB allows users to perform ‘back-in-time’ queries based on data sets from an earlier time period, reminiscent of the Wayback Machine.  LucidDB’s columnar storage means data can be compressed into a much smaller space that with traditional database structures.

The meeting was generously hosted by Skills Matter, an organisation that supports the Agile and Open Source developer community through free events, training courses, conferences and publishing.

PLUG needs your support and involvement to flourish, so if you’re in the Southeast and want to take part or recommend a theme for the next meeting, visit here or contact Dan Keeley.

The next meeting will be held in September or October  – we hope to see you there!

Sarah Lafferty
Director and co-founder
Round Earth Consulting


High availability and scalability with Pentaho Data Integration

March 31, 2011

Experts often possess more data than judgment.” – Colin Powell….hmmm, those experts surely are not using a scalable Business Intelligence solution to optimize that data which can help them make better decisions. :-)

Data is everywhere! The amount of data being collected by organizations today is experiencing explosive growth. In general, ETL (Extract Transform Load) tools have been designed to move, cleanse, integrate, normalize and enrich raw data to make it meaningful and available for knowledge workers and decision support systems. Once data has been “optimized,” only then can it be turned into “actionable” information using the appropriate business applications or Business Intelligence software. This information could then be used to discover how to increase profits, reduce costs or even write a program that suggests what your next movie on Netflix should be. The capability to pre-process this raw-data before making it available to the masses, becomes increasingly vital to organizations who must collect, merge and create a centralized repository containing “one version of the truth.” Having an ETL solution that is always available, extensible and highly scalable is an integral part of processing this data.

Pentaho Data Integration

Pentaho Data Integration (PDI) can provide such a solution for many varying ETL needs. Built upon a open Java framework, PDI uses a metadata driven design approach that eliminates the need to write, compile or maintain code. It provides an intuitive design interface with a rich library of prepacked plug-able design components. ETL developers with skill sets that range from the novice to the Data Warehouse expert can take advantage of the robust capabilities available within PDI immediately with little to no training.

The PDI Component Stack

Creating a highly available and scalable solution with Pentaho Data Integration begins with understanding the PDI component stack.

● Spoon – IDE – for creating Jobs, Transformations including the semantic layer for BI platform
● Pan – command line tool for executing Transformations modeled in Spoon
● Kitchen – command line tool for executing Jobs modeled in Spoon
● Carte – lightweight ETL server for remote execution
● Enterprise Data Integration Server – remote execution, version control repository, enterprise security
● Java API – write your own plug-ins or integrate into your own applications

Spoon is used to create the ETL design flow in the form of a Job or Transformation on a developer’s workstation. A Job coordinates and orchestrates the ETL process with components that control file movement, scripting, conditional flow logic, notification as well as the execution of other Jobs and Transformations. The Transformation is responsible for the extraction, transformation and loading or movement of the data. The flow is then published or scheduled to the Carte or Data Integration Server for remote execution. Kitchen and Pan can be used to call PDI Jobs and Transformations from your external command line shell scripts or 3rd party programs. There is also a complete Java SDK available to integrate and embed these process into your Java applications.

Figure 1: Sample Transformation that performs some data quality and exception checks before loading the cleansed data

PDI Remote Execution and Clusters

The core of a scalable/available PDI ETL solution involves the use of multiple Carte or Data Integration servers defined as “Slaves” in the ETL process. The remote Carte servers are started on different systems in the network infrastructure and listen for further instructions. Within the PDI process, a Cluster Scheme can be defined with one Master and multiple Slave nodes. This Cluster Scheme can be used to distribute the ETL workload in parallel appropriately across these multiple systems. It is also possible to define Dynamic Clusters where the Slave servers are only known at run-time. This is very useful in cloud computing scenarios where hosts are added or removed at will. More information on this topic including load statistics can be found here in an independent consulting white paper created by Nick Goodman from Bayon Technologies, “Scaling Out Large Data Volume Processing in the Cloud or on Premise.”

Figure 2: Cx2 means these steps are executed clustered on two Slave servers
All other steps are executed on the Master server

The Concept of High Availability, Recover-ability and Scalability

Building a highly available, scalable, recoverable solution with Pentaho Data Integration can involve a number of different parts, concepts and people. It is not a check box that you simply toggle when you want to enable or disable it. It involves careful design and planning to prepare and anticipate the events that may occur during an ETL process. Did the RDBMS go down? Did the Slave node die? Did I lose network connectivity during the load? Was there a data truncation error at the database? How much data will be processed on peak times? The list can go on and on. Fortunately PDI arms you with a variety of components including complete ETL metric logging, web services and dynamic variables that can be used to build recover-ability, availability, scalability scenarios into your PDI ETL solution.

For example, Managing Consultant in EMEA, Jens Bleuel developed a PDI implementation of the popular Watchdog concept. A solution that includes checks to monitor if everything is on track is using the concept of a Watchdog when executing its tasks and events. Visit the link above for more information on this implementation.

 

 

Putting it all together – (Sample)

Diethard Steiner, active Pentaho Community member and contributor, has written an excellent tutorial that explains how to set up PDI ETL remote execution using the Carte server. He also provides a complete tutorial (including sample files provided by Matt Casters, Chief Architect and founder of Kettle) on setting up a simple “available” solution to process files, using Pentaho Data Integration. You can get it here. Please note that advanced topics such as this are also covered in greater detail (designed by our Managing Consultant Jens Bleuel – EMEA) in our training course available here.

Summary

When attempting to process the vast amounts of data collected on a daily basis, it is critical to have a Data Integration solution that is not only easy to use but easily extendable. Pentaho Data Integration achieves this extensibility with its open architecture, component stack and object library which can be used to build a scalable and highly available ETL solution without exhaustive training and no code to write, compile or maintain.

Happy ETLing.

Regards,

Michael Tarallo
Senior Director of Sales Engineering
Pentaho

This blog was originally published on the Pentaho Evaluation Sandbox. A comprehensive resource for evaluating and testing Pentaho BI.


Business Intelligence – In the heads of people – not the software

January 14, 2011

Business Intelligence is part of a bigger picture and not a particular software package. BI in general involves many different factors in order to be successful no matter what software or skill set is being used. It requires the knowledge and expertise of the individuals who know it best. This includes the customer knowing what problems they have or want to prevent, as well as the software vendor and/or consultants who know how to provide solutions for those problems.

Analogy 1: Imagine you want to hang a picture. You have some screws but do not have any tools. You go to the hardware store and say, “I need a hammer”, because you think that this is the proper tool to get the job done. The proper response would be for the salesperson to ask you, “What do you need the hammer for?” or “What project will the hammer help you with?” When he finds out that you want to hang a picture and all you have are screws, the salesperson may suggest a screwdriver instead of the hammer or perhaps give you nails and a hammer. Even better yet, he may offer something that is even easier to use or costs less like those new gravity hooks. The salesperson needs to ask the proper questions to help the customer with the proper solutions. He may even offer a better solution to which the customer had no idea about. The salesperson was the expert who provided the knowledge for the proper solution or alternative.

BI not only involves helping the customer with existing problems but also involves helping them to see problems that they many not know exist.

Analogy 2: Imagine having a piece of food on your lip and not knowing it. Your friend says, “Hey, you got something on your lip!”, and points it out to you. You then take the appropriate action to resolve that problem by wiping it away with a napkin. The napkin was the tool,  the wiping was the action taken from the knowledge and direction provided by your friend. Your friend helped you discover a problem you didn’t know was there.

It is clear and easy to see that these two analogies provide support that Business Intelligence is about collaboration, communication, discovery, knowledge, insight, direction and action to just name a few. These factors along with the proper software and services can provide an organization with a successful BI implementation. The software can be part of a specific BI Platform or simply an application development environment used to create business applications that provide knowledge on data. Business Intelligence is not just about collecting data and reporting, it is a methodology in which experts can provide assistance. If you would like your organization to succeed with BI, it is extremely important to understand its factors and learn how to analyze and use the data created by this methodology.

BI is part of a bigger picture, is in the heads of people and not just in the software.

Michael Tarallo
Director of Sales Engineering
Pentaho Corporation


Follow

Get every new post delivered to your Inbox.

Join 101 other followers