Building Your Big Data Team in 2015 – Top 5 Pieces of Real-World Advice

January 27, 2015

There’s lots of advice out there on building a big data team, from industry or expert analysts and leading publications. But we wanted to see how this is being implemented in real life, so we talked to the real world big data mavericks – those who’ve faced the challenge of gaining true business value from big data and succeeded.  They shared real-world insights into how they made it happen and the advice they’d give to those ready to take the plunge. (Scroll to the bottom to meet our mavericks.)

1. Clearly define your business goal, and don’t be afraid to start small.
When you work with big data, you have to know first what you’re going to do with that data” – Marc Hayem, VP of Platform Transformation, RichRelevance

It may seem obvious but is often overlooked.  Whether you’re a data-driven company whose entire business model revolves around crunching big data, or a manufacturer looking to optimize your operational efficiency using machine data, you need to be clear about the challenge you’re trying to tackle with big data. Omitting this step, you risk ending up with inappropriate technologies, a lack of executive support, and an ill-prepared team. Saad Khalid, a product manager at Paytronix, echoes the advice about starting small:

Starting small to get into big data can be useful, because you can get lost in a lot of technical jargon: Hadoop, Hive, MapReduce. My advice to people considering big data as a project would be to take it slow, have a smaller project in mind where you can actually think about the questions that you want to answer and achieve results…. Have a team that is dedicated towards that goal, and those results.  Start slow and then grow big and then scale your project. ” Saad Khalid

Andrew Robbins is CEO of Paytronix, a company that helps restaurants build brand loyalty and get rich, big-data driven insights into their customers’ behavior for better sales and marketing.  The questions that big data could answer for them were endless – but in the end, zeroing in on one small, simple question – “Who had breakfast for dinner?” – helped them define the scope of their entire project:

“For us, we sat around and thought of so many ideas and it became so big and we boiled it down to a single question and it was who had breakfast for dinner?  In that question, it seems kind of simple.  The “who” is pretty complicated.  Who are the people?  Can you give me the collection and what are they like?  What are their demographics?  “The “had breakfast,” what does that mean?  You got to get into the details of a check.  Is it scrambled eggs?  …All of those pieces led to a simple thing that we could all shoot for and that was our minimal viable product and you can get to it quicker and then the team goes, “Aha.  That’s success.” Andrew Robbins

Finally, as you define your scope, make sure the projects have a measurable return to achieve your business goals.  Because big data projects can be complex, people need to be motivated to work through the challenges and that happens when your project impacts the business in a demonstrable way. Marc Hayem is VP of Platform Transformation at RichRelevance, a company that helps retailers provide personalized recommendations to shoppers.

“I think the important thing when you get into big data is to be able to prove the value rapidly, which is to really pick the right problem and demonstrate very rapidly that you can find solutions to that problem. That you can create value around that problem… If you have identified that something that will give you a competitive advantage and the technology is applied right, then the payoff can be monumental.” Marc Hayem

2. Choose your technologies carefully, based on the challenge you’re trying to address and your organizational culture.
“Pick the tools that work and ignore all the religion that’s out there.” – Andrew Robbins, CEO, Paytronix

You should only start to investigate your technologies once you define your problem.  Many of the big data leaders we spoke to acknowledged that the big data technology ecosystem can be complex, and cautioned against being driven by the current frenzy to adopt a particularly hot technology.  Their advice is unanimous: start with one problem, start small, and work backwards from there in picking your technologies.  Always pick the tools that solve the problem at hand and find tools increase your teams’ productivity, not create obstacles.  Andrew Robbins discussed how heated the debate can be:

I think one of the things that surprised me the most was just how fragmented the tool sets are and it really seems like the wild west of different components and how religious people are that you’re using a component .… ‘If you’re using Hive, you must be crazy.  You must use Impala.  Anybody who is not using Impala is just … that doesn’t make sense’. Pick the tools that work and ignore all the religion that’s out there.  Be practical.  Pick the tools that work.  You can always switch them out in the future if you need to” Andrew Robbins

Marc Hayem shares his perspective on what makes a good fit:

Evaluating the tools can be overwhelming. There are new tools that come out constantly. There is a tendency to look at always the next shiny thing that comes out and think this will solve even more magical problems. At some point, you have to settle. You have to choose your tools. The tools that you’re comfortable with. That you have the tools for. That you have the staff for, more importantly. This is basically your tool set. That’s what you’re going to use. There is definitely with this ecosystem of open source tools, a tendency to go after the next big thing, constantly. It’s something that you have to fight a little bit. We have used a lot of open sourced software…Essentially, we believe that when you use open source solutions there is a community behind those tools. The tools get better over time, very, very rapidly.” Marc Hayem

Marc’s comments illuminate that in evaluating technologies, vendors, and platforms it’s important to consider what’s a good fit for your organization based on common values like transparency and innovation. Paytronix’s head of technology, Stefan Kochi, also believes this is an important factor:

Once we decide to implement a big data solution then we started looking at different providers, different vendors. The initial guiding principles were the ones that we use for other decisions we have made, such as they have to feel like an extended part of our company. … Some of the things we look for are – what was the technology based on? Open source versus private? How easy it for them to innovate? Innovation is critical. Do they serve things that we need? We have some guiding principles that we apply in general, the transparency of the company, how easy it to communicate with, and how solid and mature the product is. Pentaho was an attractive options early on. They use open source technologies, and that was very attractive to us. Paytronix uses a lot of open source technologies, so right there you have a connection with the approach that Pentaho has taken.”  Stefan Kochi

3. Identify key players on a cross functional team

While in some cases, a big data implementation can be done with one person or a very small team, the general consensus is that having a dedicated, cross functional team will ensure success. This is critical to ensuring that business needs are understood and data is successfully prepared and accessible to meet the defined the business needs. So what roles are needed?  We asked our big data leaders and internal big data services team to comment on what is working and compiled the results.  While structures vary from organization to organization, here are some key roles to consider.

  • Executive Sponsor- This senior level person understands the business needs, rallies support, and funds the solution. Andrew Robbins is an example:

“Paytronix is full of bright, curious, empathetic people. I wasn’t the star of this …we have a really bright engineer who is at the forefront of thinking about [big data] and I probably just provided some air cover so that we’re safe to go after it and be successful.”  Andrew Robbins

  • Business User – This individual defines and prioritizes the business requirements and then translates them into high level technical requirements.

“My favorite part about what I do currently is gathering requirements and actually really thinking about what our next product’s going to be.  What our next feature’s going to be.  Talking to our clients, and talking to my internal clients, which is the rest of the team here.  Really start to think about a new feature, a new product, and gathering those requirements, and thinking about design.  I love working with the engineering team, and really trying to think about how to approach problems in several different manners, and really try to come up with a creative solution so our clients can benefit from it.” Saad Khalid

  • Subject matter expert – Especially important in non-technical industries where the gap between a data developer and the Business user can be very large, this person knows the business intimately.
  • Data scientist – This individual understands the data and can extract information from that data to meet the business requirements. The data scientist ideally has both domain knowledge, statistical analysis background, and basic understanding of computer science.

“As I mentioned earlier, we have hundreds of algorithms that basically constantly try to decide what is best for our customer. You have to be able to build those algorithms. You have to understand the mathematics behind it. you have to understand the technologies. You also need very good data scientists. You need people who understand very well the mathematics behind the predictive modeling that takes place in personalization.” Marc Hayem

  • Data Engineer/Software Engineer – This individual has a software engineering background and experience in developing software for distributed or multi-threaded applications. This person typically is a server side Java developer who can implement ETL at scale using various Big Data technologies. Someone with experience in statistics and machine learning is a plus.

“Paytronix has a small engineering group. We’re not a large firm, but we’re fortunate to have a very talented engineering team. Those engineers who have done a lot of existing development of the product are also able to explore and go from an idea and a concept to a real product….There is a lot to manage when it comes to big data.  We have a dedicated team that looks after our structure and architecture.  There is an architecture that oversees big data and we also have 2 software developers. You need to have a dedicated team to take care of this structure.  It is extremely important. ” Saad Khalid

  • Data journalist – We’re hearing more and more about a data journalist – someone who looks at the data from a storytelling aspect. Forbes even predicts that storytelling will be the hot new job in big data analytics in 2015. This person serves as the link to the larger audience for the data, making it understandable to the audience consuming the data.
  • Platform/Systems Architect – This is a senior technical architect responsible for designing the entire end-to-end solution that meets the business requirements for both short-term deliverables and long-term needs. Typically this person has a software engineering background in large scale clustering/distributed processing systems and is responsible for technology selections and implementation process.  The architect defines the big data blueprints, or architectural model, that an organization will implement.

Another lesson that Paytronix has learned is the importance of building a working model first. You can get caught up in the big picture, being very strategic, but you have to build the working model first. If you have a billion transactions that you want to ETL, you should probably ETL a thousand. You get an idea how the systems are working with a thousand transactions. Another important thing that we learned is that you have to be very focused on system integration and architect should be always present as you connect. Systems talking to each other is like building many bridges. You have people focus on each bridge, but someone needs to oversee all the bridges together.” Stefan Kochi

  • IT/Operations manager – This person operationalizes, deploys, manages, and monitors the systems. They should understand Hadoop and big data to successfully deploy across systems and scale to hundreds or thousands of servers, instead of just a few.  Yug Muppala, a software engineer at RichRelevance, points out the critical nature of this role:

We at RichRelevance have a really good operations team that keeps our servers up and running all the time. That is really important they make the cluster available to us and keep the health of the cluster up and running.”  Yug Muppala

4. Be creative to make the most of your human and technology resources
“Instead of search for the mythical people, we would take people we know and create a team that could be successful”  - Andrew Robbins, Paytronix

While the above list provides general guideline for a big data team, it’s only a starting point.  There’s a well-known meme about how looking for the perfect data scientist – who combines analytics with business savvy  and development skills and mathematics – is like looking for a unicorn: it doesn’t exist.  Companies who’ve successfully launched big data initiatives haven’t used unicorns – they’ve been innovated and are clever with how they resource their project and leverage their team.  Andrew Robbins acknowledges this:

When you make the move the Big Data, what are you concerned about?  What we’re concerned about in Paytronix and probably the biggest one is can you be successful, and then you go back from that and you say, “Where are the people?  What people are going to implement this solution?”  Is it internal people or are we going to go hire people?  Then people talk about data scientists.  Have you seen a data scientist?  Do you live next to one?  Can you find them on the street? I think one of the things that made us successful at Paytronix was to say we would, instead of search for the mythical.  To us, a data scientist is a function, not a person.  Data science might include a strategist, an analyst and an engineer.  In between them, they can satisfy the need of data science.” Andrew Robbins

Creative thinking and innovative technologies offer other options to remove the need for unicorns.  There are many emerging technologies that help minimize the dependence on coding and other hard to find skillsets – for smaller companies that can’t afford data scientists, these technologies are attractive options. Yug Muppala, a software engineer at RichRelevance, talks about why they use Hive:

Hive is very easy for anyone with SQL knowledge to start writing, querying the Hadoop cluster. That’s a big advantage. Not many people have knowledge around Pig scripts and stuff like that and most of our data science team is very comfortable with writing SQL queries. Hive gives them that advantage so that they could just go write queries themselves instead of having to wait for someone else to write the extraction for them.” Yug Muppala

Pentaho’s own visual interface helps here, by reducing the amount of code needed to join data, and reducing the time Paytronix spent on this task from two weeks to a mere hour and a half:

“We have some data in our transactional database and we have some data in Hadoop. Joining these two together was a hassle before and Pentaho helped us solve this problem. . . .It’s a simple step within Pentaho. ..We don’t have to write a lot of code which we were doing before and it’s a simple process of dragging and dropping steps to connect these different data sources.” Yug Muppala

5. Look to the future
Last – as you look ahead to building a team in 2015, there are a few thing to keep in mind:

  1. Consider the cloud. More and more companies are running all or part of their big data environment in the cloud.  As cloud becomes more widely adopted and becomes more mature and secure.  Look for team members with experience in the cloud, in addition to those who have dealt with data governance and compliance issues.
  2. Consider self-service analytics. Whether the end user is a customer or an internal user, you’ll need to consider how to make the insights created from your big data environment available for consumption both inside and outside your firewalls.  How will you deliver high-quality governed data to end users for analysis? Will you embed analytics in customer-facing software, or perhaps within an enterprise application?
  3. Consider the profile of people willing to tackle these big data challenges. In addition to experience with the relevant technologies and having people to embrace and learn from the challenge that big data provides. Marc Hayem says, “The people I’ve worked with are very much start-up people. They are adventurous a little bit more than your average IT person.”

Meet the Mavericks:

Andrew Robbins, Paytronix2
Andrew Robbins, CEO, Paytronix
Learn more about Andrew’s journey with big data here.

Marc Hayem, RichRelevance2
Marc Hayem, VP of Platform Transformation, RichRelevance
Learn more about Marc’s journey with big data here.

Saad Khalid, Paytronix2
Saad Khalid, Product Manager, Paytronix

Stefan Kochi, Paytronix2
Stefan Kochi, Head of Technology, Paytronix

Yug Muppala, RichRelevance2
Yug Muppala, Software Engineer, RichRelvance


Analyze 10 years of Chicago Crime with Pentaho, Cloudera Search and Impala

December 23, 2013

Hadoop is a complex technology stack and many people getting started with Hadoop spend an inordinate amount of time focusing on operational aspects – getting the cluster up and running, obtaining foundational training, and ingesting data. Consequently it can be difficult to get a good picture of the true value that Hadoop provides, namely unlocking insight across multiple data streams that add valuable context to the transactional history comprising most of the core data in the enterprise.

At Strata Hadoop World in October, Pentaho’s Lord of 1’s and 0’s or CTO, James Dixon, unveiled a powerful demonstration of the true value that Hadoop – combined with enabling technology from Pentaho and our partner Cloudera – can provide. He took a publicly available data set provided by the City of Chicago and built a demo around it that enables nontechnical end-users to understand how crime patterns have changed over time in Chicago, unlocking insight into the type of crimes being committed in different areas of the city – not only historically but also broken down by time of day and day of week. As a result, citizenry as well as law enforcement have a much better sense of what to expect on the streets of Chicago from the insight the demonstration provides.

In the demo, end-users start with a dashboard that provides a high-level understanding of the mix of crimes historically committed on the streets of Chicago over the last ten years. Watch the demo here:

This kind of top-to-bottom understanding of (in this case) crime patterns is uniquely enabled by the capability Pentaho delivers to the market, combining dashboarding, analytics and data integration into one easily-embedded platform that leverages blending across multiple data sets.

The deep understanding that Pentaho’s solution delivers to end-users is enabled by two key technologies from Cloudera: Cloudera Search and Impala. The original data set provided by the City of Chicago was loaded into a Cloudera Hadoop cluster using Pentaho’s data integration tool, Pentaho Data Integration (“PDI”). End-user drilldown is powered by Cloudera Search, which executes a faceted search on behalf of Pentaho’s dashboard. Once an area of interest has been located, Cloudera’s Impala executes low-latency performance of SQL on the raw data stored in the Hadoop cluster to bring up individual crime records.

Although Hadoop is often perceived as a geek’s playground, the power of Pentaho’s business-friendly interface is readily apparent when engaging this demo. Unlocking the power of Hadoop can be as simple as engaging Pentaho’s integrated approach to analytics together with Cloudera’s foundational platform to deliver an integrated solution whose value is apparent to nontechnical executives wondering whether Hadoop is the right choice for a key initiative.

Rob Rosen
Field Big Data Lead
Pentaho


Big Data, Big Revenue for Marketers

December 12, 2013

Why might Big Data mean millions for marketing?  Because it has the potential to create a more complete picture of the buyer, thereby empowering marketers to more effectively deliver the right message to the right individual at the right time – and ultimately increase sales.  In the following brief video from DMA 2013, Marketo VP/Co-founder Jon Miller and Pentaho CMO Rosanne Saccone provide a crash course on what Big Data means for marketers.  It covers:

  • The defining characteristics of Big Data – Velocity, Variety, & Volume
  • How marketers can leverage Big Data to blend operational information (CRM, ERP) and online data (web activity, social networking interactions) for new insights
  • Sample Big Data use cases that organizations are green-lighting today to optimize customer interactions and drive marketing’s contribution to revenue

Note that this is an excerpt from a larger presentation – for the full video please click here.

We’d also recommend this blog post by Jon Miller for more context on Big Data in marketing.

For additional compelling use cases that leverage Big Data for marketing and other functions, see here.

Ben Hopkins
Product Marketing
Pentaho


Big Data 2014: Powering Up the Curve

December 5, 2013

Last year, I predicted that 2013 would be the year big data analytics started to go into mainstream deployment and the research we recently commissioned with Enterprise Management Consultants indicates that’s happened. What really surprised me though is the extent to which the demand for data blending has powered up the curve and I believe this trend will accelerate big data growth in 2014.

Prediction one: The big data ‘power curve’ in 2014 will be shaped by business users’ demand for data blending
Customers like Andrew Robbins of Paytronix and Andrea Dommers-Nilgen of TravelTainment, who recently spoke about their Pentaho projects at events in NY and London, both come from the business side and are achieving specific goals for their companies by blending big and relational data. Business users like these are getting inspired by the potential to tap into blended data to gain new insights from a 360 degree customer view, including the ability to analyze customer behavior patterns and predict the likelihood that customers will take advantage of targeted offers.

Prediction two: big data needs to play well with others!
Historically, big data projects have largely sat in the IT departments because of the technical skills needed and the growing and bewildering array of technologies that can be combined to build reference architectures. Customers must choose from the various commercial and open source technologies including Hadoop distributions, NoSQL databases, high-speed databases, analytics platforms and many other tools and plug-ins. But they also need to consider existing infrastructure including relational data and data warehouses and how they’ll fit into the picture.

The plus side of all this choice and diversity is that after decades of tyranny and ‘lock-in’ imposed by enterprise software vendors, in 2014, even greater buying power will shift to customers. But there are also challenges. It can be cumbersome to manage this heterogeneous data environment involved with big data analytics. It also means that IT will be looking for Big Data tools to help deploy and manage these complex emerging reference architectures, and to simplify them.  It will be incumbent on the Big Data technology vendors to play well with each other and work towards compatibility. After all, it’s the ability to access and manage information from multiple sources that will add value to big data analytics.

Prediction three: you will see even more rapid innovation from the big data open source community
New open source projects like Hadoop 2.0 and YARN, as the next generation Hadoop resource manager, will make the Hadoop infrastructure more interactive. New open source projects like STORM, a streaming communications protocol, will enable more real-time, on-demand blending of information in the big data ecosystem.

Since we announced the industry’s first native Hadoop connectors in 2010, we’ve been on a mission to make the transition to big data architectures easier and less risky in the context of this expanding ecosystem. In 2013 we made some massive breakthroughs towards this, starting with our most fundamental resource, the adaptive big data layer. This enables IT departments to feel smarter, safer and more confident about their reference architectures and open up big data solutions to people in the business, whether they be data scientists, data analysts, marketing operations analysts or line of business managers.

Prediction four: you can’t prepare for tomorrow with yesterday’s tools
We’re continuing to refine our platform to support the future of analytics. In 2014, we’ll release new functionality, upgrades and plug-ins to make it even easier and faster to move, blend and analyze relational and big data sources. We’re planning to improve the capabilities of the adaptive data layer and make it more secure and easy for customers to manage data flow. On the analytics side, we’re working to simplify data discovery on the fly for all business users and make it easier to find patterns and catch anomalies. In Pentaho Labs, we’ll continue to work with early adopters to cook up new technologies to bring things like predictive, machine data and real-time analytics into mainstream production.

As people in the business continue to see what’s possible with blended big data, I believe we’re going to witness some really exciting breakthroughs and results. I hope you’re as excited as I am about 2014!

Quentin Gallivan, CEO, Pentaho

Big-Data-2014-Predictions-Blog-Graphic


9 Years Later….

October 8, 2013
5founders

Photo taken the day Pentaho was founded – October 8, 2004

On Oct 8, 2004 five guys got some crazy idea to create a commercial open source BI offering to provide customers of all sizes with a better and more affordable solution than existed from proprietary vendors. Nine years later – “BI” became “BA”,  the core platform just underwent its biggest overhaul since its inception,  our UI/UX is the best ever, the open source community is still key, big data has become our core growth strategy, predictive is awakening, we went thru the biggest financial crisis since the Great Depression and are achieving great Y/Y bookings growth. Last, but not least, we have been very fortunate to attract and retain a fantastic, talented, passionate team to make us the leader in big data analytics. Big Data is one of the biggest business impacts our industry has seen in decades and we’re making it happen.

Congrats to the entire company for making this real. Happy Birthday Pentaho.

Richard

Richard Daley
Co-Founder and Chief Strategy Officer, Pentaho


Pentaho 5.0 blends right in!

September 12, 2013

Dear Pentaho friends,

Ever since a number of projects joined forces under the Pentaho umbrella (over 7 years ago) we have been looking for ways to create more synergy across this complete software stack.  That is why today I’m exceptionally happy to be able to announce, not just version 5.0 of Pentaho Data Integration but a new way to integrate Data Integration, Reporting, Analyses, Dashboarding and Data Mining through one single interface called Data Blending, available in Pentaho Business Analytics 5.0 Commercial Edition

Data Blending allows a data integration user to create a transformation capable of delivering data directly to our other Pentaho Business Analytics tools (and even non-Pentaho tools).  Traditionally data is delivered to these tools through a relational database. However, there are cases where that can be inconvenient, for example when the volume of data is just too high or when you can’t wait until the database tables are updated.  This for example leads to a new kind of big data architecture with many moving parts:

Evolving Big Data Architectures

Evolving Big Data Architectures

From what we can see in use at major deployments with our customers, mixing Big Data, NoSQL and classical RDBS technologies is more the rule than the exception.

So, how did we solve this puzzle?

The main problem we faced early on was that the default language used under the covers, in just about any business intelligence user facing tool, is SQL.  At first glance it seems that the worlds of data integration and SQL are not compatible.  In DI we read from a multitude of data sources, such as databases, spreadsheets, NoSQL and Big Data sources, XML and JSON files, web services and much more.  However, SQL itself is a mini-ETL environment on its own as it selects, filters, counts and aggregates data.  So we figured that it might be easiest if we would translate the SQL used by the various BI tools into Pentaho Data Integration transformations. This way, Pentaho Data Integration is doing what it does best, not directed by manually designed transformations but by SQL.  This is at the heart of the Pentaho Data Blending solution.

MattCasters_Blog_graphic

The internals of Data Blending

In other words: we made it possible for you to create a virtual “database” with “tables” where the data actually comes from a transformation step.

To ensure that the “automatic” part of the data chain doesn’t become an impossible to figure out “black box”, we made once more good use of existing PDI technologies.  We’re logging all executed queries on the Data Integration server (or Carte server) so you have a full view of all the work being done:

Data Blending Transparency

Data Blending Transparency

In addition to this, the statistics from the queries can be logged and viewed in the operations data mart giving you insights into which data is queried and how often.

We sincerely hope that you like these new powerful options for Pentaho Business Analytics 5.0!

Enjoy!

Matt

–If you want to learn more about the new features in this 5.0 release, Pentaho is hosting a webinar and demonstration on September 24th – Two options to register:  EMEA & North America time zones.

Matt Casters
Chief Data Integration, Kettle founder, Author of Pentaho Kettle Solutions (Wiley)


Customers Speak out – Wisdom of the Crowds Business Intelligence Study, 2013

May 21, 2013

Pentaho-wisdom-panel“Responsiveness”, “professionalism”, “knowledge” and “experience”– these are just a few of the words our customers used in giving Pentaho the honor of being recently named a top business intelligence technology vendor in the third annual independent Wisdom of Crowds® Business Intelligence Market Study conducted by Dresner Advisory Services, LLC. The report recognizes Pentaho as a “High Growth BI Software” company with a critical mass of customers growing well above the average.

We have made and continue to make significant investments in simplifying and delivering real value in big data integration and analytics and our customers’ satisfaction. Being named a ‘high growth vendor’ validates that we are experiencing high growth in concert with the big data market, but not at the expense of our customers.

Pentaho earned high marks from its customers on multiple metrics specifically standing out in product, support, consulting and integrity.  This independent research comes straight from the voice of our customers, which is the best possible acknowledgement that we are indeed delivering the future of analytics.

I encourage you to download the full Wisdom of Crowds report to learn how the top vendors stack up and the top BI trends.

Donna Prlich
Senior Director, Product Marketing


Follow

Get every new post delivered to your Inbox.

Join 11,881 other followers