Digital Transformation Specialists

View Original

Hand-in-hand: Splunk and Data Science - Part One

This is part one (1) of a two-part blog on what Data Science means, and how a platform like Splunk makes Data Science more accessible to stakeholders and businesses.

7 min read

Introduction

Using mathematics and statistics to compute results for business has been around for centuries, such as to measure profits, count inventory and determine trends.

Yet, as more internal and customer-facing services become increasingly digitized and mobile, the result has been the rapid generation of tremendous amounts of data.

This data contains a whole host of information which we only recently started to recognize as holding competitive value to our businesses, such as financial status, climate temperatures, customer intent and even human behaviours.

To extract value from the data, it requires much more than knowing how to calculate typical mean averages or standard deviations. It involves the development of a disciplined field to use advanced statistical methods to identify predictions and communicate the hidden story locked away within the data.

That field is what we now call Data Science.

However, this relatively new and ever-evolving field of Data Science is filled with magic and mystery to many people and has led business leaders to start asking themselves questions like:

  • What is Data Science?

  • Do I need a Data Scientist?

  • Is my business Data Science ready? and

  • What tools can/should I use?

These questions have led to much confusion and misguided efforts. Some common areas of struggle include:

  1. Inaccurate Data Scientist job postings. Without really understanding what Data Science means, many organizations have misidentified that they need a Data Scientist when, in fact, they perhaps need a Data Engineer.

  2. Premature acquisition of software/hardware. Without really understanding the steps towards predictive insights, businesses are prematurely investing in equipment like Hadoop appliances, without genuinely knowing the ROI.

  3. Unrealistic expectations of your Data Scientist. After taking the plunge to hire a Data Scientist, expectations are then placed on them by business leaders to not only plan your data insights strategy but to build all the necessary infrastructure components. This usually is not what the Data Scientist signed up for and ends up becoming a frustrating situation for all involved.

Our goal

The goals of this blog are to help eliminate the confusion and provide some clarity around what is Data Science, the tasks that are usually performed, the people that administer them and software products/tools, such as Splunk, we can use to get us there.

Why highlight Splunk when talking about Data Science?

When Splunk was founded back in October 2003 by Michael Baum, Rob Das and Erik Swan, the goal was to build a google-like search engine tool for machine-generated data to identify significant trends and insights.

So with this goal in mind, Splunk has made a name for itself over the years by being a reputable provider of insights into I.T. operations and cyber-security use cases.

Unfortunately, this reputation has led to Splunk being overlooked for its capability to provide insights in many other vital areas of business.

With Splunk's #DataToEverything vision expressed at this year's .conf19 conference, Splunk has made itself more relevant to Data Science through their meaningful efforts to lowered the bar of entry to advanced Machine Learning predictive capabilities.

The steps to predictive analytics

Before businesses can invest in developing a strategy for predictive analytics, clarity is required to help demystify what Data Science means and how it aligns with business outcomes, so a plan of action can be formed.

The following diagram below illustrates the three (3) primary steps to deliver predictive analytics:

  1. Identify, sort, inventory and cleanse your data;

  2. Analyze/report on your organized data; and

  3. Build predictions after B.I. Reports are created and discussed.

In the interest of full disclosure, messaging in this diagram is my interpretation of the "What, Where and How of Data Science" infographic designed by the 365 Data Science group.

The @365 Data Science group has worked hard to demystify Data Science, so I highly recommend you follow them on Twitter and Linkedin, start regularly reading their blog and take their courses on their website and education sites like Udemy.

Step 1) Identify, sort, inventory and cleanse your data.

Organizations that have both clean/organized data, and an employed Data Scientist, can usually hit the ground running to start to elicit valuable predictive insights.

However, in most cases, organizations do not even know, inventoried or understand the information they have, let alone have a qualified person that can identify the questions they want to ask of that data.

So the first step in any Data Science initiative is to organize, categorize and clean your data. We try and classify data into two primary buckets:

  • Traditional Data. This data includes standardized structured and semi-structured data such as relational database information, delimited flat files, log files and CSV information.

  • Big Data. This data includes structured, semi-structured, and unstructured data that have, at minimum, the following characteristics: Volume, Velocity and Variety. (i.e. Having a pool of hundreds of thousands of rows of data does not necessarily classify it as "Big Data." However, if that volume of data was streaming in every few minutes and included a variety of text, images and audio, then it certainly could.)




However, it typically is not enough to only classify data this way. In most cases, data will need to be either sub-classified or transformed to make it usable:

  • Data Inventory & Labeling. Where is your data, and what is its criticality? Is it deemed sensitive, like personal or credit card information? What are the risks of a data breach, and what is the cost to the organization? The scope of establishing an inventory of data will be based on each organization's requirements and can be as simple as maintain a spreadsheet.

  • Cleansing. Does the data need to be reformatted, transformed or joined with other data to make it more useful? Do we need to remove dirty or unusable data from our raw sources?

  • Dealing with missing values. Many forms of data input come from end-users that may not choose to fill in all valuable information. Decisions will need to be made to determine how to deal with these situations. (i.e. for missing ages, do we need to calculate an average age for the missing rows?)

These are just some of the data decisions and actions that must be made and will require effort and prioritization to complete before the next stage of business intelligence can be fully realized.

If starting from scratch, consider beginning by limiting the scope of cleaning/classifying data for only highly impactful use cases or lines-of-business. Once complete, broaden your scope from there.

Data Scientist responsibilities - A common mistake

One common mistake that organizations make is expecting their newly hired Data Scientist to be responsible for performing these data inventory and clean-up tasks. 

While many Data Scientists have the skills to do so, it usually is not what they were expecting as part of their role when hired and often leads to feelings of frustration of both parties.

Organizations serious about producing predictive insights should consider how they build out their Data and Data Science teams with supporting cast members to take on these specific roles.



Supporting cast members for Data can include:

  • Data Architects. Who are the gatekeepers for overall data integrity, flows and pipelines; and

  • Data Engineers / Database Admins. Responsible for the operational management of infrastructure and software tools used for both traditional and big data.

Software and Programming Languages

For traditional data, there are many conventional RDBMS systems such as Oracle, MySQL, SQL Server and Postgres that can be used to manage structured data.

There are also many programming languages specifically designed to manipulate data such as the R or the more popular Python. Each has distinct data structures such as data frames and libraries written by Data Scientists to transform data at scale.

However, for big data, typical RDBMS systems and programming languages are not well suited to handle or manage the volume, velocity and variety of data. Software such as Hadoop or Apache Spark is uniquely designed using functions like map-reduce to crunch this kind of data, which can be later used for B.I. Reporting.

Data Sources and the Splunk platform

Recently at Splunk's 2019 .conf event in Las Vegas, President of Worldwide Field Operations @Susan St. Ledger talked about Splunk's new #DataToEverything vision. 

It is a concept where the Splunk platform expands beyond just the querying of internal Splunk indexes and extends this functionality to any external source of data, such as Hadoop's HDFS file system and Amazon S3 buckets.

The impact, from a data management and insights perspective, is that Splunk will be able to process and search those sources directly without needing to ingest it and consume additional storage space.

This capability makes Splunk an invaluable tool for Data Architects, Engineers and DBAs.

End of part one

This concludes part one of this two-part blog on Data Science and Splunk. 

Summary

To summarize, we discussed some points of frustration when it comes to delivering on Data Science initiatives. We then identified the areas of Data Science illustrated in a diagram and considered how step-one is broken up into traditional and big data, the data inventory and clean-up tasks that must be performed, and people that are typically responsible.

In part two, I will discuss steps two (2) and three (3), which will go into how we deliver on Business Intelligence and Predictive Analytics and people responsible for helping deliver these results.

I will also present how the Splunk platform makes Data Science more accessible using Splunk's Machine Learning and Deep Learning Toolkits.

Stay tuned for part two!