Hand-in-hand: Splunk and Data Science - Part Two

December 2, 2019 Gerry D'Costa

This is part two (2) of a two-part blog on what Data Science means, and how a platform like Splunk makes Data Science more accessible to stakeholders and businesses.

7 min read

To recap:

In part one, we discussed some points of frustration when it comes to delivering on Data Science initiatives. We then identified the areas of Data Science and considered step one, the Data phase, which is broken up into traditional and big data, the tasks that are performed, and who is typically responsible.

We also started to make an argument for why Splunk should be considered as a valuable software tool for any Data Science initiative.

In part two, I will continue with the Data Science portion of our diagram, which includes Business Intelligence and Predictive Analytics, with a concluding argument on how Splunk and Data Science go hand-in-hand.

If you have not read part one yet, click here to read that post first.

Step 2) Analyze / Report after data is gathered and organized.

Now that step one is completed, and our data is organized, we can now create different views of our data for Business Intelligence insights as they relate to our business KPIs, SLAs and ROIs.

For example, by measuring the data produced by a variety of sources such as applications, cloud infrastructure, financial systems and user behaviour, we can begin to define our expected SLAs of our web application and the economic impact of any downtime.

And just as with step one, it requires a supporting cast of individuals such as B.I. Analysts and Developers who are trained to understand the organization's business outcomes and align them by creating relevant reports and dashboards.

To impress on this point, I invite you to read an article by Kathleen Walch, Managing Partner at Cognilytica in Washington D.C. and contributing writer to Forbes, where she describes why Data Scientists are not Data Engineers.

Understanding the past

A common misconception is that Business Intelligence is part of a predictive analytic or insight strategy. While predictive analytics come from reliable Business Intelligence conclusions and would be considered part of a broader Data Science initiative, there is one fundamental difference.

The goal of Business Intelligence is to help analyze and explain patterns of past behaviour to provide insights, while Predictive Analytics focuses on the future.

Business Intelligence helps us tell a story of what has happened through visualizations to allow us to make forecasts, better decisions and forms the basis of our predictive analytics strategy.

Software and Programming Languages for B.I.

From a programming language perspective, we have similar options as step one, such as R, Python, SQL RDBMS systems, plus new options such as Jupyter/IPython Notebooks, Tableau and Power BI.

Jupyter or IPython Notebooks are especially compelling as it is designed to be an entire interactive computing environment that enables users to create a "notebook" that includes executable live code, plots/visualizations, narrative text, equations, images and video. A Jupyter Notebook is a compelling way to tell a story with your data.

For an excellent walkthrough tutorial of Jupyter Notebooks, I highly recommend you check out this YouTube video posted by Corey Schafer.

Other typical data visualization and dashboard tools used by many Data Scientists are Tableau and Power BI.

Similar to Splunk, both Tableau and Power BI allow for the connection and manipulation of different data sources through a user-friendly interface. The platform can then use this information to create highly customizable dashboards that can be viewed by end-users with proper access and control permissions.

Using the Splunk platform for B.I.

From a B.I. perspective, we have the advantage of using Splunk in step one to help cleanse and transform our data. This allows us to quickly create visualizations using build-in charts functions or through downloaded apps from the Splunkbase website.

Dashboards are highly customizable and can be updated through an internal XML editor, but the real advantage of Splunk for B.I. is its ease of use and flexibility.

For a more in-depth demonstration of Splunk data visualizations, please check out one of my previous blogs, which goes through the entire process of creating a dashboard using a Microsoft Azure Automation use case.

Step 3) Build future predictions based on BI results

The most significant difference between this step and the previous steps is that we now use the information gathered by Business Analysts of past data to look forward in time.

This step is firmly in the realm of the Data Scientist who shines by inspecting our BI Reports to predict future behaviour and build a story or narrative for business leaders to follow.

Data Scientists generate these predictions by building models based on statistical algorithms that split the data into training data for the algorithm to learn from, and test data which we use to validate the accuracy of our model.

Models are a construct we use in Predictive Analytics to mathematically represent how an event occurs given specific characteristics. For example: Can we predict the weight of a person if we know their gender and height?

We further separate Predictive Analytics into two areas:

Traditional Predictive Methods. This method focuses on using conventional statistical probability Machine Learning to build models and make predictions. This includes equations such as regressions, clustering, factored and time series analysis. A high percentage of many Machine Learning prediction problems can be solved through traditional predictive models and methods.
Machine Learning Methods. This method builds on top of many of the traditional techniques to create advanced Machine Learning models. Advanced models include deep learning, neural networks, natural language processing (NLP) and real-time object recognition and require extensive processing of volumes of data to create accurate models.

Software and Programming Languages for Predictive Analytics

For Traditional methods of predictive analytics, your most versatile programming languages are once again, R and Python. They both include many libraries of linear, polynomial and random forest regressions.

From a Machine Learning perspective, tools include the same programming languages as traditional methods, but also include many other applications that build on top of Python to help develop and train deep learning models such as Keras and Google's TensorFlow.

We can also still use Jupyter Notebooks for both types of Predictive Analytics to create our narrative for how we use the data to define our outcomes.

Predictive Analytics and the Splunk Machine Learning Toolkit

To address the majority of traditional predictive use cases, Splunk has created a Machine Learning Toolkit. It is a free installable Splunk application from Splunkbase that can efficiently construct traditional machine learning models without having to be a software coding expert in Python or R.

The following visualization is based on a model I created to help predict employee satisfaction towards the company they work for. Information is collected through an eNPS survey (Employee Net Promoter Score) and then run through an easily constructed linear regression model in Splunk with no coding necessary.

To explore this practical example further, please visit my previous blog, which demonstrates how to create and build a model with the Splunk Machine Learning Toolkit. It uses Python in the background to construct all traditional models, so before any can be made, the Python for Scientific Computing add-on environment from Splunkbase must first be installed.

Results from our model can be analyzed to determine the effectiveness and re-run/fine-tuned using different attributes and training/testing data ratios.

I encourage you to watch the following Splunk Machine Learning Toolkit video (11 mins) on YouTube, which provides an excellent tour of the toolkit and how to create/fine-tune models, predictions and schedule training activities.

Once the model is finalized, dependant data representing a future time can be fed into the model to make predictions and visualized in charts that can be added to new or existing Splunk dashboards.

The Splunk Deep Learning Toolkit

Built as an extension of the Machine Learning Toolkit, the goal of the Splunk Deep Learning Toolkit is to handle more advanced Machine Learning requirements that go beyond traditional predictive methods. (As identified in our Data Science Diagram)

Splunk does this by using a separate environment to build advanced machine learning models, rather than trying to develop these models internally within the platform as it does with traditional methods.

Splunk uses the Deep Learning Toolkit to connect to an external pre-configured Docker Container environment, which Splunk can automatically deploy using standardized Docker images to facilitate model creation.

These Docker images are based on popular deep learning frameworks and libraries that we identified earlier in the previous section, such as Jupyter Notebooks, TensorFlow and Keras. To make the configuration even more comfortable, each framework can be pre-built by the Machine Learning Toolkit.

Again, I encourage you to watch the following Splunk Deep Learning Toolkit video (10 mins) on YouTube, which provides an excellent tour of the toolkit and how it integrates with Jupyter Notebooks and Tensorflow to help create the narrative for our predictive scenarios.

Once models are created in the docker environment, Splunk can then leverage these external models to develop advanced predictions that can be executed, visualized and displayed in a Splunk dashboard.

Conclusion

As we enter into the next decade, the effort to stay competitive and provide business insights through predictive analytics and Data Science has become a "must-have" capability for all businesses in the world.

However, in this ever-evolving field, organizations have found it challenging to know how to start these initiatives, hire the right people and invest the time and money in the right tools and services to see them through.

In this blog, I hope I have helped reduce the confusion by providing a better understanding of the areas that make up Data Science, the tasks that should be performed, the people that should manage them and the software products we will need to invest in to get us there.

I also hope I have demonstrated how Splunk has deeply invested in trying to solve data problems by bringing #DataToEverything and how they are making Data Science more accessible to stakeholders and the business.