Sunday, 1 July, 2018

The Data Mining Process

Author
Lambros Photios

Share

The data mining process requires acquisition, validation, preprocessing, importing, and cleaning. The end to end process isn’t simply digestion of data into your company’s database. Stemming from this article, which outlines what data mining is, this article documents the process companies should follow to maximise the value of acquired data. With data forming the precedent for decision making in business, it’s increasingly important that data mining practice adheres to a reliable, scalable framework.

Acquisition

The first stage in data mining is acquisition. For the purpose of this article, this doesn’t refer to importation of data from an external source. Instead it refers to the research and development process involved in acquiring data, and consequently determining the data that needs to be acquired. The approach here needs to start by understanding the outcomes of the business. Because we don’t have many acronyms to remember as is, we’re going to use the ODS method of acquiring data (yes, I created another!):

  1. Outcome - What are the outcomes of the data mining process, and why has data been identified as the means of achieving that outcome?
  2. Data - What is the data that will be required to achieve the outcome identified?
  3. Source - Where is the data? Does your company have access to it, or will it need to be acquired from elsewhere?

The above process should be carried out as a workshop with your team, not individually. This helps to reduce situational bias, and enable a wider array of ideas, which is particularly useful when identifying particular sources of information.

Validating

Data imported from external sources needs to be confirmed as valuable to achieve the business outcome. This is best handled through an experimentation process, whereby minimal software related work should be executed before first confirming the validity of the data imported. For instance, let’s assume a key outcome of your data mining project is to determine the performance of salespeople in your business using data. We’ve identified the potential sources of that data are Salesforce and internal sales tracking solutions. A data mining solution is then implemented to import this data. However, after importation, we confirm the internal sales tracking solutions doesn’t contain any information in addition to Salesforce, which is also appropriate to achieve the key outcome. By validating (or testing) the data sooner, we’re able to confirm any data imported shows the expected information to achieve the ideal business outcome.

Preprocessing

Despite sourcing data from either internal or external sources, it is rarely provided in a useful format. The data will therefore need to undergo a step called preprocessing. This transforms the data into a consistent format that can be leveraged by other systems (such as data visualisation dashboards). Explained slightly differently, preprocessing moulds the data from multiple unique platforms into a standardised format.

By way of example, a wholesaler is selling products via two different retail companies. These retailers provide the information back to the manufacturer in a format that is relevant to their own system. The data, ultimately, is the same, but formatted slightly differently. Retailer 1 stores dates in DD/MM/YY format, and Retailer 2 stores data in MM/DD/YYYY format. The difference seems minor, but can reduce the reliability of reports, impeding decision making. Preprocessing will manipulate the data from both of these systems before it enters the manufacturer’s data storage, where reports can be created.

There are typically three systems involved in this process:

  1. Source (database)
  2. Middleware (software)
  3. Local Storage (database)

These systems are responsible for the following:

  1. Source - The source you’re acquiring data from, such as a marketing platform tracking business analytics.
  2. Middleware - The software responsible for pulling data from the source, preprocessing the data, and saving it to your local storage.
  3. Local Storage - The database storing all information across your systems, from which useful insights can be deduced.

This whole process is critical to data mining, and must be completed before commencement of data acquisition.

Importing

After creating systems to preprocess data, we can now develop an importing process. There are two methods of executing this, and for the purpose of describing them, we’ll assume the data source is Salesforce (that we need to acquire data from), and the destination is of course our “middleware” (or systems, for ease of reference).

Batch Import

A batch import uses a schedule to “pull” information from a source at a regular, recurring interval. An example daily batch could work as follows:

  1. At midnight, get all sales pipeline data from Salesforce since the last “batch time” (we’ll refer to this again shortly).
  2. Your server processes this data (successfully imported).
  3. Update the “batch time” to the time of the most recent record.

We update the batch time, as a midnight batch may fail. This provides us with the confidence that even if a batch fails, in the next instance we will import all information since the last batch. We save the batch time to the time of the most recent record on Salesforce, as a minuscule time difference between servers (Salesforce and your server) could cause some data to not be imported in the next instance. To explain this differently; if our server is 5 seconds ahead, and we were to save the last batch time as our server’s time, during the next import we will lose those 5 seconds of data. Instead, we use the Salesforce time to ensure batches capture all data.

Hook Based Import

A hook based import is a pure real time method for acquiring data. A “hook” in computing refers to a process whereby an external system (such as Salesforce) sends a message (containing data) to your server when an event occurs. An example hook based import could be executed as follows:

  1. A new lead is created in your sales pipeline within Salesforce.
  2. A hook instigates, which sends the data to your server.
  3. Your server processes this data (successfully imported).

While hook based importing methods are preferred from a “real time” perspective, they offer two primary risks:

  1. You need to expose access to your server from an external location. This creates a cybersecurity risk, even when done properly.
  2. If a hook’s data isn’t delivered (for example, if your server was unreachable), it becomes incredibly cumbersome to retrieve that data without reliance on the integrity of the external platform.

We tend to use batch importing wherever possible to avoid this, with shorter “intervals” where required for more up-to-date information.

Cleaning

After importing the data, it is rarely in a perfect state for analysis. Therefore, the data needs to be “cleaned” to ensure it’s fit for purpose. Data cleaning can use business rules, statistics, and machine learning, and the method is chosen based on the particulars of the project. The clean is generally executed using one of two methodologies:

  1. Update - The update method involves manipulating inconsistently formatted data sets to transform it into a usable state.
  2. Delete - The delete method involves deleting any data sets that are broken, or unusable. This is a last resort, but is particularly valuable in statistical analysis where outliers can potentially cause undesirable outcomes if used for computation of mathematical problems.

The cleaned data of course will generally differ depending on the methodology used (whether using business rules, statistics, or machine learning). Generally, we find it’s best to clean data after it has already been imported. This is achieved by storing two data sets:

  1. Raw - The original data directly from the source, completely unmanipulated.
  2. Cleaned - The cleaned data, including updated results, and not including data flagged for deletion.

Once the data has been cleaned, the data mining process is complete! At this point, the data can then be applied towards analysis, whether manual, using business rules, statistical analysis, or machine learning.

About the Author.

Author
Lambros Photios

View on LinkedIn

I embarked on my entrepreneurial journey six years ago with one goal: To build a culture and technology focused company. Working with industry leaders, I’ve had the honour of delivering challenging projects with intricate specifications, and within tight deadlines.

Tags.