The data mining process requires acquisition, validation, preprocessing, importing, and cleaning. The end to end process isn’t simply digestion of data into your company’s database. Stemming from this article, which outlines what data mining is, this article documents the process companies should follow to maximise the value of acquired data. With data forming the precedent for decision making in business, it’s increasingly important that data mining practice adheres to a reliable, scalable framework.
The first stage in data mining is acquisition. For the purpose of this article, this doesn’t refer to importation of data from an external source. Instead it refers to the research and development process involved in acquiring data, and consequently determining the data that needs to be acquired. The approach here needs to start by understanding the outcomes of the business. Because we don’t have many acronyms to remember as is, we’re going to use the ODS method of acquiring data (yes, I created another!):
Source - Where is the data? Does your company have access to it, or will it need to be acquired from elsewhere?
The above process should be carried out as a workshop with your team, not individually. This helps to reduce situational bias, and enable a wider array of ideas, which is particularly useful when identifying particular sources of information.
Data imported from external sources needs to be confirmed as valuable to achieve the business outcome. This is best handled through an experimentation process, whereby minimal software related work should be executed before first confirming the validity of the data imported. For instance, let’s assume a key outcome of your data mining project is to determine the performance of salespeople in your business using data. We’ve identified the potential sources of that data are Salesforce and internal sales tracking solutions. A data mining solution is then implemented to import this data. However, after importation, we confirm the internal sales tracking solutions doesn’t contain any information in addition to Salesforce, which is also appropriate to achieve the key outcome. By validating (or testing) the data sooner, we’re able to confirm any data imported shows the expected information to achieve the ideal business outcome.
Despite sourcing data from either internal or external sources, it is rarely provided in a useful format. The data will therefore need to undergo a step called preprocessing. This transforms the data into a consistent format that can be leveraged by other systems (such as data visualisation dashboards). Explained slightly differently, preprocessing moulds the data from multiple unique platforms into a standardised format.
By way of example, a wholesaler is selling products via two different retail companies. These retailers provide the information back to the manufacturer in a format that is relevant to their own system. The data, ultimately, is the same, but formatted slightly differently. Retailer 1 stores dates in DD/MM/YY format, and Retailer 2 stores data in MM/DD/YYYY format. The difference seems minor, but can reduce the reliability of reports, impeding decision making. Preprocessing will manipulate the data from both of these systems before it enters the manufacturer’s data storage, where reports can be created.
There are typically three systems involved in this process:
These systems are responsible for the following:
This whole process is critical to data mining, and must be completed before commencement of data acquisition.
After creating systems to preprocess data, we can now develop an importing process. There are two methods of executing this, and for the purpose of describing them, we’ll assume the data source is Salesforce (that we need to acquire data from), and the destination is of course our “middleware” (or systems, for ease of reference).
A batch import uses a schedule to “pull” information from a source at a regular, recurring interval. An example daily batch could work as follows:
We update the batch time, as a midnight batch may fail. This provides us with the confidence that even if a batch fails, in the next instance we will import all information since the last batch. We save the batch time to the time of the most recent record on Salesforce, as a minuscule time difference between servers (Salesforce and your server) could cause some data to not be imported in the next instance. To explain this differently; if our server is 5 seconds ahead, and we were to save the last batch time as our server’s time, during the next import we will lose those 5 seconds of data. Instead, we use the Salesforce time to ensure batches capture all data.
A hook based import is a pure real time method for acquiring data. A “hook” in computing refers to a process whereby an external system (such as Salesforce) sends a message (containing data) to your server when an event occurs. An example hook based import could be executed as follows:
While hook based importing methods are preferred from a “real time” perspective, they offer two primary risks:
We tend to use batch importing wherever possible to avoid this, with shorter “intervals” where required for more up-to-date information.
After importing the data, it is rarely in a perfect state for analysis. Therefore, the data needs to be “cleaned” to ensure it’s fit for purpose. Data cleaning can use business rules, statistics, and machine learning, and the method is chosen based on the particulars of the project. The clean is generally executed using one of two methodologies:
The cleaned data of course will generally differ depending on the methodology used (whether using business rules, statistics, or machine learning). Generally, we find it’s best to clean data after it has already been imported. This is achieved by storing two data sets:
Once the data has been cleaned, the data mining process is complete! At this point, the data can then be applied towards analysis, whether manual, using business rules, statistical analysis, or machine learning.