Warehouse Science

Data-Mining the warehouse

Purpose

Warehouse profiling is a special case of data-mining, which is simply jargon for sifting through historical data for opportunities and insights that might confer advantage. Amazon studies its customer orders so that it can sell more; we study customer orders so that they can be delivered more efficiently.

As the name “data-mining” suggests, there is a certain amount of serendipity involved, along with a knowledge of what to look for and how to search most efficiently. It is also important to have the right tools.

The data of most enterprises resides in large relational database management systems, such as Oracle, IBM Informix, SAP ASE, MS SQL Server, and others. (Many have small free versions.) High-quality open-source databases include MySQL and Postgresql.

A database contains a collection of tables, each of which is similar to a spreadsheet in which each row describes some object, such as a sku; and each column describes some attribute of the object, such as its name or weight.

To mine data requires first that you manage large datasets. The main tool you will need is some program that will allow you to query tables and to connect information from one table with that from another. The industry-standard way to do this is via the Structured Query Language (SQL).

There are many tools built on top of SQL to make data-mining easier. This area is developing very rapidly — for commercial software see offerings by Tableau, Microstrategy, or Qlik.) Nevertheless it is still good to have some familiarity with SQL, which underlies these tools.

Resources

Here is an introduction to the use of relational databases and SQL to examine warehouse activities. This exercise guides you in setting up some basic database tools and it provides you with a clean set of data on which to practice.
Here is a list of all the data for which you should ask if undertaking an independent profiling project.
Here is a great reminder to plot results in addition to reporting statistics.