Affinity Analyzer

A tool for identifying skus that are ordered together

If your customers generally order an oil filter gasket when they order an oil filter, then you may be able to reduce travel in your warehouse by storing the filter and the gasket close to each other. This program identifies such opportunities.

You can search for such opportunities by preparing an order history that combines the shopping lists of recent customers and loading it into this program, which will read the orders, search for patterns, and report highly correlated skus. In addition, it will identify skus that tend to complete orders.

Download the program

Please read the license and disclaimer, then click here to download the program. It will appear as a jar file, which most systems will run if you double-click on it. Alternatively, open a terminal and enter

java -jar WhAffinity.jar

If the program does not run, make sure you have the latest version of Java installed and your security settings allow execution of Java programs.

How to use the program

  1. Prepare an order history that lists all pick-lines, sorted by customer order. This should be a text file in csv format, with the order ID in the first field and the SKU in the second. (Here is an example).
  2. Start the Affinity Analyzer program and open the the order history file. The program will parse the file and analyze the patterns of customer orders.
  3. Examine the statistics to find SKUs that have been ordered together.


You can find more information and tools like this in our textbook and associated web pages.


Why does the program not look for groups of 3 or 4 or more skus that are frequently ordered together?

This is impractical and unnecessary. It is impractical because the time to process the sales history and the space to store the results both increase exponentially in the size of the affinity groups. It is unnecessary because, if a group of, say, 4 skus are frequently ordered together, this will be recognized by the current pairs analysis, which will report 6 pairs frequently ordered together.

Why do I get an out-of-memory error?

If your order history contains fifty thousand SKUs then the program must tabulate statistics on about (50 000)(50 000) = 2 500 000 000 different pairs of SKUs, which can overflow memory. But the program can take advantage of as much memory as you have available so quit other applications or move to a machine with more memory. Addendum: Some versions of MS Windows apparently allocate no more than 1GB of memory to any Java process. This does not seem to be the case for Mac or Linux.

Why is the program taking so long to run?

Your computer does not have enough memory — see the previous question — and is continually swapping the contents of memory to disk (also known as “thrashing”). You must add more memory to your computer or use another computer that has more memory or truncate the data file (for example, use only the first 10 000 lines). For comparison, a laptop with 3GB RAM takes less than 1 minute to process 200 000 lines of sales history describing 10 000 SKUs.

What can I do about an out-of-memory error?

Here is a stripped-down, command-line version of the program. It reports only the popularity of SKU pairs. Furthermore, it makes a simplifying assumption that may not apply to your data: It assumes that the popularity of SKU pairs does not vary appreciably over the data, so that a SKU pair that is popular at the start of the data will also tend to be popular toward the end of the data; and similarly for unpopularity.

You must have Java 8 installed on your computer. Invoke the program by entering the following in a terminal window.

java -jar WhAffinityCmdLn.jar order-history.csv m n k

The program will then read the order history and tabulate appearances of every pair of SKUs. It will pause after every m orders and purge from memory all but the n SKU pairs most popular to this point. Large m or n consume more memory but better protect against missing seasonal popularity.

At the end, the program will write to stdout (print to the screen) the k most popular pairs of SKUs. You will probably want to capture the output by re-directing it: Just append

 > output-file.csv

to the line invoking the program.

For comparison purposes, on a typical laptop the program processed 13 million lines of sales history in about five minutes, with m = 100,000 and n = 100,000.