Saturday, November 22, 2008

What is Predictive Analytics?

I just saw this link about the difference between BI and Predictive Analytics. This comes on the heels of a meeting I had with UCSD Extension folks, talking about predictive analytics and data mining in the context of teaching courses for professionals, and this topic came up: how is predictive analytics different from BI?

First, I'd like to applaud the author, Vladimir Stojanovski, for concluding there are differences, and for trying to get at what those differences are.

The article states that this:

To tie this all back to the question of BI vs. Predictive Analytics (PA), a metaphor I've heard used to describe the difference goes something like this: if BI is a look in the rearview mirror, predictive analytics is the view out the windshield.


In my experience, this is a common definition. Predictive Analytics and Data Mining are seen as predicting future events, whereas OLAP looks at past data.

While I'd love to jump on this bandwagon because it makes for a simple and compelling story, I cannot ride this one. And that's because both BI and PA look at historic data. PA isn't magic in coming up with predictions of the future. In fact, both BI and PA ultimately look at and use the same data (or variations of the same historic data). Both can predict the future, so long as the future is consistent with past, either in a static sense, or in a dynamic sense (by extrapolating past data into the future).

I think it is better to describe the difference in this way: BI reports on historical data based upon an analyst's perspective on which fields and statistics are interesting, whereas PA induces which fields, statistics and relationships are interesting from the data itself. I think it is the combinatorics, sifting, iterative nature of PA that gives it better predictive accuracy of the future (coupled with using business metrics to assess if the fields found truly are predictive or not).

So let's not oversell--what PA does is reason enough for it to be an integral part of any analytics or BI group.

Monday, October 20, 2008

What topics would you like to see covered at a KDD conference?

This is your chance to voice your opinion!

What topics, sessions, or tutorials would be most useful for you at a conference like KDD? Would a full industrial track be of interest, of are industries so diverse that we really need tracks to be narrowed to specific industries?

Please--practitioners only. I'm defining practitioners as those who get paid to develop models that are actually used in industry.

I'll kick it off with one idea:

Tutorials (1/2 day) geared toward the practitioner. This means that if techniques are described (such as social networking), there must be implementations of the algorithmic ideas available in competitive commercial software. As great as R and Matlab are, for example, relatively few practitioners are programmers that can take advantage of these kinds of frameworks.

I know there are tutorials at KDD every year. This year I didn't go because they were all on Sunday and I wasn't able to attend then, but would have wanted to go to the Text Mining tutorial as that is a topic that has become a significant part of my business over the past couple of years.

One last thought: I think one thing that may happen (understandably) is that topics that have been covered in years passed are not revisited. For those of us who live in the data mining world, it is far more interesting to continue to explore new ideas, especially those that build on ideas we have already explored in depth. However, as data mining increases in its use, we are bringing folks in who have not had that same benefit. For many, a tutorial on decision trees would be very useful and interesting (like the KDD 2001 tutoral--trees to my knowledge have not been revisited since except in the framework of ensembles in 2007).

Thursday, October 09, 2008

Two Books of Interest

Recently, I have been reading two books which may be of interest to data miners, Statistical Rules of Thumb by Gerald Van Belle (ISBN-13: 978-0471402275) and Common Errors in Statistics (and How to Avoid Them), by by Phillip I. Good and James W. Hardin (ISBN-13: 978-0471794318). Both impart practical advice based on extensive experience and statistical rigor, yet avoid becoming hung up on academic issues.

While both are written from the point of view of traditional statisticians, they do suggest the use of some less traditional techniques, such as the bootstrap and robust regression. A wide range of topics is covered, such as sample size determination, hypothesis testing and treatment of missing values. Both books also include some material written for audiences working in specific fields, such as environmental science and epidemiology. Material in these two books will vary in applicability to data mining, given the traditional statistical focus on smaller data sets and parametric modeling.

I highly recommend both of them. Tables of contents can easily be found on-line, and an entire chapter of Statistical Rules of Thumb is available at: Chapter 2: Sample Size.

Thursday, September 25, 2008

KDD 2008

It's hard to believe that KDD2008 was the first KDD I've attended in seven years. It was striking how much has changed in that time, and that was one of the primary reasons I attended this past year--to see for myself if the reports I've heard are true. Sure enough, they are.

These reports, primarily from colleagues in industry, were that KDD didn't have anything they could "take home and use". Many of these folks are analysts who are decidedly not academic, so I thought I had a sense for what they meant.

I found their reports hit the mark. Seven years ago I was able to find (1) significant numbers of industry personnel at the conference and (2) many talks that were accessible enough for non-academics to understand. This time around there were few industry practitioners I met who were not PhDs. That's not to say there weren't interesting talks. Two I didn't see in person, but read later were the Elkan paper on learning from positive and unlabelled examples and the Grossman paper on Data Clouds. Though-provoking both. The lunch talk by Trevor Hastie was very interesting in talking about regularization, but it was geared toward those who can digest his textbook (which is among the finest data mining / statistical learning texts out there).

Social networking was a key theme of the conference, and it was such a dominant force at the conference that it deserves a separate post.

Lastly, the decline in participation by the business community was nowhere more evident than in the vendors room--only a few data mining software vendors were there, which indicates to me that it isn't viewed as a place to increase sales: if I remember correctly, only Microsoft, Oracle, Statsoft, Salford Systems, and SAS were there. A quick look at the kdnuggets software survey shows who wasn't there.

So it seems that KDD has wandered from a business/academic mix to a more academic conference, which is, of course, the prerogative of the organizers. I'm still searching for a great conference for the data mining practitioner who has the level of understanding of data mining to read and absorb a book like the Witten/Frank machine learning book but desires a more practical approach to the subject.

Wednesday, May 28, 2008

What data mining software to buy?

This post (http://www.dmreview.com/issues/2007_46/10001040-1.html?portal=analytics) is an interesting example of the assessment of analytics software. The key paragraph is the conclusion where Mr. Raab states
Instead of a horserace between product features, this approach puts the focus where it should be: on value to your business. It recognizes that the value of a new tool depends on the other tools already available, and it forces evaluation teams to explicitly study the impact of different tools on different users. By creating a clearer picture of how each new tool will impact the way work actually gets done within the company, it leads to more realistic product assessments and ultimately to more productive selection choices.


I couldn't agree more. For the past 10 years, since the Elder and Abbott review of data mining software presented at KDD-98 (on my web site) I've tried to think of ways to summarize data mining software. The obvious way is by features, such as which algorithms a product has. The usability of a tool is another characteristic to add, as John, Philip Matkovsky and I wrote about in "An Evaluation of High-End Data Mining Tools for Fraud Detection". I've also described the different packages by the kind of interface (wizard, menu-driven, block-diagram, command line, etc.).

It's not easy to provide a summary in this multi-dimensional view of data mining tools. Sounds like an opportunity for predictive modeling!

Monday, May 26, 2008

What Makes a Data Mining Skeptic?

I just found this post expressing skepticism about data mining (I'll let go the comment about predictive analytics being the holy grail of data mining--not sure what this means).

The fascinating part for me was this paragraph:

Anyway. Lindy and I were a bit squirmy through the whole discussion. It seemed like so many hopes and dreams were being placed at the altar of the goddess Clementine... but I had to ask myself, could you REALLY get any more analysis out of it then you could get simply by asking your members what events they attend, plan to attend, ever attended, or might attend in the future, and why? Since when did we stop talking to our members about this stuff? A good internal marketing manager could give you all the answers you seek about which of your various audiences are likely to respond to which of your messages, who's going to engage with you, why and when, who's going to participate in which of your events, etcetera, and they would know these answers not through stats and charts (even if you ask for them) but through experience and listening.


It is interesting on several fronts. First, there is a strong emphasis on personal expertise and experience. But at the heart of the critique is apparently a belief that the data cannot reveal insights, or in other words, a data-driven approach doesn't give you any "analysis". Why would one believe this? (and I do not doubt the sincerity of the comment--I take it at face value).

One reason may be that this individual has never seen or experienced a predictive analytics solution. While this may be true, it also misses what I think is at the heart of the critique. There is a false dichotomy set up here between data analysis and individual expertise. Anyone who has built predictive models successfully knows that one usually must have both: expert knowledge and representative data (to build predictive models).

One reason for this is that while there are undoubtedly some individuals who can "give you all the answers you seek about which of your various audiences are likely to respond to which of your messages". But usually, this falls short for two reasons:
1) most individuals who have to deal with large quantities of data don't know as much they think they know, and related to this
2) it is difficult to impossible for anyone to sort through all the data with all of the permutations that exist.

Data mining usually doesn't tell us things that experts scratch their heads at in amazement. The usually confirm what one suspects (or one of many possible conclusions one may have suspected), but with a few unexpected twists.

So how can we persuade others that there is value in data mining? The first step is realizing there is value in the data.

Friday, April 18, 2008

When Distributions Go Bad

Recently I was working with an organization, building estimation models (rather than classification). They were interested in using linear regression, so I dutifully looked at the distribution,
as shown to the left (all pictures were generated by Clementine, and I also scaled the distribution to protect the data even more, but didn't change the shape of the data).
There were approximately 120,000 examples. If this were a typical skewed transformation, I would log transform it and be done with it. However, in this distribution there are three interesting problems:


1) skew is 57--heavy positive skew
2) kurtosis is 6180--heavily peaked
3) about 15K of these had value 0, contributing to the kurtosis value

So what to do? One answer is to create the log transform, but maintain sign, using sgn(x)*log10( 1 + abs(x) ). This picture looks like this:


This takes care of the summary statistics problems, as skew became 0.6 and kurtosis -0.14. But it doesn't look right--the spike at 0 looks problematic (and turned out that it was). Also, the distribution actually ends up with two ~normal distributions of different variance, one to the left and one to the right of 0.





























Another approach to this is to use the logistic transform 1 / ( 1 + exp(-x/A) ) where A is a scaling factor. Here are the distributions for the original distribution (baddist), the log-transformed version (baddist_nlog10), and the logistic transformed with 3 values of A: 5, 10, and 20, with the corresponding pictures for the three logistic transformed versions.








Of course, going solely on the basis of the summary statistics, I might have a mild preference for the nlog10 version. As it turned out, the logistic transform produced "better" scores (we measure model accuracy by how well the model rank-ordered the predicted amounts, and I'll leave it at that). That was interesting in of itself since none of the distributions really looked very good. However, another interesting question was which value of "A" to use: 5, 10, 20 (or some other value I don't show here). We found the value that worked best for us, but because of the severity of the logistic transform in how it scales the tails of the distribution, the selection of "A" depended on which range of the target values we were most interested in rank-ordering well. The smaller values of A produced bigger spikes at the extremes, and therefore the model did not rank-order these values well (these models did better on the lower end of distribution magnitudes). If we wanted to identify the tails better, we should increase the scaling factor "A" and it did in fact improve the rank-ordering at the extremes.

So, in the end, the scaling of the target value depends on the business question being answered (no surprises here). So now I open it up to all of you--what would you do? And, if you are interested in this data, I have it on my web site that you can access here.

Thursday, April 17, 2008

Data Mining survey

Karl Rexer of Rexer Analytics conducted an extensive survey of data miners in 2007, and reported on those results here at Quirks.com (a site I had never heard of before--unfortunately, you have to register to see it).

This is not to be confused with their 2008 survey, results due out soon I would expect.

A few interesting items in the survey results:

• Correspondingly, the most commonly used algorithms are regression (79 percent), decision trees (77 percent) and cluster analysis (72 percent). Again, this reflects what we have seen in our own work. Regression certainly remains the algorithm of choice for large sections of the academic community and within the financial services sector. More and more data miners, however, are using decision trees, and cluster analysis has long been the bedrock of the marketing community.


I find it interesting in of itself that academics are participating in a data mining survey, and I don't mean that in a negative way. I have viewed data mining more as a business-centric way of thinking, and to have regression advocates participate in a survey of this type is a good sign. Of course it could also mean that business folks don't have the time to fill out surveys :)

• SPSS, SPSS Clementine, and SAS are the three most frequently utilized analytic tools and were each used in 2006 by more than 40 percent of data miners. Forty-five percent of data miners also employed their own code in 2006. Respondents were asked about 26 different software packages from the powerhouses above to less-visible and -utilized packages such as Chordiant, Fair Isaac and KXEN.


Clementine usually shows up at the top of the KDNuggets survey, and I've never been sure if it was because of the kdnuggets typical user, or if it reflected true general use in the data mining community. This gives further evidence that its use is more widespread. The fact that SPSS and SAS are the others show the dominance in the survey of statisticians or acamedicians. I rarely find heavy SPSS or SAS users among technical business analysts.

• Comparisons of reported 2006 use and planned 2007 use show that there is increasing interest in the Oracle Data Mining tool, and decreasing interest in C4.5/C5.0/See5. It will be interesting to see how these trends develop over time and if other tools find greater prominence in the future.


I concur from my experience. I would put SQL Server in that category as well. I think the C4.5 popularity was largely due to licensing.

• The primary factors data miners consider when selecting an analytic tool are: 1) the dependability and stability of software, 2) the ability to handle large data sets, and 3) data manipulation capabilities. Data miners were least interested in the reputation of the software and the software’s compatibility either with other programs or with software used by colleagues.


THis looks like the responses of technical people--very much common sense. I wonder what decision makers would say? Reputation I would think ranks much higher among these people.

• The top challenges facing data miners are dirty data, data access and explaining data mining to others. Over three-quarters of data miners listed dirty data as one of the major challenges that they face. This is again consistent with our own experience and the conventional wisdom discussed at data mining conferences: a significant proportion of most projects consist of data understanding, data cleaning and data preparation.


No surprises here! However, once one goes through this process, its importance is reduced (because it is solved).

Thanks to Rexer Analytics for putting this and the 2008 survey together. I'm looking forward to those results.

Wednesday, April 16, 2008

Data Mining Data Sets

Every once in a while I receive a request or see one posted on some bulletin board about data mining data sets. I have to say, I have little patience for many of these requests because a simple google (or Clusty) search will solve the problem. Nevertheless, here are four sites I've used in the past to grab data for some testing of algorithms of software packages:

There are several sites for data, including:

UC Irvine Machine Learning Repository: http://archive.ics.uci.edu/ml/

Carnegie Mellon Statlib Archive: http://lib.stat.cmu.edu/datasets/

DELVE Datasets: http://www.cs.utoronto.ca/~delve/data/datasets.html

MIT Broad Institute Cancer Datasets: http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi

Tuesday, April 15, 2008

DM Radio and Text Mining

I'll be interviewed on the topic of text mining this coming Thursday, April 17th at 3pm EDT on DM Radio along with Barry DeVille of SAS and Jeff Catlin Lexalytics. The title of this entry links to the DM Review site.

I think you have to register to listen.

The schedule will go something like this:


3:00 PM
Hosts Eric Kavanagh and Jim Ericson frame the argument: What is text analytics, and how can it be used to find those golden needles in the haystack?

3:12 PM
Hosts interview Barry DeVille of SAS Institute: What are some good examples of customer success? What are some common mistakes?

3:24 PM
Hosts interview Jeff Catlin, CEO of Lexalytics: How does his application work? What are some examples of text mining at work?

3:36 PM
Hosts interview Dean Abbott of The Modeling Agency: We heard what the vendors said, but what does that all really mean?

3:48 PM
Roundtable discussion: All bets are off! Guests are encouraged to engage in open dialogue, and listeners can email their questions to dmradio@sourcemedia.com

Thursday, April 10, 2008

Data Mining: Widespread Acceptance When?

Data mining is widely accepted today among industries which have a history of "management by numbers", such as banking, pure science and market research. Data mining is easily viewed by management in such industries as a logical extension of less sophisticated quantitative analysis which already enjoys currency there. Further, information infrastructure necessary to feed the data mining process is typically already present.

It seems likely that at least some (if not many) other industries could realize a significant benefit from data mining, yet this has emerged in practice only sporadically. The question is: Why?

Under what organizational conditions will data mining spread to a broader audience?

Friday, April 04, 2008

Data modeling infrastructure in data mining

I've had two inquiries in the last day relating to the building of data infrastructure between the database and predictive modeling tool, which I find to be an interesting coincidence. I hadn't even thought about a need here before (perhaps because I wasn't aware of the vendors that address this issue), but am curious if others have thought through this issue/problem.

I have seen situations where the analyst and DBA need to coordinate, but due to the politics or personalities in an organization, do not. In these cases, a data miner may need tables that actually exist, but the miner doesn't have permission to access the tables, or perhaps doesn't have the expertise to know how to join all the requisite tables. In these cases, I can imagine this middleware if you will could be quite useful if it were more user-friendly. However, I'm not yet convinced this a real issue for most organizations.

Any thoughts?

Wednesday, April 02, 2008

Another Moneyball quote

Gotta get back in the habit of posting...

A quick way is to post another quote from Moneyball that I really liked

Intelligence about baseball statistics had become equated in the public mind with the ability to recite arcane baseball stats. What James's wider audience had failed to understand was that the statistics were beside the point. The point was understanding; the point was to make life on earth just a bit more intelligible; and that point, somehow, had been lost. "I wonder," James wrote, "if we haven't become so numbed by all these numbers that we are no longer capable of truly assimilating any knowledge which might result from them."
(p.95)

What I like about this quote is that it is something may of us in the analytics world have experienced: losing the point of the modeling or summary statistics by forgetting why we are doing the analysis in the first place. Or, as my good friend John Elder used to describe it, "rapture of the depths"

Sunday, January 13, 2008

Data Mining: Interesting Ethical Questions

Data mining permits useful extrapolation from sometimes obscure clues. Information which human experts have ignored as irrelevant has been eagerly snapped up by data mining software. This leads to interesting ethical questions.

Consider the risk of selling an individual automobile insurance for one year. Many factors are related to this risk. Some are obvious, such as incidence of previous accidents, traffic violations or average number of miles driven per year. Other risk factors may not be so obvious, but are nonetheless real. Suppose that it could be shown statistically that, when added to information already in use, late payment of utility bills incrementally improved prediction.

One might take the perspective that this is a business of prediction, not explanation, so- whatever the connection- this information should be added to the insurance risk model. This perspective reasons: if the connection is statistically significant, however strange it may seem, we should conclude that it is real and it should be exploited for business purposes.

Obviously, there is a countervailing perspective which has the customer asking, "What the... ? What do my utility bills have to do with my car insurance?" Even extremely laissez-faire governments may intervene in markets and forsake economic efficiency in favor of other priorities. In the United States, for example, certain types of discrimination in lending is illegal.

Another thing to consider (again, granting that the utility bill-automobile risk connection is real) is that, in prohibiting the use of utility bill payments in auto insurance risk prediction implies that less risky customers will be paying for riskier customers.

Thoughts?