Hal Varian on the Need for Data Interpreters

Hal Varian, Google’s chief economist, gave a nice summary of a major need of our era.

Emphasis added:

“The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.

“I think statisticians are part of it, but it’s just a part. You also want to be able to visualize the data, communicate the data, and utilize it effectively. … being able to access, understand, and communicate the insights you get from data analysis —are going to be extremely important.”

Hal Varian, Google’s Chief Economist, 2009

KazAnova on Stacking: leveraging multiple machine learning algorithms for better predictive models

Machine learning can be a powerful tool in the creation of predictive models. But it doesn’t provide a magic bullet. In the end, effective machine learning works very much like other high-value human endeavors. It requires experimentation, evaluation, lots of work, and a measure of hard-earned wisdom.

As Kaggle Competitions Grandmaster Marios Michailidis (AKA KazAnova) explains:

No model is perfect. Almost every time the models make mistakes. Plus, each model has different advantages and disadvantages and they tend to seize the data from different angles. Leveraging the uniqueness of each model is of the essence for building very predictive models.

To help with this process, David H. Wolpert introduced the concept of stacked generalization in a 1992 paper.

Michailidis explains the process as follows:

Stacking or Stacked Generalization … normally involves a four-stage process. Consider 3 datasets A, B, C. For A and B we know the ground truth (or in other words the target variable y). We can use stacking as follows:

  1. We train various machine learning algorithms (regressors or classifiers) in dataset A.
  2. We make predictions for each one of the algorithms for datasets B and C and we create new datasets B1 and C1 that contain only these predictions. So if we ran 10 models then B1 and C1 have 10 columns each.
  3. We train a new machine learning algorithm (often referred to as Meta learner or Super learner) using B1.
  4. We make predictions using the Meta learner on C1.

As part of his own PhD work, Michailidis developed a software stack, named StackNet to speed up the process.

Marios Michailidis describes StackNet in this way:

StackNet is a computational, scalable and analytical framework implemented with a software implementation in Java that resembles a feedforward neural network and uses Wolpert’s stacked generalization in multiple levels to improve accuracy in classification problems. In contrast to feedforward neural networks, rather than being trained through back propagation, the network is built iteratively one layer at a time (using stacked generalization), each of which uses the final target as its target.

StackNet is available in GitHub under the MIT license.

Be sure to read the interview with Michailidis about stacking and StackNet on the Kaggle blog, here.





Strategy Tips for Kaggle Competitors

Martin O’Leary recently posted some sound advice for Kaggle competitors. You can find the three-graph version in the Kaggle wiki.

Here I’ll break it into four key points:

  1. Spend a while on visualization, making graphs of various properties of the data and trying to get a feel for how everything fits together.
  2. Test the performance of a variety of standard algorithms (random forests, SVMs, elastic net, etc.) to see how they compare. It’s often very informative to look at which data points are the least well predicted by standard algorithms, as this can give you a good idea of what direction to move in. (Be warned: Home-brew algorithms can be useful later on in a project, but in the early stages you want to try out as many things as possible, not get bogged down in the details of implementing a particular algorithm.)
  3. Then move into the nitty-gritty details once you have a sense for the lay of the land.
  4. Of course, all this assumes a certain kind of problem, where the data is already in numeric/categorical form. For more “interesting” datasets, such as the recent Automated Essay Scoring competition, a lot of the early work is in feature extraction — just looking for numbers which you can pull out of the data. That tends to be a bit more creative, and I use a variety of tools to see what works best. However, one of the joys of this kind of problem is that every one is different, so it’s hard to give general advice.