Strategy Tips for Kaggle Competitors

Martin O’Leary recently posted some sound advice for Kaggle competitors. You can find the three-graph version in the Kaggle wiki.

Here I’ll break it into four key points:

  1. Spend a while on visualization, making graphs of various properties of the data and trying to get a feel for how everything fits together.
  2. Test the performance of a variety of standard algorithms (random forests, SVMs, elastic net, etc.) to see how they compare. It’s often very informative to look at which data points are the least well predicted by standard algorithms, as this can give you a good idea of what direction to move in. (Be warned: Home-brew algorithms can be useful later on in a project, but in the early stages you want to try out as many things as possible, not get bogged down in the details of implementing a particular algorithm.)
  3. Then move into the nitty-gritty details once you have a sense for the lay of the land.
  4. Of course, all this assumes a certain kind of problem, where the data is already in numeric/categorical form. For more “interesting” datasets, such as the recent Automated Essay Scoring competition, a lot of the early work is in feature extraction — just looking for numbers which you can pull out of the data. That tends to be a bit more creative, and I use a variety of tools to see what works best. However, one of the joys of this kind of problem is that every one is different, so it’s hard to give general advice.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s