It’s the data, not the algorithm

Chris Dixon points out that the challenge in “artificial intelligence” — which in a mainstream definition would be “making computers learn like humans” — is not so much the cleverness of the algorithm but in finding useful data. He cites the Google example where the breakthrough was realizing that links are a good (and previously untapped) source of data on what people think is a relevant web page. Modern AI algorithms are very powerful, but the reality is there are thousands of programmers/researchers who can implement them with about the same level of success. The Netflix Challenge demonstrated that a massive, world-wide effort only improves on an in-house algorithm by approximately 10%. […] It’s relatively easy to build systems that are right 80% of the time, but very hard to go beyond that.

Algorithms are, as they say in business school, “commoditized.” The order of magnitude breakthroughs (and companies with real competitive advantages) are going to come from those who identify or create new data sources.

(That darned 80–20 rule should be promoted to a law.)

I’ve experienced this lately in working with my own startup and flirting with another. In both cases, the algorithms are intended to figure out what people find relevant or related. The algorithms aren’t that hard to write to achieve an 80% success rate. (In theory.) But they need to be fed.

The data that we require is largely public and already in existence at big companies — but collecting that data ourselves would be a long slog of unintelligent, undifferentiated work.

Luckily, those databases are available via APIs. (I don’t want to reveal the companies for fear of tipping off what we’re up to). Without those APIs, my company would be much harder to get off the ground.

So I take some comfort in knowing that my algorithm needs to be smart but not rocket science. And I feel lucky that these APIs exist.

But it does demonstrate one thing that can prevent small companies from usurping the big ones — data.