Sunday, May 22, 2011

The Privacy Challenge in Online Prize Contests

The Privacy Challenge in Online Prize Contests

Two big new prize contests just getting under way take a page from the innovative, exciting competition run by Netflix. In a nail-biting finish in the fall of 2009, the movie rental service paid $1 million to a global team of data mavens, who just edged out another group, in most improving its online film recommendations.

The Netflix contest was celebrated as a triumph for the company and as a catalyst for bringing new techniques to data analysis. But in 2010, Netflix was forced to cancel a planned second prize because of privacy concerns. Two researchers showed that the supposedly anonymous data from the first contest could be used to identify customers. That eventually brought an inquiry from the Federal Trade Commission and a lawsuit. So Netflix shelved its plans for a second contest.

Earlier this month, Overstock.com, an online retailer, announced it would sponsor a $1 million contest to the person or team that could most improve its product recommendations. And a few weeks earlier, the Heritage Provider Network, a medical group in California, released the data and the details for its $3 million contest. Its prize will go to the team that comes up a technique for most accurately predicting which patients will be admitted to hospitals in the next year.

Both contests, like the Netflix competition, require contestants to come up with predictive algorithms, using anonymized personal data as the test bed.

So how to avoid a Netflix-style privacy blowup in the new contests?

Darren Vengroff, chief scientist at Rich Relevance, a start-up that develops recommendation technology for online retailers, has a plan. Rich Relevance is running the RecLab Prize, with Overstock putting up the $1 million.

Mr. Vengroff’s strategy involves limiting the number of contestants who receive real customer data, with names and other identifying information stripped off. In the early round of competition, teams will instead get a hypothetical data set.

Then, in the semifinals (10 contestants) and finals (down to three), he explains, the competing algorithms will be running on real customer data. But that customer data will reside on Rich Relevance’s computers, in a private “cloud” environment. That is a different approach than the model used by Netflix, which released the anonymized data to contestants.

Mr. Vengroff, who called the Netflix contest “tremendously valuable” for elevating the field of data analysis, said the privacy model in his contest was far more secure.

The organizers of the $3 million Heritage Health Prize, it seems, are counting on Arvind Narayanan. He was one of the two researchers who took the Netflix data, and showed it could mined and massaged to identify customers.

Heritage has put Mr. Narayanan on its advisory board. Today, he is a postdoctorate researcher at Stanford University and a scholar at Stanford’s Center for Internet and Society.

Mr. Narayanan said he had not yet finished his report for the Heritage prize organizers, so he would not speak of that contest in detail.

But Mr. Narayanan did have some advice for any company or institution that wanted to use anonymized personal data for research. “Be honest and ask nicely,” he said. Handling personal data on the Web, even when stripped of personally identifying information like names and credit card numbers, is a risk management game.

“There are privacy risks, even if they are small,” he said.

Mr. Narayanan points to the consent request to use anonymized personal data for medical research by 23and me, the genetic testing service, as a model of candor.

As sensitive as people understandably are about personal health information, Mr. Narayanan suggested that it might be easier to protect privacy in the Heritage prize contest than it was for Netflix.

Each anonymized account in the Netflix database, he recalled, had an average of more than 200 movie ratings or reviews. “That’s a lot of behavioral information,” rich in clues to identity, Mr. Narayanan said. “There are many subtle differences in different kinds of personal information.”

Two