Data Mining: The New Gold Rush
Photo via Arbeck of WikiMedia Commons
Data and the insight it provides is power. Simply look at the rash of privacy breaches that struck the NSA, Target, iCloud, Samsung and the United States Postal Service to see what most organizations consider private. Data is growing exponentially, and now more than ever online users need to understand what happens to their data in order to avoid, as Dropbox CEO Drew Houston infamously put it, a “trade off between privacy and convenience.”
The Digital Universe is doubling in size every two years. By 2020 the amount of data will have increased from 4.4 trillion gigabytes in 2013, to 44 trillion gigabytes according to a 2014 study done by the International Data Corporation. In more human terms, today the average household creates enough data to fill 65 32gb iPhones per year. In 2020 this will increase to 318 iPhones, according to EMC – a corporation that offers data storage and analysis.
“The amount of data created in the past two years is more than the amount of data we’ve ever had… So there is a huge amount of data and a need for a way to sort through them,” says Hui Yang, an assistant professor in the Computer Science department at SF State.
The bulk of this data is metadata, or information generated when you use technology. It is everyday data collected from consumers’ activities and can contain information such as locations, IP addresses, web searches and other browser histories. By law, most metadata can be stored indefinitely and, through data mining – a field in computer science that analyzes the patterns and connections among data – can be used to classify anything from relationships between genes and diseases, to which internet users are more likely to buy a company’s product. Using this information for commercial purposes is where data mining gets a bad rep.
Although a currently relevant pop culture term, for decades “data mining” has played an intangible role in the growth and comprehension of the digital universe. It helps find patterns among vast amounts of data that human eyes cannot discover. And while data mining analyzes everything from medical data to business data to human rights, it is one of the tools used by data brokers – companies that collect, maintain, and sell data on millions of consumers generally without the consumer’s permission or knowledge.
The negative stigma that now surrounds any and all kinds of large data collection is a more recent development that is more apparent than the data being acquired, and can largely be attributed to the business built around selling people’s metadata.
According to last year’s report from the International Data Corporation, a market research and analysis firm, “In 2013, two-thirds of the digital universe bits were created or captured by consumers and workers, yet enterprises had liability or responsibility for 85% of the digital universe.”
Data brokers are among these enterprises.
Much of the personal information analyzed through data mining and collected by data brokers is demographic and transaction information about the user, the device, and the activities occurring in between. But credit card information, census data, and more public records are also included.
“This information makes clear that consumers going about their daily activities – from making purchases online and at brick-and-mortar stores, to using social media, to answering surveys to obtain coupons or prizes, to filing for a professional license – should expect that they are generating data that may well end up in the hands of data brokers… without their permission to construct detailed profiles on them reflecting judgments about their characteristics and predicted behaviors,” reads a 2014 Senate committee report.
Generally, analyzed metadata only aims to deduct codes and statistics like IP addresses, but when tracked across multiple platforms, the paper trail can become pretty direct.
Even then the Senate report goes on to say that, “Some privacy and information experts have expressed concerns that re-identification techniques may be used with such data, and questioned whether data that identifies specific computers and devices can truly be considered anonymous.”
Anonymous from who? When the Senate asked data brokers who buys their gathered information, companies across all platforms were named.
“12 of the top 15 credit card issuers; seven of the top 10 retail banks; eight of the top 10 telecom/media companies… three of the top 10 pharmaceutical manufacturers; five of the top 10 life/health insurance providers; nine of the top 10 property and casualty insurers,” reads the 2014 Senate report.
Some of the most known offenders: Yahoo, Twitter, Youtube, Google or DoubleClick, and AOL. But what’s surprising is the type of companies who buy and sell consumers metadata.
Just recently, the Associated Press reported the Affordable Care Act website, where Americans can sign up to receive health care, was sending users’ information to a number of third party companies.
So it seems that no matter how personal, some information is not private information, at least not to these companies. The lack of transparency about the amount and type of information gathered and analyzed is ultimately unknown to most users which makes opting out of having your data collected almost impossible.
But fear not, online user security is becoming more of an immediate concern. In February, President Obama announced new rules requiring intelligence analysts, like the NSA, to delete private information they may accidentally collect about Americans. The President also spoke at The White House Summit on Cybersecurity and Consumer Protection at Stanford University on February 13, discussing legislation intended to strengthen cybersecurity, an issue that he likened to “the wild wild west” according to the New York Times.
Since 2009 and continuing into 2014, the Unites States Federal Trade Commission has recommended that Congress develop legislation that allows consumers to view the information data brokers hold about them. One of the few online consumer rights laws is California’s “Shine the Light” law, which requires companies doing business with Californians to allow customers to opt out of information sharing, or disclose how personal information will be shared.
Hence obscure and needlessly long privacy terms and agreements being more relevant than ever.
“Data mining is relatively new and it’s affecting everyone but it does not have many laws. It’s like a free market,” Yang says. ”People definitely feel like they are being watched, but if you look at privacy and then what people post, (privacy) needs a lot of work.”
To some extent, the fear about data mining can be attributed to a general lack of knowledge and regulation, fueled by headlines about the NSA. On the other hand, users are actively creating and allowing the collection and analysis of their information.
Last quarter Facebook reported an average of 890 million active daily users. A 2012 survey done by Pew Research Center shows that, “More than half of social networking site users (58 percent) say their main profile is set to private.” That still leaves the data of 42 percent of social media users unprotected.
Data is constantly being created, but in the current age it has also come to mean more to not only users but to the companies who consume the data. Data has become a panopticon, a platform on which we create our own images and through which others see our constant updates.
Ultimately, it is up to the user to manage what information they put online. Data mining and other computer sciences can be used by consumers as both an advantage and a disadvantage.
When data is pooled about locations and transactions, business with the companies who analyze this data can be much more personalized. Take Google Now as an example. If you input information such as the location of your home or work, your favorite sports teams, your most frequently made food or grocery orders, or even your airplane tickets and Google Now will provide “relative suggestions” on routes to work, restaurants and events in your area, provide updates on your favorite team, and remind you of when your flight is and when you should leave to arrive on time.
In this setting, what can be considered private information can be sacrificed for convenient personalization.
On the other hand, organizations like Stop Data Mining provide “opt out lists” with links to the opt out pages of companies that collect data. Or for a simpler solution, almost all major browsers contain a “Do Not Track” preference. There are other options to remove or manually manage “cookies” that collect metadata, alternative browsers like DuckDuckGo, which doesn’t collect or share personal information, and of course privacy settings on social media.
As the amount of data continues to grow exponentially, there will be a need for more ways to organize and sort it. What’s data mining’s future?
“More of it. More people from more backgrounds becoming data scientists. More tools for data scientists. More schools teaching data science. More products built on data understanding. Oh, and robots.” says Todd Holloway, a data analyst at Trulia and an organizer of the San Francisco Data Mining Group that teaches how to effectively use data mining to say, for example, win at fantasy football.
Whichever way you bend, know the power of the data you put out and the transparency that it carries.