Data mining made faster: New method eases analysis of 'multidimensional' information

来源:百度文库 编辑:神马文学网 时间:2024/06/12 18:20:17
July 22, 2010 To many big companies, you aren't just a customer, butare described by multiple "dimensions" of information within a computerdatabase. Now, a University of Utah computer scientist has devised a newmethod for simpler, faster "data mining," or extracting and analyzingmassive amounts of such data.
Ads by Google
Multidimensional Scaling - MDS & related programs Including free downloads -www.newmdsx.com
Mining software - Mining simulation software and services, SimMine -www.simmine.com
"Whether you like it or not,Google,Facebook, Walmart and the government are building profiles of you, andthese consist of hundreds of attributes describing you" - your onlinesearches, purchases, shared videos and recommendations to your Facebookfriends, says Suresh Venkatasubramanian, an assistant professor ofcomputer science.
"If you line them up for each person, you have a line of hundreds ofnumbers that paint a picture of a person: who they are, what theirinterests are, who their friends are and so forth," he says. "Thesestrings of hundreds of attributes are called high-dimensional databecause each attribute is called one dimension. Data mining is aboutdigging up interesting information from this high-dimensional data."
A group of data-mining methods named "multidimensional scaling" orMDS first was used in the 1930s by psychologists and has been used eversince to make data analysis simpler by reducing the "dimensionality" ofthe data. Venkatasubramanian says it is "probably one of the mostimportant tools in data mining and is used by countless researcherseverywhere."
Now, Venkatasubramanian and colleagues have devised a new method ofmultidimensional scaling that is faster, simpler, can be useduniversally for numerous problems and can handle more data, basically by"squashing things [data] down to size."
He is scheduled to present the new method on Wednesday, July 28 inWashington at the premier meeting in his field, the Conference onKnowledge Discovery and Data Mining sponsored by the Association forComputing Machinery.
"This problem of dimensionality reduction and data visualization isfundamental in many disciplines in natural and social sciences," saysVenkatasubramanian. "So we believe our method will be useful in doingbetter data analysis in all of these areas."
"What our approach does is unify into one common framework a numberof different methods for doing this dimensionality reduction" tosimplify high-dimensional data, he says. "We have a computer programthat unifies many different methods people have developed over the past60 or 70 years. One thing that makes it really good for today's data -in addition to being a one-stop shopping procedure - is it also handlesmuch larger data sets than prior methods were able to handle."
Ads by Google
MRes at CASA UCL 2010 - Advanced Spatial Analysis & Visualisation - apply now! -www.digitalurban.org
Understand Statistics Now - Make yourself smarter today learn statistics the easy way -Simple-Statistics-Made-Easy.com
He adds: "Prior methods on modern computers struggle with data frommore than 5,000 people. Our method smoothly handles well above 50,000people."
Venkatasubramanian conducted the research with University of Utahcomputer science doctoral student Arvind Agarwal and postdoctoral fellowJeff Phillips. It was funded by the National Science Foundation.
The Curse of Dimensionality
When analyzing long strings of attributes describing people, "you arelooking at not just the individual variables but how they interact witheach other," he says. "For example, if you describe a person by theirheight and weight, these are individual variables that describe aperson. However, they have correlations among them; a person who istaller is expected to be heavier than someone who is shorter."
The high "dimensionality" of data stems from the fact "the variablesinteract with each other. That's where you get a [multidimensional]space, not just a list of variables."
"Data mining means finding patterns, relationships and correlationsin high-dimensional data," Venkatasubramanian says. "You literally aredigging through the data to find little veins of information."
He says uses of data mining include Amazon's recommendations toindividual customers based not only on their past purchases, but onthose of people with similar preferences, and Netflix's similar methodfor recommending films. Facebook recommends friends based on people whoalready are your friends, and on their friends.
"The challenge of data mining is dealing with the dimensionality ofthe data and the volume of it. So one expression common in the datamining community is 'the curse of dimensionality,'" saysVenkatasubramanian.
"The curse of dimensionality is the observed phenomenon that as youthrow in more attributes to describe individuals, the data mining tasksyou wish to perform become exponentially more difficult," he adds. "Weare now at the point where the dimensionality and size of the data is abig problem. It makes things computationally very difficult to findthese patterns we want to find."
Multidimensional scaling to simplify multidimensional data is anattempt "to reduce the dimensionality of data by finding key attributesdefining most of the behavior," says Venkatasubramanian.
Universal, Fast Data Mining
Venkatasubramanian's new method is universal - "a new way ofabstracting the problem into little pieces, and realizing many differentversions of this problem can be abstracted the same way." In otherwords, one set of instructions can be used to do a wide variety ofmultidimensional scaling that previously required separate instructions.
The new method can handle large amounts of data because "rather thantrying to analyze the entire set of data as a whole, we analyze itincrementally, sort of person by person," Venkatasubramanian says. Thatspeeds data mining "because you don't need to have all the data in frontof you before you start reducing its dimensionality"
Venkatasubramanian and colleagues performed a series of tests oftheir new method with "synthetic data" - data points in a"high-dimensional space."
The tests show the new way of data mining by multidimensional scaling"can be faster and equally accurate - and usually more accurate" thanexisting methods, he says.
The method has what is known as "guaranteed convergence," meaningthat "it gets you a better and better and better answer, and iteventually will stop when it gets the best answer it can find,"Venkatasubramanian says. It also is modular, which means parts of thesoftware are easily swapped out as improvements are found.
Privacy and Data Mining
What of concerns that we are sacrificing our privacy to marketers?
"The issue of privacy in data mining is like any set of potentiallynegative consequences of scientific advances," says Venkatasubramanian,adding that much research has examined how to mine data in a manner thatprotects individual privacy.
He cites Netflix's movie recommendations, for example, noting that"if you target advertising based on what people need, it becomes useful.The better the advertising gets, the more it becomes useful informationand not advertising."
"And the way we are being inundated with all forms of information intoday's world, whether we like it or not we have no choice but to allowmachines and automated systems to sift through all this to make sense ofthe deluge of information passing our eyes every day."
Provided by University of Utah (news :web)