Semantic Search: The Myth and Reality - ReadWriteWeb

来源:百度文库 编辑:神马文学网 时间:2024/06/03 03:36:06
Semantic Search: The Myth and Reality
Written byAlex Iskold / May 29, 2008 2:15 PM /24 Comments
For a few years now people have been talking about semantic search. Any technology that stands a chance to dethrone Google is of great interest to all of us, particularly one that takes advantage of long-awaited and much-hyped semantic technologies. But no matter how much progress has been made, most of us are still underwhelmed by the results. Inhead-to-head comparisons with Google, the results have not come out much different. What are we doing wrong?
For example, when asked, What is the capital of France? both approaches come back with the correct answer - Paris. Also, a lot of queries that we are used to typing into Google in abbreviated form, come back with similar results if we type them using natural language. Clearly something is off. We all know that semantic technologies are powerful, but how and why? In this post we will show that the problem is that we are asking wrong questions.
The mistake is that semantic search engines present us with Google-like search box and allow us to enter free form queries. So we type the things that we are used to asking - primitive queries. It never occurs to us to type in What actor starred in both Pulp Fiction and Saturday Night Fever? or What two US Senators received donations from a foreign entity? We type simple questions, but this is not where the power of semantic search lies. Lets look at the spectrum of semantic technologies from Google, to SearchMonkey, to Powerset, and Freebase to understand what is going on.
What Problem Are We Trying to Solve?
The first confusion in the space comes from the fact that semantic search is being positioned as the answer to all possible problems - from modern search, currently dominated by Google, to problems that are computationally impossible. The situation is made more difficult by the fact that right now there is only a thin range of problems where semantic search can clearly do better. This range is complex queries involving inferencing and reasoning over a complex data set.

As shown in the diagram above basic queries are easily handled by Google. Sadly, natural language processing gives little advantage when it comes to this category of problems. Google correctly answers the question about Leonardo Da Vinci's birthday leaving no opportunities to improve the search by understanding the nouns and the verbs that user typed in.
Before looking at the problems that are perfect for semantic search, lets look at the hardest problems. These are computationally challenging problems that really have nothing to do with understanding semantics. The misconception has been perpetuated since early days of the Semantic Web that somehow, because we will annotate the web, we will be able to solve these super complex problems. This is simply not true. There are fundamental limits to what we can compute, and a class of problems that have an exponential number of possible solutions is not going to be magically solved because we represent data as RDF.
The good news is that there is a set of problems that are great for semantic search. These are the problems we have been solving so wonderfully with relational database. Way too often we forget that semantic technologies are here to help us represent relational data spread over the entire web - so it should be no surprise to us that it is relational queries that semantic search engines would excel at.
The Spectrum of Semantic Search Players
But semantic search is not just about the questions that we are asking. Because the web is just a bunch of unstructured HTML pages, semantic search is also about the underlying data. At its most structured extreme we findFreebase - the semantic database of everything. Freebase is accessible via free text search, but more importantly via MQL (Metaweb Query Language). MQL is essentially JSON with wildcards. Using it you can construct any query against Freebase and the result will be the same query with answers filled in.

Powerset, in a way, is just a relational database. It operates against certain, structured information. On the other end of the spectrum is Google, which is all about statistical frequencies and very little semantics. The recently launchedSearchMonkey from Yahoo! is an interesting twist. It does not add anything to the result set, but instead uses semantic annotations to present a richer, more interactive and useful user interface.
Companies likeHakia and Powerset are probably working the hardest. These companies are trying to simultaneously build Freebase-like structures on the fly and then do natural language queries on top of them. The difference is that Hakia is using (likely similar) technology to query over the entire web, while Powerset has (probably shrewdly) chosen to restrict the search to Wikipedia.
Are Hakia, Powerset and Freebase All That Different?
This analysis brings up a question - which of these technologies are different and which are essentially the same? Lets get the easy one down first. Yahoo!'s SearchMonkey is no different from Google or any other search, as far as the core search technology is concerned. The difference is simply in the presentation layer. SearchMonkey is smart about creating a better user experience by letting publishers present the search results to the users in the best possible way.
But when it comes to Hakia, Powerset and Freebase the situation is much more complicated. On the surface all these products are different - Hakia lets you search the whole web, Powerset is restricted to Wikipedia (and Freebase!) and Freebase itself has two search interfaces - the search box and query language. Here is the problem - the natural language interface has nothing to do with the underlying data representation.
The fact is that all of these semantic search technologies allow people to type in arbitrarily complex questions and then interpret these queries and execute them against their databases. Fundamentally, Hakia, Powerset, and Freebase are databases. Fundamentally, all of them have some kind of Natural Language Processing that translates the question into a canonical query over the database.
To gain insight into all of this, think about Freebase and its query language MQL. Unlike natural language, which allows all sorts of constructs, MQL is non-ambiguous. This JSON-like language allows users to construct precise statements against Freebase. The fact that Powerset allows natural language queries does not mean that inside Powerset there is no database. For sure, though, there is a similar kind of database as there is beneath the Freebase search box. What is really different about Freebase and Powerset is the data gathering approach and user experience.
Back to the Future: It's All About UI
Probably the most striking revelation about the semantic search space is User Interface. First, to go on the tangent, Powerset got it right by realizing that semantics needs to be surfaced in the UI. After a user searches Powerset, a contextual gadget, aware of the semantics of the results, helps the user complete the search experience.
Yet the biggest mistake that I think Powerset is making is also in the UI. The search box that everyone is familiar with via traditional web search engines needs to go. Having a simplistic search interface hurts Powerset and Hakia, and to a lesser extent Freebase, which is not positioning itself as generic search.
Think about the recent launch of Powerset. The company released a vastly better way to interact with one of the most important sources of information on the web - Wikipedia. But what did the critics say? Lets see if this is a Google killer. And the answer to that is "no."
But what if Powerset restricted what can be searched? What if instead of a search box there was another interface or what if they told users not to look up things that they can find easily on Google? Why is it that new companies are expected to improve on the algorithm that has ruled the web for over a decade? Instead, the expectation should really be to solve the problems that can not be solved by Google today.
Conclusion
Semantic search is an upcoming technology that has set the expectations way too high. We have all been misled into thinking that these technologies are here to dethrone Google by delivering better search results. Neither of those things are true. What is true, however is that semantic search is going to be big and it is going to help us answer questions that we simply cannot answer today - complex, inferencing queries asked over the entire web as if it was a database.
In order for these semantic search technologies to make a dent in the market, they need to clean up their messaging and most importantly, their user interface. Presenting a search box is both misleading and detrimental, as people associate it with the simplistic questions that Google solves without any problems. To really showcase semantic search, these companies need to come up with innovative UIs that will help users to understand the power that is being put at their fingers.
As always, please tell us what you think. What should semantic search companies do to gain their place in the marketplace?

Posted in :Features,Search Services,Semantic Web,Trends
Tags:freebase,google,hakia,powerset,searchmonkey,semantic search,yahoo
Related Entries
2008 Web PredictionsSemantic Web: What Is The Killer App?Semantic Web Patterns: A Guide to Semantic TechnologiesHuddleChat: Did Google Just Rip Off 37Signals?The Danger of Free


1 TrackBacks
Listed below are links to blogs that reference this entry:Semantic Search: The Myth and Reality.
TrackBack URL for this entry: http://www.readwriteweb.com/cgi-bin/mt/mt-tb.cgi/4083
»What, Why, Virtualization, Widgets, and Semantic Search from Feld Thoughts
Lots of good stuff from my friends this weekend.  I've been periodically doing this daily reading thing - hopefully you like it.  Feel free to flame me in the comments if you don't; give me positive feedback if you do. - What, Why, Can and Ho...Read More
Tracked onJune 1, 2008 12:35 PM
Comments
 
Subscribe to comments for this post ORSubscribe to comments for all Read/WriteWeb posts
Первыйнах
Posted by:Dmitry |May 29, 2008 2:33 PM

Alex:
Excellent points. Thinking in terms of potential outlets for monetization (advertising and licensing), I believe there are two fundamental problems with how semantic search has been positioned:
a.) In the consumer search space, Google has no ROI to consumers (as they do to advertisers). To the best of my knowledge, Google has never published precision or recall figures and probably never will. It is *discernibly* good enough, convenient, lightning fast, *comprehensive* (hence its part of speech is now as much a verb as a noun), and free. Google doesn’t promise much. But what it promises, it does *very* well. And it does the job for free. Google's relevance was significantly better than its predecessors without requiring any change in behavior - this is the critical point. PageRank produced a very noticeable leap in quality *with the exact same user model as its predecessors*. It was classic "embrace and extend."
Promises are tempting but dangerous. Until there is another such leap (that the average Joe can notice *without* any change in searching behavior), Google will remain King.
b.) Taking ROI into account, most consumers (and business users) don’t care about the "how." They care about the "what" and the "why." Some IT managers care about the "how" either because they are paid to care (to provide needed expertise and due diligence for internal ops) or because they are enthusiasts that track the latest trends in technology. But at critical mass, most people don’t care.
Semantics represents the "how." The industry should, instead, focus on the "what" and the "why." Once there is a clear business case for the "what" and the "why," the market will determine the best "how" that meets the objectives of the "what" and the "why."
As an example, Google *sells* better targeting as the "what" and the "why." Search is the "how." Search is merely a means to an end - the end is Google's value proposition to advertisers. If semantic search represents a clear leap in terms of better targeted, more quantifiable advertising, advertisers will take notice. But again, the operative phrase is *targeting* (the "what") NOT semantics (the "how").
c.) Businesses care *primarily* about business processes, not enabling technologies. Business processes have ROIs, budgets, and buyers. What is the ROI of semantics? Pivoting the conversation this way is a non-starter for an enterprise buyer with highly competitive budget line items. Businesses do indeed buy infrastructure (e.g., storage, routers, etc.) but only because such infrastructure supports existing business processes which have (or *should* have) measurable or perceived ROIs. Semantics need to either have a clear value-add to existing business processes, must facilitate the creation (yes, creation) of new business processes that might not be possible or practical absent semantics, or must clearly constitute infrastructure that underlies enhanced or novel business applications.
I am confident though that this positioning problem will be addressed - in the near future. Any new technology goes through this phase. Eventually, there will be big winners, with a diverse range of business models.
Cheers.
Posted by: Nosa |May 29, 2008 4:20 PM

Nosa,
Great insight and addition to the post - thank you!
Alex
Posted by:Alex Iskold |May 29, 2008 6:10 PM

The value in semantic search is simply about creating ontologies that allow for results refinement that can deliver a much higher degree of relevance than currently possible with query refinement. This can has huge value to users and advertisers.
I would also caution people to not see semantic search engines as destination sites the way people regard search engines. this is a whole new ball game.
Posted by:Jonathan Mendez |May 29, 2008 6:33 PM

Alex - one of the most thoughtful posts on semantic search I've read.
We'd be better off coming up with a different name for it than Semantic Search, since search tends to position it incorrectly to users who are accustomed to Google. More importantly, we all have to be careful not to overhype semantic search. I started in the text analytics space in 2000 with ClearForest and too many companies (including us) were overpromising and under-delivering.
For ClearForest and others, rather than search, we attempted to position it as "business intelligence for text", but that drove users to expect simple dashboards a la Business Objects, which is also not the right paradigm.
The analytics are what makes semantic analysis useful. So, tools like visualization that can surface relationships in the tagged content become ways to navigate large bodies of semantic knowledge. The more we think about it as an analysis tool, the more likely it will be that we come up with the appropriate problems for it to solve.
For those familiar with structured databases, I look at semantic analysis more like a SAS tool than a Business Objects dashboard. If you think about the data problems that SAS can solve, they're probably the equivalents of what semantic analysis can solve for text.
Posted by:Barry Graubart |May 29, 2008 7:34 PM

to really do the meaning thing, we will have to get away from the alphabet...
either ideograms, categorization by color, the five senses, in short, a different kind of coding system...
prior to that, i think this overall problem would be solved very differently if the starting place was, say, mandarin. a calligrapher can communicate many many layers of meaning with just four symbols, impossible with four roman letters.
and the learning curve for any new "system" will probably not be as long as feared, so simple a child can do it
Posted by: gregory |May 29, 2008 8:29 PM

The future of Search will be a combination of both Semantics which is symbolic (computer manipulation of symbols or objects) & numeric-based search (manipulation of numbers ie, high speed number-crunching), eg - Google, LiveSearch, etc... The 2 will complement each other. Currently, numeric based search still outperformed semantic search, in terms of recall & precision.
The current Google PageRank algorithm only computes a 2D (rows & columns) frequency matrix of links (outward from & inward to a page), but multi-dimensional (greater than 2D, such as 3D, 4D or more) matrix analysis (called Tensor calculus) is starting to appear from the community of data analysts, which I quoted in my message on the thread :sezwho acquires tejit semantic platform. I haven't seen any Tensor-PageRank yet, but it won't be too long before it appears in the literatures. However, theHITS algorithm (similar to PageRank) has been tensorised (3D matrix) as described in the abstract of the following paper:
Abstract The TOPHITS Model for Higher-Order Web Link Analysis
The third dimension in addition to the outward & inward links is the anchor text of the links between them. This is only the beginning of tensorisation of current algorithms. Imagine a search engine that is based on say 20 tensor dimensions?
To avoid the shortfall of numeric-based search in todays environment, one can use a guided search (interactive search), ie, start with a narrow bag of words, then the engine will refine the search & narrow it down to the target the user wants. Such guided search is described in the following paper:
Interactive Search Grouping - Search result grouping using Independent Component Analysis
See, the thing about Semantic search's advantage is it relies on the user to give it a full natural language sentence as alex's example : What actor starred in both Pulp Fiction and Saturday Night Fever? Most users don't like typing a query like this, because it is too long to type, however if it is voice-enabled search , then long phrases is no problem. This is where numeric-based guided search comes into the picture. Users can type in short phrase and from there, the engine guides the user by feedbacks and more queries.
Finally, here is a good video from Peter Norvig director of research at Google, in his talk in the Future of Search meeting organised by Berkeley in 2007. Interested readers should watch the video as he raised interesting things about the future of Search for Google.
Future of Search - 2007 : Peter Norvig
Posted by: Falafulu Fisi |May 30, 2008 1:34 AM

Alex,
There is a big assumption here... that Google is not semantic search.
How do we know that? How do you know where to place Google on the "matrix" above?
I have no doubt Google are doing whatever they can - using whatever technology delivers the goods - to produce the best search experience for its users.
Unless PROVEN wrong, I think we can't just call Google Search the "semantic outsider" which is what I hear in your article.
Can we prove they're not hiding a lot of PLSI or whatever under the hood???
-Alister
Posted by:Alister Cameron // Blogologist |May 30, 2008 3:27 AM

Alex, great post that is really helpful in getting to grips with a complex subject. The main take away for me was the idea that copying the Google search box is a recipe for failure. There are plenty of great user interfaces that can be constructed around a structured query. Google will never do that, it would damage their brand. They will put in all the smarts (semantic or whatever) under the cover to keep that 85% growing to 90%. The only way I can see to make a dent in that is to fundamentally change the value proposition for publishers and advertisers. Not sure how to do that of course (if I was I would keeping quiet about it).
Posted by:bernard lunn |May 30, 2008 5:17 AM

Bernard,
Absolutely, you are right, the main take a way is that search box is inviting to enter old style queries and that is not going to be impressive.
Alex
Posted by:Alex Iskold |May 30, 2008 5:54 AM

@Alister,
Google is not a semantic search engine today, at least not the same kind of way that others are trying to be. The main algorithm is based on frequency analysis and Page rank.
There is also a light weight semantical analysis - for example when you search for books or movies Google knows about these types of objects. But it does not appear to be using deeper semantics. Nor does it need to, because in face to face comparison there is no advantage for types of searches that people perform.
Alex
Posted by:Alex Iskold |May 30, 2008 5:57 AM

Alex - very nice overview of a confusing topic area.
I'm skeptical that semantic search will be anything more than a niche technology anytime soon, for one reason: Most searchers don't dig further than the top three search results.
If people are happy with what they see in Google's top three search results, they aren't going to use advanced search or semantic search.
The focus of these companies probably needs to be on finding profitable niches or on turning this into background technology.
Posted by:James Lewin |May 30, 2008 8:38 AM

Alex - great article, and close to home for Snooth. We're a vertical search engine, and so have some semantic-ish search functionality, yet we also use the plain ol' search box, and find that most queries dont take into account the full potential of our seach algorithms.
--Philip
Posted by:Philip James |May 30, 2008 10:35 AM

No amount of RDF will let your computer answer "What is the best vocation for me now?", I agree.
But it can get you most of the way there - it wouldn't take much for your computer to be able to answer "What jobs are available in my area that match my skills A, B and C, my interests X, Y and Z and pay at least $$$?"
Posted by:Brendan Taylor |May 30, 2008 11:27 AM

Alex, great post! I particularly like the categorization of search companies/technologies along the two dimensions "structured data" and "query complexity". I guess that coming from a GOFAI background I would add a third one called "reasoning" - how much and what kind of processing is done to that structured information (an aside: at what point do we call it knowledge?). Along this dimension there would be various points: from "simple inheritance" (a dog is a mammal) to "heuristic reasoning" (what evidence do I need to look for to answer questions like "does coffee grow in Russia?").
Posted by:Dan Tecuci |May 30, 2008 12:03 PM

Searching for multiple concepts that can be described in a variety of ways on Google (such as prior art searching in the patent world) is challenging or impossible. These searches can be more easily conducted using combinations of boolean and proximity operators in large scale commercial databases. LSA or 'semantic' search sometimes provides results that neither system retrieves, but also provides many many 'false drops' or too much noise.
LSA systems have come a long way since I first used one over 10 years ago, but still have a ways to go. I think that one of the answers is greater user control. The 'black box' system ala Google and others is fine for simple queries, but isn't nearly as powerful as a system that allows significant user interaction via a re-iterative search process.
I know that one developer at Yahoo has commented that the future lies in training the user, not enhancing the algorithm. I think that this is the quickest way to improve results. And may be part of the solution that Alex mentions above re: the searchbox problem.
Additional solutions in creative types of data visualization in both the results and input screen can enhance the process by providing a less steep learning curve for the final user.
David.
Posted by: David Holloway |May 30, 2008 10:25 PM

LSA (for search, clustering or recommendation) is solved via an algorithm calledSVD (Singular Value Decomposition), and agin LSA is 2D (rows & columns) matrix.
SVD has been tensorised recently. I have seen 3D Tensor-SVD used for online sentiment monitoring. I haven't seen anyone using it for search engine yet (based on what I 've seen in the literatures), however, one cannot discount commercial applications that already adopt it. Here some analytic examples of its use:
Eigen-Trend: Trend Analysis in the Blogosphere Based on Singular Value Decompositions
A tensor higher-order singular value decomposition for integrative analysis of DNA microarray data from different studies
MPEG Video Watermarking Using Tensor Singular Value Decomposition
I believe that Eigen-Trend tool has been made available as a commercial tool. Eigen-Trend uses higher order (multidimensional) SVD or multi-linear algebra eigen decompositions.
Posted by: Falafulu Fisi |May 30, 2008 11:39 PM

What a well thought out article. Thank you.
My opinion on the matter is that it could be a white elephant of an issue. As others have pointed out, customers don't care for the details, they just want results and that is the issue for semantic searching - nobody cares.
What would make the difference is when the public learn what semantic searching could do for them. Maybe the name is wrong and it should be called AI search, but the big picture question is still, Why are we doing this?
Are we looking to topple Google or is it to find the next big thing?
Google cannot be toppled. It is like Microsoft. However, like Microsoft, Google can be made irrelevant and that is what semantic searching should do.
Think of it as a hidden tab on your browser which uses the document you are working on to find every piece of information which is relevant and presents it to you in an easy to cut / paste form.
This would in essence be a 200 word search box but it would answer the issues being highlighted in this article.
Thanks again for a very interesting article - I hope it helps encourage smarter people than me to give us the future of the internet!!
Posted by:Oli Rhys |May 31, 2008 6:41 AM

When I asked Google "What actor starred in both Pulp Fiction and Saturday Night Fever?", the top few answers were John Travolta, and the hits to "What two US Senators received donations from a foreign entity?" it will find your article, wherein you said "Turns out that both Barak Obama and Hillary Clinton received donations from UBS AG."
Then again, if you go to McCain's page athttp://www.opensecrets.org/pres08/contrib.php?cycle=2008&cid=N00006424 you'll see that UBS contributed $93,000 to his campaign, so the information you found .. wasn't correct. At least three US Senators received money from UBS AG.
I'm not convinced that your analysis showed anything other than your enthusiasm.
Posted by: Andrew Dalke |May 31, 2008 10:53 AM

Good job again, Alex.
I've commented in the past few weeks on my blog that the killer app powered by web semantics was most likely NOT going to be about search, and that Powerset should not try to look like a search engine. To a new technology, a new value prop. Glad the message is being amplified here.
I believe, as (for full disclosure) does my employer, that web semantics (note I'm not using the "semantic web" terminology which is too marked with a specific set of technologies) will best support a personalized knowledge ecosystem, a customized web that assembles and delivers just the data you need, in real time. When that works, we'll be able to say "SEARCH = CONTENT", since searching will really about telling the computer what you want, and having it delivered tailored-made to you based on your query and when needed an understanding of who you are.
Really looking forward to seeing the industry jump on this bandwagon!
Posted by:Greg Boutin |June 2, 2008 7:21 AM

I gave this more complex query a try and instead found the opposite. Google was able to answer, in search position number two, the query: What cartoon involves a boy who becomes a girl? However, Powerset had no answer for me.
Additionally Google provided other series that the criteria, of which I was previously unaware.
Posted by:Joshua Drake |June 2, 2008 8:09 AM

I enjoyed reading your thoughts.
Here are my two cents:
The Semantic Web is a web of data, in some ways like a global database, characterized and relying on correct tagging of elements. This should make the work of bots much easier and allow algorithms to rely on information versus having to spend computing power on figuring out if 'apple' is a fruit or a company name.
So, the way I see it, this does not really change a whole lot about the way the data is presented (search result).
But search engines use one algorithm for every user. This approach seems to be inherently flawed. Search results should reflect reality .. in the sense, that we all have our own reality and hence our 'own' algorihtm.
It seems like Google is gradually moving into that direction ...
Posted by:Kameir |June 3, 2008 1:03 PM

What you seem to be missing is any consideration for the user's willingless to learn to use these 'powerful search interfaces' - all experience shows it is close to none. In general users are not able to understand, willing to learn and motivated to use complex search interfaces.
If you entice users to ask real questions (and not use keyword queries) your facing a different problem: no algorithm in existence today is able to correctly answer even half of arbitrary worded questions, hence users will be faced with a search engine giving mostly false answers.
It is for these reasons that Semantic Search approaches only recently started to mimic the text input field of Google, pushing any fancy user interaction into the query refinement phase and thereby lowering the entrance barrier for users.
Posted by:Valentin |June 3, 2008 11:37 PM

Alex,
I was gladdened by the conclusion of your post: user experience is king. When the dust settles around this developing technology, it's utility and value will be determined 1) by the user's willingness to interact with it in a way that surfaces its advantages and 2) the interface's ability to return results that represent its utility and value at a glance.
As to the question of who is most likely to embrace the technology, I believe most of your skeptical commentors are thinking too broadly about the audience. There are many deep searchers in large international companies (e.g. think pharma) trying to leverage internal assets and business intelligence across disciplines and geographies who would gladly enter more than a simple keyword query if they believed it would improve their results and save them time.