The semantic web is stupid
For whatever reason I read some blog post about the “semantic web”. Often, and my even greater annoyance, people capitalize semantic web as if it were a proper noun. It is not even a properly thought out idea, let alone a proper noun.
What is the semantic web anyway? Well the notion is that the internet is full of “stuff” most of it unstructured and hard to understand. Like web pages which contain text and images. So the semantic web is some sort of web of semantically connected ideas. The origin of the name is from Sir Tim Berners-Lee, the inventor of the World Wide Web (which in contrast to the semantic web, actually does exist as a unique entity and thus deserves its capital letters). I don’t hold Berners-Lee in any particular high regard, and at the time (1999) he suggested there would be a “semantic web” separate or extended from the WWW the internet was a much stupider place and much less interconnected. Now most good sites have APIs, cloud computing is an everyday reality, and many people spend a significant part of the day inside web apps, which may use HTML as an interface but still are based on structured data.
So a combination of two things has happened in the past decade: the semantic web has happened without any lame W3C standards to guide it, and the semantic web is a dumb idea that will never exist and is used by a wayward companies to gain traction in a crowded market.
What really happened
First of all, let’s address why the semantic web has already happened. It’s called APIs. The real problem that the semantic web addresses is that there is little portability of data across the web. APIs of all types and interoperability solve this problem, but not through some grand plan of the W3C. Hilariously, the semantic web stack starts with “User Interface and Applications”, and then “Trust”. Those two items are basically most of the Internet. Worse than that, below this is “Proof”, a clear sign that this is in the dark realm of academics. If you want to create a Facebook app that shows you a map with travel videos from all our friends, you can use the Facebook, YouTube, and Google Maps APIs to achieve this. Why do you want to port everything down to RDF or whatever the hell the semantic web specifies and then build it back up? And with cloud computing tools that have great toolkits already built for them, you can manipulate the data in whatever language you’re using. You simply don’t need some lowest-common-denominator tool.
The second item, that the semantic web is a stupid idea is ultimately the real reason there isn’t and never will be a semantic web. So far all the specification work has been at the bottom but most businesses haven’t been leveraging this work, and yet a scant few actually claim to be “semantic”. Nearly all of these perform some sort of natural language processing, or NLP. This basically means that they analyze text (usually only English) to extract meaning (semantics) and then send that somewhere. This semantic analysis has far more history than the semantic web, which perhaps indicates why people want to associate semantic analysis with the semantic web. Good examples of companies applying the “semantic web” tag to their work are Powerset, Spock, and Tripit.
Powerset
Powerset is an NLP-driven search engine. I personally find their work within the wikipedia corpus to be pretty impressive, but there isn’t any demo outside that corpus because of the precise problem that it is hard to make NLP work for all domains. Semantic webbers might argue that if everything were to publish in RDF (aka do all the work of classifying the data and tagging it) then it would be simple to make Powerset’s NLP work everywhere. That assumes that Powerset isn’t really flexing its NLP muscle the way it claims to be and is instead relying on Wikipedia’s consistent structure. The reality is that Powerset has trouble applying NLP to the web as a whole because of the reality of NLP analysis: it is very compute-intensive and complex and does not scale at all.
Why Powerset isn’t semantic
It’s tough to know what Powerset’s algorithms are actually doing but it’s clear that they are doing a lot of work on the wikipedia corpus. However I think they stop short of extracting meaning. They use the same techniques as any search engine to find your answer, and on top of that they use very basic and ineffective summarization. My single query to Powerset hints that they are trying hard but still not very far: “How many countries are there in the world?” yields a reasonable article as the first result: List of Countries. And the answer (one of the many possible answers anyway) is right in the article, but it’s not in the snippet under the search result but it can’t (using the right sidebar) find the answer on the page. Google, on the other hand, puts an answer in the snippet along with an indication that the answer is ambiguous. That’s because Google relies on information being duplicated across the Internet and assumes that somewhere someone will have phrased the question in the exact same way you have, and other people will link to it, so you’ll get the right answer. Powerset doesn’t have faith in people and their behavior, it places much greater faith in its machines’ ability to analyze text and pull out answers. That faith is misplaced, at least for the time being. At some point in the future machines may be able to answer questions by understanding the semantics of the question and all the information on the internet, but not today.
Spock
Spock and Tripit are similarly limited in domain. Spock is a “people search” that apparently thinks I am 51 years old and live in Nicholasville. Neither is true, but it’s scrapping through a very small number of sites trying to find structured bits of data to tie together and present to me. To say that it is useless is to be kind to it. It is absolutely filled with ads, and devoid of useful info. A google search tells you far more about me than Spock, and better yet doesn’t seem creepy. I’m not 51, but if I was, is that what you’d want to know about me?
Why Spock isn’t semantic
Spock looks through a few sites which tend to have people on them, looks in the typical spots where interesting points of data are and then constructs a profile. The best reason I can give why this isn’t semantic is that there is a ton of data on the net about me. You can quickly find out that I had trouble with an AIC7xxx driver for Linux in 1999 if you’re interested, just use Google. If you go one deeper and figure out all my aliases (not difficult) you can unlock reams of information. Spock doesn’t do that because it is stupid. It may aspire to actually construct a semantic profile but right now a human being and Google can do far better with fewer ads.
Tripit
Tripit is the only thing I vaguely like although I don’t use it. Basically you forward all your travel emails to tripit and it scraps them and combines them together. So if you are flying to Chicago, staying at the Hilton, and renting a car from Enterprise, it will tell you that in one place.
Why TripIt isn’t semantic
This is supposedly semantic because it extracts the text of the email and figures out where you are going and when. I think that stops pretty short of “semantic”. It knows a bunch of places and formats for dates and it scans the email for dates and places. I sent Tripit the plans of the trip I’m currently on and it didn’t combine together the hotel and flight, so I have two entries. It even has dates, one says San Francisco, and so does the other. Perhaps it expects that I will be in the hotel for the entire time when I only have it booked partially. That reason is that it has no clue of the “semantics” of a trip to San Francisco. If it can’t even combine a flight and hotel stay, good luck with understanding anything more esoteric.
Will anything ever be semantic
My gut feeling is that over time we’ll be able to leverage NLP and machine learning in more clever ways, but I actually don’t believe that it will be based on any type of semantic tagging, but instead loads of data and loads of processing time using relatively unsophisticated algorithms. Google has two parallel mechanisms for connecting search queries to the (regular) web: results and ads. Results are generated from analyzing the link structure and ads leverage the principles of economics and scarcity. If I want “Tumi T3 luggage” and I ask google for it, by damned, I get it. Google doesn’t need to know what that is ontologically (as in classifying Tumi as a manufacturer of luggage, t3 as a line that Tumi makes and using luggage to reinforce the previous two classifications) but it does know that there are images of Tumi T3 as well as a load of sellers who are willing to pay to be in front of me when I type “Tumi T3 luggage”. Simply put, there’s no additional value in knowing the semantics if you can provide me good links without them. I simply know of no situation where this sort of semantic information is hugely useful and I challenge someone to suggest one.
Finally I think the entire idea of structuring the data of the web to be more machine readable is a fantasy by lazy academics. Google has done fine without such structure and it’s not clear to me that it would any better with said structuring. Further if you are, say, Delta, there is little incentive for you to use some lame duck format like RDF to make it easier for TripIt. You want to make your customers happy, not TripIt. Customers want email and web sites, and care very little about RDF. If you do have data you want to share around, you create an API and require people to use it to access your apps because that puts the onus on them if they want to convert it out of some format convenient to you to a format convenient to them (including RDF).
It’s sad to say that the semantic web is empty except for academics and wishful thinkers, but that’s what happens when you take what one guy says too seriously. You end up chasing the rabbit down the hole without checking to see if anyone’s following or if it’s even worth the time.









