It may be hard to tell, but the world is about to undergo another monumental revolution. It's going to be disruptive, empowering and irrefutable. This is a revolution in how we deal with information, and by extension, what we are able to do with it.
Up to this point, computers have created data, stored data, parsed data and analyzed data, but they haven't understood data. Excel can run massive calculations for an in-depth analysis of state retail sales, but it only thinks of "Pennsylvania" as a label or data point. It doesn't know that Pennsylvania is a state (or that a state is an artificially delineated area of land with its own level of government). Our computers are incredibly smart and very dumb at the same time.
Slicing, dicing and julienning
Let's say I'm curious about how many restaurants within 20 miles of Hoboken list their menu on their website. Assuming I don't already have access to a database that will tell me that, I might try to search for "number of restaurants within 20 miles of Hoboken with online menus". My chances of getting a useful result for that query are almost exclusively related to whether someone else has already compiled that research and labeled it in a similar way. In other words, my chances aren't good.
My chances should be much better. All of the data exists, and is publicly available. This is NOT a data problem, it is a knowledge problem.
If I want answer that question today, I have to find a reasonably authoritative source (like Yelp), figure out a set of search parameters, then manually work my way through their results. I probably have to keep a list of each one I count, so I can cross-check against another source (maybe Google) for better accuracy. If I start working on this now, maybe I'll have an answer by August.
How much does a computer have to understand in order to turn my months of miserable drudgery into a task that can be done by the time I come back from lunch? Not all that much. It has to understand what a restaurant is, what data sources list information for many restaurants, how to identify and plot the physical addresses of those listings, how to identify the websites that correspond to those listings, and how to identify a menu from the content presented on those websites. Doesn't seem very far fetched when I write out the process like that, does it?
Diffbot just raised $2 million to further develop its "web content extraction technology". It's building systems and APIs to allow software to identify and classify content from websites. Given how much funding has flowed into other areas of technology with much less potential than this one, I'd expect to see a lot of other VC action in the field over the next few years.
"Siri, find me a computer that understands what I'm asking"
If the type of task I described earlier sounds vaguely familiar, it's because we are starting to see the first examples of this type of search. I'm talking about Siri, and her ability to answer questions like "Where can I get pizza delivered from?" Yes, she is still very primitive, and yes, she tries too hard to answer questions that she probably shouldn't (like "What's the best smartphone"), but between all of that, she actually executes on the three key elements of semantic search:
- Understanding the meaning of the query
- Understanding the structure and meaning of the content
- Delivering intelligently matched content in a usable way
She's not perfect, but she'll get a lot better. This is always the progression. The first mobile phone was so heavy it was barely mobile. The first sea creature to crawl out onto dry land would seem pretty sad by our land-dwelling standards. But once that first step is taken, things evolve quickly.
Smart software to make us look smarter
Remember how Abraham Lincoln said "I cannot tell a lie" after he chopped down the cherry tree? Oh, he didn't say that? Whoops! My face is as red as a cherry after that gaffe.
Honest Abe is a pretty well documented figure, and that is a pretty well known quote, but Microsoft Word is perfectly happy to sit by silently while I make an embarrassing mistake by attributing it to the wrong person. This will not be the case in the future. Just as "check grammar" appeared next to "check spelling", the next to appear will be "check facts".
Again, the data is out there. There are databases of quotes, and databases of historical and public figures. And even if there isn't a single authoritative source, there is enough to say "76% of the references attribute the quote to X, and 24% attribute it to Y".
Google is searching for the answer
If there's a next evolution in search, you can bet that Google is trying to get there first. When it comes to semantic search, they have actually been building bits of it into their core product for years. They started with very discrete types of queries. If you typed in what looked like an address, they would try to find a map result that matched it. If you typed in "weather in philadelphia", they would slide in an actual 5 day forecast before the organic search results.
The recent launch of Google Knowledge Graph takes all of that a step further. The mantra "things, not strings" really does seem to sum up the vision. Suddenly, a vast array of searches turns up more than just weblinks sorted by PageRank and keywords, but it turns up actual information related to the query. If I type in "Wharton School acceptance rate", Google reports back to me that The Wharton School of the University of Pennsylvania had an acceptance rate of 9% in 2009. It also gives me a photo, the summary from the school's entry in Wikipedia, and other related searches (related by its understanding of what the Wharton School is, NOT related by keyword).
Once again, this is merely a baby step. Presumably I want to know the most recent acceptance rate data that is available. Google's Knowledge Graph gives me the 2009 data, but the top organic search result is the Wikipedia page which clearly contains a 2011 undergraduate acceptance rate, and the #3 organic result is a US News profile which has the 2011 MBA acceptance rate. So there's that same old problem again. Google HAS the data, but they haven't been able to analyze and understand it enough to turn that information into knowledge.
Google claims that its supplemental database contains "more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects." I expect this database will expand in line with Moore's law, or even faster. What I also expect is that Google will achieve a breakthrough that will allow it to process, parse, dissect, extract and connect the knowledge that it already retrieves every day. Once it does that, the world of "search" will change forever.
Great act, but what do you do for an encore?
Restaurant menus, auto-fact-checking, acceptance rates. These all sound useful, but they fall far short of the disruptive revolution I promised in the opening paragraph, right? Ahh, but the match, when struck, never looks as impressive as the bonfire it leads to!
When we look at the human potential for innovation, it really comes down to three stages:
Ideas are easy, cheap and plentiful. They come to us out of the blue, like heavy drops of rain. It's the other two steps that are arduous and expensive. The research you have to undertake to prove to yourself (and/or sources of funding) that the idea is a) feasible and b) destined for success, is often hobbled by our lack of research skills and the incredibly time-consuming process of manual searching. The execution phases is usually no walk in the park either.
Any time you can lower the cost and barriers for either of these phases, you will unleash a wave of new innovation and creation. Desktop publishing software gave everybody the ability to set their own type and quickly create their own signage. Amazon's Kindle and eBook publishing gave everybody the ability to publish and publicly sell something they write, giving rise to whole new armies of authors. YouTube and digital video opened the the door to millions of moviemakers. Each "democratization" has significant, far-reaching effects.
Knowledge mining – the use of software and technology to manipulate, combine and analyze the world's knowledge, based purely on natural language requests – will tear down the barriers of research. It will allow everyone to quickly evaluate a new idea and determine whether it is worth pursuing. Everything from "Where should I open my new pizza restaurant?" to "What is the market potential for an all-natural arthritis remedy?" will be met with meaningful data and cross-referenced information, without having to spend weeks or months manually searching and compiling it. Fewer good ideas will die from neglect. More breakthroughs will sprout up from inspired amateurs outside of the affected industries.
In other words, rather than drowning in the sea of data we have created, knowledge mining will let us sail that sea to fantastic new lands. This is the future I see ahead of us. It's the future we've been searching for.