Things Not Strings: Google’s New Hotel Profiles Exemplify Its Approach to Entities
Back in 2012, Google’s Amit Singhal published a now-famous blog post announcing the Knowledge Graph. The post, titled “Inside the Knowledge Graph: Things, Not Strings,” introduced a fundamental change in the methodology behind Google’s long-standing mission “to organize the world’s information and make it universally accessible and useful.”
Whereas Google’s original PageRank algorithm, the basis for populating SERPs to this day, makes use of links and other contextual signals to determine the relevancy of a webpage, the Knowledge Graph represents a totally different approach, one that potentially realizes Tim Berners-Lee’s early notion of a semantic web where all content is linked according to its meaning.
The Knowledge Graph, as Singhal describes it, shifts Google’s attention from “strings” to “things,” meaning that search is no longer a matter of finding text that matches the text in your query (string matching), but rather a matter of understanding the concepts in a query as well as its probable intent and mining the Google datastore for a response that represents what Google knows about that concept.
The Knowledge Graph’s early incarnation
Ask “Who is Charles Dickens?” for example, and the Knowledge Graph “knows” that you are asking about an author from a certain historical period and country of origin who is celebrated for having written certain books. The search results Google displays for such a query, with components we’ve come to recognize such as Rich Snippets, Related Questions, and the Knowledge Panel, represent a semantically structured body of knowledge about the entity Charles Dickens.
We’ve all seen evidence of the Knowledge Graph at work in examples like this one, though for the most part the evidence has pointed out not so much Google’s ability to mine the web for meaning, but rather its ability to regurgitate Wikipedia entries.
In a Washington Post article in 2016, for example, Caitlin Dewey referred skeptically to Google’s “sketchy quest to control the world’s knowledge,” pointing out that when it came to sensitive topics like the status of Taiwan or Jerusalem, or even the mysteriously contested height of Hillary Clinton, Google relied too heavily on a mechanically redacted version of the facts that tended to flatten crucial nuances and even distort the truth.
A patent for graphing the world
In the time since, Google has worked to improve its handling of such sensitive information, but it has also been quietly expanding its ambitions. That fact is nowhere clearer than in Bill Slawski’s recent analysis of a patent issued to Google in February of this year that explains its methodology behind extracting and classifying information about entities.
An entity, in Google’s terms, is any place, idea, thing, object, or otherwise classifiable node in a stream of data. The patent describes Google’s attempt to record in a massive database not only the more obvious types of entity classifications—famous authors like Charles Dickens, celebrities, historical events, nations, and so on—but also the superclasses to which those classes belong, such as humans, men, women, citizens of certain nations, and so on, and the subclasses they contain, such as lifespan, marital status, or important works. The ambitious aim outlined in the patent would also be able to provide information about the relationships between each node of meaning. To expand on the example we started with, Charles Dickens can be described as an entity belonging to the class “famous authors,” which is part of the class “humans.” Dickens, the entity, contains subclasses such as “important works,” which include entities like Great Expectations. Relationships between entities may be expressed in forms like, “Dickens was the author of a novel called Great Expectations.”
The patent doesn’t say, of course, that Google merely intends to map the universe of famous authors. It implies that Google intends to map—well, everything. In Slawski’s analysis, the true scope of the Knowledge Graph comes into view, and that scope is massive.
The trick is for Google to divorce itself from heavy reliance on secondary sources like Wikipedia and to be able instead to classify and cross-reference information as a native, self-sustaining activity on web pages themselves. That’s what makes the patent filing a little different from the evidence of the Knowledge Graph we’ve already seen in the wild.
One might think we’ve seen this before with Schema markup. After all, the Schema.org standard, created by Google in collaboration with other search engines, provides a standard tagging language that helps to reveal the content of text, at least as it applies to certain classes of information, such as local businesses or consumer reviews. The entities Schema markup is designed to identify, and the relationships between those entities which the standard helps to describe, are structurally very similar to the entities described in the new patent.
But Schema markup is more like a crutch than a comprehensive solution. The requirement it imposes—that human beings apply semantic tags to text on web pages—will never scale. That’s essentially why Tim Berners-Lee gave up on the semantic web in the first place and defaulted to the simple display language of HTML. Schema will never serve the ultimate purpose the Knowledge Graph takes as its aim because the vast majority of content classifications cannot be captured with prescriptive markup. Google needs to train its technology to impose semantic structure on raw unstructured text, just as our brains do.
It’s only when the Knowledge Graph is able to apply what it knows to any new content it encounters, learning as it goes about entities it hasn’t seen before, that its larger intent will be realized.
In the patent, there’s a lot of language about this learning process. From the abstract:
“Computer-implemented systems and methods are provided for extracting and storing information regarding entities from documents, such as webpages. In one implementation, a system is provided that detects an entity candidate in a document and determines that the detected candidate is a new entity. The system also detects a known entity proximate to the known entity based on the one or more entity models. The system also detects a context proximate to the new and known entities having a lexical relationship to the known entity. The system also determines a second entity class associated with the known entity and a context class associated with the context. The system also generates a first entity class based on the second entity class and the context class. The system also generates an entry in the one or more entity models reflecting an association between the new entity and the first entity class.”
In other words, the Knowledge Graph learns about new entities by inference, setting aside what it already knows in order to isolate what it doesn’t. If successful, this process could eventually run itself, becoming ever more effective at learning new things the more new things it learns.
Take for example a web page that discusses Bill and Hillary Clinton. Imagine that the Knowledge Graph has already collected and organized a significant amount of information about Bill but has never heard of Hillary. The patent describes an entity extraction process whereby Google draws a circle around what it already knows, leaving the rest as new information to be associated with a new entity called Hillary Clinton.
Of course, no document contains only two entities. Indeed, an entity like “Bill Clinton” is a member of a superclass called “humans” and another called “politicians” and another called “Caucasian men,” each superclass containing subclasses like age, height, time in office, ancestry, and so on. In short, this type of analysis is incredibly complex, but its issued patent suggests Google is probably attempting it.
So far, one doesn’t see this more fully realized version of the Knowledge Graph much in actual search results—though this may partly be due to the fact that, when successful, the results could be hard to distinguish from data mined in a cruder fashion.
Google now treats hotels as entities
While this more ambitious way of surfacing information about entities is not yet standard, in researching Google’s new interface for hotels, I think I’m seeing evidence of a real-world example. The Google Hotels interface contains structured information about each listed hotel culled from a variety of internal and third-party sources. You can visit the new interface, which Google quietly announced in March, at google.com/hotels.
In the blog post announcing the new interface, Google’s Richard Holden mentions machine learning in a couple of key sentences. The first relates to the search experience, with Holden introducing Google’s new Deals filter by explaining, “This filter uses machine learning to highlight hotels where one or more of our partners offer rates that are significantly lower than the usual price for that hotel or similar hotels nearby.”
The second mention of machine learning hints that Holden is talking about the Knowledge Graph: “You can also view a hotel’s highlights—like a fancy pool, if it’s a luxury hotel, or if it’s popular with families—with expanded pages for photos and reviews curated with machine learning.”
It would be easy to elide these two references to machine learning and make the false assumption that Holden is talking about the same thing in both cases, but he’s not. In the first case, historical trend analysis of hotel prices helps Google highlight deals that are lower than the usual price. The Deals filter incorporates machine learning, we can assume, in the sense that it continually takes in new information about hotel prices and adjusts its recommendations accordingly. As such, this is a classic example, not unlike the type of machine learning one sees in programmatic advertising or Netflix movie recommendations.
On the other hand, curation of content like photos and reviews sounds a lot more like the process of linking data to entities. In this case, machine learning likely comes into play from the point of view of constructing a dynamic user interface that presents the most engaging, useful, or popular reviews and photos from various sources.
The major entity at play, of course, is the hotel property itself, to which other entities are linked such as:
- Contact information sourced from the hotel, Google’s local data, Google users, and third-party sources
- Hotel class ratings assigned by Google based on data from “third-party partners, direct research, feedback from hoteliers, and machine learning inference that examines and evaluates hotel attributes, such as price, location, room size, and amenities”
- Booking rates from Google’s Hotel Ads marketplace
- Reviews from Google users, third-party reviews from booking sites like Expedia and travel sites like TripAdvisor, and first-party reviews (if available) sourced by the hotel itself
- Review summaries created by Google partner TrustYou
- Amenity data sourced from the hotel and from Google users
- Photos, videos, and 360 degree images sourced from the hotel, Google users, and third-party sites like Oyster
- Descriptive text sourced from Google’s editorial team, partially powered by web mining
- Neighborhood data including descriptive text and Google’s own location ratings
- Neighborhood maps with points of interest identified
- Nearby attractions pulled in from Google Maps
- Links to transportation and directions also from Google Maps
It’s an impressive compendium of data. For a while now, it has been assumed that Knowledge Panels represented the best showcase of Google entity data, and this remains true for many other businesses, celebrities, and historical figures. But Knowledge Panels generally work from a smaller set of data sources, such as Google Maps itself for businesses or Wikipedia for historical figures. Google Hotels represents a significantly expanded dataset with a more complex mix of sources, one that represents a step away from dependence on a pre-existing body of knowledge, and a step toward self-sufficient population of the Knowledge Graph.
To be sure, there are reasons why hotels would be an obvious choice for this evolutionary step. They are, after all, discrete entities in the world about which it is relatively straightforward to accumulate reliable, verifiable information. It’s a big leap from hotels to more controversial entities like Taiwan, Jerusalem, and Hillary Clinton. Wikipedia has been able to maintain a neutral position when dealing with controversy, in part due to the fact that well-managed crowdsourcing does an excellent job of evening out bias, and in part due to its old-school insistence on authoritative primary sources.
It remains to be seen how Google will ensure that the Knowledge Graph doesn’t place undue reliance on sources that shouldn’t be trusted but happen to contain readily parseable information. That’s just one of the many challenges inherent in the project. But the progress represented by Google Hotels suggests that Google is making a serious attempt at turning its patented technology into a reality, one that may augur a fundamental transformation in search.