REBLOG: “Leaves of Graph” by Pete Coco

About Aaron McCollough

English Literature Librarian, University of Michigan

Originally posted ato ACRLog (http://acrlog.org/2012/08/23/leaves-of-graph/) by Pete Coco. Pete is the Humanities Librarian at Wheaton College in Norton, MA and Managing Editor at Each Moment a Mountain: Archivally Inspired Art and Inquiry.

Note: This post makes heavy use of web content from Google Search and Knowledge Graph. Because this content can vary by user and is subject to change at anytime, this essay uses screenshots instead of linking to live web pages in certain cases. As of the completion of this post, these images continue to match their live counterparts for a user from Providence, RI not logged in to Google services.

This That, Not That That

Early this July, Google unveiled its Knowledge Graph, a semantic reference tool nestled into the top right corner of its search results pages. Google’s video announcing the product makes no risk of understating Knowledge Graph’s potential, but there is a very real innovation behind this tool and it is twofold. For one, Knowledge Graph can distinguish between homonyms and connect related topics. For a clear illustration of this function, consider the distinction one might make between bear and bearsThough the search results page for either query include content related to both grizzlies andquarterbacks, Knowledge Graph knows the difference.

Second, Knowledge Graph purports to contain over 500 million articles. This puts it solidly ahead of Wikipedia, which reports having about 400 million, and lightyears ahead of professionally produced reference tools like Encyclopaedia Brittanica Online, which comprises an apparently piddling 120,000 articles. Combine that almost incomprehensible scope with integration into Google Search, and without much fanfare suddenly the world has its broadest and most prominently placed reference tool.

For years, Google’s search algorithm has been making countless, under-examined choices on behalf of its users about the types of results they should be served. But at its essence, Knowledge Graph presents a big symbolic shift away from (mostly) matching it to web content — content that, per extrinsic indicators, the search algorithm serves up and ranks for relevance — toward the act of openly interpreting the meaning of a search query and making decisions based in that interpretation. Google’s past deviations from the relevance model, when made public, have generally been motivated by legal requirements (such as those surrounding hate speech in Europe or dissent in China) and, more recently, the dictates of profit. Each of these moves has met with controversy.

And yet in the two months since its launch, Knowledge Graph has not been a subject of much commentary at all. This is despite the fact that the shift it represents has big implications that users must account for in their thinking, and can be understood as part of larger shifts the information giant has been making to leverage the reputation earned with Search toward other products.

Librarians and others teaching about internet media have a duty to articulate and problematize these developments. Being in many ways a traditional reference tool, Knowledge Graph presents a unique pedagogic opportunity. Just as it is critical to understand the decisions Google makes on our behalf when we use it to search the web, we must be critically aware of the claim to a newly authoritative, editorial role Google is quietly staking with Knowledge Graph — whether it means to be claiming that role or not.

Perhaps especially if it does not mean to. With interpretation comes great responsibility.

Some Questions

The value of the Knowledge Graph is in its ability to authoritatively parse semantics in a way that provides the user with “knowledge.” Users will use it assuming its ability to do this reliably, or they will not use it at all.

Does Knowledge Graph authoritatively parse semantics?

What is Knowledge Graph’s editorial standard for reliability? What constitutes “knowledge” by this tool’s standard? “Authority”?

What are the consequences for users if the answer to these questions is unclear, unsatisfactory, or both?

What is Google’s responsibility in such a scenario?

He Sings the Body Electric

Consider an example: Walt Whitman. As of this writing, the poet’s entry in Knowledge Graph looks like this (click the image to enlarge):

You might notice the most unlikely claim that Whitman recorded an album called This is the Day. Follow the link and you are brought to a straight, vanilla Google search for this supposed album’s title. The first link in that result list will bring you to a music video on Youtube:

Parsing this mistake might bring one to a second search: “This is the Day Walt Whitman.” The results list generated by that search yield another Youtube video at the top, resolving the confusion: a second, comparably flamboyant Walt Whitman, a choir director from Chicago, has recorded a song by that title.

 

Note the perfect storm of semantic confusion. The string “Walt Whitman” can refer to either a canonical poet or a contemporary gospel choir director while, at the same time, “This is the Day” can refer either to a song by The The or that second, lesser-known Walt Whitman.

Further, “This is the Day” is in both cases a song, not an album.

Knowledge Graph, designed to clarify exactly this sort of semantic confusion, here manages to create and potentially entrench three such confusions at once about a prominent public figure.

Could there be a better band than one called The The to play a role in this story?

Well Yeah

This particular mistake was first noted in mid-July. More than a month later, it still stands.

At this new scale for reference information, we have no way of knowing how many mistakes like this one are contained within Knowledge Graph. Of course it’s fair to assume this is an unusual case, and to Google’s credit, they address this sort of error in the only feasible way they could, with a feedback mechanism that allows users to suggest corrections. (No doubt bringing this mistake the attention of ACRLog’s readers means Walt Whitman’s days as a time-traveling new wave act are numbered.)

Is Knowledge Graph’s mechanism for correcting mistakes adequate? Appropriate?

How many mistakes like this do there need to be to make a critical understanding of Knowledge Graph’s gaps and limitations crucial to even casual use?

Interpreting the Gaps

Many Google searches sampled for this piece do not yield a Knowledge Graph result. Consider an instructive example: “Obama birth certificate.” Surely, there would be no intellectually serious challenge to a Knowledge Graph stub reflecting the evidence-based consensus on this matter. Then again, there might be a very loud one.

Similarly not available in Knowledge Graph are stubs on “evolution,” or “homosexuality.” In each case, it should be noted that Google’s top ranked search results are reliably “reality-based.” Each is happy to defer to Wikipedia.

In other instances, the stub for topics that seem to reach some threshold of complexity and/or controversy defers to “related” stubs in favor of making nuanced editorial decisions. Consider the entries for “climate change” and the “Vietnam war,” here presented in their entirety.

In moments such as these, is it unreasonable to assume that Knowledge Graph is shying away from controversy and nuance? More charitably, we might say that this tool is simply unequipped to deal with controversy and nuance. But given the controversial, nuanced nature of “knowledge,” is this second framing really so charitable?

What responsibility does a reference tool have to engage, explicate or resolve political controversy?

What can a user infer when such a tool refuses to engage with controversy?

What of the users who will not think to make such an inference?

To what extent is ethical editorial judgment reconcilable with the interests of a singularly massive, publicly traded corporation with wide-ranging interests cutting across daily life?

One might answer some version of the above questions with the suggestion that Knowledge Graph avoids controversy because it is programmed only to feature information that meets some high standard of machine-readable verification and/or cross-referencing. The limitation is perhaps logistical, baked into the cake of Knowledge Graph’s methodology, and it doesn’t necessarily limit the tool’s usefulness for certain purposes so long as the user is aware of the boundaries of that usefulness. Perhaps in that way this could be framed as a very familiar sort of challenge, not so different from the one we face with other media, whether it’s cable news or pop-science journalism.

This is all true, so far as it goes. Still, consider an example like the stub for HIV:

There are countless reasons to be uncomfortable with a definition of HIV implicitly bounded by Ryan White on one end and Magic Johnson on the other. So many important aspects of the virus are omitted here — the science of it, for one, but even if Knowledge Graph is primarily focused on biography, there are still important female, queer or non-American experiences of HIV that merit inclusion in any presentation of this topic. This is the sort of stub in Knowledge Graph that probably deserves to be controversial.

What portion of useful knowledge cannot — and never will — bend to a machine-readable standard or methodology?

Ironically, it is Wikipedia that, for all the controversy it has generated over the years, provides a rigorous, deeply satisfactory answer to the same problem: a transparent governance structure guided in specific instances by ethical principle and human judgment. This has more or less been the traditional mechanism for reference tools, and it works pretty well (at least up to a certain scale). Even more fundamental, length constraints on Wikipedia are forgiving, and articles regularly plumb nuance and controversy. Similarly, a semantic engine like Wolfram Alpha successfully negotiates this problem by focusing on the sorts of quantitative information that isn’t likely to generate so much political controversy. The demographics of its user-base probably help too.

Of course, Google’s problem here is that it searches everything for every purpose. People use it everyday to arbitrate contested facts. Many users assume that Google is programmatically neutral on questions of content itself, intervening only to organize results for their relevance to our questions; Google, then, has no responsibility for the content itself. This assumption is itself complicated and, in many ways, was problematic even before the debut of Knowledge Graph. All the same, it is a “brand” that Knowledge Graph will no doubt leverage in a new direction. Many users will intuitively trust this tool and the boundaries of “knowledge” enforced by its limitations and the prerogatives of Google and its corporate actors.

So:

Consider the college freshman faced with all these ambiguities. Let’s assume that she knows not to trust everything she reads on the internet. She has perhaps even learned this lesson too well, forfeiting contextual, critical judgment of individual sources in favor of a general avoidance of internet sources. Understandably, she might be stubbornly loyal to the internet sources that she does trust.

Trading on the reputation and cultural primacy of Google search, Knowledge Graph could quickly become a trusted source for this student and others like her. We must use our classrooms to provide this student with the critical engagement of her professors, librarians and peers on tools like this one and the ways in which we can use them to critically examine the gaps so common in conventional wisdom. Of course Knowledge Graph has a tremendous amount of potential value, much of which can only proceed from a critical understanding of its limitations.

How would this student answer any of the above questions?

Without pedagogical intervention, would she even think to ask them?

One Response to “REBLOG: “Leaves of Graph” by Pete Coco”

  1. Aline Soules says:

    Thanks for posting this most thoughtful piece.

    As always, the key lies in “authority work.” This pinnacle of traditional cataloging has periodically been decried (both pre- and post-Web) as a time-consuming element that doesn’t “show” to the public. As a result, on the high-low quadrant, it falls into the square of high work, low impact. Administrators have tried to get catalogers to drop this practice over the years (I was subject to such pressure many times when I worked in tech services).

    Yet, here we are again–talking about reliability and authority.

    My guess is that Google probably spends more time on such activities than we do, even though their focus is on creating algorithms to do this work rather than having individual humans spend time on the actual activity. That’s just a guess, though, because Google says little about its proprietary, corporate intelligence practices (understandably).

    I have not paid attention to this tool myself, but I will definitely check it out and follow the trail provided by this wonderful article to try to understand better just what is going on.

    The future is exciting, but I suspect the future is more Google’s than libraries’ when it comes to this sort of thing. Our job, more and more, is about brokering information and helping users to understand both what is going on behind the scenes and also what attention they need to bring to the information they retrieve.

Leave a Reply