EndUser 2006 notes on opening session [Updated]

[Through a series of missteps that I won't go into here, I discovered that I had accidentally deleted this post, first published a few weeks ago. I feel pretty dumb. When I figured out what happened, I sat here, stunned, wondering what to do. Then I remembered Google's good 'ol caching capability, did a quick search to call up the cached version of this post, did a quick copy and paste, and voila, problem solved. Well, almost. My error wiped out the original post entirely, meaning that it automatically broke the link to that post, as well. There's nothing I can do about that. In the process of reconstituting the content, I decided on some editorial tweaks throughout.]

(Warning, this is a pretty lengthy post.)

Yesterday was the start of EndUser 2006, Endeavor’s customer conference. Somewhere around 1,000 customers have shown up for this event, some coming from as far away as Australia, New Zealand, several European countries, as well as Canada, Latin America, and of course, the U.S. As I’ve noted before, there are several conference sessions dealing with topics of interest, but yesterday’s highlight was the opening general session featuring a representative from Google who spoke in depth about Google’s Book Search project. Tom Turvey, Head, Google Book Search Partnerships, gave a brief over of Google and how it makes money, defined the elements of Google Book Search, described the Google Book Search Partner Program (which he oversees), and finally discussed the Library Program portion of Google Book Search. Tom has a long history of working with online content, serving in numerous roles in the publishing industry relating to online delivery, including launching Barnes & Noble’s ebook offerings and most recently holding a senior post at HarperCollins.

Tom began by describing Google’s business. He mentioned that Google now provides 59% of all Internet search referrals. Google’s oft-repeated mission is “to organize the world’s information and make it universally accessible and useful.” Their Its core business, i.e. how they the company makes money, is from advertising revenue generated via paid search ads using Google AdSense. Tom also mentioned that Google is the leader, by far, in referrals to book sites (currently it processes about 60% of all such referrals). In describing Google’s business, Tom pointed out some interesting statistics about book purchasing. He provided statistics showing that 13% Thirteen percent of all book purchases are now done online; schools/libraries make up about 24% of the book buying market, direct to consumer purchasing (direct from publishers) is about 2%; and the biggest growth area recently has been in non bookstore retail (books being purchased in Costco, Sam’s Club, Wal-Mart, etc.).

The next portion of the presentation focused on an explanation of Google Book Search. Tom pointed out that in his experience, never has there been so much misinformation about a product as there has been with Google Book Search (GBS). He made some comment that 90% of what has been published in the news media is false, thus the importance of explaining exactly what it’s about. GBS, at its heart, is an attempt to associate book content with what searchers are looking for in search engines. There are two main parts to GBS: the Partner Program, and the Library Program. The Partner Program involves relationships and agreements between Google and publishers. GBS launched in October 2004 at the Frankfort Book Fair. As of now there are literally thousands of publisher partners spanning seven languages. One of the most frequent questions publishers ask Google is, what books are good choices for discovery via GBS? One of Tom’s funnier statements was “we don’t need to help Harry Potter find an audience.” What Google is mostly interested in is the arcane, the obscure, and bringing this material to light via searching GBS. Every page is searchable; users are searching books from cover to cover. There are two ways of providing search on book content: a dedicated search (books.google.com), and integrating book content within the general Google search. The main intent of working with publishers is to drive book sales. Content is protected in a variety of ways (Tom mentioned that as you can imagine, this element of agreements with publishers often gets “into the weeds”). Only 20% of a book is viewable by one user during the course of a month. Print, copy, and save are disabled. Scanned images are purposely low resolution. Publishers can add/remote remove their material at any time. There is page level security as well. A percentage of pages is never visible at one time. Google’s process for receiving publisher content is pretty straightforward: the publisher usually sends either a PDF or a print copy. If the latter, Google digitizes it. As an interesting aside to closing out this portion of the talk, Tom mentioned “Oh by the way, the five publishers who are suing Google over the Library Project are actually members of the Partner Program.”

In turning to the third and last portion of the presentation, Tom outlined the elements of the Library Project. Partner libraries, as most people are aware by now, include Stanford, NYPL, Oxford, Michigan, and Harvard. In researching and comparing collections from each partner library, Google discovered that 60% of books are held in only one of the partner libraries. For legal and other issues, Google began the project by focusing on public domain books. However, public domain books make up only about 20% of a typical library collection. Ten percent of a typical collection is made up of books that are still in print (i.e. the stuff that is handled via the Partner Program). Most books, 90%, are in print but in a fuzzy area in which they may be out of print but still in copyright, or perhaps out of copyright. Seventy percent of collections were published after 1923 and fall into three categories: in copyright, in public domain, or the rights may have reverted. Obviously Google needed to figure out how to solve or address these complexities. Their solution was to offer to scan everything but provide three views: sample pages (partner view), snippet view (book under copyright w/out agreement with a publisher partner), and full book view (book is in public domain). The snippet view means that the full text of each book is indexed; users can only view three snippets from the book; there are links to “buy this book” as well as “find in a library”; different categories of books are handled in different ways; and copyright holders may opt out of display and/or scanning.

Obviously a critical factor for Google is optimizing and streamlining the workflow. For example, a key consideration was figuring out how long it takes to scan a typical book. Tom mentioned that in the early days of the project, founder Larry Brin and another staff member would use a metronome to time each other over and over again as they tried to figure out how best to scan a book. (Why a metronome? I have no idea and neither did Tom.) Books are scanned as is, including scribbles, marginalia, notes, whatever. Google is aiming to build a comprehensive collection of indexed books but has a long way to go yet on achieving that goal. Some of the challenges they face on a daily basis are 100% OCR accuracy, 100% image quality, search and integration with web search, the accuracy of any affiliated metadata, the existence of lots of “edge cases” in terms of how to process and display the scanned results, how to address books that contain multiple languages and/or scripts; and how best to achieve a good level of speed/automation of the entire process. As with their much vaunted (and top secret) search algorithms, Google is constantly tweaking the process to try to improve the quality. How do they handle math formulas, spelling correction (Tom used the example of vernacular language that is meant to be spelled a certain way but which looks wrong to a typical spell checker), etc.? What is the best way to deal with automated metadata extraction? Can they figure out an automated way to detect (and appropriately handle) different languages and/or scripts?

Tom made a big point of the fact that Google is actively engaging the library community. Librarians tell Google the good and the bad about GBS (e.g. of bad: too overwhelming for users, hard to know which stuff is authoritative and what is junk, desire to know exactly how the process for scanning and indexing works). Google wants to ensure that GBS works for libraries by making information more discoverable, driving more library usage, and supporting a worldwide community, which is especially relevant for remote and distributed library users. Google has no desire whatsoever to put libraries out of business; in fact, Tom claims that the opposite is true.

[One of the things that I thought was particularly striking was that at one point during the session, Mr. Turvey asked for a show of hands from the audience of those people who were aware of the facts and details he had provided about Google Book Search. To my astonishment, I was one of the few people to raise their hands. Maybe this was just due to some people not fully understanding the question or to some people's innate shyness, who knows. But if it was an indicator of professional ignorance of these matters, then we're in big trouble.]

After concluding his prepared remarks, Tom invited the audience to pose questions. This was perhaps the most interesting portion of the session and Tom handled the questions with aplomb and a dose of wit. Below are my notes of the substance of some of the questions posed, followed by the substance of what I could jot down of Tom’s answers.

Question: When a user sees a link to “find in a library” which leads to Open WorldCat, what librarians want is to have that user come to us rather than use Google and/or buy the book from the publisher. What is your view on this?
Answer: It appears that this is in fact what is happening. Logs show that adding the “find in a library” link, directed to Open WorldCat, has driven a tremendous growth in traffic to WorldCat. Presumably this leads to higher library use.

Question: I’d like to see much more powerful search options, including things like truncation, proximity searching, and boolean capabilities. Is this something Google is considering?
Answer: That’s a very good question, what I’d expect from a librarian <laughter from the audience>. Some of these capabilities are things we are indeed working on, while some of them are already available via the Advanced Search option.

Question: I believe that in search results from publisher content, there is no link to “find in a library” when there is such a link provided in the library search. Why is that?
Answer: Good question. Remember that the goal of GBS is to have a relevant search. The vast majority of books available in GBS at this time are from publishers. Over the next few years, that proportion will flip to emphasize library-owned material. Honestly there is a constant tug and pull between publishers and Google over this issue of how to direct users. Publishers, obviously, participate in GBS to sell more books.

Question: Is there any plan to include Library of Congress Subject Headings (LCSH) as part of the GBS search?
Answer: LCSH and other taxonomies are already used to some extent behind the scenes to assist with determining relevance as well as identifying relationships between books (linking from one book to a related book).

Question: Can you speak about why you are being sued by some of your publisher partners?
Answer: Attorneys love it when you talk publicly about their litigation <much laughter from audience>. Seriously, though, no, I can’t answer that.

Question: Are you indexing each book cover to cover (i.e. full text)? How do you determine relevancy? [Editorial aside: Was this person paying attention? This question was clearly answered in the context of the presentation.]
Answer: Yes, we are doing full text. The ranking/relevancy algorithms used in GBS are pretty much the same as those used in the regular Google search. Some tweaking is of course necessary to make the algorithms relevant for book search. We do user interface testing every month and as a result, we constantly tweak/change the algorithms.

Question: Do you have a formal digital preservation strategy?
Answer: We have agreements with our library partners that cover preservation to whatever degree they have specified in their legal agreements. It really depends on what partner libraries want. Other than that, no, we do not have a formal preservation strategy and do not feel that that is a role we should assume.

Question: Elaborate on how relevant metadata is in GBS.
Answer: Well, first of all, metadata does play a role in GBS but our bias is always toward full text, with metadata/abstracts thought of as secondary. This is probably the opposite of how most libraries would prioritize things.

Question: I have a question on the issue of fair use. Are you working to expand the concept of fair use in terms of scholarly material in particular?
Answer: We feel that our stance on fair use and GBS is very, very significant. We do not have any formal focus on scholarly material in GBS, though.

Question: What is Google’s stance toward the Open Content Alliance? Does Google view them as partners, or competitors?
Answer: We have an open door, a desire to partner and share in digitizing material. We believe that initiatives such as the Open Content Alliance are worthy of our support. However, as you can imagine, there are certain complexities and a lot of politics involved in this kind of interaction. We want to participate in initiatives like this in as open a way as possible.

Question: “Find in a library” links only to WorldCat at present. Does Google have any plans for directing traffic to other bibliographic (i.e. library) databases (this is particularly important for those libraries who aren’t linked from WorldCat)?
Answer: We’d be interested in any other worthwhile bibliographic databases, but WorldCat is it for now.

Question: A single search box is very attractive, but when you expand your data sources (as Google is doing), the simplicity and relevance of this one search become more difficult to maintain. How do you handle this?
Answer: We constantly reevaluate the one box concept and it is an ongoing problem to solve. There is no ready answer.

Question: How do you handle materials from publishers once those materials have gone out of print?
Answer: Good question. Once a publisher’s book goes out of print, they request that it be removed from the index and then it no longer appears in the search. The exception to this would be if there happens to be a copy of that same book that has been scanned and indexed as part of the Library Project. In that case, the book would remain in the index.

Question: Do you have plans for providing regional Google book searches (e.g. one for New Zealand imprints)? This is important for those outside of the U.S. because currently there is such a predominance of U.S. imprints in GBS.
Answer: We already do this, e.g. currently we have 65 regional book searches.

Question: The exposure from GBS for libraries is great, but it needs to be more two way, e.g. to direct users looking for material in a local library catalog to GBS and/or elsewhere. Are there any plans to extend the Google API to be used by libraries for integration into their online catalogs?
Answer: Something like this functionality is present in Google Scholar. We are very happy with this integration with library services and we want to figure out ways to extend this further.

Question: What’s your view on library’s development of customized Greasemonkey scripts to integrate library results in with GBS?
Answer: Anything that doesn’t violate copyright, we’re all for.

Question: GBS is very exciting. What about developing Google Journals?
Answer: <tongue in cheek> …So we have this thing called Google Scholar…Actually we are working ways to better integrate or link between GBS and Google Scholar.

Question: There is clearly a balance of power issue relating to the premise that allowing Google to do all this scanning and digitizing of book content puts the burden of proof on the content creator rather than the user. What are your thoughts about this?
Answer: We believe that this is a very important issue and our stance on this hinges on the belief that we are simply being consistent between the indexing of website content and indexing the content of books.

Question: What about working to include government documents, because they do no present a copyright problem?
Answer: Yes, we have a team devoted to this very issue. It is a bigger challenge to do this than it may at first appear because in order to do it we need to work out who is responsible (i.e. the publisher) of the multitude of gov docs. Expect progress on this front.

Library online catalogs and relevancy ranking [Updated]

Karen Schneider’s post on the ALA Techsource blog, “How OPACs Suck, Part 1: Relevance Rank (Or the Lack of It),” is a rant by a librarian who either presents a foregone conclusion due to incomplete research, or one who reaches a conclusion out of misunderstanding. Unfortunately such rants are fairly common. Karen complains about the lack of relevancy ranking in most online catalogs, something that most search engines routinely employ. She sums up the result of her research with the following statement:

“Relevance ranking is just one of many basic search-engine functionalities missing from online catalogs.”

Be sure to read the post as well as all of the comments (28 so far).

So why do I find this post problematic? Well, first of all, Karen makes a blanket statement like the one quoted above, without qualification. The fact is that library online catalogs do include relevancy ranking, and they have for years. The online catalog for Endeavor, for example, called WebVoyage, has had relevancy ranking for just about all of its existence (about nine years). It has never been “perfect” but it has been there. No, it doesn’t work in the same manner as, say, Google’s Pagerank algorithm. (It predates that technology, anyway.) And I don’t think it should be expected to, either. I agree that the ease of use and the transparency of the results for library online catalogs should be close or very similar to Google’s but comparing library online catalogs to Google in this way is like comparing apples to oranges. For one thing, the underlying data and databases for library online catalogs is almost entirely different than the data and database(s) underlying a major search engine. See screen shots here that illustrate this capability in WebVoyage.

Another problem I have with this post is that it blames vendors of library online catalogs for the fact that relevancy ranking isn’t apparently present in many instances. There is no consideration given by Karen to the possibility that relevancy ranking may not appear to be available because libraries themselves have chosen not to implement it or make it readily available to their users. The perspective here is very one-sided. Let’s all blame the vendors for inhibiting us librarians from properly serving our users and meeting their expectations. Vendors are by no means blameless, but neither are librarians. Just once, I’d like to see Karen and others of her ilk acknowledge that situations like these are not as black and white as they may like to believe. Sometimes I think it’s a matter of convenience because many librarians have long since cast “the vendor” as the bogeyman (“how dare they actually care about making money?!”). I say, look at both sides of the issue and especially do not be so quick to lay blame without truly understanding the reality of what vendors provide and what vendors do. Here is another quote from Karen’s post:

“But the interesting questions are: Why don’t online catalog vendors offer true search in the first place? and Why we don’t demand it? Save the time of the reader!”

OK, so what is “true search,” Karen?! (I don’t believe that is defined anywhere in the post.) What you define as “true search” isn’t necessarily how another person might define it. This is just common sense. If “true search” is meant as relevancy ranking, as I’ve already pointed out, vendors HAVE offered and DO offer “true search.”

But I’m beginning to see that that kind of answer doesn’t fit the simplistic, librarians-as-hapless-victims paradigm Karen has preconstructed so it wouldn’t count. It wouldn’t be relevant.

P.S. In one of her comments responding to another person’s comment, Karen talks about how vendors don’t offer field-weighted searching in online catalogs, either. I can’t wait to read “the facts” she will present. [Updated 3/20/2006: Especially since Endeavor's WebVoyage does already provide field-weighted searching.]

Some Thoughts on RDA and ILS vendors [Updated]

Some time ago I noted here that an acquaintence of mine had snagged an interesting job at ALA as RDA Project Manager. Yesterday I sat down and read more about RDA, which stands for Resource Description and Access. In particular I read through the RDA Prospectus, published by an international group called the Joint Steering Committee for Revision of AACR, or JSC for short. This group is responsible for implementing changes to the cataloging code of practice in use by the majority of libraries in North America, the U.K., and Canada. The current cataloging code is known as the Anglo-American Cataloging Rules (AACR) and this has been the standard code for cataloging since the 1960s when the first edition of AACR was published. Having taken all of the cataloging coursework in library school and then starting out in the profession as a serials cataloger at the University of Chicago Library and then managing a large cataloging unit there for quite a while, I have “grown up” on AACR and have been actively involved in the cataloging community, particularly the serials cataloging part, in the past. I’ve since moved away from that professional focus somewhat and am no longer as current in my knowledge as I used to be. I had heard about RDA but didn’t really pay much attention to it. So it was a big surprise to me to read yesterday that RDA will be replacing AACR (or rather, AACR2R, which is the 2nd, rev. ed. of AACR that is currently in use). I decided to delve into RDA in more detail.

What I learned from the prospectus and from some of the discussion surrounding RDA that I could find is very intriguing. This is a very big change, and, in my view, a positive one. It is a big change on many levels but since I work for a major ILS (integrated library systems) vendor, I focused on what this new standard might mean for them. Here are some thoughts or impressions that came to mind:

  • Acceleration of the end of MARC, or at least, the lessening of emphasis on MARC. MARC (which stands for MAchine Readable Cataloging) is not directly tied to AACR2R or RDA in theory but nevertheless the two are closely entwined in practice. While AACR2R (and soon, RDA) describes cataloging rules such as how to choose the title of a book, MARC is the standard for how to record and transmit cataloging information electronically. MARC also drives or controls much of what cataloging information gets displayed to users in online catalogs. My reading of the prospectus makes it seem very clear that RDA will not assume the use of MARC but instead will be designed to be of use in a variety of metadata formats, of which MARC will be one of many. Of course there are already many other metadata formats in use by libraries other than MARC (e.g. EAD, Dublic Core, etc.), but this kind of emphasis by RDA on multiplicity of formats has far-reaching implications and solidifies or adds weight to the trend toward multiplicity of formats that’s been underway for several years. Why does this matter to ILS vendors? It matters because the core record or basis for just about every major ILS system is the MARC record. Expansion of multiplicity of metadata formats supported by an ILS calls for radical system redesign — assuming, of course (which I personally do not), the need for an integrated (some say, monolithic) library system continues to exist.
  • The prospectus makes it clear that RDA will be predicated on FRBR (Functional Requirements for Bibliographic Records) and FRAR (Functional Requirements for Authority Records), conceptual models developed under the auspices of IFLA (the International Federation of Library Associations and Institutions). These models have been around for quite a while yet very few ILS vendors have made their systems compatible with them as of yet. Implementation of RDA, as it is currently proposed, anyway, will change that from “it would be nice, but…” to “must be capable of…” In other words, it will no longer be desirable, but required. That’s a big difference. Those ILS vendors who have maintained the status quo on this one won’t be able to do so for much longer.
  • According to the prospectus, “RDA is being developed to provide a better fit with emerging database technologies, and to take advantage of efficiencies and flexibility that such technologies offer with respect to data capture, storage, retrieval, and display.” This could mean all kinds of things for ILS vendors and I am not certain really of what JSC has in mind. However, database design and maintenance is perhaps the most integral, complicated, and proprietary aspect of modern library systems. Any changes in that aspect of ILS work will be of huge significance for vendors.
  • Perhaps if RDA is successfully implemented, the idea of an ILS will enjoy a renaissance if/when vendors and/or libraries develop a system that can readily ingest, output, and manipulate library data no matter how it is encoded. Rather than component-izing (a madeup word) the disparate pieces of traditional ILS functionality as seems to be the general trend nowadays, maybe RDA, with its inherent tolerance for a multiplicity of metadata formats, will result in one central system that can handle those formats in one place with the flexibility that libraries need. Who knows?
  • One major portion of RDA will be dedicated to relationships. I find this interesting and a good thing. One of the biggest failings of ILS systems is that they have largely failed to readily help librarians piece together disparate works so that the user of the online catalog can readily see relationships among them.
  • One thing not mentioned at all in the prospectus is the whole concept of user-supplied metadata, e.g. tagging, and how that will play a role in the future for online catalogs and bibliographic utilities. I believe that tagging as a phenomenon is here to stay, even if I have my doubts about its efficacy right now. How can or should ILS vendors enable user-supplied metadata in conjunction with library-supplied cataloging?

I admit that I don’t know as much as I should know about RDA and surrounding issues, and I may have misinterpreted some of what I’ve read. Or maybe there are even more radical implications for ILS vendors than what I can think of right now. Regardless, I am fairly confident that RDA’s progressive approach bodes for a lot of upheaval for a lot of stakeholders. I’m going to pay a lot more attention to it than I have heretofore!

RSS and Aleph online catalogs

A lot of people in the library blogosphere get excited when an ILS vendor announces some kind of RSS capability for their online catalogs. I wanted to mention here some excitement of my own when I recently discovered some interesting RSS functionality for the ILS I maintain (Ex Libris Aleph 500), developed by Peter Corrigan of the National University of Ireland, Galway, James Hardiman Library. Peter has implemented this in relation to A9.com’s OpenSearch technology.

See his entry at A9.com and also click here to see a sample search. (Note the orange icons for RSS and Permalink in the upper lefthand side.) Yes, this is cool!

I also read on the North American Aleph Users Group discussion list that the new product manager for Aleph, Katriel Reichman, is actively tracking and investigating the use of RSS in Ex Libris products, including SFX and MetaLib.