Tuna Breath: Search Behavior Archives

November 21, 2003

Frustration Factor

We've had a number of people write us to ask, "Where is it?" They know it's not in their local library, and they don't want to do the search of every library to which we provide a link (such a small subset of the possible world, too). To those few, we've responded that they should take the full catalog details to their favorite reference librarian. This morning a reference librarian wrote, "I was taking a look at RedLightGreen and found two items that we would be interested in obtaining through Interlibrary Loan. ... I searched WorldCat, Melvyl & Google and found no other references to these items. I do not have access to Eureka, so I seek your advice on how I could obtain location info. on these items." The librarian provided the reference for a typescript of over a 100 pages and a microform of a 4 page discussion guide for a CBS broadcast.

So, the "go ask your reference librarian" answer isn't going to work. Here's my response.

Thank you for your interest in RedLightGreen. In designing this service, we had to work with the constraint that we would not be able to identify the cataloging institution. We struggled with the frustration factor that some things are unique and are not accessible outside of the cataloging institution or items have access restricted to academic researchers: in short, some cataloged items are not available by interlibrary loan. To limit this frustration, we've excluded things that are clearly cataloged as archival materials. On the other hand, a number of us believe that knowing of the existence of something will assist those researchers beyond the traditional walls of the academy. As you've provided the actual items, i was motivated to treat this as a brief use case study and see what i could discover, knowing the information in the bibliographic record.
A further Google search on Eugene DuBow (note the removal of the space) demonstrated that he "headed the Chicago office of American Jewish Committee and was coordinating counter-demonstration work greater Chicago area" [1]. Eugene DuBow's bio appears here: http://www.uky.edu/Alumni/hoda/dubowEC.htm . It indicates that as of 2000 he was still working for the AJC. You might contact them directly for information on Eugene DuBow's work: http://www.ajc.org/. The AJC has a library that may be of assistance: Library at ajc org.

As far as "CIStems, Inc., Cultural information service, " it seems they did work for the major networks as late at 1988 [2]. As all three guides associated with CIStems had to do with Jewish issues, I suspect CIStems is the same as this group,"Cultural Information Service, a New York-based ecumenical resourcing agency. ...Cultural Information Service, PO Box 92, New York, NY, 10016." [3] It seems CIStems is likely the same Cultural Information Service run by Frederic and Mary Ann Brussat (brussat at spiritualrx com)[4], who seem active on the web. You may be able to contact them directly for a copy of the guide. If these links between instances of such a generic name are correct, it will be pretty remarkable. My expectation was that your best bet was going to be contacting CBS to find the educational division, etc, etc.

I am familiar with the Vanderbilt Television News Archive [5] and thought that, while they did not likely have the viewing guide, they may have the actual TV show. I did a search on CBS and Skokie and didn't turn up an October 1981 show. I'm not sure when they started taping more than just the evening news casts; this may have been before that. Depending on your research interests, you may find the other Skokie/Nazi news episodes of interest. The Vanderbilt archive is a unique institution in that their right to loan tapes directly to individuals (instead of to another library) is expressly granted in federal law.

Thanks again for your interest in RedLightGreen. I hope that you continue to find the information and the service useful. We welcome comments and critiques as we continue through this pilot period in developing the service.

Sincerely,

Judith Bush

[1] http://www.colorado.edu/conflict/civil_rights/topics/0700.html
[2] http://www.va-holocaust.com/page106.html and a search in RedLightGreen turns up a guide for "Holocaust"
[3]http://www.georgiabulletin.org/local/1978/09/14/d/from a Google search on ["Cultural information service" discussion guide]
[4]http://www.spiritualityhealth.com/newsh/items/blank/item_227.html#brussat
[5] http://tvnews.vanderbilt.edu/TV-NewsSearch/tvn-database-info.pl?SID=&UID=&CID=&auth=&code=

Posted by judielaine at 12:47 PM | Comments (0) | TrackBack

October 24, 2003

Quarter Life Crisis's review -- part II -- LC Subject Headings

In Quarter Life Crisis's section Refining, Prost writes:

The catalogue also supports refining searches, which is a very good idea and can be very helpful. Unfortunately the Subjects provided seem very random. When searching for Vector Bundles Why can I choose Vector Bundles – Congresses but it’s not related to Vector Bundles? Why can’t I simply limit my search results to a broad area like Mathematics, say? For example in this search, all the top results are maths related. Still the top Subjects to come up are completely unrelated.

The "Refine Search" section is based on the primarily LC Subject headings. It's tempting to believe these are hierarchical. They're not. They're *linear*. They are designed for *DRAWERS*. ANALOG. Little cards.

(deep breath -- sorry for the rant)

We spent MONTHS trying to use Recommind's automated classification to create something more intuitive. We tried the fragments of the subject headings, we tried using Dewey and LC Class Numbers. (Designed for collocating *books* on *shelves.* Physical object. One to one relations. More ANALOG.) It didn't really work. We had problems with granularity, problems with antique classifiers (where books on contemporary Iraqi politics have Dewey class numbers that put them "under" Archaeology -- which makes sense if you want your books on Mesopotamia and Ancient Persia to be near Iran and Iraq), and problems generating labels.

In the end, we do a rapid analysis of all the subject headings associated with the first hundred works that are returned. Catalogers gave subject headings of Vector Bundles – Congresses with the understanding that they'd be right behind the books on subject cards for Vector Bundles. They knew that the researcher intent on discovery would keep flipping back through the cards. No point on creating two different cards for the user to flip by, was there? And there was certainly no point of labeling this book "Mathematics" -- only books that address the whole broad subject area would get a subject heading like that.

...rambling about an example use and some bugs and forthcoming fixes...

I don't know much about vector bundles, so i'll switch to something i do: halo nuclei. The disambiguation problem is clear to me in this case. There are books about the structures of galaxies -- halo galaxies and the nuclei -- the thick centers -- of galaxies. Not my interest. I want halo nuclei -- stable nuclei that are believed to have a density distribution that falls off much less rapidly than the lighter isotopes of the same element. (Usually the next lighter nuclide is not stable.) Many of these elements are important in nucleosynthesis, though (which occurs in stars).

These "related" subjects, then, are the frequency ranked subjects assigned to the titles in the result set. If i'm looking for halo nuclei in the context of nucleosynthesis, i can limit by astrophysics and then check the refined list of subjects. I notice a bug i thought we'd gotten rid of -- astrophysics remains as "refine search by" subject. I check all the remaining subjects to see if i can limit on nucleosynthesis -- i can.

We're planning on changing the "show more subjects" listing to an alphabetic list. The current list puts the most frequently occurring at the top. Without numbers, it's hard to understand the ordering. [We don't display any numbers because we do the analysis across the first one hundred results -- so if twenty-five of the first one hundred results have "astrophysics" as a subject, i might get forty results when i select that limit.]

Posted by judielaine at 10:14 PM | Comments (0) | TrackBack

September 20, 2003

Stop Words

We are fast approaching launch, and in some strange mixing of metaphors, i find myself suffering bouts of vertigo.

Our information architects felt we needed a number of sample searches to help orient new users to RedLightGreen -- give a sense of the breadth of content, the type searches that are supported. Pam Dewey, from Member Services, suggested that we arrange it as different sample paper topics, and she looked up assignments on-line. Once we were able to get the full database to search, Joe Zeeman began vetting the assignments.

We had declined to use Recommind's stop words when indexing the documents. Over half of RLG's Union Catalog are records for items not in English. While the records would be coded for the language of the item, much of the cataloging is in English. Still, even were we to decide that we would use the 008 language fields to catalog the transcribed MARC fields and treat the rest as English, we would still apply English stop words to non-English fields. Not only are uniform titles not necessarily in the language of the item *or* in the cataloging language, but we also have catalog records from libraries that catalog in languages other than English. (For example, the Swiss National Library is now a member and catalogs in three languages.) Since using the Recommind relationship analysis to provide cross language access was a goal, we didn't want to sabotage it by applying stop words across the board.

With these stop words indexed, the Recommind engine gives the same results for a search on "caste in India" as "caste India." The very common articles and prepositions don't take a prominent position in the word vectors that determine relatedness. The same results, though, take an order of magnitude longer to return.

My initial response was to blow that off. Who searches with whole phrases? Well, it seems around half the folks with whom i talked search whole phrases. They know the articles and propositions don't affect the results; they expect them to be stripped away, Google-like, so they type the whole phrase.

I run to the opposite extreme. I'll often just use "francisco" with other terms when trying to find San Francisco related results. After hall conferences, a quick call to Recommind indicated we could turn on the stop words now, leaving the words indexed in the system. Enough RLG staff seemed to expect the words to be stripped out, and the delay in the search did seem significant. "Go ahead," i said, "and we'll review the stop word list later."

The stop word list was extensive. Recommind's large legal clientele seems obvious when noting the stop words of "accordingly," "consequently," "furthermore," "howbeit," and especially "thereafter," "thereby," "therefore," "therein," "thereof," "thereto," and "thereupon." The amount of correspondence indexed reveals itself in excluding a number of salutations.

"Death Becomes Her" became a search for "death" with this stop word list. I rapidly revised down the 576 stop words to 58. Twenty-five of those were the letters of the alphabet except for X. "Generation X" is, apparently, a well used subject heading.

We've got the shorter stop word list in place now; i hope no one expects to find the results of "Who, What, Where, and Why In Having A Will."

Constant review of the duration of searches and the frequency certain words are used is going to be required for us to tune that short list. We’ll likely implement the stop word list at the application layer, so that we can allow a user to require a stop word as well.

Posted by judielaine at 03:06 PM | TrackBack

September 08, 2003

Heidi Chronicles

Given a search on "Heidi," we don't return the Heidi Chronicles.

If the problem is that there are 1000 authors named Heidi and no way for the Recommind engine to prefer one author to another (meaning if you were looking for an author named Heidi you'd be missing 500+ of them), should we decide to err on the side of titles and make titles slightly more relevant than authors? We can't perform the miracle of turning up that author you vaguely remember as having the name Heidi, not matter how hard we try -- we can't handle results larger than 500 in a timely manner. (Now, remember that her book was about biscuits and we can help.) So, we should decrease the relevancy value of the author fields.

Posted by judielaine at 05:43 PM | Comments (0) | TrackBack

August 05, 2003

Mind Reading and Other Unattained Miracles

There have been times, in discussing plans for RedLightGreen, that we reminded ourselves that there are users who ask questions like these. While some of these are out of scope and some -- well, we don't have any books here -- there are some that point to the problems that new library users have. (Even if my image of a "new library user," in the generic sense, is a child under the age five.) And some of these point to problems of those who have poor memories (I'll count myself in that group.)

A browse through some of these "sample queries"

I am disappointed to find our current engine returns nothing on "Robert James Waller Waltzing through Grand Rapids" or on "Waltzing through Grand Rapids." (I have a slim chance of remembering a book title, and even slimmer chance of remembering a person's name.) "Waltzing Grand Rapids" returns seven books. Stemming is not enabled on this test engine, so i try "waltz grand rapids." "Grand Rapids" is found in the place of publication of many of these results, and, yet again, points to the need to remove that data from the keyword search. I recognize that there is no way the current system would really know that "Grand Rapids" and "Cedar Bend" are both rivers.

The "civil war" search won't help the student who assumes that they can write a paper about the whole war, although there are plenty of books about the US Civil war. It does have the functionality to help them discern that, "oh yeah, there have been other 'civil wars' than the one i had in mind." The disambiguation is not as useful as I'd hoped it would be, but it's a start, appropriate for a pilot system. The person who asked, "Can you tell me why so many famous Civil War battles were fought on National Park Sites," might pose a query to the RedLightGreen interface as "civil war national park." This does return some results -- National Park Service publications about the US Civil War -- although I'm not sure one would ever come across the answer to the particular question in a book.

My search on "time machines" entertained me a great deal and illustrated why stemming is a critical addition -- without stemming, HG Wells doesn't show up until result 55.

Enough entertainment for the day.

Posted by judielaine at 05:32 PM | Comments (0) | TrackBack

July 31, 2003

Shelf Life, No. 117 (July 31 2003) ISSN 1538-4284

Today's Shelf Life has three summaries that seem to build on a theme.

In one summary, Rita Vine, of Workingfaster.com states that researchers need a tool box of starting search places.

The next summary points to emerging subject specific search engines that specialize. The example give is CiteSeer. What's not mentioned is the software that drives CiteSeer, ResearchIndex, is available for developing new indexes. I remember Kurt battling with NEC lawyers to get a decent license for the code -- there's isn't an obvious link that makes it clear it's available but it is. (At the last JCDL there were several talks about making the extraction of metadata/features in the crawled articles more accurate.) Thus, a proliferation of ResearchIndex driven search engines leads to Rita Vine's multiplicity of rich starting places.

Finally, there's an article about how all these tools will need to implement methods for visualizing the result sets. I suspect that these tools are going to be for fee. "Most users" (in particular, the students we surveyed when developing RedLightGreen) want the Google "I feel lucky" result. They did their search, now tell them what they want to know. I have a cat who is having a biopsy for a lump on his gum -- i want a few websites about feline mouth cancers. I don't want to see a huge widget that shows the ratio of word choices like oncology
vs cancer, feline vs cat, but if i was trying to figure out the best treatment for my cat, I may very well be interested in subscribing to a resource that would help me quickly sort out the many results in the tail of the popularity distribution.

Posted by judielaine at 07:38 AM | Comments (0) | TrackBack