File /Humanist.vol22.txt, message 149


Date:         Mon, 28 Jul 2008 06:59:58 +0100
From: Humanist Discussion Group <willard.mccarty-AT-MCCARTY.ORG.UK>
Subject: 22.147 indexing
To: humanist-AT-Princeton.EDU


               Humanist Discussion Group, Vol. 22, No. 147.
       Centre for Computing in the Humanities, King's College London
                        www.princeton.edu/humanist/
                     Submit to: humanist-AT-princeton.edu



         Date: Mon, 28 Jul 2008 06:55:46 +0100
         From: amsler-AT-cs.utexas.edu
         Subject: Re: 22.143 indexing


I am not sure more data will improve the situation. The fundamental
problem is
that when computers present "relevant" articles they are basing the
relevance
on computational heuristics which are vastly impoverished compared to human
judgements. The common methodology is to look for the same words as
those used
in the primary article or to look for similar sets of citations to those
of the
original article.

Word matches by computer are typically morphologically-related matches.
As the
age of articles grows with time, language changes its meaning. Searching for
"relevant" articles to a contemporary article using terminology invented
recently will not find articles using different terminology and effectively
could appear to have exhaustively found all relevant articles whereas in
fact
the concept was discovered and explored far earlier using different
terminology.

Citations go back as far as citation indexes do, but that isn't back to the
beginning of the literature. I do not believe citation indexes are extending
their coverage backwards into earlier and earlier years of publications.
They
may well be satisfied that they now have sufficient depth of coverage in
years
that earlier citations wouldn't improve their retrievals. Once again, the
related articles could simply dry up as one goes backwards and reaches the
digitization horizon. A good question is whether a resource such as JSTOR,
dedicated to the past, could benefit from citation indexing or minimally
act as
a set of milestones for conventional citation indexing to reach in
extending its
coverage backwards in time.

What can be done. First, I believe some new studies of paper-only research
should be undertaken. The computational basis for "relevant" articles
should be
more formally studied with reference to whether the computational
processes in
use are equivalent or merely doing what is easy to compute, ignoring
what can't
be computed.

For example, when researching I often identify more than related
terminology. I
look for authors, institutions, journals and library call numbers with
multiple
relevant works and then research those authors, institutions, journals and
library call numbers themselves to see what's there. One can often
discover a
pivotal organization or individual who mentored generations of students
following a theoretical approach that transcends the terminology. Or a
journal
that has published the bulk of the articles about a theory (and scanning
journal tables of contents can restore the missing connectivity to
non-terminologially related works). Library call numbers are an
excellent way
of discovering related works and most electronic catalogs allow you to
scan the
shelves electronically if you can't do it in person.

Terminology is often invented to separate one's research from others.
Matching
terminology isn't proof of exhaustive coverage, as others looking at the
same
task may have likewise invented their own terminology. Knowing that
terms are
essentially from different schools of thought about a common problem is
hard to
determine through computer indexes alone.

From - Mon Jul 28 12:41:18 2008
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
   

Humanist Main Page

 

Display software: ArchTracker © Malgosia Askanas, 2000-2005