Freedom of Information Conference 2000
Professor Blaise Cronin Indiana University
at Bloomington
Bibliometrics and Beyond: Some thoughts on webometrics
and influmetrics
I was asked
at short notice to talk about the future of bibliometrics in the world
of the web, the internet, and developments such as PubMed and BioMed Central.
What bibliometricians
do basically is count things and, at the risk of over simplifying what,
they count are publications and typically publications which have appeared
in peer review journals. They also count citations to the work of scientists,
scholars, and researchers, and they count at the nano-level - how many
times have you been cited, how frequently has your work been invoked.
They do it at numerous levels, from a micro level, such as a research
team, to a macro level, such as a nation state. Basically for the last
thirty or forty years bibliometricians have looked at publications and
citations to scholars. They've plotted, they've measured, they've tracked,
and they've tried to map the information theoretic structure of disciplines,
the intersections of disciplines, and the evolution of fields, and have
tried to use these techniques, amongst other things, to pick winners in
science.
The kind
of emerging environment of publishing that this event is addressing suggests
a raft of new opportunities for people interested in measuring, citations
after all are the links that scientists make in their papers to the rest
of the literature. I often think it would have been extremely interesting
had, say thirty years ago, Eugene Garfield - who is the patriarch of citation
indexing - met Ted Nelson, who was the conceptual grandfather of hypertext,
which underpins the world wide web. In a sense, citation indexing is a
conceptual linking scheme that's been waiting for years for something
like the world wide web. I think we're going to see over the next few
years the emergence of non-commercial, public domain, open, autonomous
citation indexing techniques and I think we're at an extremely interesting
moment in the evolution of bibliometrics.
The
origins of science citation
Let me start
with a very brief history of citation indexing and analysis. Most, if
not all, of you are familiar with the science citation index. The name
was coined by none other than Joshua Lederberg who, way back in the 1950s,
was an ardent supporter of the idea of citation indexing and analysis
and was a redoubtable supporter of Eugene Garfield and his efforts to
get funding and ultimately develop the company ISI, which brought us the
Science Citation Index (SCI) and its sister products. In fact, the SCI
grew out of a subject specialty index and owes a great deal to Shepherd's
Index, a tool which is still being used in law to track citations to legal
literature.
If we look
from 1950 through to the 1990s, the primary application of citation indexing
was to retrieve the literature of science. It was conceived by Garfield
as a retrieval to, not as something to be used in evaluating academics,
departments, research groups, or the state of the health of high energy
physics in Britain, France or anywhere else. It really was a retrieval
tool but quite a number of individuals recognised that it had far more
serious and significant implications in terms of our understanding of,
if you like, the structural dynamics and the social nature of scientific
activity. But basically the science citation index is limited to three-and-a-half
to four thousand journals. That's what I call a bounded set and they are
primarily print based, peer review journals. It is a slice of life, it
is the crème de la crème, ISI will argue. Other will contest
it and challenge the reliability of the data set but by and large it is
the crème de la crème of science literature, the social
sciences, and also the arts and humanities.
Creating
"silent" scientists
Over the
years researchers in different fields and individuals such as Henry Small,
the director of research at ISI, have developed rather more sophisticated
techniques for mapping and modelling the growth, evolution, and interaction
of scientific fields by developing such things as co-citation analyses,
citation mapping, and visualisation techniques. Those of us who've done
this kind of work and or have played with these kinds of things have been
historically reliant upon the tool sets from ISI. Those tool sets are
not just limited to a finite slice of the journal literature, they basically
exclude things, such as monographs. Now that won't matter in biomedicine,
it does, however, matter not only in the humanities but quite significantly
in social science. So if you are trying to identify significant thought
leaders in fields such as the sociology of science, reliant upon ISI indexes,
excellent though they be, you will not get the full picture because of
the absence of monographs. I've argued for years with ISI and tried to
persuade them to include them but they declined so to do for cost reasons.
The last
eight or ten years I've been trying to persuade ISI to also include acknowledgements,
the goat's droppings of academia and scholarship some would say, but that
is to underestimate the social significance of acknowledgements. If we
broaden it to look at the debate currently in the medical literature,
suppose we do get rid of the author and replace it with contributors and,
in some cases, guarantors, wouldn't we like to identify all those who
have contributed? If you've contributed by critiquing a paper or assisting
in data analyses, you at least want to find your name in the acknowledgement
section. The trouble is, like second authors, third authors, and end author
of a paper, you disappear into the black hole of citation space, you don't
exist. Jack Meadows, a historian of science, talks about silent scientists
and there are many academics, not just scientists, who have contributed
significantly to the evolution of ideas, the maturation of doctoral and
post-doctoral students but it doesn't show up in the two most evident
measures, publication counts and citation counts. And ISI has, perhaps
for good commercial reasons, resisted incorporating acknowledgees. Now
imagine a database of all the individuals who were acknowledged. You may
dismiss it but if you happen to be a telescope operator this is precisely
what you're looking for. If you've taken part in a clinical trial and
are not a co-author this is just the sort of thing you're looking for.
Hearing
silence through the web
Now I think
that is set to change as we move into the world of the web. Why? Because
the extensive growth in links and the increased transparency that links
provide us with allow us to see a much wider range of scholarly contribution.
New modes of contribution or historically invisible modes of contribution
will become potentially visible. The development of e-print archives is
significant not just because it challenges certain historical assumptions
about the usability of unvetted information but also because you can find
that that paper has been subsequently cited, even if it's not yet appeared
in the Journal of Record. So it provides further insight into the historically
invisible linkages between documents and texts.
Developments
such as PubMed Central, BioMed Central, and HighWire are going to create
the conditions, I suspect, for different modalities of publication, ranging
across the spectrum of peer review from highly robust, full blooded, double
bind peer review through to various lighter options. What that's going
to do is throw up a much broader range of objects for bibliometricians
and citation analysts to look at, analyse, track, and measure. So it's
an extremely interesting mode, a cusp in the evolution of research and
bibliometrics because we now have an incipient infrastructure which will
allow us to look at a much finer level of granularity. And we have developments
like CrossRef. What is CrossRef but an initial commercial variant on what
ISI is or has historically done? It represents, I suspect, a threat to
ISI's near historic monopoly on citation indexing and linking.
And then
we have something we call research indexing. This allows you to see citations
in-context citation. For example, you do not just get Smith 1994 but you
can actually see the context, if you like the semantic wrap-around, in
which Smith is being cited. It tells you more about the nature, purpose,
and motivation of this citing author. This is giving us a level of detail
and contextualisation that we have not previously had and it's going to
allow us to explore and understand at a deeper, richer level what actually
is happening, what individuals link to and cite.
Now you may
think that acknowledgements sitting down at the bottom are trivial but
the Wellcome Trust has been tracking acknowledgement data so they can
provide information to funding agencies in Europe and the UK. We could
eventually see systems being developed that would allow us to track individuals'
contributions in the context of clinical trial and other studies. One
could also imagine, for example, citation to clinical guidelines being
tracked in the context of evidence based medicine. So we're at a moment
in time where new tools are likely to be coming out of unusual stables
and contexts and applied by new populations to new outputs for new purposes.
New
publishing - messy, slippy, evanescent and promiscuous
Traditional
publishing and traditional bibliometrics dealt with printed, peer-reviewed
journal articles in persistent scholarly journals. To exaggerate for effect,
what we're going to see is a sort of free-form publishing, a sort of libertarian
publishing. We're going to see many different modes of output, we're going
to see many different forums in which scholars, researchers, and scientists
can publish the work. So the units of analyses with which bibliometricians
are going to deal are going to become much more diverse and it seems to
me there are three elementary questions to be asked. What is it that's
being measured? Where is it exactly and how do we access it? And what
is the 'it' which we're accounting part of? The "what is the it"
- it could be a traditional journal article, an overhead transparency
of Nancy Kerrigan, an e-print paper, a self-posting, or it may be a version
3 of a working paper which is being dynamically revised. These are not
traditionally the potential units of analyses addressed by bibliometricians.
They are messy, they are slippy, sometimes evanescent, and we're not quite
sure how to deal with them.
So, what
is being measured? And where is it? How do we identify this miscellany
of new age, post-modern scholarly outputs? What's an accepted format in
each case for representing and labelling such output? Where do they reside?
Who owns them? How can we ensure persistence and stability over time?
How do we deal with link rod? How do we deal with vanishing URLs? Does
it belong to a traditional journal? Some new-age variant on our historic
concept of a journal? Is it part of a host service? An archive? An e-depository?
All this brings to mind a phrase by John CD Brown; "The social life
of documents". Documents have a social life, they have kinship structures,
and we need to understand how they are socially contextualised.
One presumes
bibliometricians are concerned about the integrity, quality, and significance
of the things they are measuring, tracking, and counting. As we move into
what I call promiscuous publishing there are serious questions to be asked
about pedigree and persistence of the publication itself, of the source
and the host. If it's resident in PubMed Central, which is legitimated
through its association with the NIH, we feel pretty comfortable and relaxed.
If it's sitting in my mother's server at home, well we may or may not
have some grounds for concern! There are issues of implied or perceived
quality - are these outputs? Remember it could be Nancy Kerrigan's overhead,
it could be scholarly skywriting, it could be the latest version of my
paper sitting in granny's server. Are they covered by abstracting indexing
services, historically a good indicator of presumed quality? Has any of
the work been funded by agencies such as the NIH of NSF? So what are the
kinds of indicators of perceived quality that we're going to want to rely
upon or invoke when dealing with multi-modal, promiscuous output from
publishing? And who's imprimatur under-girds these units? Have they been
subjected to full peer review or is it rather laid back. Is there no peer
review whatsoever? We need to understand the significance of different
peer-review apparatus and paraphernalia in different disciplines, fields,
and sub-fields.
Differing
cultures in different fields of science
One of our
colleagues asked me why the Los Alamos model hasn't been adopted by every
other field around the world. The answer is quite simply that the structure
of high energy particle physics research is fundamentally different from
biomedicine, oceanography, fruit fly research. By that I mean the nature
of the collaborate is the intense level of internal, intramural, intra-group
review, and the ground rules, procedures, and processes which are formally
in place before work gets released is a very self-knowing community. It's
unlikely that dubious work, flawed work, or fraudulent work will escape
in a way that, one has to acknowledge, happens in medical research. And
it is the intense social relationships and the material practices of those
people who are requesting beam time and trying to find the next particle,
that allows a development such as the Los Alamos e-print archive to develop.
There is such inherent credibility in the work being produced even though
it may not have gone through the formal, final stages of peer review.
That model technologically could be replicated in other fields but sociologically
it may not be acceptable and so one has to acknowledge the socio-cognitive
differences across fields and sub-fields and how those psychic differences,
cognitive differences, behavioural differences will either accelerate
or delay the adoption of new modes of publishing, storing and archiving,
and communicating information.
The last
point in this has to do with the prevailing reward system in scientific
fields. What counts in one field may not count in another. For example,
in computer science conference papers are taken fairly seriously and,
in the context of promotion and tenure, count for something. In other
fields, such information science, we look rather less favourably upon
conference proceedings and papers. What counts in medicine and what counts
in other fields will be reflected differently in the rewards systems of
those fields.
What
value a citation?
Many people
challenge some of the claims made for citation data in evaluation context,
whether evaluating journals using impact factors or evaluating research
groups through citation accounts. Nonetheless there is a considerable
body of rigorous research that shows by and large that citations are perhaps
the single most useful indicator of something and that something may be
impact, may be utility, may, at a stretch, be quality and, of course,
it may be lots of other things that are not quite so acceptable. But what
exactly are bibliometricians measuring? Is it quality? Is it faddishness?
Is it a critical flash in the pan? What exactly does a citation signify?
These questions
have not been answered to everybody's satisfaction. Some citations are
negative, that doesn't matter - at least I took the time to say that your
work was flawed, the fact that I selected you is in itself revealing.
But what do citations tell us? Is to be linked-to analogous with being
heavily-sited? If so how might such measures be used in conjunction with
others to develop a new metrics for the age of the world wide web? How
do we determine what is substantive and what is transient in the world
of volatile electronic publishing and posting? Do we utterly ignore motivations
- I don't care why you voted for Reagan, I don't care why you voted for
Thatcher, I'm simply looking at the broad distribution of votes. And then
if we're going to be looking at metrics in the context of the web there
are serious issues as to the reliability of the search engines. Run a
search on AltaVista today on subject X, run it twelve hours later, and
you're going to get a very different output. Run a search using AltaVista
and a search using HotBot and you'll get very different outputs. The courage
of these engines is partial and so if you're going to rely upon indicators
or measures derived from the web one needs to take account of reliability.
Are we going to weight all indicators of an individual's presence on the
web equally? Same question in citations, are all citations equal? If Dr
Varmus cites my work I'll feel good, if some third rate masters student
does I might feel less good, yet in counting citations we count the Varmus
one the same way that we count the masters student. If somebody mentions
me in a news group, if somebody mentions me in a conference programme,
if I'm the subject of an animated discussion on some issue, is that significant,
is it trivial? Who determines and how do we weight? I don't have the answers
and nor am I proposing, I'm simply saying that with the advent of the
transparency of the web, with the availability of all kinds of indicators
of people's cognitive and social activity, the opportunity for counting
name-numerology is rampant.
And the last
thing, Steven Adler coined the term "slash dot effect" to describe
the surge or swarming behaviour that takes place when everybody gallops
to a particular URL. You saw it during the Superbowl. Victoria's Secret
web site crashed as a result of the number of people who raced out of
their rooms having seen the TV advert during the Superbowl and logged
onto Victoria's Secret. What does that tell us? But what are the equivalents
of Victoria's Secret hits in the world of scholarship?
The clever
project, folk at IBM have been looking at the web in terms of the pattern
of linkages to and from sites and they talk in terms of authorities and
hubs but what's most interesting is if you read their work they explicitly
acknowledge the science citation index, they explicitly acknowledge the
contributions of Garfield, and they explicitly acknowledge the utility,
from their point of view, of the journal impact factor as a measure which
could be ported across to the world wide web.
Prospect
mining, web style
The last
thing - and I don't' have any concrete examples - is that I suspect we'll
see a battery of tool sets and engines that go beyond today's search engines
and allow us to prospect. Where are the ideas germinating that haven't
quite worked their way into formal literature? What about those post docs
who don't have the social status and access to resources to get their
ideas across? With new modes of publishing, with new modes of communication,
we're going to be able to see perhaps early signs of subjects of interests
or incipient trends in the conduct of science.
And so I
conclude with the two words: "indicator mining". I can imagine
X years from now that someone will be developing tools that will allow
us to marshal these distributed goat's droppings way beyond acknowledgements
and citation counts. There will be richer, multi-dimensional pictures
of individual's presence, in particular, scientific communities and differential
and multifarious impacts a scholarship is having upon their peer communities.
Questions from
the floor
Questioner
1: I'm interested to hear your thoughts on including acknowledgement
and contributor-ship as a way of helping to judge people's value on the
web. One of the campaigns that I am working on is open peer review and
trying to find ways of rewarding peer reviewers for the work they do.
It seems to me that the way that you're suggesting is exactly a way to
do that. That if open peer review exists on the web - a signed commentary
attached to a paper - the peer reviewer can then be rewarded in the way
that you are indicating for the work they're doing.
Professor
Blaise Cronin: I have done a number of very tedious analyses of acknowledgement
in different fields. I looked at them over the years, gathering literally
tens of thousands of acknowledgements to look for patterns and to see
whether there was evidence that there is a population of mentors who may
not be highly visible in terms of publication and citation output but
who clearly are having instrumental effects upon the development and growth
of their field. We have done rank ordered analyses of these individuals
and I have done some very vulgar and crude correlations of ranking acknowledgees
with rankings of citations. In other words, looking at people who are
highly cited and people who are highly acknowledged and it doesn't always
work out. There's a Russian astrophysicist who has created an astronomy
acknowledgement index and he said, after some initial scepticism, it's
been very warmly received within the astronomy community. And I mentioned
in passing earlier on the telescope operators. They were one of the groups
that felt vindicated, perhaps validated, it was almost a legitimisation
of their existence. They are the unsung heroes and heroines of our working
lives - it doesn't attract kudos and visibility but without people fixing,
managing, operating, and setting telescopes, the real astronomy, the science,
doesn't get done. They were delighted that somebody had actually produced
what was in fact a public register of their contribution.
I'm not suggesting
we start counting these things and that people's annual pay rises are
based on being acknowledged 2.7 times or 9.3 times. That trivialises the
potential. But it does provide some social recognition, if nothing else,
for people who make significant contributions Sociologists talk in terms
of trusted assessor-ship for this phenomenon. So yes, one could imagine
in the context of open peer review, if I take the trouble to critique
constructively somebody else's paper it may be archived for us all to
see. That is how things like e-pinions on the web work. They have ratings.
There are people who offer opinions and they struggle with eachother to
be the most highly vaunted, highly cited opinion giver. They do it for
nothing but it matters. The prestige, the status, so the social factors
in this are much more important than some of the rather crude quantitative
aspects which could be exploited. So I really do see it as throwing a
sort of search light on some of the shadow lands of scientific research
so that all those who make contributions can receive the due recompense.
|