Accuracy in Electronic Archives:
An Investigation

By Bruce William Oakley

Knight Foundation Copy Editing Fellow
University of North Carolina at Chapel Hill
April 1997

Currently the editor of Arkansas Online for the Arkansas Democrat-Gazette
in Little Rock, Arkansas. His email address is bruce_oakley@adg.ardemgaz.com

Link to Chamberlain's 1998 Follow-up Survey

Link to Database Quality Web Page

Introduction

The electronic age has made it easier than ever to get more information faster to flesh out newspaper articles. But this places an ever-increasing and ever more complex burden of quality control on newspaper librarians. A four-month project at the University of Chapel Hill looked into the question of whether electronic archives are saving or suffocating good journalism. Let the story begin:

Brummett's bewilderment

John Brummett knew he'd made a mistake, and now it was staring him in the face. Brummett, a political columnist at the Arkansas Democrat-Gazette in Little Rock, couldn't let this kind of mistake be published: A responsible journalist mustn't write that a public figure served jail time when in fact the conviction was overturned and the named party had never been behind bars during his appeal!

Fortunately, Brummett realized his error early in the process, grabbed the page proof he was reading and dutifully marked the incorrect passage. He turned the proof over to the op ed page designer, the corrections were made and all was well.

Or so Brummett thought.

Several days later, he was reading a letter from a lawyer for the public figure, quoting the excised passage and demanding corrective action. Brummett was stunned to learn that the offending language, which never appeared in newsprint, had made it into the paper's archives in the Lexis/Nexis database.

How did it happen?

The Democrat-Gazette had mixed computer platforms, with programmed translation of material that crossed a ''bridge'' from the editorial to the design platform. Corrections from page proofs were entered on the design platform and did not go back across the bridge to the editorial system. The archival capture came after the material crossed the bridge, but did not incorporate these final page proof corrections. Thus, the paper in effect transmitted a draft to Nexis, and inaccurate material that the paper never printed was available in its electronic archives to anyone with a modem and a few minutes to explore cyberspace.

Is Brummett's problem the rarest of disasters or the tip of the iceberg? Salvation or suffocation?

Perceiving and pursuing a problem

Four months of investigation show that misunderstood technology, misguided assumptions, poor planning and plain inattention play roles in dirtying the electronic archives of our nation's newspapers.

The investigation, during a fellowship at the University of North Carolina at Chapel Hill under a grant from the John S. and James L. Knight Foundation, aimed to determine the scope of the problem.

A survey of newspaper librarians sought to assess the systems by which newspapers deal with electronic archiving: How many have badly mixed technology that practically builds inaccuracy into their systems, and how many insist on rigorous quality control and careful programming?

A comparison of electronic archives to newsprint and a search of electronic archives for selected published corrections tried to give a hint of the frequency of accuracy problems in electronic archiving.

Quality control

Accuracy is the very essence of archiving: An archive is a reliable storehouse of information that helps those who use it build on history. Librarians have long emphasized rigorous quality control as the key element in planning, setting up and maintaining an electronic archive. Librarians have been warning of the pitfalls since the very advent of electronic archiving with The Toronto Globe and Mail's full-text online archive in 1977.

The complexity of the problem is summed up beautifully on page 228 of News Media Libraries: A Management Handbook, published in 1993 by Greenwood Press and edited by Barbara Semonche:

No one wants an inaccurate database. Everyone makes mistakes. These two indisputable statements add up to the necessity for quality control carried out on a regular basis.

The solution is difficult (and expensive) to apply. Rigorous, relentless quality control takes a lot of work by a lot of people:

A daily quality control list of stories added to the database the previous day, faithfully checked for typographical errors and incorrect page or sections information, is a start for an adequate quality control procedure. If it is possible, the quality control assistant should not be one of the persons who enhanced that issue of the paper, because it is easier to spot errors in a first-time reading. ...

Quality control is sometimes a seemingly expendable task if the library staff is overburdened, as most are, but it is important, if not vital, for the integrity of the database. It is the final step toward a complete and excellent online system.

If newspapers can't make that investment at the back end, the system must be set up for maximum efficiency at the beginning. Again this means involving many: Editors, library staff and network administrators must have a voice in developing the database software. Only with a full understanding of the information flow through the system can a reliable, efficient electronic archive be created.

So, are papers investing in this kind of quality control at the back end? Are they involving all of the right people at the front end? Four months of contacts with librarians from North Carolina to California indicate the levels of commitment are oceans apart.

Degrees of separation

At one daily in a Top 50 metro market, the librarian has been placed at the service of a newsroom research director who is not an information science professional. Attendance at costly professional conferences has been ruled out, and the librarian is left to contend alone with issues of archival integrity.

The computer system is a mix of platforms, and items that arrive in the library queue may not contain all page proof corrections. Transmission to the library queue also is not faultless. The librarian and an assistant spend most of their day in retrieval and pre-archive markup of published material.

Some enhancing is done, with the attachment of keywords (or descriptors) to each file. These keywords, however, are not chosen from a carefully controlled list (or thesaurus), but are assigned by the enhancer after a quick read of the material. Anyone looking for such articles by keyword must be of the same mind-set as the enhancer, which research has shown is problematic even when using a thesaurus: A 1992 study by University of Minnesota graduate student Mark Neuzil found that descriptor searches using three metro papers' lists produced no better than 55 percent of the stories that a full-text search for these terms uncovered in the papers' 1989 archives.

With retrieval, markup and enhancement absorbing so much time at this paper's library, quality control is negligible. A quick check of the headline, lede and end of the article is all that time allows. The librarian sighs and says microfilm or microfiche is the archive of record anyway, as an uncorrupted physical reproduction of the newspaper. The problem, of course, is that the newsroom does not turn to microfilm when searching for background.

The librarian laments that there is no way for her small staff to take about 100 articles a day and ''edit them line by line and get it right. We're lucky to have enough time to check the headline and the lede.''

Corrections are appended at the tops of stories they apply to, and stories fetched from the archive always open at the beginning of the file, but it is up to any editorial staff using the archive to deal responsibly with correction information. Fortunately, there have been no reported instances of corrupt information being retrieved and making it into newsprint.

These problems are all well-recognized and long familiar to librarians, and many newspapers have already found some solutions. But this librarian, cut off from professional associations and most other contact with members of her profession, cannot benefit from the experience of others. This archive is trouble in waiting.

Democrat-Gazette in the basement

At the author's employer, the Arkansas Democrat-Gazette, library manager Alfred Thomas and his staff of three work in the basement, removed from the third floor newsroom. The library does not have a terminal with direct access to newsroom articles -- it is a receiving point only, for articles awaiting archiving and transmission to Lexis/Nexis. The librarians generally no longer search Nexis, because of budget concerns. The newsroom does use a Nexis link.

A new computer system was installed in late 1996 that was supposed to eliminate some of the problems of the previous system, which required a bridge from the System Integrators Inc. (SII) editing system to the Macintosh-based design platform. Under the old system, archival capture came at a stage that did not always include final proofing corrections and so effectively created an archive of polished drafts, not final copy. The new system was also supposed to eliminate the need for time-consuming searches for items that didn't make it to the library queue. This has not been the case, according to librarians. In fact, librarians said the number of missing stories has risen from perhaps three each week to four each day. With a limited staff, search time cuts into time for quality control or research. Quality control of daily archiving consists of a quick visual check of the headline, the lede and the last paragraph to make sure that the entire file is there. The library does not generally verify that final page proof corrections have been incorporated in the archival feed. The choice of archiving software was made by systems people and ran counter to Thomas' preference.

Also, for reasons not yet fathomed, duplicates appear when anyone attempts to retrieve material from the in-house archive. This is a particular nuisance in a full-text display, where a reporter ends up scrolling twice through a 50-inch article.

Library staff feels distanced from the newsroom and unable to devote much attention to research and resources. Librarians would like their department to be more of a resource center, but even acquiring an updated set of encyclopedias has proven difficult. (The paper recently acquired a new set of encyclopedias, but put them in the newsroom. So, the librarians trained in research aren't readily able to use them, and reporters and editors don't have the advantage of specialists' assistance.) Standard desktop references such as legal and medical dictionaries, Bartlett's, and film and music guides are also in the newsroom. Only specialized research volumes and indexes are exclusive to the library.

The problems of this setup came home to roost in November with Brummett's disaster. Discovering the error actually revealed another layer of problems. Getting the error purged from Nexis took long and insistent effort. First, the version that was actually printed was queued up and transmitted. A check of Nexis revealed that this correct version appeared, but the prior version had not been removed. Instead of one wrong version, there were conflicting versions. A second prolonged stint on the phone followed before the never-printed version was finally purged.

Corrections in general require complicated effort at the Democrat-Gazette. A written policy prescribes standard newsroom procedures for ensuring that corrections are attached to archival versions, but librarians describe their part of it as an eight- or nine-step process that has grown more rather than less complicated with advancing technology. The process sometimes includes retyping the correction rather than electronically cutting and pasting -- again introducing greater chance of new error.

Tracking trouble

Brummett's case is the worst the Democrat-Gazette has faced, but a check of a few days of section-front material in Nexis suggests that everything bad that can happen in this setup does happen. A review of material published Sept. 15, 1996 (chosen because Sept. 15 is the author's birthday), on the paper's old mixed-platform computer system found a caption with typos that did not appear in the print edition: ''seeking'' became ''speeking'' and ''miracle'' became ''milracle.''

The archival versions of stories published on these days also did not include paragraph breaks. The last sentence of each paragraph ran together with no space between its period and the first word of the next paragraph. This leaves no particular confusion about meaning, but it is an awful nuisance to a reader.

The archived version of the front page ''In the News'' compendium of briefs did not include final page proof corrections -- ''Santa Ana'' was spelled correctly in the final (City) edition but was captured as ''Santa Anna'' in the archival version.

Still another article appeared in two versions on the archive. Under the headline ''It's Hope against Hope,'' one version was tagged by the archive as 1,154 words; the other, 834 words. Neither version gave any indication of which edition it appeared in or how the versions differed. And neither of these versions matched the City (final) edition, which was still longer. (The Democrat-Gazette librarians, like most newspaper librarians, generally save either the final edition or the longest version -- or both, but in this case they got neither.)

This was not the only article for this day available in two versions. Under the headline ''Convict's rehab center raises questions,'' one version of a 1A story ran 1,818 words and the other 1,848 words. In this case, the difference was clear -- and frightening. The extra 30 words in the longer version included the notice in the header that this version was a corrected version! The correction, attached at the top, fixed a name appearing a few paragraphs from the end of the lengthy piece. By the time a searcher scrolled to the point of error, however, this notice would no longer be visible and there is no reminder at the point of error.

And the original version without the correction notice remained in the archive!

The same thing occurred with a 1B story under the headline ''2 challenging incumbent for LR school board seat.'' One version contained a correction notice dated Sept. 18, 1996, amending information about a candidate's academic degree. But the original remained available unmarked in any way.

Switching to a seamless computer system has helped, but some new problems have arisen and some of the old ones linger.

A check of 1A and 1B articles published Jan. 17-18, 1997 (dates chosen at random), found two articles with headline typos in the commercial database. The archive showed a front page article with a name in a headline that did not match newsprint: It had been mangled from ''Flanagin'' (already an unusual spelling of a fairly familiar surname) to ''Falagin.'' In this same archive headline, the surname ''huckabee'' had been incorrectly typed with a lower case h. The lack of capitalization may not confuse anyone about the name, but it could mean a miss in a case-sensitive headline search -- and it simply doesn't match what was printed! This headline had been stripped somehow from the electronic version transmitted to the library, and had been badly rekeyed. A headline on another article this day also was obviously retyped, because ''souls'' in print became ''sould'' in the archive.

Why, in this advanced electronic age, does our industry spend so much time retyping? How, in this litigious age, do we dare?

These articles published in January fall under the Democrat-Gazette's new computer system, which is supposed to provide seamless operation. The articles checked under this system did match newsprint in nearly all cases, with the exception of the headlines mentioned. As it turns out, different designers build their pagination documents in different ways, and some of these methods disconnect headlines from body type, or section-front segments from jumps. Also, it seems the program that strips typesetting markup removes headlines under some circumstances.

New technology poses new problems.

Right from the start -- Raleigh

In the Democrat-Gazette's case, errors arose at every conceivable stage of archiving: Electronic capture at the wrong time; code stripping that adulterates text and eventually leads to what should be unnecessary retyping that then is poorly proofread; and inadequate policing of the archive after publication of a correction. Quality control at every stage is the answer, and that takes people and resources.

One paper that does it right is The News & Observer in Raleigh, N.C. The N&O has 20 employees in four divisions in its library covering archiving, periodicals and resources, newsroom research and public research. The public research is kept separate from the newsroom and conducted in a different building to avoid questions of conflict of interest.

The entire newsroom library operation is highly automated, but every step is carefully monitored. A computer script fetches every article in the pagination system after the press run, translates it from a pagination to a plain-text file and attaches an in -house-designed header with archiving fields. These files are then sorted into computer folders by section. A team of four enhancers (three on weekends) then goes to work, scanning header fields and verifying their content, then checking paragraph by paragraph for a match to newsprint. Keywords are attached from a thesaurus of about 100 terms.

The enhancers work from a story list called a ''slug report'' that helps them verify header fields and keep an accurate count of their work. Computer scripts written by a former enhancer who now works in systems at the paper also are employed to check coding in the date and summary fields. In all, enhancers handle 140-145 articles each weekday, 200-235 each on Saturdays and Sundays. During the author's visit to the library March 2, 1997, the slug report showed only one headline dropped by the translator script over the course of several days. (Consider that in a check of only two section covers of a randomly chosen day's archive of the Democrat-Gazette, two headlines were found to have been retyped by librarians.)

Occasionally, the Raleigh scripts also place bylines in the wrong header field, but even here enhancers merely cut and paste electronically rather than retype. The most significant other problem with the translation script is handling fractions. Any article that contains fractions requires a careful check of newsprint against the electronic version, and the fractions must be typed in over the inevitable mistranslation.

Handling of corrections remains a bit cumbersome. The correction itself is archived, but the correction and the original article are not simply appended electronically. The original must be called up on a computer terminal and the correction retyped at the top. The amended article is then refiled and the entire year's archival database ''reindexed.'' This reindexing runs in computer background and can take several hours as the year passes and the database grows. The amended version is then sent by file transfer protocol to the various commercial databases on which the archives appear.

Overall, the bugs are few, but still galling to a staff that prides itself on unceasing quality control.

The process is so refined because archive manager Colline Roberts had a strong, direct hand in development of the computer scripts when the archive came on line in 1994 and Macintosh-based pagination followed half a year later. She helped design the header format and worked closely with systems programmers as they automated processes. An enhancer with programming propensities wrote some small scripts to handle repetitive cleanup tasks and eventually worked his way into a systems job -- with the stipulation that he remain especially available for library technical support.

This is a library at a paper committed to a top-flight archive.

Diligence in distress -- Riverside

The complexity of our age makes this sort of commitment essential. Problems can arise even when all procedures are followed to the letter and everything is done just so. For the library staff of The Press-Enterprise in Riverside, Calif., the commitment to excellence has meant intense effort to get things right even after material goes out to commercial databases.

While tracking some other concerns, library director Jackie Chamberlain learned in mid-1996 that The Orange County Register was having trouble with parts of stories missing from the Nexis database. Chamberlain found a similar problem at The Press-Enterprise and in chasing down the failed transmissions came across a number of truncated stories. Lots of investigation, e-mailing and a two-hour teleconference involving Nexis revealed that long stories that had corrections appended or that were resent with fixes became truncated on retransmission.

Riverside uses Basis/DataTimes software, and its files are sent in chunks that are reassembled at the receiving point. Basis users also can transmit just the pieces of an article that update or are amended and have them woven into the original. At Nexis, this reassembly did not proceed properly with the Basis files. Tops of long articles never got their bottoms. Worse, it turned out that corrections attached at the top in Basis were moved to the bottom in Nexis, but since the bottoms and tops never were rejoined, articles not only were truncated but also were left without their corrections.

This raises inevitable questions of legal liability: Whose neck is on the line if someone happens to go to the database and retrieve an article that's missing its correction? In wrestling with the problem, the earliest solution was to send the entire text of any such articles, not just the revised segments. There was still a problem of reassembly, however, and eventually Nexis wrote a software patch to handle it. But the user still must be sure to resend all the pieces of articles corrected in any way.

The problem that remains is that any files sent before the software patch was developed may still be corrupt. Basis users could face a needle-in-a-haystack search through their archives or transmission lists for such corrupted files, but a simpler method is to retransmit entire batches of files. The new versions then supplant the old, if the new versions are encoded with the same access number or identification stamp as the old.

During the course of her investigation, Chamberlain alerted other librarians across the country and kept in steady contact with Nexis, searching for an industry solution to what she might have looked at as a local problem. And she got lots of assistance from Jim Zikratch of the paper's technical staff. Quality control must be the concern of all, or there is no quality control.

So, Press-Enterprise librarians have one more item on their to-do list. Chamberlain now does occasional verification of the commercial database. Nexis has helped by providing an early morning e-mail report on updates. The report alerts The Press-Enterprise to any reassembly problems so they can be fixed. Chamberlain believes that these remaining stitching problems arise from librarians occasionally forgetting the new need to resend all the pieces of an amended article.

The watch never ends.

Watching and waiting -- Winston-Salem

So, what can anyone take away as a lesson from watching such situations unfold? Well, for library director Ginny Hauswald at The Winston-Salem (N.C.) Journal, the message may be that patience is a virtue. The Journal (circulation 92,000 daily, 102,000 Sunday) is due to put its archive on line in June 1997. The move to an electronic text archive has been on hold for a few years during beta testing at the paper of pagination software and installation of a digital photo archive.

In the meantime, though, Hauswald and other newsroom leaders have been attending conferences and expositions and reviewing various archiving programs. She speaks well of the cooperative spirit with which the move has been approached, though she admits she has had to be a fighter at times for some of her key concerns. She has a staff of four and is hoping to add a part-time on-call person.

Her trips to conferences and professional networking give her a good view of the road ahead. Specific questions already resolved include timing of the archival capture -- the archival feed will be the same as the feed to press. (A former managing editor insisted -- it's good to have support throughout the newsroom.) A thesaurus of descriptors will be used, and an index of newsclips with 20,000 entries that has been building since 1992 also will be included in the new electronic archive. Questions that remain are whether there will be enough time under the new system for continued cross-training of library staff in research and database enhancement roles.

Knowing what other papers have gone through should help make for a smooth, efficient transition to a clean electronic archive.

Summary

This four-month investigation of electronic archiving uncovered problems at every step in the process, from the first capture of information to the last connection between a commercial database and a searcher. Problems were fewer and smaller at libraries with the most rigorous quality control, but no library was immune.

Technology puts more and more information in any given place with less and less human handling, but this creates a crying need for more and more human monitoring. In our communication business, it seems the computers are doing more of the communicating these days. But every time computers communicate, people need to be communicating, too -- talking with each other to do a better job of setting up the rules for these silicon connections and then monitoring these connections to keep them pure.

There is no rest in this task and no part of the newsroom can avoid sharing in it.

Surely, back at the Democrat-Gazette in November, columnist John Brummett was convinced he had done everything right. He had caught an error before it could be published and later seen the amended version in newsprint. How could he have imagined a system that would preserve the unfinished product?

In the newspaper business, one is constantly reminded to check one's assumptions and sources. Assume the archival capture comes after final page proof corrections? Better check. Assume headlines, captions and corrections are electronically cut and pasted where they belong rather than retyped before archiving? Better check. Assume corrected versions sent to a commercial database supplant incorrect ones? Better check. Assume that article retrieved from a commercial database is actually what was put on newsprint? Better check!

Some will read the details of this investigation and see a great deal of extreme nit-picking. Perhaps many newspaper leaders will continue to bet that the sorts of errors uncovered here are not worth the investment in people and time needed to safeguard against them. But the investigation shows that at the very least newspapers should check their assumptions about the validity of their electronic archives. Searchers of these archives similarly need to ask some questions to establish the level of confidence they can have in a given database.

Quality control is everyone's job.

Data Analysis

This study of electronic archiving proceeded on several fronts and was designed as a broad-brush effort to uncover problems rather than as a rigorous scientific exploration of any single concern. The investigation included site visits to newsroom libraries, a survey posted on an Internet listserv, e-mail and phone contacts with librarians, line-by-line comparisons of electronic archives to newsprint and a search for selected published corrections. Each paper used for line-by-line comparison was also the subject of a site visit, so some conclusions may be drawn about the relationship between the library setup and the quality of the archives. Organizational details on the site visits, line-by-line comparisons and survey follow. Results are detailed in the next section.

LINE-BY-LINE COMPARISON

The front page and local covers of several newspapers from varied dates were compared word for word with commercial databases. The papers and dates were chosen capriciously for the most part, but occasionally with comparisons and convenience in mind. The mix ultimately incorporated libraries with varying degrees of sophistication using systems with varying degrees of seamlessness and allowed for some comparisons of databases of different vendors.

The papers and dates examined (with the reasoning behind the choices in parentheses), and the databases:

The Arkansas Democrat-Gazette (author's employer)
    Searched on Nexis:

The News & Observer in Raleigh, N.C. (library with strong reputation, within driving distance of Chapel Hill, N.C., for site visit)
    Searched on Nexis and DataTimes (Retrieved same articles from each for Jan. 17 for comparison):

The Charlotte Observer (home paper to another Knight fellow, has mixed computer system similar to Democrat-Gazette's old system and lies within driving distance for site visit)
    Searched on Dialog:

The Tennessean in Nashville (roughly halfway along route from Little Rock to Chapel Hill, convenient for site visit)
    Searched on DataTimes:

CORRECTIONS SEARCH

Corrections published in the Arkansas Democrat-Gazette on March 11, 12 and 15, 1997, were sought in Nexis archives to determine if the original incorrect information could still be retrieved without notification of the error. The publication dates were chosen because they coincided with a trip to Little Rock by the author.

SITE VISITS

Librarians answered questions, described their systems and demonstrated their procedures. The papers in this group are also the ones chosen for line-by-line comparisons. The visits helped give a sense of the work flow and load and also provided a glimpse of different software, staffing levels and procedures.

THE SURVEY QUESTIONNAIRE

A 29-question survey was posted on the Newslib listserv owned by Barbara Semonche, director of JoMC Library at UNC-CH, with two weeks allowed for replies by e-mail, facsimile or ''snail mail.'' The listserv has about 740 subscribers, including journalists, students, vendors and newspaper librarians. The survey drew responses from 12 libraries, including 11 newspapers (one Canadian) and one news service for a group of newspapers. Two sets of responses were generated from notes from the site visits to Raleigh and Little Rock for comparison to the unnamed respondents.

 
Survey questions:  
GENERAL 
1) Your paper's name: 
2) Web URL (if you have one): 
3) Commercial databases on which your archives are available:
4) Circulation of your newspaper (daily, Sunday): 
5) How many editions do you publish? 
6) How many people are employed in the newsroom? (Include
photo, graphics and special projects staff)
7) Identify your computer system(s) for editing, design, archiving. 

THE LIBRARY
8) How many people are on the library staff? (List their job titles.)
9) Do library shifts cover night hours? Weekends?
10) What archiving software does the library use?
11) If you have different platforms, which platform feeds the archives?
12) Whether you have mixed platforms or seamless system:
    a) Does the archival feed include final proof corrections?
    b) Who monitors the feed? Library staff? Editorial staff? Mixed team? 
    c) What quality control measures do you take to ensure the archives 
	match the printed product? 
13) Does library staff enhance archived articles with keywords?
14) Is there a thesaurus of keywords?
15) Are all editions archived and do all versions of an article that 
	updates over multiple editions make it into archives? 
16) If all editions and versions are not archived, what is the selection 
	process for determining which ones are? 
17) Does the newsroom have terminals with access to archival database? 
18) Does the library have terminals with access to the newsroom database? 

DATABASES 
19) If your archives are available on multiple commercial databases, 
	does the identical text appear in each database?
20) How is the transmission to the on-line database(s) handled? 
	Automated? Markup by librarians? By editors? By MIS personnel? 
	By vendor personnel? Other? 
21) What quality control is done at transmission and to verify transmission 
	and who handles it?   
22) Please describe the work flow and quality control at each step from 
	editor to librarian to archive.   

CORRECTIONS
23) Do you have written policies on publishing and archiving corrections?
    a) Please describe how corrections are attached. And are they placed 
	at the top, the bottom, the point of error, or is some other 
	method used? 
    b) Do you verify that a correction has reached the on-line database(s)? 
	If not, describe what other means are used for quality control.
24) What other kinds of correcting do you do in the database (e.g.,  
	checking notes, fixing validation errors, cleaning up typos in 
	the header fields) 

THE WEB
25) Do you have a web edition?
26) If not, is one planned?
27) Is the library involved in current or planned web edition? 
28) Describe any ways in which text on the web edition varies from the 
	newsprint one? For example, material removed because of length 
	might be restored before web publication.
29) Is the web edition archived? If so, how is it handled? Is there a 
	search engine? If so, please identify it.

The Survey Results

THE SURVEY

Staffing in the newsroom libraries responding generally runs about 1 for every 35 people in the newsroom, with the highest level better than 1 to 20 and the lowest around 1 to 60. Library staff size in the sample varies widely in relationship to circulation and number of editions.

The table below does not include the news service that responded, but its library has a staff of three in a newsroom of 35 employees. No respondent's paper or news service had encountered legal problems arising from corrupt archives.

Paper	Circulation		News	Lib.	Number
	Daily	Sunday		Staff	Staff	of editions
_____________________________________________________________________
1)	71,000	84,000		90	2		3
2)	211,500	307,000		175	5		3
3)	180,000	240,000		300	5		7
4)	166,605	173,533		238	7		8
5)	67,000	100,000		80	1 F/2 PT	2
6)	340,000	530,000		250	13		3
7)	240,000	300,000		270	8		3
8)	135,710	N/A		111	4		1
9)	375,000	850,000		260	7		3
10)	382,484	461,620		352	15		8
11)	240,000	440,000		235	8		3
12)	112,000	140,000		125	4 F/2 PT	6
13) Raleigh	
	169,028	196,434		250	3
14) LR	
	175,218	288,250		210	4		3
The Democrat-Gazette is the only paper in the group where the library does not have a newsroom terminal.

Four respondents do not have archives available on commercial databases. Five have archives available on more than one database and a sixth is negotiating with a second vendor. Six are with DataTimes, five are with Lexis/Nexis, four with Dialog, two with NewsBank. Infomart is also represented.

Five respondents had Atex editorial systems, four had SII, three had Macintosh networks and the other had a PC-based system.

Eight of the respondents said that archival capture does not always incorporate final proof corrections. Of these, four reported extensive quality control at this stage by enhancers to counter the problem, three reported at least a quick scan at the enhancing stage, and one admitted the paper's approach was to ''hope for the best.''

Quality control in the overall process ranged from paragraph-by-paragraph checks at enhancing, along with software and human verification at nearly every technical stage, to relying on the hope that the software works. The commitment to quality control was independent of size: Papers large and small fell on each side of the divide.

All of the respondents except the news service and the Democrat-Gazette used keywords in enhancing, and only one did not use a thesaurus. Three papers archived all editions and all versions of an article updating across editions and a fourth paper microfilmed all editions and electronically archived the final. The remaining respondents generally archived either the latest or longest version of articles.

Librarians monitored the various communication points in the process at most papers, with systems personnel aiding at two places and editors playing a role at three.

Two respondents append corrections at the bottom of articles, with one of these using a special field and the other also filing the correction as a separate article. The remaining respondents place corrections at the top, with four using a special field. Of the respondents filing corrections at the top, one also files the correction as a separate piece and two place a ''See correction'' notice at the point of occurrence. Two respondents verify that a correction reached the commercial database, another respondent conducts spot checks and the others rely on software verification of transmission and on adherence to procedures.

SITE VISITS

Findings from the site visits are incorporated throughout the report and particularly in the detailed results of the line-by-line comparisons. Librarians and systems personnel encountered in all cases were informative, dedicated, polite and helpful.

LINE-BY-LINE COMPARISON

Arkansas Democrat-Gazette
The Arkansas Democrat-Gazette converted to an all-Macintosh computer system late in 1996. The former system linked a System Integrators Inc. (SII) editing platform to a Macintosh-based design platform. The switch seems to have reduced problems in the archive, but it has certainly not eliminated them.

The paper's electronic archive is found in the news library of Nexis under the file name arkdem. The file does not use ''page'' as a searchable field, so the search string took the form DATE IS MM-DD-YYYY. This yielded all of the given date's stories. The focus command .fo 1A or 1B winnowed the list of citations to articles on the appropriate section covers. Nexis files give word counts.

    Sept. 15, 1996:
Three front-page wire service articles do not appear in the archive. One article appearing on 1B in City (final) edition ran on the front page in an earlier regional edition and was archived as a 1A story, so it was not compared.

Page	Partial headline 		Word count                
1A	Convict's rehab center....	two versions -- 1,818/1,848
1A	Buildup in gulf...		753
1A	It's Hope against...		two versions -- 1,154/834
1A	In the news...			471
1B	Medical board allows...		697
1B	Playing the blues (caption)	150
1B	Four-car crash leaves...	236
1B	Henry sent in to give...	935
1B	Little Rock, Pulaski County...	two versions -- 889/930
1B	School board candidates...	869
Can't get a break: The archival versions of all articles on this date were missing paragraph breaks. The last word and period of each paragraph and first word of the succeeding paragraph were combined. Presumably, this is a problem with the Democrat-Gazette's code-stripping software or a problem in transmission to the database.

Problem after problem: ''It's Hope against ...'' appeared in two versions, but no notation anywhere indicated which edition the versions came from or how or why they differed. Neither version exactly matched the City (final) newsprint edition used for comparison, which was longer yet. How is a searcher to sort this out without some guidance in some header or memo field?

In the longer version, the ''I'' in ''It's'' was missing from the headline. At the least, this means sloppy proofreading during archival preparation. A photo caption archived with the text misspells ''seeking'' as ''speeking'' and ''miracle'' as ''milracle.'' This indicates text that wasn't captured or appended electronically as it should have been and then was badly retyped and proofread.

Who's on first: ''Convict's rehab center ...'' appeared in two versions, the first one listed as 1,818 words long and the next is 1,848. The longer version contains a correction notice at the top (published Sept. 16, 1996, according to the notation) correcting the first name of a man quoted late in the article. Otherwise-duplicate versions with correction notices should supplant originals in commercial databases, because they ordinarily should be transmitted with the same file identifier used for the original. This method apparently was not used in this case. Also, it should be noted that searches usually produce lists with the most recently filed version appearing first because the default arrangement is latest-date first. The search parameters in this case turned even that upside down, and the search yielded the corrupt version first, with no hint of the problem. Had the search not continued to the next citation, the error easily could have been retrieved and propagated.

''Little Rock, Pulaski County ...'' suffers this same problem. It appears in two versions on the archive, with the second, longer version including a correction dated Sept. 18, 1996, on the academic credentials of a man named late in the article. In this case, both versions also include two paragraphs not in the newsprint version, which suggests the page was changed and trims made after the archival capture.

Timing and translation: ''Four-car crash leaves ...'' appears in the archive with a sentence that concludes ''was critical Saturday night in University Medical Center.'' The newsprint version reads ''was listed as critical ...'' (emphasis added). This is the sort of language a copy editor would have added late on a page proof. Similarly, the daily 1A ''In the news'' feature has ''Santa Ana'' misspelled ''Santa Anna'' although the newsprint edition has the correct spelling. The 471-word archival version of these briefs also includes a sentence that reads: ''Tilmer Everett, 25, was arrested, after Bismarck, N.D., police said the when he was...'' but ''after'' and ''the'' do not appear in newsprint. These errors again indicate archival capture not incorporating final proofreading corrections.

Also, each item in this roundup of briefs begins in newsprint with a black box instead of a paragraph indent, and the boxes are normally translated to paragraph indents in the archive. In two cases here, however, the indents are missing and an ''m'' appears where the items run together. This indicates a computer translation problem.

Little things: ''Buildup in gulf ...'' had two odd paragraph breaks in the archive, one in the middle of a word and the other in the middle of a sentence, that do not appear in newsprint. ''Playing the blues,'' a set of captions for a photo essay, was missing ''(right)'' in one caption, indicating which picture it accompanied. This notation is meaningless in a text-only archive, so perhaps there is no harm in this imperfect match with newsprint.

Good things: ''Medical board allows...,'' ''Henry sent in to give ...'' and ''School board candidates ...'' matched word for word.

Everybody's doing it: A deck headline with ''Medical board allows ...'' flowed right into the main headline with no punctuation or break, making comprehension difficult. Nearly all newspaper archives in this study handled deck heads this way -- and nearly all were confusing.

    Jan. 17, 1997:
One front-page wire service article does not appear in the archive. The front-page ''In the news'' for this date was not compared.

Page	Partial headline 		Word count
1A	Dumond granted ...		1,537
1A	Issue No. 1 in survey? ...	  620
1A	Check for park diamonds ...	  687
1A	LRPD has vested interest ...	  657
1A	Inauguration: Circumstances ...	1,261
1B	LR squadron leaving ...		  748
1B	Gas spill closes ...		  548
1B	Ex-police officer guilty ...	  615
1B	Weather radar site ...		  873
1B	Teleconferencing acquits ...	  798
1B	Tax decision for church ...	  490

Off with the heads: ''Dumond granted ...'' has a deck headline that runs into the main head in the archive, and the archived deck contains two typos besides: The 'H' in ''Huckabee'' is incorrectly lower case and ''Flanagin'' has become ''Falagin.'' The headline likely was stripped by software at some stage and retyped badly.

Getting better: All other articles compared for this date matched newsprint, including faithful reproduction of some minor miscues. For example, ''Inauguration: Circumstances ...'' includes a reference to ''65-foot flag polls'' and ''LRPD has vested interest ...'' includes a paragraph that ends with a comma instead of a period. It is perhaps a mixed blessing that these also appear in the archive.

    Jan. 18, 1997:
Two front page wire service articles for this date are not in the archive. The front-page ''In the news'' feature was not compared. The local section was not compared.

Page	Partial headline 		Word count
1A	Women lawmakers blast ...	1,048 
1A	Victim: Governor, parole ...	  637
1A	2nd inaugural lacks ...		1,029
1A	'Hearts and souls' must ...	1,303

Wrong-headed: '' 'Hearts and souls' must ...'' became 'Hearts and sould' in the archive, again indicating a headline inadvertently lost and badly retyped.

Missing something: Two captions archived with ''Women lawmakers blast ...'' and one with '''Hearts and souls' must ...'' were missing all apostrophes. This is a computer translation problem difficult to catch without a word-for-word scan.

Better yet: Articles otherwise matched in all particulars.

CORRECTIONS SEARCH

Arkansas Democrat-Gazette
The Democrat-Gazette has a written policy on corrections that includes approval at the assistant managing editor level, and completion of a form that goes to the library. This has not helped the electronic archive.

Corrections published March 11, 12, and 15, 1997, were tracked in Nexis. Searches were conducted in the library news, file arkdem, with search strings of the form DATE IS MM-DD-YYYY AND (identifying text). The identifying text in most cases was a name appearing in the correction. Dates of the original article containing the error were chosen to see if amended versions had supplanted the inaccurate ones. Five corrections on these three days were checked. In all five cases, the original article containing the error was retrieved. Not a single one had a correction notice or any other memo regarding the error attached.

A subsequent search for the corrections themselves, using the dates they were published and the same identifying text that produced the faulty originals, yielded no matches. Not one of the corrections was archived as a separate file, and no corrected versions were found. The search was then expanded to include a range of 10 days after publication of the original, on the chance that an amended file might have appeared later. This, too, produced no matches.

In one of the March 15 cases, the correction amended the amount owed on a fine. The original article was still available, but an article published the following day also was uncovered by the expanded search range. This follow-up gave the correct figure.

In another March 15 case, the correction amended a description of the contents of House Bill 1109. A check of the previous day's archive for the original article showed that it referred to House Bill 1108. Policy is fine, but performance is faulty.

LINE-BY-LINE COMPARISON

The Charlotte Observer
The Observer uses an SII editing platform linked to a Macintosh-based design platform, like the Democrat-Gazette's old system. It is a Knight-Ridder paper with its archives available commercially in Dialog. The Dialog identifying number for The Observer is 642, and date and page are searchable fields. A three-set search produced 15 items. Dialog gives word counts. Section C contained local news for the randomly chosen date of April 3, 1997.

The search sets were:
    s1=DATE IS 04-03-1997 (165 items)
    s2=s1 AND PG=1A (7items)
    s3=s1 AND PG=1C (8 items).

Sets 2 and 3 were sent directly to a printer rather than saved on disk.

The communication line took some sort of hit during transmission from Dialog, producing garble and requiring retransmission of items 5-7 of s2. And neither set's final page printed, so add a couple more lines to the list of communications difficulties in our electronic age.

April 3, 1997:
Two 1A items and two 1C items appearing in the search did not come from the Metro edition used and so were not compared to newsprint. Headlines appear in all caps in the archive, with deck heads running into main heads without separators. This complicates comprehension.

The Observer makes good use of the memo field to indicate graphics, photos or info boxes that accompany articles. These notations also specify if that material has not been archived. Because of the cross platforms, Macintosh-generated graphics often are not archived. The memo field is also used to indicate whether a story appeared in other editions, and if so whether it was on a different page or was of a different length. Keywords or descriptors are used for indexing and appear at the end of the archived article. The library uses a thesaurus of descriptors.

Page	Partial headline 		Word count
1A	Vote could hurt ...		  775
1A	One option: Zero ...		1,305
1A	Teens face longer ...		  964
1A	Memos show Clinton ...		1,040
1A	Minivan's child seat ...	  951
1A	Pentagon forecast sees...	  692
1C	Bill would let people see ...	  429
1C	Killer silent; time runs ...	  926
1C	6-year-old runs into ...	  373
1C	Plates turn staff BMWs ...	  590
1C	1960 election ballots found ...	  314
1C	STATESVILLE AUTO SHOP ...	   77

Taking good notes: ''Teens face longer ...'' includes a deck head from a different edition than the one compared, but the archive specifies which edition is saved. A memo notes that a chart that accompanied the article can only be retrieved from microfilm. The memo also informs searchers that an info box is attached at the end. A memo on the archival version of ''Plates turn staff BMWs ...'' notes that it is a longer version taken from another edition. Except for the additional material, the text matches.

Picture imperfect: A caption archived with ''Memos show Clinton ...'' incorrectly gives the name ''Harold Ickesin.'' The picture was a mug shot of Harold Ickes used with a quote block, and the newsprint text included an identifying phrase following Ickes' name that began with ''in.'' The incorrect name could have come from sloppy deletion of this identifying phrase. Three photos accompanied ''Killer silent; time runs ...'' in the Metro edition, but only two captions appear in the archives.

Mixed platforms, clean copy: All other articles matched letter for letter. During the site visit, librarian Marion Paynter indicated that enhancers check the first and last word of each paragraph to verify validity of the archival capture. This pays off in a clean archive despite a mixed system.

LINE-BY-LINE COMPARISON

The Tennessean, Nashville
The Tennessean also has a mixed SII-Macintosh system. Descriptors are used, but these are added at whim after a quick scan of the article and are not pulled from a thesaurus. The problem with this kind of indexing is discussed above. The Tennessean is a Gannett paper in a joint-operating agreement with The Banner, and its archive appears in DataTimes under the identifier TNNS. Date and page are searchable fields. The search was for DATE IS 02/14/97 AND PAGE (1A OR 1B). The files give printer-page estimates rather than word counts. The headlines appeared in all caps, and deck heads generally were not included.

Feb. 14, 1997:

Page	Partial headline 	
1A	Hostage safe, gunman ...
1A	Sundquist 'unaware' ...
1A	March house officially ...
1B	Mayor plans schools ...
1B	Hunting harbingers ...
1B	Singing the praises ...
1B	Sundquist defends ...
1B	Traffic delays cost ...

Dash it all: In all articles, dashes used to set off phrases had been converted into spaces, which made comprehension difficult.

With words unspoken: In ''Traffic delays cost ...,'' square brackets used to insert phrasing parenthetically in a quote were lost in translation. This dangerously puts words in a speaker's mouth.

Troubling picture: Two captions of photos with ''Hostage safe, gunman ...'' are run together in the archive, again making it hard for a searcher to sort out. The archival version of a caption with ''Sundquist 'unaware' ...'' cuts off after only seven of 13 words in newsprint.

What's in a name: A photo caption credit on ''Singing the praises ...'' appeared as ''SHELLEY MAYS STAFF'' and one is left to wonder if Staff is the photographer's surname. The use of all caps and the lack of a separator make this especially hard to sort out. The form ''Staff photo by Shelley Mays'' would eliminate any chance of confusion. The byline on this story is handled similarly: ''RAY WADDLE RELIGION EDITOR.'' This is easier to sort out, but easier still would be the form used on ''Sundquist defends ...'': ''DUREN CHEEK Staff Writer.'' Here the use of caps for the name and upper and lower case for the credit provide clarity.

Pardon the interruptions: Line breaks not in newsprint are added in the middle of sentences in the archived versions of ''Mayor plans schools ...'' and ''Traffic delays cost ...'' These breaks appear at the top of the second column of the newsprint articles, where each had a white-on-gray reverse head identifying the story locale as ''Davidson.'' The computer translation for the archive presumably stripped this tag but left a line-end command in its place. On ''Singing the praises ...,'' a quote block in display type on the front page appears in the archive in the middle of the sentence where the article jumped from the front page. After the quote, the article's jumphead appears, followed by the interrupted text. This was confusing even with newsprint right at hand for comparison. Worse, the quote used for the display block follows just three paragraphs after this archival insertion, strengthening the impression that the transmission has been somehow garbled.

LINE-BY-LINE COMPARISON

The News & Observer in Raleigh, N.C.
The N&O archives are available in Nexis and DataTimes. Articles published Jan. 17, 1997, were retrieved from Nexis and DataTimes to see if identical text appears in each. The Nexis identifier for The N&O is nwsobs; in DataTimes, it is rnobs. Word counts are provided in Nexis, while DataTimes gives an estimated printer-page count. Date and page are searchable fields in The N&O's DataTimes file. The N&O archives its articles placing the section letter first, though page numbers come first in newsprint. 1A thus appears as A1 in archives. The DataTimes search command was DATE IS MM-DD-YYYY AND PAGE (A1 or B1). The Nexis search specified DATE IS MM-DD-YYYY and the resulting list was winnowed by the focus command .fo A1 or B1.

The N&O uses descriptors from a thesaurus in enhancing articles for the archives. File headers cite the edition in which the article appeared. Deck heads generally are not included.

Bylines and credits appear in separate fields, eliminating confusion that arises when these run without separators in a single line.

Commitment to quality control and careful software development give Raleigh a generally clean archive, but a few small things indicate the enormity of the task librarians face: Captions dutifully archived in DataTimes were absent in Nexis, one byline was wrong in both Nexis and DataTimes, and a headline on one article in DataTimes appeared at the end of the file rather than the beginning. Diligence remains the watchword, though: Library staff at The N&O heard about these findings of the misplaced headline and mistyped byline and moved to correct them.

Some headlines are changed in the archive into all caps, while others are upper and lower case. Colons are inserted after subheads in the archive, which helps with comprehension. Since the electronic version does not reproduce newsprint's visual cues of larger, bolder type and centering, the punctuation makes the subhead seem less like a disembodied phrase accidentally inserted.

Dashes, which posed a translation problem at The Tennessean, are accurately reproduced in The N&O archive. Square brackets, also a problem at The Tennessean, are translated into curly braces in The N&O archive. While not a match in typography, this is at least a match in spirit that cannot corrupt a quote that includes parenthetical material.

Notes about any graphics not archived appear at the end of articles, where captions are also attached.

Jan. 17, 1997:

Page	Partial headline 		Word count
1A	N.C. schools' grade ...		1,040
1A	Report assails Gingrich ...	671
1A	Watershed rezoning ...		1,130
1A	Cosby's son shot ...		895
1A	Firm to get postal ...		506
1B	Twice is too much ...		560
1B	Garner man pockets ...		596
1B	Public input on arena ...	907
1B	Arctic chill hits ...		703

Double, double, a bit of trouble: The archival versions matched in DataTimes and Nexis, except that captions with ''Garner man pockets ...'' and ''Arctic chill hits ...'' did not appear in Nexis. The databases were identical down to errors: The archived byline on ''Public input on arena ...'' is ''MATTHEW EISELY,'' though it appears correctly as ''MATTHEW EISLEY'' in newsprint. In ''Twice is too much ...,'' both databases read ''That showed intelligence and understanding, but also that ...,'' where newsprint has ''but it also that ...'' In this same article, an ellipsis that appears in newsprint is absent from both databases. The dropped ellipsis seems to be a computer translation glitch, because ''Report assails Gingrich ...'' also has this problem in the archived versions.

Unclear the deck: A deck head is included with ''Arctic chill hits ...,'' but it runs into the main head without separators: ''Arctic chill hits area Dress in layers, keep head covered, experts say.'' Again, this is difficult to read.

Jan. 18, 1997:
Only DataTimes versions were compared for this date.

Page	Partial headline 		
1A	Panel urges reprimand ...	
1A	Board OKs funding ...		
1A	Focus groups becoming ...
1A	Heroes of Brinks money ...
1A	Boxer's one-night paycheck ...
1B	Spreading the word ...
1B	All South owes debt to ...
1B	County supports ...
1B	GTE hustles to restore ...
1B	NCCU cheers author ...
1B	BY MONDAY, ...

Getting braces: Parenthetical insertions in quotes in ''Focus groups becoming ...'' and ''Board OKs funding ...'' appear with square brackets in newsprint and as curly braces in the archive. This ensures that the insertions are not interpreted as the speaker's own words.

Colon-ized: Subheads are made more comprehensible by the insertion of colons in ''Focus groups becoming ...'' and ''Board OKs funding ...''

Lost his head: The headline on ''All South owes debt to ...'' appears at the end of the article.

Different dialects: An information box accompanying ''Focus groups becoming ...'' is appended to the archived version. Black boxes at the start of each item in its list have been translated into dashes, and a line end is missing before the credit and is translated into two tildes. A space separates these characters from the credit, so the result is not hard to read. ''Source:'' in the credit for the info box appears in the archive as ''SRCE.'' An info box appended to ''Panel urges reprimand ...'' also had tildes and dashes instead of a line break and boxes. None of these variations from the original poses a threat to understanding.