Link to Chamberlain's 1998 Follow-up Survey
Link to Database Quality Web Page
![]()
Fortunately, Brummett realized his error early in the process, grabbed the page proof he was reading and dutifully marked the incorrect passage. He turned the proof over to the op ed page designer, the corrections were made and all was well.
Or so Brummett thought.
Several days later, he was reading a letter from a lawyer for the public figure, quoting the excised passage and demanding corrective action. Brummett was stunned to learn that the offending language, which never appeared in newsprint, had made it into the paper's archives in the Lexis/Nexis database.
How did it happen?
The Democrat-Gazette had mixed computer platforms, with programmed translation of material that crossed a ''bridge'' from the editorial to the design platform. Corrections from page proofs were entered on the design platform and did not go back across the bridge to the editorial system. The archival capture came after the material crossed the bridge, but did not incorporate these final page proof corrections. Thus, the paper in effect transmitted a draft to Nexis, and inaccurate material that the paper never printed was available in its electronic archives to anyone with a modem and a few minutes to explore cyberspace.
Is Brummett's problem the rarest of disasters or the tip of the iceberg? Salvation or suffocation?
The investigation, during a fellowship at the University of North Carolina at Chapel Hill under a grant from the John S. and James L. Knight Foundation, aimed to determine the scope of the problem.
A survey of newspaper librarians sought to assess the systems by which newspapers deal with electronic archiving: How many have badly mixed technology that practically builds inaccuracy into their systems, and how many insist on rigorous quality control and careful programming?
A comparison of electronic archives to newsprint and a search of electronic archives for selected published corrections tried to give a hint of the frequency of accuracy problems in electronic archiving.
The complexity of the problem is summed up beautifully on page 228 of News Media Libraries: A Management Handbook, published in 1993 by Greenwood Press and edited by Barbara Semonche:
No one wants an inaccurate database. Everyone makes mistakes. These two indisputable statements add up to the necessity for quality control carried out on a regular basis.
The solution is difficult (and expensive) to apply. Rigorous, relentless quality control takes a lot of work by a lot of people:
A daily quality control list of stories added to the database the previous day, faithfully checked for typographical errors and incorrect page or sections information, is a start for an adequate quality control procedure. If it is possible, the quality control assistant should not be one of the persons who enhanced that issue of the paper, because it is easier to spot errors in a first-time reading. ...Quality control is sometimes a seemingly expendable task if the library staff is overburdened, as most are, but it is important, if not vital, for the integrity of the database. It is the final step toward a complete and excellent online system.
If newspapers can't make that investment at the back end, the system must be set up for maximum efficiency at the beginning. Again this means involving many: Editors, library staff and network administrators must have a voice in developing the database software. Only with a full understanding of the information flow through the system can a reliable, efficient electronic archive be created.
So, are papers investing in this kind of quality control at the back end? Are they involving all of the right people at the front end? Four months of contacts with librarians from North Carolina to California indicate the levels of commitment are oceans apart.
The computer system is a mix of platforms, and items that arrive in the library queue may not contain all page proof corrections. Transmission to the library queue also is not faultless. The librarian and an assistant spend most of their day in retrieval and pre-archive markup of published material.
Some enhancing is done, with the attachment of keywords (or descriptors) to each file. These keywords, however, are not chosen from a carefully controlled list (or thesaurus), but are assigned by the enhancer after a quick read of the material. Anyone looking for such articles by keyword must be of the same mind-set as the enhancer, which research has shown is problematic even when using a thesaurus: A 1992 study by University of Minnesota graduate student Mark Neuzil found that descriptor searches using three metro papers' lists produced no better than 55 percent of the stories that a full-text search for these terms uncovered in the papers' 1989 archives.
With retrieval, markup and enhancement absorbing so much time at this paper's library, quality control is negligible. A quick check of the headline, lede and end of the article is all that time allows. The librarian sighs and says microfilm or microfiche is the archive of record anyway, as an uncorrupted physical reproduction of the newspaper. The problem, of course, is that the newsroom does not turn to microfilm when searching for background.
The librarian laments that there is no way for her small staff to take about 100 articles a day and ''edit them line by line and get it right. We're lucky to have enough time to check the headline and the lede.''
Corrections are appended at the tops of stories they apply to, and stories fetched from the archive always open at the beginning of the file, but it is up to any editorial staff using the archive to deal responsibly with correction information. Fortunately, there have been no reported instances of corrupt information being retrieved and making it into newsprint.
These problems are all well-recognized and long familiar to librarians, and many newspapers have already found some solutions. But this librarian, cut off from professional associations and most other contact with members of her profession, cannot benefit from the experience of others. This archive is trouble in waiting.
A new computer system was installed in late 1996 that was supposed to eliminate some of the problems of the previous system, which required a bridge from the System Integrators Inc. (SII) editing system to the Macintosh-based design platform. Under the old system, archival capture came at a stage that did not always include final proofing corrections and so effectively created an archive of polished drafts, not final copy. The new system was also supposed to eliminate the need for time-consuming searches for items that didn't make it to the library queue. This has not been the case, according to librarians. In fact, librarians said the number of missing stories has risen from perhaps three each week to four each day. With a limited staff, search time cuts into time for quality control or research. Quality control of daily archiving consists of a quick visual check of the headline, the lede and the last paragraph to make sure that the entire file is there. The library does not generally verify that final page proof corrections have been incorporated in the archival feed. The choice of archiving software was made by systems people and ran counter to Thomas' preference.
Also, for reasons not yet fathomed, duplicates appear when anyone attempts to retrieve material from the in-house archive. This is a particular nuisance in a full-text display, where a reporter ends up scrolling twice through a 50-inch article.
Library staff feels distanced from the newsroom and unable to devote much attention to research and resources. Librarians would like their department to be more of a resource center, but even acquiring an updated set of encyclopedias has proven difficult. (The paper recently acquired a new set of encyclopedias, but put them in the newsroom. So, the librarians trained in research aren't readily able to use them, and reporters and editors don't have the advantage of specialists' assistance.) Standard desktop references such as legal and medical dictionaries, Bartlett's, and film and music guides are also in the newsroom. Only specialized research volumes and indexes are exclusive to the library.
The problems of this setup came home to roost in November with Brummett's disaster. Discovering the error actually revealed another layer of problems. Getting the error purged from Nexis took long and insistent effort. First, the version that was actually printed was queued up and transmitted. A check of Nexis revealed that this correct version appeared, but the prior version had not been removed. Instead of one wrong version, there were conflicting versions. A second prolonged stint on the phone followed before the never-printed version was finally purged.
Corrections in general require complicated effort at the Democrat-Gazette. A written policy prescribes standard newsroom procedures for ensuring that corrections are attached to archival versions, but librarians describe their part of it as an eight- or nine-step process that has grown more rather than less complicated with advancing technology. The process sometimes includes retyping the correction rather than electronically cutting and pasting -- again introducing greater chance of new error.
The archival versions of stories published on these days also did not include paragraph breaks. The last sentence of each paragraph ran together with no space between its period and the first word of the next paragraph. This leaves no particular confusion about meaning, but it is an awful nuisance to a reader.
The archived version of the front page ''In the News'' compendium of briefs did not include final page proof corrections -- ''Santa Ana'' was spelled correctly in the final (City) edition but was captured as ''Santa Anna'' in the archival version.
Still another article appeared in two versions on the archive. Under the headline ''It's Hope against Hope,'' one version was tagged by the archive as 1,154 words; the other, 834 words. Neither version gave any indication of which edition it appeared in or how the versions differed. And neither of these versions matched the City (final) edition, which was still longer. (The Democrat-Gazette librarians, like most newspaper librarians, generally save either the final edition or the longest version -- or both, but in this case they got neither.)
This was not the only article for this day available in two versions. Under the headline ''Convict's rehab center raises questions,'' one version of a 1A story ran 1,818 words and the other 1,848 words. In this case, the difference was clear -- and frightening. The extra 30 words in the longer version included the notice in the header that this version was a corrected version! The correction, attached at the top, fixed a name appearing a few paragraphs from the end of the lengthy piece. By the time a searcher scrolled to the point of error, however, this notice would no longer be visible and there is no reminder at the point of error.
And the original version without the correction notice remained in the archive!
The same thing occurred with a 1B story under the headline ''2 challenging incumbent for LR school board seat.'' One version contained a correction notice dated Sept. 18, 1996, amending information about a candidate's academic degree. But the original remained available unmarked in any way.
Switching to a seamless computer system has helped, but some new problems have arisen and some of the old ones linger.
A check of 1A and 1B articles published Jan. 17-18, 1997 (dates chosen at random), found two articles with headline typos in the commercial database. The archive showed a front page article with a name in a headline that did not match newsprint: It had been mangled from ''Flanagin'' (already an unusual spelling of a fairly familiar surname) to ''Falagin.'' In this same archive headline, the surname ''huckabee'' had been incorrectly typed with a lower case h. The lack of capitalization may not confuse anyone about the name, but it could mean a miss in a case-sensitive headline search -- and it simply doesn't match what was printed! This headline had been stripped somehow from the electronic version transmitted to the library, and had been badly rekeyed. A headline on another article this day also was obviously retyped, because ''souls'' in print became ''sould'' in the archive.
Why, in this advanced electronic age, does our industry spend so much time retyping? How, in this litigious age, do we dare?
These articles published in January fall under the Democrat-Gazette's new computer system, which is supposed to provide seamless operation. The articles checked under this system did match newsprint in nearly all cases, with the exception of the headlines mentioned. As it turns out, different designers build their pagination documents in different ways, and some of these methods disconnect headlines from body type, or section-front segments from jumps. Also, it seems the program that strips typesetting markup removes headlines under some circumstances.
New technology poses new problems.
One paper that does it right is The News & Observer in Raleigh, N.C. The N&O has 20 employees in four divisions in its library covering archiving, periodicals and resources, newsroom research and public research. The public research is kept separate from the newsroom and conducted in a different building to avoid questions of conflict of interest.
The entire newsroom library operation is highly automated, but every step is carefully monitored. A computer script fetches every article in the pagination system after the press run, translates it from a pagination to a plain-text file and attaches an in -house-designed header with archiving fields. These files are then sorted into computer folders by section. A team of four enhancers (three on weekends) then goes to work, scanning header fields and verifying their content, then checking paragraph by paragraph for a match to newsprint. Keywords are attached from a thesaurus of about 100 terms.
The enhancers work from a story list called a ''slug report'' that helps them verify header fields and keep an accurate count of their work. Computer scripts written by a former enhancer who now works in systems at the paper also are employed to check coding in the date and summary fields. In all, enhancers handle 140-145 articles each weekday, 200-235 each on Saturdays and Sundays. During the author's visit to the library March 2, 1997, the slug report showed only one headline dropped by the translator script over the course of several days. (Consider that in a check of only two section covers of a randomly chosen day's archive of the Democrat-Gazette, two headlines were found to have been retyped by librarians.)
Occasionally, the Raleigh scripts also place bylines in the wrong header field, but even here enhancers merely cut and paste electronically rather than retype. The most significant other problem with the translation script is handling fractions. Any article that contains fractions requires a careful check of newsprint against the electronic version, and the fractions must be typed in over the inevitable mistranslation.
Handling of corrections remains a bit cumbersome. The correction itself is archived, but the correction and the original article are not simply appended electronically. The original must be called up on a computer terminal and the correction retyped at the top. The amended article is then refiled and the entire year's archival database ''reindexed.'' This reindexing runs in computer background and can take several hours as the year passes and the database grows. The amended version is then sent by file transfer protocol to the various commercial databases on which the archives appear.
Overall, the bugs are few, but still galling to a staff that prides itself on unceasing quality control.
The process is so refined because archive manager Colline Roberts had a strong, direct hand in development of the computer scripts when the archive came on line in 1994 and Macintosh-based pagination followed half a year later. She helped design the header format and worked closely with systems programmers as they automated processes. An enhancer with programming propensities wrote some small scripts to handle repetitive cleanup tasks and eventually worked his way into a systems job -- with the stipulation that he remain especially available for library technical support.
This is a library at a paper committed to a top-flight archive.
While tracking some other concerns, library director Jackie Chamberlain learned in mid-1996 that The Orange County Register was having trouble with parts of stories missing from the Nexis database. Chamberlain found a similar problem at The Press-Enterprise and in chasing down the failed transmissions came across a number of truncated stories. Lots of investigation, e-mailing and a two-hour teleconference involving Nexis revealed that long stories that had corrections appended or that were resent with fixes became truncated on retransmission.
Riverside uses Basis/DataTimes software, and its files are sent in chunks that are reassembled at the receiving point. Basis users also can transmit just the pieces of an article that update or are amended and have them woven into the original. At Nexis, this reassembly did not proceed properly with the Basis files. Tops of long articles never got their bottoms. Worse, it turned out that corrections attached at the top in Basis were moved to the bottom in Nexis, but since the bottoms and tops never were rejoined, articles not only were truncated but also were left without their corrections.
This raises inevitable questions of legal liability: Whose neck is on the line if someone happens to go to the database and retrieve an article that's missing its correction? In wrestling with the problem, the earliest solution was to send the entire text of any such articles, not just the revised segments. There was still a problem of reassembly, however, and eventually Nexis wrote a software patch to handle it. But the user still must be sure to resend all the pieces of articles corrected in any way.
The problem that remains is that any files sent before the software patch was developed may still be corrupt. Basis users could face a needle-in-a-haystack search through their archives or transmission lists for such corrupted files, but a simpler method is to retransmit entire batches of files. The new versions then supplant the old, if the new versions are encoded with the same access number or identification stamp as the old.
During the course of her investigation, Chamberlain alerted other librarians across the country and kept in steady contact with Nexis, searching for an industry solution to what she might have looked at as a local problem. And she got lots of assistance from Jim Zikratch of the paper's technical staff. Quality control must be the concern of all, or there is no quality control.
So, Press-Enterprise librarians have one more item on their to-do list. Chamberlain now does occasional verification of the commercial database. Nexis has helped by providing an early morning e-mail report on updates. The report alerts The Press-Enterprise to any reassembly problems so they can be fixed. Chamberlain believes that these remaining stitching problems arise from librarians occasionally forgetting the new need to resend all the pieces of an amended article.
The watch never ends.
In the meantime, though, Hauswald and other newsroom leaders have been attending conferences and expositions and reviewing various archiving programs. She speaks well of the cooperative spirit with which the move has been approached, though she admits she has had to be a fighter at times for some of her key concerns. She has a staff of four and is hoping to add a part-time on-call person.
Her trips to conferences and professional networking give her a good view of the road ahead. Specific questions already resolved include timing of the archival capture -- the archival feed will be the same as the feed to press. (A former managing editor insisted -- it's good to have support throughout the newsroom.) A thesaurus of descriptors will be used, and an index of newsclips with 20,000 entries that has been building since 1992 also will be included in the new electronic archive. Questions that remain are whether there will be enough time under the new system for continued cross-training of library staff in research and database enhancement roles.
Knowing what other papers have gone through should help make for a smooth, efficient transition to a clean electronic archive.
Technology puts more and more information in any given place with less and less human handling, but this creates a crying need for more and more human monitoring. In our communication business, it seems the computers are doing more of the communicating these days. But every time computers communicate, people need to be communicating, too -- talking with each other to do a better job of setting up the rules for these silicon connections and then monitoring these connections to keep them pure.
There is no rest in this task and no part of the newsroom can avoid sharing in it.
Surely, back at the Democrat-Gazette in November, columnist John Brummett was convinced he had done everything right. He had caught an error before it could be published and later seen the amended version in newsprint. How could he have imagined a system that would preserve the unfinished product?
In the newspaper business, one is constantly reminded to check one's assumptions and sources. Assume the archival capture comes after final page proof corrections? Better check. Assume headlines, captions and corrections are electronically cut and pasted where they belong rather than retyped before archiving? Better check. Assume corrected versions sent to a commercial database supplant incorrect ones? Better check. Assume that article retrieved from a commercial database is actually what was put on newsprint? Better check!
Some will read the details of this investigation and see a great deal of extreme nit-picking. Perhaps many newspaper leaders will continue to bet that the sorts of errors uncovered here are not worth the investment in people and time needed to safeguard against them. But the investigation shows that at the very least newspapers should check their assumptions about the validity of their electronic archives. Searchers of these archives similarly need to ask some questions to establish the level of confidence they can have in a given database.
Quality control is everyone's job.
The papers and dates examined (with the reasoning behind the choices in parentheses), and the databases:
The Arkansas Democrat-Gazette (author's employer)
    Searched on Nexis:
The News & Observer in Raleigh, N.C. (library with strong reputation,
within driving distance of Chapel Hill, N.C., for site visit)
    Searched on Nexis and DataTimes
(Retrieved same articles from each for Jan. 17 for comparison):
The Charlotte Observer (home paper to another Knight fellow, has mixed
computer system similar to Democrat-Gazette's old system and lies
within driving distance for site visit)
    Searched on Dialog:
The Tennessean in Nashville (roughly halfway along route from Little
Rock to Chapel Hill, convenient for site visit)
    Searched on DataTimes:
Survey questions:
GENERAL
1) Your paper's name:
2) Web URL (if you have one):
3) Commercial databases on which your archives are available:
4) Circulation of your newspaper (daily, Sunday):
5) How many editions do you publish?
6) How many people are employed in the newsroom? (Include
photo, graphics and special projects staff)
7) Identify your computer system(s) for editing, design, archiving.
THE LIBRARY
8) How many people are on the library staff? (List their job titles.)
9) Do library shifts cover night hours? Weekends?
10) What archiving software does the library use?
11) If you have different platforms, which platform feeds the archives?
12) Whether you have mixed platforms or seamless system:
a) Does the archival feed include final proof corrections?
b) Who monitors the feed? Library staff? Editorial staff? Mixed team?
c) What quality control measures do you take to ensure the archives
match the printed product?
13) Does library staff enhance archived articles with keywords?
14) Is there a thesaurus of keywords?
15) Are all editions archived and do all versions of an article that
updates over multiple editions make it into archives?
16) If all editions and versions are not archived, what is the selection
process for determining which ones are?
17) Does the newsroom have terminals with access to archival database?
18) Does the library have terminals with access to the newsroom database?
DATABASES
19) If your archives are available on multiple commercial databases,
does the identical text appear in each database?
20) How is the transmission to the on-line database(s) handled?
Automated? Markup by librarians? By editors? By MIS personnel?
By vendor personnel? Other?
21) What quality control is done at transmission and to verify transmission
and who handles it?
22) Please describe the work flow and quality control at each step from
editor to librarian to archive.
CORRECTIONS
23) Do you have written policies on publishing and archiving corrections?
a) Please describe how corrections are attached. And are they placed
at the top, the bottom, the point of error, or is some other
method used?
b) Do you verify that a correction has reached the on-line database(s)?
If not, describe what other means are used for quality control.
24) What other kinds of correcting do you do in the database (e.g.,
checking notes, fixing validation errors, cleaning up typos in
the header fields)
THE WEB
25) Do you have a web edition?
26) If not, is one planned?
27) Is the library involved in current or planned web edition?
28) Describe any ways in which text on the web edition varies from the
newsprint one? For example, material removed because of length
might be restored before web publication.
29) Is the web edition archived? If so, how is it handled? Is there a
search engine? If so, please identify it.
The table below does not include the news service that responded, but its library has a staff of three in a newsroom of 35 employees. No respondent's paper or news service had encountered legal problems arising from corrupt archives.
Paper Circulation News Lib. Number Daily Sunday Staff Staff of editions _____________________________________________________________________ 1) 71,000 84,000 90 2 3 2) 211,500 307,000 175 5 3 3) 180,000 240,000 300 5 7 4) 166,605 173,533 238 7 8 5) 67,000 100,000 80 1 F/2 PT 2 6) 340,000 530,000 250 13 3 7) 240,000 300,000 270 8 3 8) 135,710 N/A 111 4 1 9) 375,000 850,000 260 7 3 10) 382,484 461,620 352 15 8 11) 240,000 440,000 235 8 3 12) 112,000 140,000 125 4 F/2 PT 6 13) Raleigh 169,028 196,434 250 3 14) LR 175,218 288,250 210 4 3The Democrat-Gazette is the only paper in the group where the library does not have a newsroom terminal.
Four respondents do not have archives available on commercial databases. Five have archives available on more than one database and a sixth is negotiating with a second vendor. Six are with DataTimes, five are with Lexis/Nexis, four with Dialog, two with NewsBank. Infomart is also represented.
Five respondents had Atex editorial systems, four had SII, three had Macintosh networks and the other had a PC-based system.
Eight of the respondents said that archival capture does not always incorporate final proof corrections. Of these, four reported extensive quality control at this stage by enhancers to counter the problem, three reported at least a quick scan at the enhancing stage, and one admitted the paper's approach was to ''hope for the best.''
Quality control in the overall process ranged from paragraph-by-paragraph checks at enhancing, along with software and human verification at nearly every technical stage, to relying on the hope that the software works. The commitment to quality control was independent of size: Papers large and small fell on each side of the divide.
All of the respondents except the news service and the Democrat-Gazette used keywords in enhancing, and only one did not use a thesaurus. Three papers archived all editions and all versions of an article updating across editions and a fourth paper microfilmed all editions and electronically archived the final. The remaining respondents generally archived either the latest or longest version of articles.
Librarians monitored the various communication points in the process at most papers, with systems personnel aiding at two places and editors playing a role at three.
Two respondents append corrections at the bottom of articles, with one of these using a special field and the other also filing the correction as a separate article. The remaining respondents place corrections at the top, with four using a special field. Of the respondents filing corrections at the top, one also files the correction as a separate piece and two place a ''See correction'' notice at the point of occurrence. Two respondents verify that a correction reached the commercial database, another respondent conducts spot checks and the others rely on software verification of transmission and on adherence to procedures.
The paper's electronic archive is found in the news library of Nexis under the file name arkdem. The file does not use ''page'' as a searchable field, so the search string took the form DATE IS MM-DD-YYYY. This yielded all of the given date's stories. The focus command .fo 1A or 1B winnowed the list of citations to articles on the appropriate section covers. Nexis files give word counts.
    Sept. 15, 1996:
Three front-page wire service articles do not appear in the archive. One
article appearing on 1B in City (final) edition ran on the front page in
an earlier regional edition and was archived as a 1A story, so it was not
compared.
Page Partial headline Word count 1A Convict's rehab center.... two versions -- 1,818/1,848 1A Buildup in gulf... 753 1A It's Hope against... two versions -- 1,154/834 1A In the news... 471 1B Medical board allows... 697 1B Playing the blues (caption) 150 1B Four-car crash leaves... 236 1B Henry sent in to give... 935 1B Little Rock, Pulaski County... two versions -- 889/930 1B School board candidates... 869Can't get a break: The archival versions of all articles on this date were missing paragraph breaks. The last word and period of each paragraph and first word of the succeeding paragraph were combined. Presumably, this is a problem with the Democrat-Gazette's code-stripping software or a problem in transmission to the database.
Problem after problem: ''It's Hope against ...'' appeared in two versions, but no notation anywhere indicated which edition the versions came from or how or why they differed. Neither version exactly matched the City (final) newsprint edition used for comparison, which was longer yet. How is a searcher to sort this out without some guidance in some header or memo field?
In the longer version, the ''I'' in ''It's'' was missing from the headline. At the least, this means sloppy proofreading during archival preparation. A photo caption archived with the text misspells ''seeking'' as ''speeking'' and ''miracle'' as ''milracle.'' This indicates text that wasn't captured or appended electronically as it should have been and then was badly retyped and proofread.
Who's on first: ''Convict's rehab center ...'' appeared in two versions, the first one listed as 1,818 words long and the next is 1,848. The longer version contains a correction notice at the top (published Sept. 16, 1996, according to the notation) correcting the first name of a man quoted late in the article. Otherwise-duplicate versions with correction notices should supplant originals in commercial databases, because they ordinarily should be transmitted with the same file identifier used for the original. This method apparently was not used in this case. Also, it should be noted that searches usually produce lists with the most recently filed version appearing first because the default arrangement is latest-date first. The search parameters in this case turned even that upside down, and the search yielded the corrupt version first, with no hint of the problem. Had the search not continued to the next citation, the error easily could have been retrieved and propagated.
''Little Rock, Pulaski County ...'' suffers this same problem. It appears in two versions on the archive, with the second, longer version including a correction dated Sept. 18, 1996, on the academic credentials of a man named late in the article. In this case, both versions also include two paragraphs not in the newsprint version, which suggests the page was changed and trims made after the archival capture.
Timing and translation: ''Four-car crash leaves ...'' appears in the archive with a sentence that concludes ''was critical Saturday night in University Medical Center.'' The newsprint version reads ''was listed as critical ...'' (emphasis added). This is the sort of language a copy editor would have added late on a page proof. Similarly, the daily 1A ''In the news'' feature has ''Santa Ana'' misspelled ''Santa Anna'' although the newsprint edition has the correct spelling. The 471-word archival version of these briefs also includes a sentence that reads: ''Tilmer Everett, 25, was arrested, after Bismarck, N.D., police said the when he was...'' but ''after'' and ''the'' do not appear in newsprint. These errors again indicate archival capture not incorporating final proofreading corrections.
Also, each item in this roundup of briefs begins in newsprint with a black box instead of a paragraph indent, and the boxes are normally translated to paragraph indents in the archive. In two cases here, however, the indents are missing and an ''m'' appears where the items run together. This indicates a computer translation problem.
Little things: ''Buildup in gulf ...'' had two odd paragraph breaks in the archive, one in the middle of a word and the other in the middle of a sentence, that do not appear in newsprint. ''Playing the blues,'' a set of captions for a photo essay, was missing ''(right)'' in one caption, indicating which picture it accompanied. This notation is meaningless in a text-only archive, so perhaps there is no harm in this imperfect match with newsprint.
Good things: ''Medical board allows...,'' ''Henry sent in to give ...'' and ''School board candidates ...'' matched word for word.
Everybody's doing it: A deck headline with ''Medical board allows ...'' flowed right into the main headline with no punctuation or break, making comprehension difficult. Nearly all newspaper archives in this study handled deck heads this way -- and nearly all were confusing.
    Jan. 17, 1997:
One front-page wire service article does not appear in the archive. The
front-page ''In the news'' for this date was not compared.
Page Partial headline Word count 1A Dumond granted ... 1,537 1A Issue No. 1 in survey? ... 620 1A Check for park diamonds ... 687 1A LRPD has vested interest ... 657 1A Inauguration: Circumstances ... 1,261 1B LR squadron leaving ... 748 1B Gas spill closes ... 548 1B Ex-police officer guilty ... 615 1B Weather radar site ... 873 1B Teleconferencing acquits ... 798 1B Tax decision for church ... 490
Off with the heads: ''Dumond granted ...'' has a deck headline that runs into the main head in the archive, and the archived deck contains two typos besides: The 'H' in ''Huckabee'' is incorrectly lower case and ''Flanagin'' has become ''Falagin.'' The headline likely was stripped by software at some stage and retyped badly.
Getting better: All other articles compared for this date matched newsprint, including faithful reproduction of some minor miscues. For example, ''Inauguration: Circumstances ...'' includes a reference to ''65-foot flag polls'' and ''LRPD has vested interest ...'' includes a paragraph that ends with a comma instead of a period. It is perhaps a mixed blessing that these also appear in the archive.
    Jan. 18, 1997:
Two front page wire service articles for this date are not in the archive.
The front-page ''In the news'' feature was not compared. The local section
was not compared.
Page Partial headline Word count 1A Women lawmakers blast ... 1,048 1A Victim: Governor, parole ... 637 1A 2nd inaugural lacks ... 1,029 1A 'Hearts and souls' must ... 1,303
Wrong-headed: '' 'Hearts and souls' must ...'' became 'Hearts and sould' in the archive, again indicating a headline inadvertently lost and badly retyped.
Missing something: Two captions archived with ''Women lawmakers blast ...'' and one with '''Hearts and souls' must ...'' were missing all apostrophes. This is a computer translation problem difficult to catch without a word-for-word scan.
Better yet: Articles otherwise matched in all particulars.
Corrections published March 11, 12, and 15, 1997, were tracked in Nexis. Searches were conducted in the library news, file arkdem, with search strings of the form DATE IS MM-DD-YYYY AND (identifying text). The identifying text in most cases was a name appearing in the correction. Dates of the original article containing the error were chosen to see if amended versions had supplanted the inaccurate ones. Five corrections on these three days were checked. In all five cases, the original article containing the error was retrieved. Not a single one had a correction notice or any other memo regarding the error attached.
A subsequent search for the corrections themselves, using the dates they were published and the same identifying text that produced the faulty originals, yielded no matches. Not one of the corrections was archived as a separate file, and no corrected versions were found. The search was then expanded to include a range of 10 days after publication of the original, on the chance that an amended file might have appeared later. This, too, produced no matches.
In one of the March 15 cases, the correction amended the amount owed on a fine. The original article was still available, but an article published the following day also was uncovered by the expanded search range. This follow-up gave the correct figure.
In another March 15 case, the correction amended a description of the contents of House Bill 1109. A check of the previous day's archive for the original article showed that it referred to House Bill 1108. Policy is fine, but performance is faulty.
The search sets were:
    s1=DATE IS 04-03-1997 (165 items)
    s2=s1 AND PG=1A (7items)
    s3=s1 AND PG=1C (8 items).
Sets 2 and 3 were sent directly to a printer rather than saved on disk.
The communication line took some sort of hit during transmission from Dialog, producing garble and requiring retransmission of items 5-7 of s2. And neither set's final page printed, so add a couple more lines to the list of communications difficulties in our electronic age.
April 3, 1997:
Two 1A items and two 1C items appearing in the search did not come from
the Metro edition used and so were not compared to newsprint. Headlines
appear in all caps in the archive, with deck heads running into main heads
without separators. This complicates comprehension.
The Observer makes good use of the memo field to indicate graphics, photos or info boxes that accompany articles. These notations also specify if that material has not been archived. Because of the cross platforms, Macintosh-generated graphics often are not archived. The memo field is also used to indicate whether a story appeared in other editions, and if so whether it was on a different page or was of a different length. Keywords or descriptors are used for indexing and appear at the end of the archived article. The library uses a thesaurus of descriptors.
Page Partial headline Word count 1A Vote could hurt ... 775 1A One option: Zero ... 1,305 1A Teens face longer ... 964 1A Memos show Clinton ... 1,040 1A Minivan's child seat ... 951 1A Pentagon forecast sees... 692 1C Bill would let people see ... 429 1C Killer silent; time runs ... 926 1C 6-year-old runs into ... 373 1C Plates turn staff BMWs ... 590 1C 1960 election ballots found ... 314 1C STATESVILLE AUTO SHOP ... 77
Taking good notes: ''Teens face longer ...'' includes a deck head from a different edition than the one compared, but the archive specifies which edition is saved. A memo notes that a chart that accompanied the article can only be retrieved from microfilm. The memo also informs searchers that an info box is attached at the end. A memo on the archival version of ''Plates turn staff BMWs ...'' notes that it is a longer version taken from another edition. Except for the additional material, the text matches.
Picture imperfect: A caption archived with ''Memos show Clinton ...'' incorrectly gives the name ''Harold Ickesin.'' The picture was a mug shot of Harold Ickes used with a quote block, and the newsprint text included an identifying phrase following Ickes' name that began with ''in.'' The incorrect name could have come from sloppy deletion of this identifying phrase. Three photos accompanied ''Killer silent; time runs ...'' in the Metro edition, but only two captions appear in the archives.
Mixed platforms, clean copy: All other articles matched letter for letter. During the site visit, librarian Marion Paynter indicated that enhancers check the first and last word of each paragraph to verify validity of the archival capture. This pays off in a clean archive despite a mixed system.
Feb. 14, 1997:
Page Partial headline 1A Hostage safe, gunman ... 1A Sundquist 'unaware' ... 1A March house officially ... 1B Mayor plans schools ... 1B Hunting harbingers ... 1B Singing the praises ... 1B Sundquist defends ... 1B Traffic delays cost ...
Dash it all: In all articles, dashes used to set off phrases had been converted into spaces, which made comprehension difficult.
With words unspoken: In ''Traffic delays cost ...,'' square brackets used to insert phrasing parenthetically in a quote were lost in translation. This dangerously puts words in a speaker's mouth.
Troubling picture: Two captions of photos with ''Hostage safe, gunman ...'' are run together in the archive, again making it hard for a searcher to sort out. The archival version of a caption with ''Sundquist 'unaware' ...'' cuts off after only seven of 13 words in newsprint.
What's in a name: A photo caption credit on ''Singing the praises ...'' appeared as ''SHELLEY MAYS STAFF'' and one is left to wonder if Staff is the photographer's surname. The use of all caps and the lack of a separator make this especially hard to sort out. The form ''Staff photo by Shelley Mays'' would eliminate any chance of confusion. The byline on this story is handled similarly: ''RAY WADDLE RELIGION EDITOR.'' This is easier to sort out, but easier still would be the form used on ''Sundquist defends ...'': ''DUREN CHEEK Staff Writer.'' Here the use of caps for the name and upper and lower case for the credit provide clarity.
Pardon the interruptions: Line breaks not in newsprint are added in the middle of sentences in the archived versions of ''Mayor plans schools ...'' and ''Traffic delays cost ...'' These breaks appear at the top of the second column of the newsprint articles, where each had a white-on-gray reverse head identifying the story locale as ''Davidson.'' The computer translation for the archive presumably stripped this tag but left a line-end command in its place. On ''Singing the praises ...,'' a quote block in display type on the front page appears in the archive in the middle of the sentence where the article jumped from the front page. After the quote, the article's jumphead appears, followed by the interrupted text. This was confusing even with newsprint right at hand for comparison. Worse, the quote used for the display block follows just three paragraphs after this archival insertion, strengthening the impression that the transmission has been somehow garbled.
The N&O uses descriptors from a thesaurus in enhancing articles for the archives. File headers cite the edition in which the article appeared. Deck heads generally are not included.
Bylines and credits appear in separate fields, eliminating confusion that arises when these run without separators in a single line.
Commitment to quality control and careful software development give Raleigh a generally clean archive, but a few small things indicate the enormity of the task librarians face: Captions dutifully archived in DataTimes were absent in Nexis, one byline was wrong in both Nexis and DataTimes, and a headline on one article in DataTimes appeared at the end of the file rather than the beginning. Diligence remains the watchword, though: Library staff at The N&O heard about these findings of the misplaced headline and mistyped byline and moved to correct them.
Some headlines are changed in the archive into all caps, while others are upper and lower case. Colons are inserted after subheads in the archive, which helps with comprehension. Since the electronic version does not reproduce newsprint's visual cues of larger, bolder type and centering, the punctuation makes the subhead seem less like a disembodied phrase accidentally inserted.
Dashes, which posed a translation problem at The Tennessean, are accurately reproduced in The N&O archive. Square brackets, also a problem at The Tennessean, are translated into curly braces in The N&O archive. While not a match in typography, this is at least a match in spirit that cannot corrupt a quote that includes parenthetical material.
Notes about any graphics not archived appear at the end of articles, where captions are also attached.
Jan. 17, 1997:
Page Partial headline Word count 1A N.C. schools' grade ... 1,040 1A Report assails Gingrich ... 671 1A Watershed rezoning ... 1,130 1A Cosby's son shot ... 895 1A Firm to get postal ... 506 1B Twice is too much ... 560 1B Garner man pockets ... 596 1B Public input on arena ... 907 1B Arctic chill hits ... 703
Double, double, a bit of trouble: The archival versions matched in DataTimes and Nexis, except that captions with ''Garner man pockets ...'' and ''Arctic chill hits ...'' did not appear in Nexis. The databases were identical down to errors: The archived byline on ''Public input on arena ...'' is ''MATTHEW EISELY,'' though it appears correctly as ''MATTHEW EISLEY'' in newsprint. In ''Twice is too much ...,'' both databases read ''That showed intelligence and understanding, but also that ...,'' where newsprint has ''but it also that ...'' In this same article, an ellipsis that appears in newsprint is absent from both databases. The dropped ellipsis seems to be a computer translation glitch, because ''Report assails Gingrich ...'' also has this problem in the archived versions.
Unclear the deck: A deck head is included with ''Arctic chill hits ...,'' but it runs into the main head without separators: ''Arctic chill hits area Dress in layers, keep head covered, experts say.'' Again, this is difficult to read.
Jan. 18, 1997:
Only DataTimes versions were compared for this date.
Page Partial headline 1A Panel urges reprimand ... 1A Board OKs funding ... 1A Focus groups becoming ... 1A Heroes of Brinks money ... 1A Boxer's one-night paycheck ... 1B Spreading the word ... 1B All South owes debt to ... 1B County supports ... 1B GTE hustles to restore ... 1B NCCU cheers author ... 1B BY MONDAY, ...
Getting braces: Parenthetical insertions in quotes in ''Focus groups becoming ...'' and ''Board OKs funding ...'' appear with square brackets in newsprint and as curly braces in the archive. This ensures that the insertions are not interpreted as the speaker's own words.
Colon-ized: Subheads are made more comprehensible by the insertion of colons in ''Focus groups becoming ...'' and ''Board OKs funding ...''
Lost his head: The headline on ''All South owes debt to ...'' appears at the end of the article.
Different dialects: An information box accompanying ''Focus groups becoming ...'' is appended to the archived version. Black boxes at the start of each item in its list have been translated into dashes, and a line end is missing before the credit and is translated into two tildes. A space separates these characters from the credit, so the result is not hard to read. ''Source:'' in the credit for the info box appears in the archive as ''SRCE.'' An info box appended to ''Panel urges reprimand ...'' also had tildes and dashes instead of a line break and boxes. None of these variations from the original poses a threat to understanding.