Preserving Web history

With 60 Web sites, 20,000 Web pages and approximately 100 page changes per month to manage, you would think that Chris Strout wouldn't dwell on the past. But Strout, Web site manager at Chicago-based insurance brokerage Aon Corp., says that preserving historical Web site information is critical to meeting his company's regulatory obligations.

"We've had some compliance issues with the SEC where they've said, 'Information we're looking for is not on the site. Where is it? Has it been on the site in the past?' " he says. Using TeamSite, a content management tool from Sunnyvale, Calif.-based Interwoven Inc., Strout says, he can show where the requested content appeared at a given time -- and how users navigated to it.

Regulatory compliance is just one reason to maintain access to historical Web site information, corporate archivists say. Information in Web archives can also provide critical evidence to protect a company in legal matters, or allow the marketing department to look back at previous online marketing efforts to see how the company presented itself and its products over time.

Unfortunately, in many organizations, Web site content -- including the original context, look and feel -- is disappearing into oblivion.

"This could be a period that is relatively undocumented, given the amount of information that's out there," laments Bruce Bruemmer, corporate archivist at Cargill Inc. in Wayzata, Minn. "We're going to just lose a lot of information."

Part of the problem is complexity: How do you archive continually changing Web sites with thousands of pages that include active content and dynamically generated page elements? Many organizations avoid that question and just try to get the basics. A simple tool like Adobe Systems Inc.'s Acrobat can encapsulate static Web page content and maintain active hyperlinks within searchable Portable Document Format (PDF) images. On the high end, content management tools from companies such as Pleasanton, Calif.-based Documentum Inc. can provide more detailed snapshots of previous Web site content. Interwoven's TeamSite virtualization engine can even re-create historical application servers, JavaServer Pages, Extensible Style Language style sheets and other code. But IT must weigh the cost of such systems, which can easily run into six figures.

Getting the Basics

Bruemmer takes a piecemeal approach to archiving. "Right now, I'm in the sticks and stones era," he says. "If there's Web content I want to capture, I'll capture it and move it off-line on CD-ROMs as PDFs."

He recommends establishing policies for determining what content should be archived and for how long. But applying such policies is difficult because, like many companies, Cargill has different groups managing its many Web sites. "That sort of stymies a uniform approach to capturing the Web sites," he says.

Becky Haglund-Tousey, archives manager at Northfield Ill.-based Kraft Foods Inc., began thinking about Web site archiving in 1996, with the launch of the Kraft Kitchen interactive Web site. "Once we moved to the Web age, a lot of records were no longer created in hard copy, so we have to deal with them in this format," she says. She doesn't rely on administrators to archive Kraft's 40 to 50 product-specific Web sites, which include Jell-o.com and Nabisco.com. Instead, two employees use Adobe Acrobat to create quarterly PDF images of the static Web site content, and Haglund-Tousey stores those images on CD-ROMs. The process is time-intensive, she says, so she limits PDF captures to the first three levels of the Web site. "We don't capture down to every level, because [then] you're capturing a lot of redundant information," she says.

Acrobat converts text, page links and some graphics but will only reference Flash, Java or dynamically generated content, which must be stored separately. "I would like to have a product that would capture application-driven and interactive elements," acknowledges Haglund-Tousey. But for now, she says, PDFs capture enough of the look and feel of the original Web site.

Acrobat allows searching across PDFs for embedded content. It uses a format called extensible metadata platform, or XMP, to embed basic metadata elements, such as the creator's name and the creation date, with each file. Acrobat also supports XMP extensions to allow user-defined fields, although Kraft doesn't use this function. (An Adobe spokesperson acknowledges that the way to apply extensions in the product is "not obvious" and says the company is working to correct that.)Haglund-Tousey considered more sophisticated -- and expensive -- content management systems before settling on a US$249 version of Acrobat. However, she says, her main concern wasn't price. "A lot of the content management software packages out there aren't designed for long-term preservation of site content," she says.

Going for Broke

Strout sees things differently. Aon, a $7 billion insurance brokerage firm, was forced into thinking through content management issues in order to comply with regulatory disclosure rules that require some Web content to be preserved for up to seven years. Interwoven's TeamSite provides access to archival Web content as a logical extension to the content creation and management process.

The system allows Aon to apply metadata tags that describe content in detail as it creates and maintains more than 20,000 pages. As pages move off the live sites, they're still available on a staging server, where TeamSite's virtualization engine allows system users to view previous Web site content. Strout uses TeamSite to take a snapshot of the Web sites four times per day. "At the click of a button, we're able to go back a year," Strout says. Well, almost. Aon stores the underlying content and its Web page design properties separately. "It takes some work to marry those together," he says.

Strout acknowledges that a copy of Acrobat is considerably less expensive than a content management system. But he sees PDFs as inadequate. "Having a snapshot of the site allows us to re-create not just the content but also the user experience, and that sometimes is more important," he says.

Direct access to the archive is limited, however. Of Aon's 53,000 employees, only about 100 need access to the system. Thirty people have direct access; others must make requests. "We can send them a virtual link, and they can view that," Strout says.

Strout also claims that vendor lock-in isn't an issue because Aon's Web site metadata is stored in properties files built using XML. "If we had to, we could take it out of Interwoven," he says.

Still, the rapid obsolescence of Web application software and file formats leaves some people wondering how well these tools will work in the long term. "I don't know that [the content management vendors] have addressed the 10-year problem, as opposed to the two-year problem," says Susan Feldman, an analyst at Framingham, Mass.-based IDC, adding that, over time, scalability may also become an issue.

But given the rapid pace of change in Web site content, Haglund-Tousey doesn't think companies can wait for a perfect solution. She says that while she'd like to have a system that would capture all Web site content, PDFs capture the basics, and "we're able to do that now."

Coping With Web Obsolesence

For corporate archivists, the Web brings a new kind of planned obsolescence. Where manufacturers were once accused of designing products that wore out too quickly in order to sell more of everything from new cars to washing machines, it's now the rapidly changing technological underpinnings of Web sites that threaten to render archival information unusable.

"Are we going to know what a TIFF file is in 20 years?" asks Darrell Delahoussaye, manager of collaborative systems at San Francisco-based Bechtel Corp. In addition, vendor-controlled file formats such as PDF, Flash and Windows Media could evolve in ways that may force time-consuming conversions of older files to maintain readability.

"Not only are the formats changing," adds IDC's Susan Feldman, "but [so are] the platforms on which they run." That includes everything from JavaScripts to the Web server software and hardware that a site runs on.

"At some point, you hit the roadblock of, Do I maintain all the components or make the transfer to a new environment?" says Delahoussaye. That expense is likely to force hard choices, so it's important to appraise what's important, he says.

Delahoussaye has been down that road before, with word processing documents that migrated from Wang format to WordPerfect and finally to Word. While the text was preserved, he says, some things probably got lost in the translation. But for important content, he says, companies may have no choice but to convert files.

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

More about Adobe SystemsCargill AustraliaDocumentumIDC AustraliaInterwovenNabiscoSECWang

Show Comments
[]