American Printer's mission is to be the most reliable and authoritative source of information on integrating tomorrow's technology with today's management.

File Formats: What in the World is XML?

May 1, 1998 12:00 AM

         Subscribe in NewsGator Online   Subscribe in Bloglines

Lately we have been hearing a lot about file formats such as PDF, TIFF/IT and XML. While the perception to many is that each of these is the magic panacea for all prepress ills--and the path to untold riches--the reality is that they are just different file formats, each with a different focus. In past issues, we have covered TIFF/IT and PDF, so now it is time to take a closer look at XML.

In order to better understand each of these file formats, and specifically what XML brings to the table, let's start with a brief explanation of file formats and how they affect what we do.

For all intents and purposes, there are two types of file formats--native application formats and intermediate transfer formats. Most file formats are the "native application" variety, existing in a language used to describe the output of a specific software application.

For example, a document created in Microsoft Word has its own format, which is different from a document created in WordPerfect. While they are both word processor software packages and have a common core of "standard" ASCII text, there are many application-specific commands included in each file format, due to distinct differences in the way styles are handled, formatted and other unique features.

While you may be able to open the file format created in one application into another, it is usually a result of a filter or translator that will interpret the differences. So, most of the time these differences may not be a problem. At least until one of the software companies decides to upgrade its file format. Then you can run into a problem in which the filter or translator for that specific application no longer works. Of course this can cause all sorts of headaches in a production environment, and it is one reason why many organizations try to establish common format standards.

Some of these standards do exist, and more are coming, but that won't necessarily affect the Microsoft Word/WordPerfect conversion problem or the Quark/PageMaker, etc. conversion issues. Applications will always have their own file format since that is usually what allows them the "value added" that differentiates products.

Transfer formats, on the other hand, are specifically designed with a level of compatibility across applications and perhaps even platforms.

Examples of these formats in our prepress world are EPS and TIFF and, now, TIFF/IT and PDF. Each of these formats is designed to import files from one application into another and/or as a standalone transfer or output format.

The actual information represented in these formats is usually raster or vector images. However, as we move into an era in which there are a lot of files needing to be managed, we need to look at another potential file attribute-- "meta data," or the data that describes what is represented in that file.

For example, you could have a file of a red car that was used in a specific job. That file has a name that may or may not reflect its actual description (such as CT46982, or redcar.tif). However, that file may need to be used in other projects, but based on a different premise.

For example, the red car may be a Porsche, or perhaps it is a convertible, or a four-door or a 1998 model. Each of these attributes may be a valid way to describe that file. Depending on the use and the person looking, one of these attributes may be a good way to find the file or image. But let's take it further--how about other information such as the price, availability or shipping information, etc.

Many of you have already had to look at these types of file descriptions and file management in your own businesses as a way to keep track of work or as a way to offer data management services to customers. However, most of these asset management systems are designed to handle only your data and its corresponding meta data. Should you want to move a file to another database (at your customer's site, for example), you will only be able to transfer the image files. The meta data in your database is not easily transferable or compatible with the other systems.

XML (Extensible Markup Language) has the potential to address this problem. The roots of XML go back into SGML (Standard Graphic Markup Language), the file format used for a lot of technical document publishing over the past decade.

SGML was so unique because of its ability to define a standard set of "tags" to identify specific text attributes used in a document. This allows SGML to properly format in a changing page size and structure. The kinds of tagged information that SGML supports are titles, subtitles, paragraphs, footnotes, emphasis, etc. However, SGML has become a complex specification. Its creators have added so many bells and whistles that SGML has become unwieldy. This problem is one of the main reasons SGML has not gained a broader range of acceptance.

HTML (Hypertext Markup Language) is an application of SGML. HTML is being used for the World Wide Web (WWW) because of its ability to readily display text and images, as well as link information across platforms and operating systems.

HTML's tagset and use also has grown in presentation-related applications. The fact that HTML has been selected as the basic document format for the Web, combined with its simplicity, has lead to its overwhelming success. In concept, HTML is to Web pages what PDF is to printed pages--it is concerned with describing how a page should look, but describes almost nothing about the contained information itself. Therefore, HTML is incapable of information exchange. All you can do with an HTML file is to display it.

That's where XML comes in. XML has an "extensible tag set" that forces a defined tagging syntax without defining the specific tags that could limit the usage. The XML creators have taken 90 percent of the power of SGML and combined it with 10 percent of its complexity. They also added Internet-related features. XML not only offers a more flexible and open way of tagging objects, the information stays with the object, not in a specific database. This allows objects to be identified and shared with any system that would recognize XML tags.

This acceptance has been forthcoming from most of the major vendors including Microsoft, Apple, Adobe, Bitstream, Oracle and Sun. XML has been incorporated into Microsoft Internet Explorer 4.0, and Netscape is planning to support it in Communicator. XML is currently going through the process to designate it as a standard format, which will most likely drive its usage across many application solutions.

The potential that XML offers is very exciting. Not only will it allow users to more readily share database objects and corresponding meta data with customers, it also opens the door for a wider use of shared database information.

Well, all of that is very interesting, but as a printer, why should you care? Many of you are already sharing objects with customers, albeit in most cases without the meta data--but it works. What else can XML offer?

Imagine your client is a catalog publisher, a magazine publisher or a corporate client with products and services to sell. These customers have found that increasingly they need to publish and distribute their products in paper and on the Web. Not only that, but they have also found that, as a result of the increasing on-demand mentality of our culture, they need to target their message and add immediacy. In order to satisfy these needs, they either need to have a large staff constantly building pages targeted either to print or the Web or, ideally, they would like to automatically assemble these pages on-demand.

Ultimately, the common thread and driving force for significant growth of the Web and "variable data, on demand printing," is database publishing. These pages will need to be automatically assembled based on the target audience and media used (print or display).

XML has the potential to allow that information to reach the targeted audience and media through the identifications specified in its tags. The information for this type of publishing can be collected from many disparate databases using XML tagged objects, then assembled into a single document.

Applications such as PageFlex (being developed by Archetype, the applications division of Bitstream), create the necessary tools to take advantage of this potential. PageFlex's architecture is based on the concept of separating a document's form from its content. That means it handles the fonts, colors, layout and other design elements separately from the text, images and graphics. The form aspects are encoded into an "intelligent template" that captures the general design intent of the document. The text, images and graphics are XML tagged with their relevant meta data.

Once the template has been created (it can be created for print, Web delivery or multiple variations of each), the content can be automatically assembled into the designated "hot spots" within the template to create a unique document. This allows for variable pages with variable data and personalization, if desired.

This type of tool will further drive cross-media targeted publishing. It opens up the doors to a closer collaboration between printers and their customers, with the potential for many new types of products and services.

Database publishing, in general, will play an important role in the future of publishing. XML will, once fully accepted and implemented, help drive its growth. XML has just recently been formally accepted as an official specification of the World Wide Web Consortium (W3C). Its use is being implemented in many new applications such as Web browsers, as well as many database systems. Applications such as PageFlex, which take advantage of this type of data, will increasingly be developed over the next few years. This isn't necessarily technology you can implement today, but you can start to prepare.

Ensure that either your current or future database vendor will support XML. Start to look at the way you set up your meta data for the objects in your asset management database. Ensure that you have thought through the logical identification of your objects, not only for present use, but also for future use. If you are unsure of how to go about this, there are resources available that can assist you in this important task.

Work with your customers and their data on this important task, since it can help in developing a closer and stronger relationship for the future. So while XML may not be ready for full implementation in your day-to-day processes, it is definitely something that you should keep on your radar screen.

The W3C or World Wide Web Consortium, ( is an organization originally created in 1994 as the "unofficial" guardian of browser standards, by Tim Berners-Lee, the "father" of the Web. The organization's membership reads like Who's Who of computing, including companies such as Hewlett-Packard, Netscape, Sun Microsystems and Microsoft, to mention a few. While they don't have any official responsibility for setting and maintaining standards, they collectively carry enough weight that they are usually able to steer the industry in their determined direction.