What is a tagged document?

A tagged document is one that ‘separates information and structure from presentation’ by the use of tags.

By means of this separation:

  • authors and editors determine the information and structure of the document
  • authors, publishers and designers can efficiently control the physical presentation of hard-copy versions and the default presentation of tagged electronic versions
  • readers of tagged electronic versions can readily control the presentation to suit their requirements.

By contrast, the presentation of an untagged document relies on visual formatting to indicate structure. Appearance alone is used to signal headings. Assistive technologies will see these headings and other structural elements as default paragraphs (p tags). In WCAG terminology, the structural elements of the document cannot be ‘programmatically determined’ by the technology, and therefore the technology cannot present that information in accordance with the reader’s requirements. The presentation of these elements can’t be overridden by the reader’s custom style sheet, they can’t be announced or listed as headings by screen readers, and they can’t be found by a tag search.

There are multiple tagged formats, and tagging in itself does not mean that a document is accessible. Tagged formats are only useful if readers have the relevant assistive technologies to interact with those formats.

It’s useful to understand how tags operate in the three most common accessible file formats, and also to understand the ecosystems that currently exist for these formats:

  • HTML
  • PDF
  • MS Word and RTF.

HTML

HTML is the only one of the three formats that is strictly a tagged format. The tagging features of the other formats are more idiosyncratic, as I’ll explain below.

An HTML file is a text-only file that clearly separates text (information) from the structure (tags). The tags and their purpose are defined by HTML standards.

When this text-only file is opened in a browser, it will only display the text, not the tags. The tags will be used by the browser to set up the presentation in accordance with the browser’s default style sheet. These defaults will reflect the tag’s function (e.g. the h1 tag will use a larger font size than the h2 tag).

Typically, publishers of HTML pages will append a style sheet to their pages. Style sheets define the presentation attributes of the structure tags. The style sheets override the browser’s default presentation and ensure that the pages reflect the publisher’s branding and that they look relatively similar in all browsers.

In turn, an individual reader of an HTML page can apply their own style sheet to override both the browser’s default style sheet and the publisher’s appended style sheet. For example, a reader with low vision can specify large font sizes for all text, and their preferred colours for headings. A reader with dyslexia can specify their preferred font for all text.

Because all HTML documents use the same palette of tags, a reader can set up a personal style sheet once and apply that style sheet as the default for all HTML pages they read.

Characteristics of the HTML format and ecosystem

  • It is an outstanding general-purpose tagging language
  • There are outstanding authoring tools to suit all budgets
  • Documents can be updated easily, and global changes can be made quickly, even across thousands of documents
  • Accessibility is a highly integral aspect of the language
  • Accessibility is highly integrated into the work practices of corporate web development teams
  • Browsers are free and available on desktop and mobile devices
  • Browsers offer strong support for custom style sheets
  • Assistive technologies generally offer better support for HTML than for any other document format.

PDF

The PDF file format began life with its focus firmly on presentation. Even information sometimes took a back seat, since in the early days the PDF format was commonly used to publish scanned documents which contained no readable electronic text. Before 2001, there was no electronic mechanism for indicating structure in a PDF document.

In view of this background, it’s remarkable that Adobe was able to transform the PDF into a viable accessible format. To do so, it didn’t really overturn its original paradigm; instead, the tagged PDF format functions as a document within a document. This is the core idiosyncrasy of tagged PDFs.

What does this mean in practice?

The first thing to note is that like HTML, the tagged PDF format has its own palette of tags, which is more limited than the HTML palette. PDF does not support character-based span tags such as strong/bold and italic/emphasis tags, hence these formats are not announced by screen readers even though they appear in the visual layout. (Note, however, that screen readers users often prefer not to have these tags announced when they read HTML or MS Word documents.)

Programs that export a tagged PDF, for example InDesign and MS Word, in effect export two documents – the visual document and the tagged document. These authoring programs have protocols for tagging the document’s contents when the file is exported to PDF. They also provide additional mechanisms that allow the user of the authoring program to control how elements are tagged – InDesign much more so than MS Word.

It is not the content of the visual document that determines the content of the tagged PDF, but a combination of the content, the formatting, and the export setup. Without the correct setup, particularly in InDesign, it is entirely possible and even likely that the tagged PDF will match neither the visual layout nor the correct structure.

Authoring programs do not export the full palette of PDF tags, so if some the unsupported tags are required in the PDF, they need to be added manually in Adobe Acrobat. Strange things can happen at this point, because the connection between the tagged PDF and the visual PDF is tenuous. As tags are edited to fix structures that could not be set in the authoring program, nothing changes in the visual layout. Tag editing offers the opportunity to get the tagged PDF right, but it equally has the potential to move the tagged PDF further adrift from the visual layout and intended structure if it’s not done correctly. The visual layout provides no indication that the tags are incorrect.

So, is a tagged PDF accessible in the same way an HTML file is? No, it isn’t.

We explained above that a reader can customise the presentation of an HTML file. By means of a custom style sheet, they use their browser as an assistive technology to display the document to suit their requirements.

There are no PDF readers available that can customise the presentation of a tagged PDF. Adobe Reader offers Reflow view and some colour display options, but these features are poorly developed and they are offshoots of the visual PDF, not the tagged PDF.

Accordingly, the tagged PDF format is currently of most relevance for screen reader users. Compatible assistive technologies use the tags to read the structure and provide similar navigational features as those available in other tagged formats.

Characteristics of the PDF format and ecosystem

  • The tagging language is quite good but inferior to HTML
  • There are no authoring programs currently capable of generating all the required tags
  • The best authoring program, InDesign, is expensive and has a steep learning curve
  • Setting up an InDesign file for exporting tags can be time-consuming if the layout is complex
  • Exported PDFs often require some postprocessing, and sometimes significant postprocessing, in Adobe Acrobat
  • Tags and content are difficult and in many cases impossible to edit within Adobe Acrobat
  • Documents can be difficult and time-consuming to update
  • The tagged PDF is not inherently linked to Acrobat’s two visual views (layout and reflow view)
  • Accessibility is poorly integrated into the work practices of corporate and freelance graphic designers
  • Many publishers of PDFs rely on the inadequate Adobe Acrobat accessibility checker as the only ‘proof’ that a PDF is accessible
  • Many published PDFs that are claimed to be accessible are in fact not accessible and often seriously deficient in other respects, for example the reading order is scrambled
  • Many users of assistive technologies are wary of the PDF format because they have experienced difficulty with PDFs erroneously claimed to be accessible
  • Accessibility support for PDF documents on mobile devices is poor
  • Assistive technologies are not available to apply custom style sheets to a PDF document
  • Free and cheap assistive technologies generally offer inferior support for PDF than for HTML – however, fully featured (usually expensive) assistive technologies do offer good support for PDF.

Over time, we will be posting articles discussing some of these issues.

MS Word and RTF

Microsoft Word refers to the word processing program as well as two file formats (.doc. and .docx). The program is not free, but there are a number of free programs that can read the file formats. In practice this makes MS Word as open a format as RTF, and for that reason I generally refer only to the more commonly used MS Word format.

MS Word is not obviously a tagged format, but the fact that an MS Word file can readily be resaved as an HTML file indicates there is an implicit tagged structure in MS Word documents. For example, paragraph styles can be regarded as de facto tags, with headings functioning as h1, h2 etc,  lists functioning as ol and ul , and most other styles functioning as the default p tag.

Compatible word processors can function as assistive technologies for MS Word files. Readers can use magnification and display options, customise style sheets and, in the case of .docx format, apply a custom theme in MS Word.

Like HTML and tagged PDF, MS Word provides a standard palette of de facto tags via the built-in default paragraph styles (Heading 1 to Heading 9, Normal, Body Text, List Bullet, List Number, Footnote Text and many many more). Each paragraph style is assigned an outline level, so for each Heading X the outline level is Level X, while for other styles the outline level is Body text.

Users can and often do create additional paragraph styles, and they can designate the outline level of these styles. Long documents may use dozens of paragraph styles, usually a combination of default styles and user-defined styles. Readers who wish to take control of the presentation of such a document can theoretically do so by editing the style definitions, but in most cases this is not practical. This is a consequence of MS Word not using a standard and finite palette of styles.

Accordingly, an accessible MS Word file should ideally use a restricted palette of styles, consisting of Headings 1 to X as required, a single style for all body text, a minimum number of styles for lists, and additional styles only used for structural purposes such as captions, footnotes, hyperlinks and so on. Wherever possible, default style names should be used. There are hazards involved with using default style names, but these can be addressed by the correct setup of templates.

When a restricted palette of standard styles is used, it becomes viable for readers to apply their own style sheet.

This is a very important point not only for online web documents but also for intranet documents and documents shared by email. When organisations use accessible templates for in-house documents, it becomes feasible for an individual staff member to apply their custom style sheets or themes in order to read and work with these documents comfortably and productively.

Fully featured screen readers provide strong support for the MS Word format. However, even some of the best screen readers do not fully support outline levels, for example a paragraph style might be designated as Level 1, but the screen reader will not treat this as an h1 as it does the default style Heading 1.

Because MS Word is an authoring program, it provides a large number of keyboard shortcuts for getting around the document. These shortcuts are well known by many blind readers and they integrate well with screen reader navigational features.

Characteristics of the MS Word format and ecosystem

  • It is not actually a tagged format, but assistive technologies can perceive it as a tagged format to a considerable degree
  • It is by far the most commonly used authoring program for corporate documents
  • In general, good practice for layout purposes equates to good practice for accessibility purposes
  • As a result, it is feasible (but not easy) for an organisation to implement good accessibility practices throughout the organisation, even among staff are not working in the area of publishing
  • Without organisational commitment in this area, poor layout and formatting practices are likely to prevail, and will result in inaccessible documents
  • Documents are easy to update
  • MS Word format is preferred over PDF by many users of assistive technologies (partly because many published PDFs are either untagged or poorly tagged)
  • As a powerful authoring program, MS Word has a large number of keyboard shortcuts, many of which are of great help in reading and navigating a document efficiently
  • Accessibility support for MS Word documents on mobile devices is poor
  • It is significantly more difficult to apply custom style sheets to an MS Word document than to an HTML document
  • Free and cheap assistive technologies generally offer inferior support for MS Word than for HTML – however, fully featured (usually expensive) assistive technologies do offer outstanding support for MS Word
  • Readers who do not have MS Word and instead use free software to read MS Word files will generally not experience the same degree of accessibility using assistive technologies as users who do have MS Word.