PDF Files - the Essentials for Electronic Publishing

Overview

The widely used Portable Document Format (PDF) for page-based documents was introduced by Adobe in 1993. It is based in large part on the Postscript page description language for modern printers, which had been developed in the late 1970s/early 1980s. In 2008 the specification for PDFs was made available by Adobe as a royalty free open standard and adopted by the International Standards authority (PDF32000). A very good summary of the history and technical aspects of PDF files is provided on WikiPedia. The sections below provide more practical information based on our own experience of using PDFs and PDF tools over the past decade.

One of the most important features of PDF documents is they are defined by a PAGE BASED model - this describes how individual pages in the document are made up, in terms of the text, the fonts used, graphical objects, interactive elements and possibly other features associated with the page. This PAGE BASED model means that when you look at a PDF page on-screen or on printed output, it should always look the same and as specified by the designer. This is completely different from formats such as ePUB and HTML, which are not page based - they are effectively a linear stream of items, one after another, with limited "layout" elements (ePUB3 and HTML5 have improved on this of course, but they are still very much flexible, flowable formats). The two approaches have been designed independently, with a major aim of formats like ePUB being to allow the text to be the dominant element, re-sizable and re-flowable, ignoring the page concept and focusing on the size and orientation of the device on which it is viewed. ePUB and its variants and versions is the most widely used format for reading eBooks documents on mobile devices, including of course Amazon Kindle, Nook and other specilaized ebook reader devices.

Page size and orientation

Some comments should be made at this stage about page size. Because PDF files are page-based, the underlying page size (or effective page size) matters - both for successful output to print devices and for reading on-screen. The great majority of documents that are converted into PDF format are based on A4 or US Letter page size in Portrait orientation. This has a total height of 11+ inches or 300mm. If a computer screen displays at 100 pixels per inch the screen would need a resolution of at least 1100 pixels vertically to display the page, and even then it would be very difficult to read typical text at, say 12 point size. And most computer displays, including all laptops, are less than 11 inches high so any entire page will be shrunk to fit the available screen space. Technical drawings can be even more difficult to decipher as they often include small fonts and fine lines. This does not present a serious problem for print output, as long as the printing is carried out with at least 300dpi (dots per inch) or preferrably 600-2400+ dpi. There are a number of solutions to this problem, the most obvious of which is zooming. Most PDF readers support immediate zooming to page width, and this increases the font size viewed dramatically with the result that the text is easy to read - however, a downside of this is that the page then has to be scrolled downwards and maybe sidewards in order to read the entire text on a page. This in turn has some impact on the ease with which the user can read the document, particularly if it has many pages. For reference works, where small sections are referred to, this is not really a problem, but for fully reading and absorbing a long document, it is a limiting factor. The alternative to zooming the page is for the source document to be arranged (designed to fit) on a smaller page size (e.g. A5 or one-half US Letter) and/or a landscape orientation used, and/or for the document to be created with larger font sizes.

Recommendations for successful web-based PDF viewing (from Mozilla.org)

The following recommendations are from the Mozilla organization, who are responsible for the Firefox web browser, combined with some of our own recommendations.See also the page: Optimize a PDF from Adobe's website for more information. There are more improvement techniques that we can suggest:

  • Avoid using high resolution images - 150 dpi resolution for scanned images shall be enough for screens
  • Try to use JPEG encoding for color images/photos in RGB colorspace when possible
  • Avoid using compositions/effects such as transitions/masking - flatten transparency
  • Avoid using PDF generators (or do not create content) that can produce poorly structured PDF output (e.g. LibreOffice and several other PDF creators produce lots of tiny images for vector elements/pictures that they do not understand)
  • If there is such a setting, use web-optimized PDF output / linearization
  • Fix or don't produce corrupted PDFs that do not conform to the PDF32000 specification - aim for compatability with Adobe reader 7 (PDF 1.6) or earlier

Another common issue with PDF creation is the use of implicit rather than explicit URLs. Suppose you include www.adobe.com as text in your source document and then output the file to PDF. Some PDF readers, including the Adobe Reader, will "guess" that you meant this to be a web address and will act on it accordingly. However, the item will not be highlighted as a link (which is the case for web browsers also, as can be seen from www.adobe.com not showing up as a link) and for many PDF viewers on different technology platforms the result will be no action at all. The solution is to make the link explicit rather than implicit (when you want it to be a link instead of just text). You can do this by selecting the text in the source file and then telling the editor to add a hyperlink at that point. This is like ensuring that a link in text like "Click here" actually specifies where the resulting click should take you to.Adobe Acrobat includes a facility to automatically create URLs from implicit links

Video and Audio media files can be embedded in some Adobe-specific PDFs, but should not be embedded for general use as these will not work across platforms and readers and the resulting PDF files are generally huge and unsuitable for downloading. For video and audio files we recommend that these be created in MP3 or MP4 format and then placed on an in-network server for user access. Then within the PDF place a link to these files so they become linked rather than embedded. The resulting PDF file will work on all platforms and situations, and will remain small, even where the media files are large.

PDF file formats

Note - this section is based on materials published by Adobe and others - links are provided to original technical reference material below. When creating a PDF file many software packages offer you the option of choosing a particular version of PDF, and for most purposes choosing PDF version 1.5, 1.6 or 1.7 is best. These versions will be readable by all modern PDF readers, not just the Adobe Reader. Many of the advanced features of Adobe PDFs, such as javascript, audio and video embedding, and 3D model support are very specific to Adobe, so will not work on most other readers. Even form-filling and markup, which were introduced into the standards many years ago, are not supported by many PDF readers. This applies not just to offline readers on desktops, laptops and mobile and tablet devices, but also to almost all online PDF viewers that are supported by the main web browsers (Chrome, Firefox, Safari etc).

Basic structure of a PDF file

(source: Adobe, 2004)

The general structure of a PDF file is composed of the following code components: header, body, cross-reference (xref) table, and trailer, as shown in figure 1.

Figure 1. Basic structure of a PDF file

PDF file structure

The header contains just one line that identifies the version of PDF. For Example: %PDF-1.4 is the first line of the testfonts.pdf file. If you add the two values from the version number, e.g. 1.4 -> 1+4 you get 5 which is the version of Adobe Reader needed to view a document in that version of PDF - so version 1.6, which is probably the last overall "standard" version that is most widely used, requires Adobe Reader V7 or later (or other PDF readers that handle PDF version 1.6). The trailer contains pointers to the xref table and to key objects contained in the trailer dictionary. It ends with %%EOF to identify end of file. The xref table contains pointers to all the objects included in the PDF file. It identifies how many objects are in the table, where the object begins (the offset), and its length in bytes. The body contains all the object information — fonts, images, words, bookmarks, form fields, and so on.

The following links provide access to the technical specifications - these are large files, often 30Mb+

Save and Save As

When you perform a Save operation on a PDF file, the new, incremental information is appended to the original structure (see figure 2); that is, a new body, xref table, and trailer are added to the original PDF file.

Figure 2. Structure of a PDF file after updates

PDF amended file structure

PDF Creation software

In the past creating PDF files meant purchasing Adobe Acrobat software from Adobe Inc, and even today, Acrobat is one of the best and most powerful software packages for the creation and amendment of PDF files. However, there are many other ways of generating a PDF, typically as an Export option within major document creation software products. This applies to the full set of current MS Office and OpenOffice applications - whether Word, Excel, Powerpoint etc or their equivalents in OpenOffice. The OpenOffice export to PDF option is very fast and produces good quality PDF files, but has almost no options for specifying attributes of the generated file. MS Office Export to PDF does include a range of options, with the most relevant being those that create structured PDF files and files that are optimized for screen or print. MS Word files that include Heading styles can be set to create a Contents tree automatically (also known as an Outline or Bookmark tree) which is very important for fast navigation of larger documents, particularly when viewed on mobile devices. Recommended "options" settings are shown below:

Word export to PDF

All modern desktop publishing software products also generate PDF files, with preset options for print production and in some cases, for screen viewing. Amongst the best and most widely used of these are Adobe's InDesign software for PCs and Macs and QuarkXPress for Macs. Example "options" settings for InDesign (CSS4 in this example) are shown below. Note that for fast display of bookmarks on all tech platforms it may be necessary to convert InDesign "named destination" bookmarks to explicit page references (Evermap tools, see below, include an autobookmark facility for this):

CSS4 export to PDF

Mac computers will create PDF files from almost all appropriate applications, including the basic Pages and office-related facilities included as standard with OSX. Similar facilities exist within Linux. In addition to the above options, there are many "print drivers" which will create a simple PDF as output from any desktop application under Windows, just by printing the material to a specially installed printer ... in this case, a non-physical print device. The result is a PDF with very little functionality, but usable for many basic applications.

Recommended links:

  • Adobe - for Acrobat and InDesign software
  • A-PDF - for software from A-PDF (affordable PDF Tools), including an excellent watermarking tool
  • jPDFBookmarks - free PDF bookmarking tool
  • CutePDF - for software from CutePDF (PDF print driver/writer and editor)
  • PDF Creator Plus - from Peernet
  • Setasign - php software for creating and augmenting web-based PDFs
  • Evermap - providers of a wide range of PDF tools

PDF Editing software

As explained earlier PDF files have a quite complex structure, and there are many variants in terms of the standards applied and the way in which these standards have been implemented. In fact, some would argue that there are almost no real standards because there is so much variation in implementations. The result is that the structure of any particular PDF file can be extremely complicated, and thus very difficult to amedn (edit) post creation. It is, of course, possible to perform a range of functions which come under the general hheading of editing, i.e. are not simply the amendment of text on a page. These features include changing the content of specific pages; changing the use of particular fonts; cleaning/optimizing files to remove duplications and poor structure; splitting and combining PDF files; extracting and inserting pages; saving the content is alternative formats (other PDF standard variants or completely different formats, such as Rich Text (rtf); and more.... In the case of Adobe Acrobat, which is the mostly widely used PDF Editor, the standard editing tools are arranged into groups of functions:

  • Content editing - includes adding and editing text and images, plus adding hyperlinks and bookmarks
  • Page manipulation - includes rotation, deletion, splitting, watermarking, style changes etc
  • Forms and button management - includes adding fields for text/data input and related functions
  • Text recognition, i.e. OCR functionality, mainly used for conversion of scanned-in files to text and numbers where possible
  • Document processing - which includes a range of functions, from aspects of page layout to page numbering and auto-identifying web content and URLs (converting implicit URLs into explicit URLs)
  • Additional tools are also provided, either built-in or downloadable from Adobe's website

Other PDF Editors provide similar functionality, with their own take on the most important aspect of their usage - for example, the Infix editor from Iceni is much closer to a Word Processor style of editor than Acrobat. Likewise, Foxit's PDF Editor software provides a very wide range of functionality, similar to the features described above, at a very competitive price. And there are innumerable online PDF conversion and editing websites, most of which provide basic functionality as a free service with more advanced features on a subscription basis. Software providers include:

  • Adobe - for Acrobat and InDesign software
  • Iceni - PDF Editor provider
  • Foxit - PDF Editor and Reader provider
  • PDF Architect - Tailorable PDF Editor provider

PDF Online Conversion services

There are many online PDF conversion and enhancement services, some built using the Setasign php software mentioned earlier. Examples include:

  • VeryPDF- PDF conversion and watermarking
  • Small PDF- Conversion to/from PDF, Merge/Split, Basic security (lock/unlock)