Generating PDF/A compliant PDFs from pdftex: Difference between revisions
Line 130: | Line 130: | ||
==Another trivial example== | ==Another trivial example== | ||
Let's apply what we did above for another example: '''small2e.tex''' which is part of standard latex distribution. The result should be the same. | Let's apply what we did above for another example: '''small2e.tex''' which is part of standard latex distribution. | ||
* We put all the additional latex code a file called '''pdfa-supp.tex''' which looks as follows: | |||
<geshi lang="latex"> | |||
%*************************************************************************** | |||
% \convertDate converts D:20080419103507+02'00' to 2008-04-19T10:35:07+02:00 | |||
%___________________________________________________________________________ | |||
\def\convertDate{% | |||
\getYear | |||
} | |||
{\catcode`\D=12 | |||
\gdef\getYear D:#1#2#3#4{\edef\xYear{#1#2#3#4}\getMonth} | |||
} | |||
\def\getMonth#1#2{\edef\xMonth{#1#2}\getDay} | |||
\def\getDay#1#2{\edef\xDay{#1#2}\getHour} | |||
\def\getHour#1#2{\edef\xHour{#1#2}\getMin} | |||
\def\getMin#1#2{\edef\xMin{#1#2}\getSec} | |||
\def\getSec#1#2{\edef\xSec{#1#2}\getTZh} | |||
\def\getTZh +#1#2{\edef\xTZh{#1#2}\getTZm} | |||
\def\getTZm '#1#2'{% | |||
\edef\xTZm{#1#2}% | |||
\edef\convDate{\xYear-\xMonth-\xDay T\xHour:\xMin:\xSec+\xTZh:\xTZm}% | |||
} | |||
\expandafter\convertDate\pdfcreationdate | |||
%************************** | |||
% get pdftex version string | |||
%__________________________ | |||
\newcount\countA | |||
\countA=\pdftexversion | |||
\advance \countA by -100 | |||
\def\pdftexVersionStr{pdfTeX-1.\the\countA.\pdftexrevision} | |||
%******** | |||
% pdfInfo | |||
%________ | |||
\pdfinfo{% | |||
/Title (\Title) | |||
/Author (\Author) | |||
/Subject (\Subject) | |||
/Keywords (\Keywords) | |||
/ModDate (\pdfcreationdate) | |||
/Trapped /False | |||
} | |||
%************************* | |||
% explicit interword space | |||
%_________________________ | |||
\expandafter\ifx\csname pdfgeninterwordspace\endcsname\relax | |||
\message{\string\pdfgeninterwordspace\space not supported by this version of pdftex} | |||
\else | |||
\pdfmapline{+dummy-space <dummy-space.pfb} | |||
\pdfgeninterwordspace=1 | |||
\fi | |||
</geshi> | |||
* let's add to '''small2e.tex''' these lines to make it pass pdfa/1b check: | |||
<geshi lang="latex"> | |||
\def\Title{An Example Document} | |||
\def\Author{Leslie Lamport} | |||
\def\Subject{An Example Document} | |||
\def\Keywords{LaTeX,Example,Document} | |||
\input{pdfa-supp} | |||
\usepackage{xmpincl} | |||
\includexmp{pdfa-1b} | |||
</geshi> | |||
* to pass pdfa/1a check, we change | |||
<geshi lang="latex"> | |||
\includexmp{pdfa-1b} | |||
</geshi> | |||
to | |||
<geshi lang="latex"> | |||
\includexmp{pdfa-1a} | |||
</geshi> | |||
and compile the file by pdftex with the patch mentioned above. | |||
The result should be the same. | |||
==A less trivial example== | ==A less trivial example== | ||
Now let's move on to '''sample2e.tex''', which is another sample that is part of latex distributioon. | Now let's move on to '''sample2e.tex''', which is another sample that is part of latex distributioon. |
Revision as of 23:17, 23 November 2008
Introduction
This page describes necessary steps to create PDF/A compliant PDFs from pdftex and related issues. When we compile a latex document with pdftex, there can be a few issues that can prevents the result from begin pdf/a compliant, such as:
- problems with fonts:
- font files are not embedded,
- mismatch of character widths,
- characters of zero widths,
- fonts don't have a ToUnicode mapping
- problems with metadata:
- XMP data not included,
- XMP data don't match the info in pdfInfo catalog.
- problem with interword spacing: pdftex don't use space to separate words in pdf output.
- problem with color data.
The usual way to verify if a pdf file is pdf/a compliant is to use a validating tool. There are a few pdf/a checking tools; the most common one is the Preflight tool in Acrobat Professional version 8 or newer. Beware that these checking tools can give very different the result on pdf/a compliance of a given pdf: a pdf file that passes pdf/a compliance checking in acrobat 8 can still fail to pass a check by another tool. In this document, we assume the following:
- input are latex documents
- tex live 2008 (which includes pdftex version 1.40.9) is used for latexing
- Acrobat 8.0 for pdf/a validation
We start by a minimal example, and then move to more complex ones, to illustrate the issues one may encounter when trying to achieve pdf/a compliance.
A minimal example
Let's have a minimal document hello.tex that looks as follows: <geshi lang="latex"> \documentclass{report} \begin{document} Hello, world! \end{document} </geshi>
When we compile it with pdflatex and check for pdf/a compliance, we will get a report like this:
So it looks like our pdf is missing metadata. To fix this, we make a copy of hello.tex named hello-pdfa-1b.tex that looks as follows: <geshi lang="latex">
\documentclass{report}
%**************** % define medatata %________________ \def\Title{An Example Document} \def\Author{Some Name} \def\Subject{An Example Document} \def\Keywords{LaTeX,Example,Document}
%*************************************************************************** % \convertDate converts D:20080419103507+02'00' to 2008-04-19T10:35:07+02:00 %___________________________________________________________________________ \def\convertDate{%
\getYear
}
{\catcode`\D=12
\gdef\getYear D:#1#2#3#4{\edef\xYear{#1#2#3#4}\getMonth}
} \def\getMonth#1#2{\edef\xMonth{#1#2}\getDay} \def\getDay#1#2{\edef\xDay{#1#2}\getHour} \def\getHour#1#2{\edef\xHour{#1#2}\getMin} \def\getMin#1#2{\edef\xMin{#1#2}\getSec} \def\getSec#1#2{\edef\xSec{#1#2}\getTZh} \def\getTZh +#1#2{\edef\xTZh{#1#2}\getTZm} \def\getTZm '#1#2'{%
\edef\xTZm{#1#2}% \edef\convDate{\xYear-\xMonth-\xDay T\xHour:\xMin:\xSec+\xTZh:\xTZm}%
}
\expandafter\convertDate\pdfcreationdate
%************************** % get pdftex version string %__________________________ \newcount\countA \countA=\pdftexversion \advance \countA by -100 \def\pdftexVersionStr{pdfTeX-1.\the\countA.\pdftexrevision}
%*********
% XMP data
%_________
\usepackage{xmpincl}
\includexmp{pdfa-1b}
%******** % pdfInfo %________ \pdfinfo{%
/Title (\Title) /Author (\Author) /Subject (\Subject) /Keywords (\Keywords) /ModDate (\pdfcreationdate) /Trapped /False
}
\begin{document}
Hello, world!
\end{document}
</geshi>
Some notes on the example:
- it uses the latex package xmpincl to include XMP data to the pdf;
- it assumes there is a file pdfa-1b.xmp in the current directory. That file is included in this zip
When we check the pdf result using acrobat 8, we got this report:
With a little more effort, we can make our example to pass pdf/a-1a checking:
- use pdftex with a patch available at http://sarovar.org/tracker/index.php?func=detail&aid=945&group_id=106&atid=495
- make a copy of hello-pdfa-1b.tex named hello-pdfa-1a.tex and make the following change:
- replace
<geshi lang="latex"> \includexmp{pdfa-1b} </geshi> by <geshi lang="latex"> \includexmp{pdfa-1a} </geshi>
- add the following code:
<geshi lang="latex"> %************************* % explicit interword space %_________________________ \pdfmapline{+dummy-space <dummy-space.pfb} \pdfgeninterwordspace=1 </geshi>
Compile the file with patched pdftex, and we should get this report from checking:
Another trivial example
Let's apply what we did above for another example: small2e.tex which is part of standard latex distribution.
- We put all the additional latex code a file called pdfa-supp.tex which looks as follows:
<geshi lang="latex"> %*************************************************************************** % \convertDate converts D:20080419103507+02'00' to 2008-04-19T10:35:07+02:00 %___________________________________________________________________________ \def\convertDate{%
\getYear
}
{\catcode`\D=12
\gdef\getYear D:#1#2#3#4{\edef\xYear{#1#2#3#4}\getMonth}
} \def\getMonth#1#2{\edef\xMonth{#1#2}\getDay} \def\getDay#1#2{\edef\xDay{#1#2}\getHour} \def\getHour#1#2{\edef\xHour{#1#2}\getMin} \def\getMin#1#2{\edef\xMin{#1#2}\getSec} \def\getSec#1#2{\edef\xSec{#1#2}\getTZh} \def\getTZh +#1#2{\edef\xTZh{#1#2}\getTZm} \def\getTZm '#1#2'{%
\edef\xTZm{#1#2}% \edef\convDate{\xYear-\xMonth-\xDay T\xHour:\xMin:\xSec+\xTZh:\xTZm}%
}
\expandafter\convertDate\pdfcreationdate
%**************************
% get pdftex version string
%__________________________
\newcount\countA
\countA=\pdftexversion
\advance \countA by -100
\def\pdftexVersionStr{pdfTeX-1.\the\countA.\pdftexrevision}
%********
% pdfInfo
%________
\pdfinfo{%
/Title (\Title) /Author (\Author) /Subject (\Subject) /Keywords (\Keywords) /ModDate (\pdfcreationdate) /Trapped /False
}
%*************************
% explicit interword space
%_________________________
\expandafter\ifx\csname pdfgeninterwordspace\endcsname\relax
\message{\string\pdfgeninterwordspace\space not supported by this version of pdftex}
\else
\pdfmapline{+dummy-space <dummy-space.pfb} \pdfgeninterwordspace=1
\fi </geshi>
- let's add to small2e.tex these lines to make it pass pdfa/1b check:
<geshi lang="latex"> \def\Title{An Example Document} \def\Author{Leslie Lamport} \def\Subject{An Example Document} \def\Keywords{LaTeX,Example,Document} \input{pdfa-supp} \usepackage{xmpincl} \includexmp{pdfa-1b} </geshi>
- to pass pdfa/1a check, we change
<geshi lang="latex"> \includexmp{pdfa-1b} </geshi> to <geshi lang="latex"> \includexmp{pdfa-1a} </geshi> and compile the file by pdftex with the patch mentioned above.
The result should be the same.
A less trivial example
Now let's move on to sample2e.tex, which is another sample that is part of latex distributioon.