Friday, 3 June 2011

Compressing PDF files

If you have got a hold of a PDF file which comprises lots and lots of images and nothing else it may well be huge if the images are not compressed. You can fix this at the command line in ubuntu. You do this as follows:

You will need:

sudo apt-get install pdftk imagemagik

First you need to unpack the images from the PDF. Start this from a blank directory because we are going to automatically do things to all the files in this directory with a specific name.

pdfimages /path/to/filename.pdf imageroot

You get lots of:

imageroot-[three digit number].somethings

These somethings are either a ppm or a pbm filetype. These are very, very, basic graphic image dumps - like a bmp image. One is for texty stuff, and the other is for imagy stuff. I therefore use these filetypes as wildcards in the next command, but you would need to replace these with the correct image types that are generated by this command if your results are different. Such as if you choose to try and output jpeg files by using the [-j] option.

Next you compress each image to a pdf, one page long. This will not work properly if you did not start with a blank directory because we are going to command changes to be made to EVERY file in this directory with the extensions produced by the last command.

For colour sources, you need jpg compression:

for file in *.{ppm,pbm}; do convert -compress jpeg -quality 50 $file  ${file%.???}.pdf; echo $file; done

That applies the commands between the [do] and the [done] bits to all [file]s with the extension of either [{,}] [ppm] or [pbm]. The commands in the middle [convert] the image files into pdf files using [jpeg] [compress]ion with [quality] [50]%. You can obviously change the quality percentage to get the best results depending on your source material. The result is sent to a file whose name is constructed from the input [file] variable [$] less whatever three letter [???] extension [.] it has, plus the characters [.pdf]. The next bit just prints out the last filename processed so you can make sure it is doing something if you are processing LOTS of files.

For black and white sources you need fax compression:

for file in *.{ppm,pbm}; do convert -alpha off -monochrome -compress Group4 -quality 100 $file  ${file%.???}.pdf; echo $file; done

This is basically the same approach as last time, just with a change to the [convert] command. This time the [jpeg] stuff is gone, and we have the [-alpha off -monochrome -compress Group4 -quality 100] bit instead. I can't get the quality setting to do anything here. The Group4 refers to the particular brand of fax compression which is applied.

Finally we take all of those individual pdfs and we combine them into one big one. Again, this will not work properly if you did not start with a blank directory. This copies EVERY pdf which matches the search string (the imageroot*.pdf bit where * means anything) into the final pdf. The classic error here would be making your image root name too similar to your original pdf name, with the result that the built pdf incorporates the original - hardly reducing the file size!

pdftk imageroot*.pdf cat output name_of_final_file.pdf

That uses the [pdf] [t]ool[k]it program to take every [*] file that starts with [imageroot] and ends with [.pdf] and con[cat]enates them into the [output] file named [name_of_final_file.pdf].


No comments:

Post a Comment