You used the split feature to split your PDF file (10 pages 1MB) and the result is 10 PDF files almost 1MB each… something must be wrong, it should have been 10 PDF files roughly 100kB each right? Wrong.

How PDF works

Oversimplifying, a PDF page is just a set of draw operations, something like draw this line here or write this text there using this font or draw this image here. Each page has resources attached to it in something called Resources Dictionary, a bucket of resources (fonts, images..) used by the page so when the operation draw this image here is met, the image is found in the page Resources Dictionary.

Now let’s imagine there is the same company logo on each page, does this mean each page has it’s own duplicated logo image in the Resources Dictionary? Of course not, each Resources Dictionary can point to the same image resource which is shared among these pages. We could even go a bit further and have every page pointing to the same Resources Dictionary which, in this scenario, would act as a document wide bucket of resources containing all the images and fonts used in the document.

What happens when we split

In our example we are splitting the 1MB PDF creating 10 new files, what happens is that each of the 10 new files must have all the resources needed to draw the page, meaning that each file must have its own copy of the company logo, fonts and all other resources. This explains why, even if the files are just one page, they have almost the same size of the original 10 pages documents, because they still need all fonts and images that were making up most of the original 1MB size

Is it always the case?

No, sometimes PDFsam is just not smart enough. Imagine page 5 (and only page 5) has a nice big image and all the pages point to the same shared Resources Dictionary containing this image, when we split we duplicate and attach the resources to each of the 10 PDF files created by the task but only the file containing page 5 will need the full resource dictionary, all the others don’t need the big image. PDFsam has an algorithm that tries (and most of the times succeeds) to identify this kind of situations and optimizes the resources attached to the resulting files, removing unused resources, in this case the big image for all but one of the generated files, and this is why most of the times you get files of the size you would expect. This process can be slow because we need to parse the page to figure what resources are actually used so we don’t always apply it but we try to identify files where there are resources potentially unused. Here is where PDFsam sometime misses a valid candidate and skips the optimization, it happens very rarely but it can happen.

Most of the splitters out there don’t even perform this kind of optimization so rest assured, PDFsam remains one of the best tool for the job 🙂

3 Replies to “I split my PDF but the resulting files are so big!”
  1. I found a very strange one, and it happened with only one of several similar files. A collection of medical records, each PDF about 1,500 to 2,000 pages, maybe 300 MB in size, to be split into individual PDFs. This particular one created individual PDFS, enormous to begin with, but of increasing size as the process went on. It took over an hour, and the last one-page PDF file was 33 MB. The resulting folder was in the hundreds of GB.

    Even when I tried to extract using the PDF editor’s function, (200 pages at a time, going backward) the same thing happened, and the last PDFs in the collection were much larger even though they were extracted first. So I concluded it had to be some anomaly in the original PDF itself, not with pdfSam’s function.

    This was one of three original PDFs, created at the same time, totaling about 4,800 pages. The other two were properly split using pdfSam, with each individual PDF a couple of hundred KB in size.

Leave a Reply

Your email address will not be published. Required fields are marked *

mode_edit
account_circle
email