You used the split feature to split your PDF file (10 pages 1MB) and the result is 10 PDF files almost 1MB each… something must be wrong, it should have been 10 PDF files roughly 100kB each right? Wrong.
How PDF works
Oversimplifying, a PDF page is just a set of draw operations, something like draw this line here or write this text there using this font or draw this image here. Each page has resources attached to it in something called
Resources Dictionary, a bucket of resources (fonts, images..) used by the page so when the operation draw this image here is met, the image is found in the page
Now let’s imagine there is the same company logo on each page, does this mean each page has it’s own duplicated logo image in the
Resources Dictionary? Of course not, each
Resources Dictionary can point to the same image resource which is shared among these pages. We could even go a bit further and have every page pointing to the same
Resources Dictionary which, in this scenario, would act as a document wide bucket of resources containing all the images and fonts used in the document.
What happens when we split
In our example we are splitting the 1MB PDF creating 10 new files, what happens is that each of the 10 new files must have all the resources needed to draw the page, meaning that each file must have its own copy of the company logo, fonts and all other resources. This explains why, even if the files are just one page, they have almost the same size of the original 10 pages documents, because they still need all fonts and images that were making up most of the original 1MB size
Is it always the case?
No, sometimes PDFsam is just not smart enough. Imagine page 5 (and only page 5) has a nice big image and all the pages point to the same shared
Resources Dictionary containing this image, when we split we duplicate and attach the resources to each of the 10 PDF files created by the task but only the file containing page 5 will need the full resource dictionary, all the others don’t need the big image. PDFsam has an algorithm that tries (and most of the times succeeds) to identify this kind of situations and optimizes the resources attached to the resulting files, removing unused resources, in this case the big image for all but one of the generated files, and this is why most of the times you get files of the size you would expect. This process can be slow because we need to parse the page to figure what resources are actually used so we don’t always apply it but we try to identify files where there are resources potentially unused. Here is where PDFsam sometime misses a valid candidate and skips the optimization, it happens very rarely but it can happen.
Most of the splitters out there don’t even perform this kind of optimization so rest assured, PDFsam remains one of the best tool for the job 🙂