When "preservation" hinders accessibility

So I have been trying to download this manual from Archive.org: https://archive.org/download/WRQReflection1Version4.2119943.5-1.44MBEnglish/Technical Reference/

Weighing in at about 50 Gigabytes, that is taking forever, and it likely to get corrupted somewhere along the way.

If this archive is like the other archives in this set, I expect it contains uncompressed TIF files with full 24-bit color un-cropped, un-color reduced flatbed scans.

Now, hats off to whoever went to the trouble to scan this bound manual a page at a time. But without further processing, it is hardly even possible to DO anything with it.

A few small tweaks would have shrunk the size down quite a bit. First of all, when I use my flatbed scanner to scan a manual, I select an area to scan that is about the size of the page. Then I place each page over the same area on the scanner. Scanning a little beyond the page is fine and better than accidentally cutting off content without noticing. But that way I don't waste time or storage scanning nothing. (I usually place some object the same height as the scanner next to the scanner so I can place the manual down flat).

I think there is even some misunderstanding about what "cropped" and "un-cropped" even mean. If you can a 3inx3in object "un-cropped" does not mean include the entire flatbed size of the scanner. It only means that the object being scanned is not cut off anywhere.

TIF is fine for intermediate processing. Since it has little or no compression, processing tools don't have to chunk away de-compressing and re-compressing. But when distributing, there is zero reason not to use PNG instead.

If all pages have minimal color or are black and white, you can save a LOT of storage space by color reducing. Typical text page scan images can be color reduce to "thousands" of colors without really losing anything. If pages are black and white, there is no reason not to use gray-scale. Since this reduces the "noise" in the image, PNGs file will come out much smaller, and you will still not get ugly artifacting as with JPG.

Now, I usually do even more processing than that. It is a lot of trouble, but when I create a PNG archive of a manual, I want those PNG files to be nice and clean and ready to print a nice crisp on a printer. This includes de-skewing, adjusting brightness/contrast so white backgrounds are solid white and black text is mostly pure black (printing grey to almost-black text can cause ugly dithering on a printout) while keeping grey around the edges to "soften" the appearance. Time permitting, I also remove punch holes or visible page-edges. Black an white manuals get gray-scaled, manuals with just a bit of crappy color get color reduced, but manuals or pages with detailed full color photos remain 24-bit color. This is usually at 600 DPI, although many early manuals were actually printed with a lower DPI than that.

All of that greatly reduces the size of PNG files.

There is a really nice command line tool called Image Magic that can automate most of that.

Now, even those get quite huge. And I therefore I usually don't post those on WinWorld. Instead I convert to a PDF file. That does mean losing some resolution and most compression modes create JPG artifacting. But with OCR, that has the huge advantage of being easily searchable, printable, and navigable. Most ~300DPI PDF files result in a reasonable size. Small manuals can be 10-100 megabytes. larger ones perhaps a couple hundred megabytes.

Now, this same sort of thing applies to other formats as well. For example, due to size, I may omit Kryoflux or SuperCard Pro images if a title is obviously not copy protected (a perhaps 20 disk set of 1.44mb disks can get annoyingly large). We also have refrained from using MDF format instead of ISO or BIN/CUE because it is considered proprietary (although more tools have grown to support it over the years).

I remember a long time ago getting complaints like "why is this archive so big?!" when an archive contained a few low-resolution JPG label scans rather than just disk .IMG. I guess dial-up is still around in a few spots. Old ADSL is slowly getting replaced by VDSL, cable, or fiber "broadband". I try to be reasonable about file sizes and formats where possible.

The point is, "preserving" stuff does little good if people can't actually DO anything with it.



BTW, on a side note this last forum upgrade must have also resulted in the editing form saving drafts more often. The pause and lag every minute or so get annoying. Oh, right, I'm the only person left on this planet that knows how to type more than 140 characters at a time :P . Also still annoyed at those dumb full page width buttons.

Comments

  • This is very helpful. I have several manuals I have wanting to scan, mostly music programs such as Cakewalk, Roland EASE, etc. However, my access is to a large copier/scanner at a school, so I would likely have to work the jobs piecemeal.
Sign In or Register to comment.