From mboxrd@z Thu Jan 1 00:00:00 1970 From: Pieter Praet Subject: Re: [OT] Scanning for archiving Date: Wed, 09 Nov 2011 08:40:42 +0100 Message-ID: <871uthwxol.fsf@praet.org> References: <87vcqy6vtl.fsf@praet.org> <2011-11-07T18-01-23@devnull.Karl-Voit.at> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([140.186.70.92]:47758) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RO2mx-0008O6-BQ for emacs-orgmode@gnu.org; Wed, 09 Nov 2011 02:41:40 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RO2mv-0003Qs-BL for emacs-orgmode@gnu.org; Wed, 09 Nov 2011 02:41:39 -0500 Received: from mail-wy0-f169.google.com ([74.125.82.169]:58741) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RO2mv-0003Qm-2F for emacs-orgmode@gnu.org; Wed, 09 Nov 2011 02:41:37 -0500 Received: by wyg24 with SMTP id 24so1586330wyg.0 for ; Tue, 08 Nov 2011 23:41:34 -0800 (PST) In-Reply-To: <2011-11-07T18-01-23@devnull.Karl-Voit.at> List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: news1142@Karl-Voit.at, emacs-orgmode@gnu.org On Mon, 7 Nov 2011 18:44:24 +0100, Karl Voit wrote: > Hi! >=20 > Inspired by =C2=ABTotal Recall=C2=BB[3], a book of two MS Research guys, I > started life logging on my own two months ago. >=20 Dammit, that's been on my reading list for almost 2 years now, and *still* it isn't available in ebook format. One would think they'd walk their talk [1], no? > [...] > * Pieter Praet wrote: > > > > Using PDF for scanned documents results in *huge* files with a seriously > > disappointing image quality.=20=20 >=20 > I can not copy that at all: >=20 > ,---- > | vk@gary ~2d % l 2011-11-02_13-22-45.png > | -rw------- 1 vk vk 103150 2011-11-02 13:22 2011-11-02_13-22-45.png > | vk@gary ~2d % convert 2011-11-02_13-22-45.png 2011-11-02_13-22-45.pdf > | vk@gary ~2d % l 2011-11-02_13-22-45.pdf > | -rw-r--r-- 1 vk vk 96457 2011-11-07 18:12 2011-11-02_13-22-45.pdf > | vk@gary ~2d % > `---- >=20 > In this example, the compression of PDF is much better than the > original PNG one. PDF is only a container format. >=20 The conversion to PDF has indeed reduced the filesize, but not for the reasons you might think: If you don't explicitly provide ImageMagick's `convert' with a compression level (`-quality' option), it will use a default of 75%. Thus I (perhaps incorrectly) infer that you've just lost 25% of the image quality for a meager 7% reduction in filesize. I do admit that the whole quality vs. filesize statement I made regarding using PDF for scanned documents wasn't entirely correct: I cut some corners. The real issue is that most folks use their scanner software to save directly to PDF, and for some reason, scanner software (especially the proprietary variety) predominantly uses JPEG compression as default when saving to PDF. JPEG was developed for storing images with smooth transitions and a high bit depth (i.e. photographs), not hard transitions and a low bit depth (i.e. documents), so you're likely to suffer a noticeable degradation in text quality, even when using 1:1 JPEG compression. You're using PNG compression though, so the whole JPEG deal doesn't apply. So, that just leaves the neverending stream of PDF security issues :) > > Consider storing your scans in DjVu format > > [1], which was developed specifically for this purpose. >=20 > PDF is a common standard whereas DjVu is something I - as an > advanced computer user - never faced before in real life. I am not > sure whether any of my computers can handle DjVu files at all. >=20 How about the Million Book Project / Universal Digital Library [2] ? Even though every computing device is most likely to support PDF, their collection is only available in TIFF and DjVu format. The list of participants and partners [3] (not to mention the magnitude and cost of their undertaking) is reason enough (for me, at least) to assume that DjVu is deemed to be rather future-proof. I'm guessing ISO standardization will be only a matter of time. > The goals of DjVu sound great but I get everything with PDF too. > Although I like the idea of OGG Vorbis, I re-ripped all my CDs using > mp3 again because I could not use many music devices or music > management software packages. >=20 Ahhh, VHS vs. Betamax, over and over again... Companies only succeed in getting everyone stuck with mediocre tools if we allow them to. You don't *need* all devices/software to support the superior format. Just get the ones that do (if there are any...), try to enlighten the people in you monkeysphere [4], and then let the free market do its work. Joe Average Consumer will eventually follow (unless pornography is at stake, apparently), and the industry will be right on his tail. > I stick to the format *any* computer can handle without special > software products. [...] Somehow this implies that *every* computer is infected with Adobe's malware. I find that rather disconcerting, to be honest :D > [...] And I do think that I get a higher chance of > being able to read my documents twenty years from now. >=20 For your sake, I hope you're right! > For scanned images I'd prefer PNG instead but the OS X Software of > my OfficeJet offers me the ability to generate PDF files where an > OCR software adds a searchable text layer above the scanned text. > This is *very* important to me since I am able to do full text > search on the content of my archived documents. >=20 May be a bit less convenient in daily usage, but you could stick to your preference of keeping all your scans in PNG format by keeping the OCR output in a separate ASCII file: #+begin_src sh for i in $(ls ${HOME}/msg/paper/inbox/*.png) ; do tesseract ${i} ${i}.txt done #+end_src That way you can access your data even on text-only machines, and full-text search is only a `grep' away. > And I plan to archive *all* of my documents. Really all of them. >=20 Then you'll probably be interested in Joey Hess' git-annex [5] to keep your archive versioned and in sync across all your devices. > Storage space does not matter (any more) to me since I have more > disk space now already than I could possible fill with my lifetime > paper correspondence. And I do think that my disk space continues to > grow in future. >=20 I'd argue it still does, otherwise you'd be keeping your scans in TIFF format. And digitized trees surely aren't the only type of correspondence you are (or will be) archiving. Efficiency should always play a major role IMO, even if the available resources are (perceived to be) infinite. Having a hangar instead of a garage doesn't warrant driving a schoolbus to work, even if doesn't guzzle a drop of gas. > [...] >=20 > Funny side fact: grayscale scan document settings produces slightly > larger files than colored ones. >=20 That's odd. Probably depends on which type of compression is used. > > gscan2pdf also supports a number of OCR utils, but the UI for this is > > clumsy (aren't they all...), so you're better off using the CLI tools > > directly. Tesseract is recommended. >=20 > I played around with ocropus, tesseract, ocroscript, hocr2pdf, > exactimage, ppa:gezakovacs/pdfocr, ... to generate those sandwitch > PDF documents (OCR text above the scanned images) on GNU/Linux. > Unfortunately none of those (very cool projects) produced reliable > results on my side. The results vary from =C2=ABno error but overlay font > size is incorrect and produces loss of layout=C2=BB to =C2=ABlibrary error > messages I can not read or handle=C2=BB.=20 >=20 > Whereas the HP OfficeJet bundles its OS X software with OCR from > Readiris which produces perfect results even in different languages > and using a usable user interface. >=20 Sadly, I can only agree with this. Google's involvement in Tesseract and OCRopus does instill hope though :) > > NOTE: When attempting something like this, a fast scanner with a *relia= ble* > > automatic document feeder will help prevent premature hair loss ;) >=20 > I have found several scanner products I was interested in: >=20 > "Canon imageFORMULA P-150": very small form factor with basic Linux > support. Price tag starts with =E2=82=AC 260. Neat form factor and very > portable. Different version "P-150m" for Mac OS X. >=20 > The authors of [3] use Fujitsu ScanSnap starting at =E2=82=AC 400. >=20 > I ended up with the Office Jet Pro (mentioned above) at =E2=82=AC 250 > because I got flatbed scanner *and* ADF-scanner *and* a > full-duplex/full-color network printer with a very good > price-per-printed-page-ratio (better than many laser printers!). And > all of this with a cheaper price tag than any scan-only-product I > was interested in. >=20 > So far I am almost satisfied. =C2=ABAlmost=C2=BB? Well, HP did a good job= with > this printer but they made only a 90% solution on almost all levels. > Whereas 100% would be possible with small additional effort when > creating the printer. But those resulting 90% are pretty usable. >=20 > 3. http://qr.cx/sAHU > --=20 > Karl Voit >=20 >=20 Peace --=20 Pieter [1] http://www.youtube.com/watch?v=3DzDcq2lmw0ls [2] http://www.ulib.org/ [3] http://www.ulib.org/ULIBAboutUs.htm [4] http://en.wikipedia.org/wiki/Dunbar's_number [5] http://git-annex.branchable.com/