From mboxrd@z Thu Jan 1 00:00:00 1970 From: Johnny Subject: Re: [OT] Scanning for archiving Date: Wed, 09 Nov 2011 09:06:51 +0000 Message-ID: <87lirpk6l0.fsf@gmx.co.uk> References: <87vcqy6vtl.fsf@praet.org> <2011-11-07T18-01-23@devnull.Karl-Voit.at> <871uthwxol.fsf@praet.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([140.186.70.92]:38240) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RO4Dp-0000EP-G5 for emacs-orgmode@gnu.org; Wed, 09 Nov 2011 04:13:30 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RO4Dn-0002Y7-Im for emacs-orgmode@gnu.org; Wed, 09 Nov 2011 04:13:29 -0500 Received: from mailout-eu.gmx.com ([213.165.64.43]:51749) by eggs.gnu.org with smtp (Exim 4.71) (envelope-from ) id 1RO4Dn-0002Rp-4G for emacs-orgmode@gnu.org; Wed, 09 Nov 2011 04:13:27 -0500 In-Reply-To: <871uthwxol.fsf@praet.org> (Pieter Praet's message of "Wed, 09 Nov 2011 08:40:42 +0100") List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: Pieter Praet Cc: news1142@Karl-Voit.at, emacs-orgmode@gnu.org Apologies for top-posting, but my comment is only inspired by the conversation and doesn't exactly build on it, so here we go. I use predominantly pdf in scanning, for one main reason only - it handles *metadata* nicely (with gscan2pdf). This is nice for searching later. When playing with DjVu, I didn't find an easy way to amend metadata - is there any good working method and tools to recommend for adding metadata for DjVu files? Thanks. Pieter Praet writes: > On Mon, 7 Nov 2011 18:44:24 +0100, Karl Voit wrote: >> Hi! >>=20 >> Inspired by =C2=ABTotal Recall=C2=BB[3], a book of two MS Research guys,= I >> started life logging on my own two months ago. >>=20 > > Dammit, that's been on my reading list for almost 2 years now, and > *still* it isn't available in ebook format. One would think they'd walk > their talk [1], no? > >> [...] >> * Pieter Praet wrote: >> > >> > Using PDF for scanned documents results in *huge* files with a serious= ly >> > disappointing image quality.=20=20 >>=20 >> I can not copy that at all: >>=20 >> ,---- >> | vk@gary ~2d % l 2011-11-02_13-22-45.png >> | -rw------- 1 vk vk 103150 2011-11-02 13:22 2011-11-02_13-22-45.png >> | vk@gary ~2d % convert 2011-11-02_13-22-45.png 2011-11-02_13-22-45.pdf >> | vk@gary ~2d % l 2011-11-02_13-22-45.pdf >> | -rw-r--r-- 1 vk vk 96457 2011-11-07 18:12 2011-11-02_13-22-45.pdf >> | vk@gary ~2d % >> `---- >>=20 >> In this example, the compression of PDF is much better than the >> original PNG one. PDF is only a container format. >>=20 > > The conversion to PDF has indeed reduced the filesize, but not for the > reasons you might think: If you don't explicitly provide ImageMagick's > `convert' with a compression level (`-quality' option), it will use a > default of 75%. Thus I (perhaps incorrectly) infer that you've just > lost 25% of the image quality for a meager 7% reduction in filesize. > > > I do admit that the whole quality vs. filesize statement I made > regarding using PDF for scanned documents wasn't entirely correct: > I cut some corners. > > The real issue is that most folks use their scanner software to save > directly to PDF, and for some reason, scanner software (especially the > proprietary variety) predominantly uses JPEG compression as default when > saving to PDF. > > JPEG was developed for storing images with smooth transitions and a high > bit depth (i.e. photographs), not hard transitions and a low bit depth > (i.e. documents), so you're likely to suffer a noticeable degradation in > text quality, even when using 1:1 JPEG compression. > > You're using PNG compression though, so the whole JPEG deal doesn't apply. > > So, that just leaves the neverending stream of PDF security issues :) > >> > Consider storing your scans in DjVu format >> > [1], which was developed specifically for this purpose. >>=20 >> PDF is a common standard whereas DjVu is something I - as an >> advanced computer user - never faced before in real life. I am not >> sure whether any of my computers can handle DjVu files at all. >>=20 > > How about the Million Book Project / Universal Digital Library [2] ? > Even though every computing device is most likely to support PDF, their > collection is only available in TIFF and DjVu format. > > The list of participants and partners [3] (not to mention the magnitude > and cost of their undertaking) is reason enough (for me, at least) to > assume that DjVu is deemed to be rather future-proof. > > I'm guessing ISO standardization will be only a matter of time. > >> The goals of DjVu sound great but I get everything with PDF too. >> Although I like the idea of OGG Vorbis, I re-ripped all my CDs using >> mp3 again because I could not use many music devices or music >> management software packages. >>=20 > > Ahhh, VHS vs. Betamax, over and over again... > > Companies only succeed in getting everyone stuck with mediocre tools if > we allow them to. You don't *need* all devices/software to support the > superior format. Just get the ones that do (if there are any...), try > to enlighten the people in you monkeysphere [4], and then let the free > market do its work. Joe Average Consumer will eventually follow (unless > pornography is at stake, apparently), and the industry will be right on > his tail. > >> I stick to the format *any* computer can handle without special >> software products. [...] > > Somehow this implies that *every* computer is infected with Adobe's > malware. I find that rather disconcerting, to be honest :D > >> [...] And I do think that I get a higher chance of >> being able to read my documents twenty years from now. >>=20 > > For your sake, I hope you're right! > >> For scanned images I'd prefer PNG instead but the OS X Software of >> my OfficeJet offers me the ability to generate PDF files where an >> OCR software adds a searchable text layer above the scanned text. >> This is *very* important to me since I am able to do full text >> search on the content of my archived documents. >>=20 > > May be a bit less convenient in daily usage, but you could stick to your > preference of keeping all your scans in PNG format by keeping the OCR > output in a separate ASCII file: > > #+begin_src sh > for i in $(ls ${HOME}/msg/paper/inbox/*.png) ; do > tesseract ${i} ${i}.txt > done > #+end_src > > That way you can access your data even on text-only machines, > and full-text search is only a `grep' away. > >> And I plan to archive *all* of my documents. Really all of them. >>=20 > > Then you'll probably be interested in Joey Hess' git-annex [5] to keep > your archive versioned and in sync across all your devices. > >> Storage space does not matter (any more) to me since I have more >> disk space now already than I could possible fill with my lifetime >> paper correspondence. And I do think that my disk space continues to >> grow in future. >>=20 > > I'd argue it still does, otherwise you'd be keeping your scans in > TIFF format. And digitized trees surely aren't the only type of > correspondence you are (or will be) archiving. > > Efficiency should always play a major role IMO, even if the available > resources are (perceived to be) infinite. Having a hangar instead of a > garage doesn't warrant driving a schoolbus to work, even if doesn't > guzzle a drop of gas. > >> [...] >>=20 >> Funny side fact: grayscale scan document settings produces slightly >> larger files than colored ones. >>=20 > > That's odd. Probably depends on which type of compression is used. > >> > gscan2pdf also supports a number of OCR utils, but the UI for this is >> > clumsy (aren't they all...), so you're better off using the CLI tools >> > directly. Tesseract is recommended. >>=20 >> I played around with ocropus, tesseract, ocroscript, hocr2pdf, >> exactimage, ppa:gezakovacs/pdfocr, ... to generate those sandwitch >> PDF documents (OCR text above the scanned images) on GNU/Linux. >> Unfortunately none of those (very cool projects) produced reliable >> results on my side. The results vary from =C2=ABno error but overlay font >> size is incorrect and produces loss of layout=C2=BB to =C2=ABlibrary err= or >> messages I can not read or handle=C2=BB.=20 >>=20 >> Whereas the HP OfficeJet bundles its OS X software with OCR from >> Readiris which produces perfect results even in different languages >> and using a usable user interface. >>=20 > > Sadly, I can only agree with this. Google's involvement in Tesseract > and OCRopus does instill hope though :) > >> > NOTE: When attempting something like this, a fast scanner with a *reli= able* >> > automatic document feeder will help prevent premature hair loss ;) >>=20 >> I have found several scanner products I was interested in: >>=20 >> "Canon imageFORMULA P-150": very small form factor with basic Linux >> support. Price tag starts with =E2=82=AC 260. Neat form factor and very >> portable. Different version "P-150m" for Mac OS X. >>=20 >> The authors of [3] use Fujitsu ScanSnap starting at =E2=82=AC 400. >>=20 >> I ended up with the Office Jet Pro (mentioned above) at =E2=82=AC 250 >> because I got flatbed scanner *and* ADF-scanner *and* a >> full-duplex/full-color network printer with a very good >> price-per-printed-page-ratio (better than many laser printers!). And >> all of this with a cheaper price tag than any scan-only-product I >> was interested in. >>=20 >> So far I am almost satisfied. =C2=ABAlmost=C2=BB? Well, HP did a good jo= b with >> this printer but they made only a 90% solution on almost all levels. >> Whereas 100% would be possible with small additional effort when >> creating the printer. But those resulting 90% are pretty usable. >>=20 >> 3. http://qr.cx/sAHU >> --=20 >> Karl Voit >>=20 >>=20 > > > Peace --=20 Johnny