From mboxrd@z Thu Jan 1 00:00:00 1970 From: Karl Voit Subject: Re: [OT] Scanning for archiving Date: Mon, 7 Nov 2011 18:44:24 +0100 Message-ID: <2011-11-07T18-01-23@devnull.Karl-Voit.at> References: <87vcqy6vtl.fsf@praet.org> Reply-To: news1142@Karl-Voit.at Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Return-path: Received: from eggs.gnu.org ([140.186.70.92]:58654) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RNTFS-0004O7-3p for emacs-orgmode@gnu.org; Mon, 07 Nov 2011 12:44:43 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RNTFQ-00067f-AI for emacs-orgmode@gnu.org; Mon, 07 Nov 2011 12:44:42 -0500 Received: from lo.gmane.org ([80.91.229.12]:56468) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RNTFP-00067Y-W6 for emacs-orgmode@gnu.org; Mon, 07 Nov 2011 12:44:40 -0500 Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1RNTFO-0002O4-7P for emacs-orgmode@gnu.org; Mon, 07 Nov 2011 18:44:38 +0100 Received: from mail.michael-prokop.at ([88.198.6.110]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 07 Nov 2011 18:44:38 +0100 Received: from news1142 by mail.michael-prokop.at with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 07 Nov 2011 18:44:38 +0100 List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: emacs-orgmode@gnu.org Hi! Inspired by «Total Recall»[3], a book of two MS Research guys, I started life logging on my own two months ago. For this purpose I bought an HP OfficeJet Pro 8500A Plus which costs € 250 and has a decent scanner. Is can scan and print full duplex. The scanner as a 30 page ADF which is quite reliable when the paper was not bend or stapled before. * Pieter Praet wrote: > > Using PDF for scanned documents results in *huge* files with a seriously > disappointing image quality. I can not copy that at all: ,---- | vk@gary ~2d % l 2011-11-02_13-22-45.png | -rw------- 1 vk vk 103150 2011-11-02 13:22 2011-11-02_13-22-45.png | vk@gary ~2d % convert 2011-11-02_13-22-45.png 2011-11-02_13-22-45.pdf | vk@gary ~2d % l 2011-11-02_13-22-45.pdf | -rw-r--r-- 1 vk vk 96457 2011-11-07 18:12 2011-11-02_13-22-45.pdf | vk@gary ~2d % `---- In this example, the compression of PDF is much better than the original PNG one. PDF is only a container format. > Consider storing your scans in DjVu format > [1], which was developed specifically for this purpose. PDF is a common standard whereas DjVu is something I - as an advanced computer user - never faced before in real life. I am not sure whether any of my computers can handle DjVu files at all. The goals of DjVu sound great but I get everything with PDF too. Although I like the idea of OGG Vorbis, I re-ripped all my CDs using mp3 again because I could not use many music devices or music management software packages. I stick to the format *any* computer can handle without special software products. And I do think that I get a higher chance of being able to read my documents twenty years from now. For scanned images I'd prefer PNG instead but the OS X Software of my OfficeJet offers me the ability to generate PDF files where an OCR software adds a searchable text layer above the scanned text. This is *very* important to me since I am able to do full text search on the content of my archived documents. And I plan to archive *all* of my documents. Really all of them. Storage space does not matter (any more) to me since I have more disk space now already than I could possible fill with my lifetime paper correspondence. And I do think that my disk space continues to grow in future. > I scan all docs @ 600dpi, predominantly gray-scale (only in colour when > it's *really* necessary) and store in DjVu format, all using gscan2pdf [2]. > > Even at that seemingly overkill resolution, single-page documents are > generally (if they aren't too "grainy") only a few 100 KiB in size. My HP software uses 300 dpi per default and it is OK to me too. Funny side fact: grayscale scan document settings produces slightly larger files than colored ones. > gscan2pdf also supports a number of OCR utils, but the UI for this is > clumsy (aren't they all...), so you're better off using the CLI tools > directly. Tesseract is recommended. I played around with ocropus, tesseract, ocroscript, hocr2pdf, exactimage, ppa:gezakovacs/pdfocr, ... to generate those sandwitch PDF documents (OCR text above the scanned images) on GNU/Linux. Unfortunately none of those (very cool projects) produced reliable results on my side. The results vary from «no error but overlay font size is incorrect and produces loss of layout» to «library error messages I can not read or handle». Whereas the HP OfficeJet bundles its OS X software with OCR from Readiris which produces perfect results even in different languages and using a usable user interface. > NOTE: When attempting something like this, a fast scanner with a *reliable* > automatic document feeder will help prevent premature hair loss ;) I have found several scanner products I was interested in: "Canon imageFORMULA P-150": very small form factor with basic Linux support. Price tag starts with € 260. Neat form factor and very portable. Different version "P-150m" for Mac OS X. The authors of [3] use Fujitsu ScanSnap starting at € 400. I ended up with the Office Jet Pro (mentioned above) at € 250 because I got flatbed scanner *and* ADF-scanner *and* a full-duplex/full-color network printer with a very good price-per-printed-page-ratio (better than many laser printers!). And all of this with a cheaper price tag than any scan-only-product I was interested in. So far I am almost satisfied. «Almost»? Well, HP did a good job with this printer but they made only a 90% solution on almost all levels. Whereas 100% would be possible with small additional effort when creating the printer. But those resulting 90% are pretty usable. 3. http://qr.cx/sAHU -- Karl Voit