emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
From: Karl Voit <devnull@Karl-Voit.at>
To: emacs-orgmode@gnu.org
Subject: Re: [OT] Scanning for archiving
Date: Mon, 7 Nov 2011 18:44:24 +0100	[thread overview]
Message-ID: <2011-11-07T18-01-23@devnull.Karl-Voit.at> (raw)
In-Reply-To: 87vcqy6vtl.fsf@praet.org

Hi!

Inspired by «Total Recall»[3], a book of two MS Research guys, I
started life logging on my own two months ago.

For this purpose I bought an HP OfficeJet Pro 8500A Plus which costs
€ 250 and has a decent scanner. Is can scan and print full duplex.
The scanner as a 30 page ADF which is quite reliable when the paper
was not bend or stapled before.

* Pieter Praet <pieter@praet.org> wrote:
>
> Using PDF for scanned documents results in *huge* files with a seriously
> disappointing image quality.  

I can not copy that at all:

,----
| vk@gary ~2d % l 2011-11-02_13-22-45.png
| -rw------- 1 vk vk 103150 2011-11-02 13:22 2011-11-02_13-22-45.png
| vk@gary ~2d % convert 2011-11-02_13-22-45.png 2011-11-02_13-22-45.pdf
| vk@gary ~2d % l 2011-11-02_13-22-45.pdf
| -rw-r--r-- 1 vk vk 96457 2011-11-07 18:12 2011-11-02_13-22-45.pdf
| vk@gary ~2d %
`----

In this example, the compression of PDF is much better than the
original PNG one. PDF is only a container format.

> Consider storing your scans in DjVu format
> [1], which was developed specifically for this purpose.

PDF is a common standard whereas DjVu is something I - as an
advanced computer user - never faced before in real life. I am not
sure whether any of my computers can handle DjVu files at all.

The goals of DjVu sound great but I get everything with PDF too.
Although I like the idea of OGG Vorbis, I re-ripped all my CDs using
mp3 again because I could not use many music devices or music
management software packages.

I stick to the format *any* computer can handle without special
software products. And I do think that I get a higher chance of
being able to read my documents twenty years from now.

For scanned images I'd prefer PNG instead but the OS X Software of
my OfficeJet offers me the ability to generate PDF files where an
OCR software adds a searchable text layer above the scanned text.
This is *very* important to me since I am able to do full text
search on the content of my archived documents.

And I plan to archive *all* of my documents. Really all of them.

Storage space does not matter (any more) to me since I have more
disk space now already than I could possible fill with my lifetime
paper correspondence. And I do think that my disk space continues to
grow in future.

> I scan all docs @ 600dpi, predominantly gray-scale (only in colour when
> it's *really* necessary) and store in DjVu format, all using gscan2pdf [2].
>
> Even at that seemingly overkill resolution, single-page documents are
> generally (if they aren't too "grainy") only a few 100 KiB in size.

My HP software uses 300 dpi per default and it is OK to me too.

Funny side fact: grayscale scan document settings produces slightly
larger files than colored ones.

> gscan2pdf also supports a number of OCR utils, but the UI for this is
> clumsy (aren't they all...), so you're better off using the CLI tools
> directly.  Tesseract is recommended.

I played around with ocropus, tesseract, ocroscript, hocr2pdf,
exactimage, ppa:gezakovacs/pdfocr, ... to generate those sandwitch
PDF documents (OCR text above the scanned images) on GNU/Linux.
Unfortunately none of those (very cool projects) produced reliable
results on my side. The results vary from «no error but overlay font
size is incorrect and produces loss of layout» to «library error
messages I can not read or handle». 

Whereas the HP OfficeJet bundles its OS X software with OCR from
Readiris which produces perfect results even in different languages
and using a usable user interface.

> NOTE: When attempting something like this, a fast scanner with a *reliable*
> automatic document feeder will help prevent premature hair loss ;)

I have found several scanner products I was interested in:

"Canon imageFORMULA P-150": very small form factor with basic Linux
support. Price tag starts with € 260. Neat form factor and very
portable. Different version "P-150m" for Mac OS X.

The authors of [3] use Fujitsu ScanSnap starting at € 400.

I ended up with the Office Jet Pro (mentioned above) at € 250
because I got flatbed scanner *and* ADF-scanner *and* a
full-duplex/full-color network printer with a very good
price-per-printed-page-ratio (better than many laser printers!). And
all of this with a cheaper price tag than any scan-only-product I
was interested in.

So far I am almost satisfied. «Almost»? Well, HP did a good job with
this printer but they made only a 90% solution on almost all levels.
Whereas 100% would be possible with small additional effort when
creating the printer. But those resulting 90% are pretty usable.

  3. http://qr.cx/sAHU
-- 
Karl Voit

  parent reply	other threads:[~2011-11-07 17:44 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-11-05 20:03 [OT] Scanning for archiving Marcelo de Moraes Serpa
2011-11-05 20:34 ` Achim Gratz
2011-11-05 20:52   ` Marcelo de Moraes Serpa
2011-11-05 21:01 ` Jan Böcker
2011-11-05 21:06   ` Marcelo de Moraes Serpa
2011-11-05 22:36 ` Pieter Praet
2011-11-05 23:35   ` Samuel Wales
2011-11-06 21:59     ` Pieter Praet
2011-11-07  6:14       ` TP
2011-11-09  8:51         ` Pieter Praet
2011-11-20 13:57         ` Matt Lundin
2011-11-07 17:44   ` Karl Voit [this message]
2011-11-09  7:40     ` Pieter Praet
2011-11-09  9:06       ` Johnny
2011-11-09 11:05       ` Karl Voit
2011-11-09 14:53     ` Karl Voit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.orgmode.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2011-11-07T18-01-23@devnull.Karl-Voit.at \
    --to=devnull@karl-voit.at \
    --cc=emacs-orgmode@gnu.org \
    --cc=news1142@Karl-Voit.at \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).