Re: [OT] Scanning for archiving

emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed

From: Johnny <yggdrasil@gmx.co.uk>
To: Pieter Praet <pieter@praet.org>
Cc: news1142@Karl-Voit.at, emacs-orgmode@gnu.org
Subject: Re: [OT] Scanning for archiving
Date: Wed, 09 Nov 2011 09:06:51 +0000	[thread overview]
Message-ID: <87lirpk6l0.fsf@gmx.co.uk> (raw)
In-Reply-To: <871uthwxol.fsf@praet.org> (Pieter Praet's message of "Wed, 09 Nov 2011 08:40:42 +0100")

Apologies for top-posting, but my comment is only inspired by the
conversation and doesn't exactly build on it, so here we go.

I use predominantly pdf in scanning, for one main reason only - it
handles *metadata* nicely (with gscan2pdf). This is nice for searching
later. When playing with DjVu, I didn't find an easy way to amend
metadata - is there any good working method and tools to recommend for
adding metadata for DjVu files?

Thanks.

Pieter Praet <pieter@praet.org> writes:

> On Mon, 7 Nov 2011 18:44:24 +0100, Karl Voit <devnull@Karl-Voit.at> wrote:
>> Hi!
>> 
>> Inspired by «Total Recall»[3], a book of two MS Research guys, I
>> started life logging on my own two months ago.
>> 
>
> Dammit, that's been on my reading list for almost 2 years now, and
> *still* it isn't available in ebook format.  One would think they'd walk
> their talk [1], no?
>
>> [...]
>> * Pieter Praet <pieter@praet.org> wrote:
>> >
>> > Using PDF for scanned documents results in *huge* files with a seriously
>> > disappointing image quality.  
>> 
>> I can not copy that at all:
>> 
>> ,----
>> | vk@gary ~2d % l 2011-11-02_13-22-45.png
>> | -rw------- 1 vk vk 103150 2011-11-02 13:22 2011-11-02_13-22-45.png
>> | vk@gary ~2d % convert 2011-11-02_13-22-45.png 2011-11-02_13-22-45.pdf
>> | vk@gary ~2d % l 2011-11-02_13-22-45.pdf
>> | -rw-r--r-- 1 vk vk 96457 2011-11-07 18:12 2011-11-02_13-22-45.pdf
>> | vk@gary ~2d %
>> `----
>> 
>> In this example, the compression of PDF is much better than the
>> original PNG one. PDF is only a container format.
>> 
>
> The conversion to PDF has indeed reduced the filesize, but not for the
> reasons you might think: If you don't explicitly provide ImageMagick's
> `convert' with a compression level (`-quality' option), it will use a
> default of 75%.  Thus I (perhaps incorrectly) infer that you've just
> lost 25% of the image quality for a meager 7% reduction in filesize.
>
>
> I do admit that the whole quality vs. filesize statement I made
> regarding using PDF for scanned documents wasn't entirely correct:
> I cut some corners.
>
> The real issue is that most folks use their scanner software to save
> directly to PDF, and for some reason, scanner software (especially the
> proprietary variety) predominantly uses JPEG compression as default when
> saving to PDF.
>
> JPEG was developed for storing images with smooth transitions and a high
> bit depth (i.e. photographs), not hard transitions and a low bit depth
> (i.e. documents), so you're likely to suffer a noticeable degradation in
> text quality, even when using 1:1 JPEG compression.
>
> You're using PNG compression though, so the whole JPEG deal doesn't apply.
>
> So, that just leaves the neverending stream of PDF security issues :)
>
>> > Consider storing your scans in DjVu format
>> > [1], which was developed specifically for this purpose.
>> 
>> PDF is a common standard whereas DjVu is something I - as an
>> advanced computer user - never faced before in real life. I am not
>> sure whether any of my computers can handle DjVu files at all.
>> 
>
> How about the Million Book Project / Universal Digital Library [2] ?
> Even though every computing device is most likely to support PDF, their
> collection is only available in TIFF and DjVu format.
>
> The list of participants and partners [3] (not to mention the magnitude
> and cost of their undertaking) is reason enough (for me, at least) to
> assume that DjVu is deemed to be rather future-proof.
>
> I'm guessing ISO standardization will be only a matter of time.
>
>> The goals of DjVu sound great but I get everything with PDF too.
>> Although I like the idea of OGG Vorbis, I re-ripped all my CDs using
>> mp3 again because I could not use many music devices or music
>> management software packages.
>> 
>
> Ahhh, VHS vs. Betamax, over and over again...
>
> Companies only succeed in getting everyone stuck with mediocre tools if
> we allow them to.  You don't *need* all devices/software to support the
> superior format.  Just get the ones that do (if there are any...), try
> to enlighten the people in you monkeysphere [4], and then let the free
> market do its work.  Joe Average Consumer will eventually follow (unless
> pornography is at stake, apparently), and the industry will be right on
> his tail.
>
>> I stick to the format *any* computer can handle without special
>> software products. [...]
>
> Somehow this implies that *every* computer is infected with Adobe's
> malware.  I find that rather disconcerting, to be honest :D
>
>> [...] And I do think that I get a higher chance of
>> being able to read my documents twenty years from now.
>> 
>
> For your sake, I hope you're right!
>
>> For scanned images I'd prefer PNG instead but the OS X Software of
>> my OfficeJet offers me the ability to generate PDF files where an
>> OCR software adds a searchable text layer above the scanned text.
>> This is *very* important to me since I am able to do full text
>> search on the content of my archived documents.
>> 
>
> May be a bit less convenient in daily usage, but you could stick to your
> preference of keeping all your scans in PNG format by keeping the OCR
> output in a separate ASCII file:
>
>   #+begin_src sh
>     for i in $(ls ${HOME}/msg/paper/inbox/*.png) ; do
>         tesseract ${i} ${i}.txt
>     done
>   #+end_src
>
> That way you can access your data even on text-only machines,
> and full-text search is only a `grep' away.
>
>> And I plan to archive *all* of my documents. Really all of them.
>> 
>
> Then you'll probably be interested in Joey Hess' git-annex [5] to keep
> your archive versioned and in sync across all your devices.
>
>> Storage space does not matter (any more) to me since I have more
>> disk space now already than I could possible fill with my lifetime
>> paper correspondence. And I do think that my disk space continues to
>> grow in future.
>> 
>
> I'd argue it still does, otherwise you'd be keeping your scans in
> TIFF format.  And digitized trees surely aren't the only type of
> correspondence you are (or will be) archiving.
>
> Efficiency should always play a major role IMO, even if the available
> resources are (perceived to be) infinite.  Having a hangar instead of a
> garage doesn't warrant driving a schoolbus to work, even if doesn't
> guzzle a drop of gas.
>
>> [...]
>> 
>> Funny side fact: grayscale scan document settings produces slightly
>> larger files than colored ones.
>> 
>
> That's odd.  Probably depends on which type of compression is used.
>
>> > gscan2pdf also supports a number of OCR utils, but the UI for this is
>> > clumsy (aren't they all...), so you're better off using the CLI tools
>> > directly.  Tesseract is recommended.
>> 
>> I played around with ocropus, tesseract, ocroscript, hocr2pdf,
>> exactimage, ppa:gezakovacs/pdfocr, ... to generate those sandwitch
>> PDF documents (OCR text above the scanned images) on GNU/Linux.
>> Unfortunately none of those (very cool projects) produced reliable
>> results on my side. The results vary from «no error but overlay font
>> size is incorrect and produces loss of layout» to «library error
>> messages I can not read or handle». 
>> 
>> Whereas the HP OfficeJet bundles its OS X software with OCR from
>> Readiris which produces perfect results even in different languages
>> and using a usable user interface.
>> 
>
> Sadly, I can only agree with this.  Google's involvement in Tesseract
> and OCRopus does instill hope though :)
>
>> > NOTE: When attempting something like this, a fast scanner with a *reliable*
>> > automatic document feeder will help prevent premature hair loss ;)
>> 
>> I have found several scanner products I was interested in:
>> 
>> "Canon imageFORMULA P-150": very small form factor with basic Linux
>> support. Price tag starts with € 260. Neat form factor and very
>> portable. Different version "P-150m" for Mac OS X.
>> 
>> The authors of [3] use Fujitsu ScanSnap starting at € 400.
>> 
>> I ended up with the Office Jet Pro (mentioned above) at € 250
>> because I got flatbed scanner *and* ADF-scanner *and* a
>> full-duplex/full-color network printer with a very good
>> price-per-printed-page-ratio (better than many laser printers!). And
>> all of this with a cheaper price tag than any scan-only-product I
>> was interested in.
>> 
>> So far I am almost satisfied. «Almost»? Well, HP did a good job with
>> this printer but they made only a 90% solution on almost all levels.
>> Whereas 100% would be possible with small additional effort when
>> creating the printer. But those resulting 90% are pretty usable.
>> 
>>   3. http://qr.cx/sAHU
>> -- 
>> Karl Voit
>> 
>> 
>
>
> Peace

-- 
Johnny

next prev parent reply	other threads:[~2011-11-09  9:13 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-11-05 20:03 [OT] Scanning for archiving Marcelo de Moraes Serpa
2011-11-05 20:34 ` Achim Gratz
2011-11-05 20:52   ` Marcelo de Moraes Serpa
2011-11-05 21:01 ` Jan Böcker
2011-11-05 21:06   ` Marcelo de Moraes Serpa
2011-11-05 22:36 ` Pieter Praet
2011-11-05 23:35   ` Samuel Wales
2011-11-06 21:59     ` Pieter Praet
2011-11-07  6:14       ` TP
2011-11-09  8:51         ` Pieter Praet
2011-11-20 13:57         ` Matt Lundin
2011-11-07 17:44   ` Karl Voit
2011-11-09  7:40     ` Pieter Praet
2011-11-09  9:06       ` Johnny [this message]
2011-11-09 11:05       ` Karl Voit
2011-11-09 14:53     ` Karl Voit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.orgmode.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87lirpk6l0.fsf@gmx.co.uk \
    --to=yggdrasil@gmx.co.uk \
    --cc=emacs-orgmode@gnu.org \
    --cc=news1142@Karl-Voit.at \
    --cc=pieter@praet.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).