From mboxrd@z Thu Jan  1 00:00:00 1970
From: Johnny <yggdrasil@gmx.co.uk>
Subject: Re: [OT] Scanning for archiving
Date: Wed, 09 Nov 2011 09:06:51 +0000
Message-ID: <87lirpk6l0.fsf@gmx.co.uk>
References: <CACHMzOHL_PUxsY=PLObyaNkjOQeMhZdxkeWHLCAB=L_eVqMpeg@mail.gmail.com>
	<87vcqy6vtl.fsf@praet.org> <2011-11-07T18-01-23@devnull.Karl-Voit.at>
	<871uthwxol.fsf@praet.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Return-path: <emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([140.186.70.92]:38240)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <yggdrasil@gmx.co.uk>) id 1RO4Dp-0000EP-G5
	for emacs-orgmode@gnu.org; Wed, 09 Nov 2011 04:13:30 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <yggdrasil@gmx.co.uk>) id 1RO4Dn-0002Y7-Im
	for emacs-orgmode@gnu.org; Wed, 09 Nov 2011 04:13:29 -0500
Received: from mailout-eu.gmx.com ([213.165.64.43]:51749)
	by eggs.gnu.org with smtp (Exim 4.71)
	(envelope-from <yggdrasil@gmx.co.uk>) id 1RO4Dn-0002Rp-4G
	for emacs-orgmode@gnu.org; Wed, 09 Nov 2011 04:13:27 -0500
In-Reply-To: <871uthwxol.fsf@praet.org> (Pieter Praet's message of "Wed, 09
	Nov 2011 08:40:42 +0100")
List-Id: "General discussions about Org-mode." <emacs-orgmode.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-orgmode>,
	<mailto:emacs-orgmode-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-orgmode>
List-Post: <mailto:emacs-orgmode@gnu.org>
List-Help: <mailto:emacs-orgmode-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-orgmode>,
	<mailto:emacs-orgmode-request@gnu.org?subject=subscribe>
Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org
Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org
To: Pieter Praet <pieter@praet.org>
Cc: news1142@Karl-Voit.at, emacs-orgmode@gnu.org

Apologies for top-posting, but my comment is only inspired by the
conversation and doesn't exactly build on it, so here we go.

I use predominantly pdf in scanning, for one main reason only - it
handles *metadata* nicely (with gscan2pdf). This is nice for searching
later. When playing with DjVu, I didn't find an easy way to amend
metadata - is there any good working method and tools to recommend for
adding metadata for DjVu files?

Thanks.

Pieter Praet <pieter@praet.org> writes:

> On Mon, 7 Nov 2011 18:44:24 +0100, Karl Voit <devnull@Karl-Voit.at> wrote:
>> Hi!
>>=20
>> Inspired by =C2=ABTotal Recall=C2=BB[3], a book of two MS Research guys,=
 I
>> started life logging on my own two months ago.
>>=20
>
> Dammit, that's been on my reading list for almost 2 years now, and
> *still* it isn't available in ebook format.  One would think they'd walk
> their talk [1], no?
>
>> [...]
>> * Pieter Praet <pieter@praet.org> wrote:
>> >
>> > Using PDF for scanned documents results in *huge* files with a serious=
ly
>> > disappointing image quality.=20=20
>>=20
>> I can not copy that at all:
>>=20
>> ,----
>> | vk@gary ~2d % l 2011-11-02_13-22-45.png
>> | -rw------- 1 vk vk 103150 2011-11-02 13:22 2011-11-02_13-22-45.png
>> | vk@gary ~2d % convert 2011-11-02_13-22-45.png 2011-11-02_13-22-45.pdf
>> | vk@gary ~2d % l 2011-11-02_13-22-45.pdf
>> | -rw-r--r-- 1 vk vk 96457 2011-11-07 18:12 2011-11-02_13-22-45.pdf
>> | vk@gary ~2d %
>> `----
>>=20
>> In this example, the compression of PDF is much better than the
>> original PNG one. PDF is only a container format.
>>=20
>
> The conversion to PDF has indeed reduced the filesize, but not for the
> reasons you might think: If you don't explicitly provide ImageMagick's
> `convert' with a compression level (`-quality' option), it will use a
> default of 75%.  Thus I (perhaps incorrectly) infer that you've just
> lost 25% of the image quality for a meager 7% reduction in filesize.
>
>
> I do admit that the whole quality vs. filesize statement I made
> regarding using PDF for scanned documents wasn't entirely correct:
> I cut some corners.
>
> The real issue is that most folks use their scanner software to save
> directly to PDF, and for some reason, scanner software (especially the
> proprietary variety) predominantly uses JPEG compression as default when
> saving to PDF.
>
> JPEG was developed for storing images with smooth transitions and a high
> bit depth (i.e. photographs), not hard transitions and a low bit depth
> (i.e. documents), so you're likely to suffer a noticeable degradation in
> text quality, even when using 1:1 JPEG compression.
>
> You're using PNG compression though, so the whole JPEG deal doesn't apply.
>
> So, that just leaves the neverending stream of PDF security issues :)
>
>> > Consider storing your scans in DjVu format
>> > [1], which was developed specifically for this purpose.
>>=20
>> PDF is a common standard whereas DjVu is something I - as an
>> advanced computer user - never faced before in real life. I am not
>> sure whether any of my computers can handle DjVu files at all.
>>=20
>
> How about the Million Book Project / Universal Digital Library [2] ?
> Even though every computing device is most likely to support PDF, their
> collection is only available in TIFF and DjVu format.
>
> The list of participants and partners [3] (not to mention the magnitude
> and cost of their undertaking) is reason enough (for me, at least) to
> assume that DjVu is deemed to be rather future-proof.
>
> I'm guessing ISO standardization will be only a matter of time.
>
>> The goals of DjVu sound great but I get everything with PDF too.
>> Although I like the idea of OGG Vorbis, I re-ripped all my CDs using
>> mp3 again because I could not use many music devices or music
>> management software packages.
>>=20
>
> Ahhh, VHS vs. Betamax, over and over again...
>
> Companies only succeed in getting everyone stuck with mediocre tools if
> we allow them to.  You don't *need* all devices/software to support the
> superior format.  Just get the ones that do (if there are any...), try
> to enlighten the people in you monkeysphere [4], and then let the free
> market do its work.  Joe Average Consumer will eventually follow (unless
> pornography is at stake, apparently), and the industry will be right on
> his tail.
>
>> I stick to the format *any* computer can handle without special
>> software products. [...]
>
> Somehow this implies that *every* computer is infected with Adobe's
> malware.  I find that rather disconcerting, to be honest :D
>
>> [...] And I do think that I get a higher chance of
>> being able to read my documents twenty years from now.
>>=20
>
> For your sake, I hope you're right!
>
>> For scanned images I'd prefer PNG instead but the OS X Software of
>> my OfficeJet offers me the ability to generate PDF files where an
>> OCR software adds a searchable text layer above the scanned text.
>> This is *very* important to me since I am able to do full text
>> search on the content of my archived documents.
>>=20
>
> May be a bit less convenient in daily usage, but you could stick to your
> preference of keeping all your scans in PNG format by keeping the OCR
> output in a separate ASCII file:
>
>   #+begin_src sh
>     for i in $(ls ${HOME}/msg/paper/inbox/*.png) ; do
>         tesseract ${i} ${i}.txt
>     done
>   #+end_src
>
> That way you can access your data even on text-only machines,
> and full-text search is only a `grep' away.
>
>> And I plan to archive *all* of my documents. Really all of them.
>>=20
>
> Then you'll probably be interested in Joey Hess' git-annex [5] to keep
> your archive versioned and in sync across all your devices.
>
>> Storage space does not matter (any more) to me since I have more
>> disk space now already than I could possible fill with my lifetime
>> paper correspondence. And I do think that my disk space continues to
>> grow in future.
>>=20
>
> I'd argue it still does, otherwise you'd be keeping your scans in
> TIFF format.  And digitized trees surely aren't the only type of
> correspondence you are (or will be) archiving.
>
> Efficiency should always play a major role IMO, even if the available
> resources are (perceived to be) infinite.  Having a hangar instead of a
> garage doesn't warrant driving a schoolbus to work, even if doesn't
> guzzle a drop of gas.
>
>> [...]
>>=20
>> Funny side fact: grayscale scan document settings produces slightly
>> larger files than colored ones.
>>=20
>
> That's odd.  Probably depends on which type of compression is used.
>
>> > gscan2pdf also supports a number of OCR utils, but the UI for this is
>> > clumsy (aren't they all...), so you're better off using the CLI tools
>> > directly.  Tesseract is recommended.
>>=20
>> I played around with ocropus, tesseract, ocroscript, hocr2pdf,
>> exactimage, ppa:gezakovacs/pdfocr, ... to generate those sandwitch
>> PDF documents (OCR text above the scanned images) on GNU/Linux.
>> Unfortunately none of those (very cool projects) produced reliable
>> results on my side. The results vary from =C2=ABno error but overlay font
>> size is incorrect and produces loss of layout=C2=BB to =C2=ABlibrary err=
or
>> messages I can not read or handle=C2=BB.=20
>>=20
>> Whereas the HP OfficeJet bundles its OS X software with OCR from
>> Readiris which produces perfect results even in different languages
>> and using a usable user interface.
>>=20
>
> Sadly, I can only agree with this.  Google's involvement in Tesseract
> and OCRopus does instill hope though :)
>
>> > NOTE: When attempting something like this, a fast scanner with a *reli=
able*
>> > automatic document feeder will help prevent premature hair loss ;)
>>=20
>> I have found several scanner products I was interested in:
>>=20
>> "Canon imageFORMULA P-150": very small form factor with basic Linux
>> support. Price tag starts with =E2=82=AC 260. Neat form factor and very
>> portable. Different version "P-150m" for Mac OS X.
>>=20
>> The authors of [3] use Fujitsu ScanSnap starting at =E2=82=AC 400.
>>=20
>> I ended up with the Office Jet Pro (mentioned above) at =E2=82=AC 250
>> because I got flatbed scanner *and* ADF-scanner *and* a
>> full-duplex/full-color network printer with a very good
>> price-per-printed-page-ratio (better than many laser printers!). And
>> all of this with a cheaper price tag than any scan-only-product I
>> was interested in.
>>=20
>> So far I am almost satisfied. =C2=ABAlmost=C2=BB? Well, HP did a good jo=
b with
>> this printer but they made only a 90% solution on almost all levels.
>> Whereas 100% would be possible with small additional effort when
>> creating the printer. But those resulting 90% are pretty usable.
>>=20
>>   3. http://qr.cx/sAHU
>> --=20
>> Karl Voit
>>=20
>>=20
>
>
> Peace

--=20
Johnny