From mboxrd@z Thu Jan  1 00:00:00 1970
From: Karl Voit <devnull@Karl-Voit.at>
Subject: Re: [OT] Scanning for archiving
Date: Wed, 9 Nov 2011 12:05:09 +0100
Message-ID: <2011-11-09T11-10-04@devnull.Karl-Voit.at>
References: <CACHMzOHL_PUxsY=PLObyaNkjOQeMhZdxkeWHLCAB=L_eVqMpeg@mail.gmail.com>
	<87vcqy6vtl.fsf@praet.org> <2011-11-07T18-01-23@devnull.Karl-Voit.at>
	<871uthwxol.fsf@praet.org>
Reply-To: news1142@Karl-Voit.at
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Return-path: <emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([140.186.70.92]:41090)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <geo-emacs-orgmode@m.gmane.org>) id 1RO5yF-0007nP-8b
	for emacs-orgmode@gnu.org; Wed, 09 Nov 2011 06:05:36 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <geo-emacs-orgmode@m.gmane.org>) id 1RO5yC-0005vX-S6
	for emacs-orgmode@gnu.org; Wed, 09 Nov 2011 06:05:31 -0500
Received: from lo.gmane.org ([80.91.229.12]:42846)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <geo-emacs-orgmode@m.gmane.org>) id 1RO5yC-0005vT-EK
	for emacs-orgmode@gnu.org; Wed, 09 Nov 2011 06:05:28 -0500
Received: from list by lo.gmane.org with local (Exim 4.69)
	(envelope-from <geo-emacs-orgmode@m.gmane.org>) id 1RO5y8-0006db-Nd
	for emacs-orgmode@gnu.org; Wed, 09 Nov 2011 12:05:24 +0100
Received: from mail.michael-prokop.at ([88.198.6.110])
	by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
	id 1AlnuQ-0007hv-00
	for <emacs-orgmode@gnu.org>; Wed, 09 Nov 2011 12:05:24 +0100
Received: from news1142 by mail.michael-prokop.at with local (Gmexim 0.1
	(Debian)) id 1AlnuQ-0007hv-00
	for <emacs-orgmode@gnu.org>; Wed, 09 Nov 2011 12:05:24 +0100
List-Id: "General discussions about Org-mode." <emacs-orgmode.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-orgmode>,
	<mailto:emacs-orgmode-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-orgmode>
List-Post: <mailto:emacs-orgmode@gnu.org>
List-Help: <mailto:emacs-orgmode-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-orgmode>,
	<mailto:emacs-orgmode-request@gnu.org?subject=subscribe>
Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org
Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org
To: emacs-orgmode@gnu.org

(I enjoy the OT discussion here and hope that no one gets upset
because of it on this ML ...)

* Pieter Praet <pieter@praet.org> wrote:
> On Mon, 7 Nov 2011 18:44:24 +0100, Karl Voit <devnull@Karl-Voit.at> wrote:
>> Hi!
>> 
>> Inspired by «Total Recall»[3], a book of two MS Research guys, I
>> started life logging on my own two months ago.
>
> Dammit, that's been on my reading list for almost 2 years now, and
> *still* it isn't available in ebook format.  One would think they'd walk
> their talk [1], no?

I personally do not want to read a book other than on paper - for
now. Annotating, highlighting and placing different kind of
postit-marks still does not have its digital representations I would
like to see :-(

I recommend [1] mainly because of its chapters upon how to start and
best practices. Previous chapters are future visions and motivation
I do think that we do not need (any more).

Besides the fact that the raw paper cut offers horrible handling
usability it is quite easy and fast to read.

>> | -rw------- 1 vk vk 103150 2011-11-02 13:22 2011-11-02_13-22-45.png
>> | vk@gary ~2d % convert 2011-11-02_13-22-45.png 2011-11-02_13-22-45.pdf
>> | vk@gary ~2d % l 2011-11-02_13-22-45.pdf
>> | -rw-r--r-- 1 vk vk 96457 2011-11-07 18:12 2011-11-02_13-22-45.pdf
>
> The conversion to PDF has indeed reduced the filesize, but not for the
> reasons you might think: If you don't explicitly provide ImageMagick's
> `convert' with a compression level (`-quality' option), it will use a
> default of 75%.  Thus I (perhaps incorrectly) infer that you've just
> lost 25% of the image quality for a meager 7% reduction in filesize.

Ah, thanks for the clarification! This is indeed interesting fact.

Still: I did never recognize any problem with the 75% result though
:-) It is clearly readable and zoomable on screen and produces very
good results when printed out again.

> The real issue is that most folks use their scanner software to save
> directly to PDF, and for some reason, scanner software (especially the
> proprietary variety) predominantly uses JPEG compression as default when
> saving to PDF.

OK, this is interesting (again). So I took a closer look on the
result files my HP OfficeJet is producing when I scan to PDF.

«pdfimages» and «file» shows me that in the PDF files there are
embedded «Netpbm PPM "rawbits" image data» image files.[6] Another
format I was not confronted with until now.

Seems to be a pixel-based compressed and standardized format. This
is fine to me so far. JPEG would be horrible ...

> JPEG was developed for storing images ...

Part of my job was to explain first term students the difference,
advantages and disadvantages regarding to file formats like JPEG and
PNG. I can tell stories ...

> You're using PNG compression though, so the whole JPEG deal doesn't apply.

Oh, this was just an example of how «convert *png *pdf» reduces file
size. (Which was corrected by you.)

I usually scan directly into «searchable PDF» which the HP Scan
offers me. My OfficeJet even allows me to simply put pages onto the
ADF, press (more) buttons (than necessary) on the printer itself and
without any further interaction, the searchable PDF files are placed
into a folder of my choice, using my file name convention containing
a time stamp of the scan process. This is kinda neat :-)

So I do not even need to turn on my TFT for scanning stuff. (My Mac
Mini is on 24/7 anyway.)

> So, that just leaves the neverending stream of PDF security issues :)

I do not publish blacked out PDF files. Or do you mean something
else? 

There is no security related issue that worries me for now. I have
to protect my data anyhow from being accessed by anyone else,
independent of the file formats.

>> > Consider storing your scans in DjVu format
>> > [1], which was developed specifically for this purpose.
>> 
>> PDF is a common standard whereas DjVu is something I - as an
>> advanced computer user - never faced before in real life. I am not
>> sure whether any of my computers can handle DjVu files at all.
>
> How about the Million Book Project / Universal Digital Library [2] ?

Well they are that big that they could even use a proprietary format
on purpose too. With valid arguments. They have different
requirements than I have.

> Even though every computing device is most likely to support PDF, their
> collection is only available in TIFF and DjVu format.

TIFF is a perfectly wide spread standard I would choose for
uncompressed raw data to store to. The automotive industry here is
using TIFF images for many purposes outside of CAD design. I would
not choose TIFF for long time archive format for my personal data.

> I'm guessing ISO standardization will be only a matter of time.

Hope so. Looks like a promising format.

>> Although I like the idea of OGG Vorbis, I re-ripped all my CDs using
>> mp3 again because I could not use many music devices or music
>> management software packages.
>
> Ahhh, VHS vs. Betamax, over and over again...

... and my beloved MiniDisc vs. some other formats :-)

Yes. And still: with my current knowledge, I'd choose VHS again. I
can still find VHS devices but Betamax is only available on flea
markets. There are more criteria than technical ones alone.

Please rest assured: I preach of open standards and best solutions
whenever I can. I fought (too) many wars for privacy concerns,
security awareness, LaTeX usage, Ogg Vorbis, OpenPGP, GNU/Linux,
W3C, banning flash, using Open Software products and so forth.

But for music I now prefer encoding in mp3 and for long time
archiving of documents I prefer searchable PDF files. The first one
is a pseudo-standard widely used and interpreted even with Open
Source tools. And PDF is also ISO standard and many free software
products are able to generate and interpret PDF files independent of
Adobe products.

That does not mean that there will - someday - a better choice even
for my requirements. But for now, the long term support and broad
availability of PDF reader products urges me to use PDF. And: It
would be very awkward when I'd try to generate DjVu files using my
HP OfficeJet scanner device and software :-(

When you are starting to digitalize *all* of your paper, the process
of scanning has to be as smooth as possible. Even small steps in
between cost you reliability and time.

And this is why I love Org-mode so much: really *great* featureset
and visualization techniques that work on simple text files. My life
logging framework[7] is very easy to understand because there are
just text files that have to be generated from different kind of
information sources. That is pure beauty.

> Companies only succeed in getting everyone stuck with mediocre tools if
> we allow them to.  You don't *need* all devices/software to support the
> superior format.  

In this point, our opinions differ greatly: I want my data to be
interpreted on as many platforms as possible. I do not event want
myself to bind to a single operating system. I want to be able to
switch my productive environment whenever I thing that another one
lets me work more efficiently than my current one.

I am a constant optimizer, always on the search for the best
solution out there.

Still I do have a strong feeling that I will stick to Org-mode for a
long time :-)

>> I stick to the format *any* computer can handle without special
>> software products. [...]
>
> Somehow this implies that *every* computer is infected with Adobe's
> malware.  

Fortunately not.[8]

> I find that rather disconcerting, to be honest :D

:-) I am more positive on the Adobe alternatives.

>> [...] And I do think that I get a higher chance of
>> being able to read my documents twenty years from now.
>
> For your sake, I hope you're right!

Lets meet again here in twenty years! :-) We are still in a time
frame where .+20y is within UNIX epoch *ggg*

> May be a bit less convenient in daily usage, but you could stick to your
> preference of keeping all your scans in PNG format by keeping the OCR
> output in a separate ASCII file:

I see your point. And I too am a big fan of the power of «grep» and
such.

But I prefer to keep the text data and the image data combined. I am
using long file names with basic content description and tags. I can
not think of a usable system that does not require file name
synchronization and such.

And: I do think that the availability and reliability of desktop
search engines are getting even better. Whenever I enter a phrase
search, I want to get the original file and not the extracted txt
file where I have to search for the corresponding original file.

>> And I plan to archive *all* of my documents. Really all of them.
>
> Then you'll probably be interested in Joey Hess' git-annex [5] to keep
> your archive versioned and in sync across all your devices.

Absolutely. I stumbled upon git-annex a while ago but did not take a
look at it by now. So far, I am only using Unison to sync my three
computers and Time Machine + rsync to backup on different disks.

>> Storage space does not matter (any more) to me since I have more
>> disk space now already than I could possible fill with my lifetime
>> paper correspondence. And I do think that my disk space continues to
>> grow in future.
>
> I'd argue it still does, otherwise you'd be keeping your scans in
> TIFF format.  

I do not think so since TIFF is (mostly used) uncompressed and my
method does not cut viable parts of the information. It is good
enough to me. A classical trade-off.

> And digitized trees surely aren't the only type of
> correspondence you are (or will be) archiving.

Of course not. But still: I am not able to fill my disks I got *now*
with all of my papers I own!

That does not mean that I want to use as many GBs as possible
though. And my disk space will grow constantly for the following
decades too. So: disk space is no issue any more. It is just there.

In «Total Recall» you will also find interesting figures to this
issue. I knew it before basically but I was not aware of the
consequences.

> Efficiency should always play a major role IMO, even if the available
> resources are (perceived to be) infinite.  Having a hangar instead of a
> garage doesn't warrant driving a schoolbus to work, even if doesn't
> guzzle a drop of gas.

Please do not use examples from the real world for those kind of
questions since I can very easily falsify your analogy on multiple
levels.

You can not «copy a chair» in real world. Think about that.

Besides: yes, *unnecessary* waste of storage is not the goal. But
when every hazzle to be more efficient is a trade-off to «being
simple to generate» or «being able to be used on all major
platforms» I am willing to spend more MBs of storage to easy my
life.

>> [...]
>> 
>> Funny side fact: grayscale scan document settings produces slightly
>> larger files than colored ones.
>
> That's odd.  Probably depends on which type of compression is used.

Probably. I did not look into that one that detailed.

>> Whereas the HP OfficeJet bundles its OS X software with OCR from
>> Readiris which produces perfect results even in different languages
>> and using a usable user interface.
>
> Sadly, I can only agree with this.  Google's involvement in Tesseract
> and OCRopus does instill hope though :)

Full Ack!

I'd *love* to see an perfectly usable free solution!

  6. http://en.wikipedia.org/wiki/Netpbm_format
  7. https://github.com/novoid/Memacs
  8. http://en.wikipedia.org/wiki/List_of_PDF_software
-- 
Karl Voit