[OT] Scanning for archiving

emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed

* [OT] Scanning for archiving
@ 2011-11-05 20:03 Marcelo de Moraes Serpa
  2011-11-05 20:34 ` Achim Gratz
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Marcelo de Moraes Serpa @ 2011-11-05 20:03 UTC (permalink / raw)
  To: Org Mode

[-- Attachment #1: Type: text/plain, Size: 479 bytes --]

Hi list,

I just bought a scanner and started to scan important documents as a
backup, and archiving them with meaningful metadata in orgmode files. Then
a question came to mind - what dpi to use? I'm not really savvy when it
comes to scanning or printing, and I want like a dpi that allows me to
reprint the document at an acceptable quality later if necessary, but that
also doesn't take that much space (600dpi pdfs take around 5MB).

Any insights welcome,

Thanks,

Marcelo.

[-- Attachment #2: Type: text/html, Size: 586 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OT] Scanning for archiving
  2011-11-05 20:03 [OT] Scanning for archiving Marcelo de Moraes Serpa
@ 2011-11-05 20:34 ` Achim Gratz
  2011-11-05 20:52   ` Marcelo de Moraes Serpa
  2011-11-05 21:01 ` Jan Böcker
  2011-11-05 22:36 ` Pieter Praet
  2 siblings, 1 reply; 16+ messages in thread
From: Achim Gratz @ 2011-11-05 20:34 UTC (permalink / raw)
  To: emacs-orgmode

Marcelo de Moraes Serpa <celoserpa@gmail.com> writes:
> I just bought a scanner and started to scan important documents as a
> backup, and archiving them with meaningful metadata in orgmode files.
> Then a question came to mind - what dpi to use? I'm not really savvy
> when it comes to scanning or printing, and I want like a dpi that
> allows me to reprint the document at an acceptable quality later if
> necessary, but that also doesn't take that much space (600dpi pdfs
> take around 5MB).

Fax in fine mode has about 200dpi resolution.  The raw scan should be in
higher resolution (usually 2x-4x the target resolution depending on the
document quality).  The file to be archived then needs to be compressed
(lossless compression is preferred, e.g. TIFF or PNG) and the bit depth
reduced (black and white, usually).  When making PDF files you need to
make sure that the image data doesn't get re-coded (often into much
inferior JPEG).  For documents containing (color) images it is often
preferrable to separately treat text and images.  The best compression
would be achieved if the whole text was extracted via OCR, but that is
probably a lot more effort than you're willing to spend.

Regards,
Achim.
-- 
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

Samples for the Waldorf Blofeld:
http://Synth.Stromeko.net/Downloads.html#BlofeldSamplesExtra

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OT] Scanning for archiving
  2011-11-05 20:34 ` Achim Gratz
@ 2011-11-05 20:52   ` Marcelo de Moraes Serpa
  0 siblings, 0 replies; 16+ messages in thread
From: Marcelo de Moraes Serpa @ 2011-11-05 20:52 UTC (permalink / raw)
  To: Achim Gratz; +Cc: emacs-orgmode

[-- Attachment #1: Type: text/plain, Size: 1522 bytes --]

Thanks Achim!

On Sat, Nov 5, 2011 at 2:34 PM, Achim Gratz <Stromeko@nexgo.de> wrote:

> Marcelo de Moraes Serpa <celoserpa@gmail.com> writes:
> > I just bought a scanner and started to scan important documents as a
> > backup, and archiving them with meaningful metadata in orgmode files.
> > Then a question came to mind - what dpi to use? I'm not really savvy
> > when it comes to scanning or printing, and I want like a dpi that
> > allows me to reprint the document at an acceptable quality later if
> > necessary, but that also doesn't take that much space (600dpi pdfs
> > take around 5MB).
>
> Fax in fine mode has about 200dpi resolution.  The raw scan should be in
> higher resolution (usually 2x-4x the target resolution depending on the
> document quality).  The file to be archived then needs to be compressed
> (lossless compression is preferred, e.g. TIFF or PNG) and the bit depth
> reduced (black and white, usually).  When making PDF files you need to
> make sure that the image data doesn't get re-coded (often into much
> inferior JPEG).  For documents containing (color) images it is often
> preferrable to separately treat text and images.  The best compression
> would be achieved if the whole text was extracted via OCR, but that is
> probably a lot more effort than you're willing to spend.
>
>
> Regards,
> Achim.
> --
> +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+
>
> Samples for the Waldorf Blofeld:
> http://Synth.Stromeko.net/Downloads.html#BlofeldSamplesExtra
>
>
>

[-- Attachment #2: Type: text/html, Size: 2133 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OT] Scanning for archiving
  2011-11-05 20:03 [OT] Scanning for archiving Marcelo de Moraes Serpa
  2011-11-05 20:34 ` Achim Gratz
@ 2011-11-05 21:01 ` Jan Böcker
  2011-11-05 21:06   ` Marcelo de Moraes Serpa
  2011-11-05 22:36 ` Pieter Praet
  2 siblings, 1 reply; 16+ messages in thread
From: Jan Böcker @ 2011-11-05 21:01 UTC (permalink / raw)
  To: Marcelo de Moraes Serpa; +Cc: Org Mode

On 11/05/2011 09:03 PM, Marcelo de Moraes Serpa wrote:
> Hi list,
> 
> I just bought a scanner and started to scan important documents as a
> backup, and archiving them with meaningful metadata in orgmode files.
> Then a question came to mind - what dpi to use? I'm not really savvy
> when it comes to scanning or printing, and I want like a dpi that allows
> me to reprint the document at an acceptable quality later if necessary,
> but that also doesn't take that much space (600dpi pdfs take around 5MB).

Hi Marcelo,

I am using 300 dpi. Even the fine print on my cell phone contract is
still comfortably readable at this resolution.
I guess that about 150 dpi is sufficient for most documents, but I don't
bother thinking about that on a case-by-case basis and just scan
everything at 300 dpi.

I do scan most documents in grayscale and only enable color when required.

Said cell phone contract weighs in at 4.6 MiB for a 6-page grayscale PDF
(about 770 KiB per page).

Btw, my problem with big file sizes it not exactly disk space (which
rapidly becomes cheaper with time) but the time it takes evince to
display the document on my laptop :)

If you are interested in the shell script I use to scan to PDF files,
see
http://www.jboecker.de/2010/04/14/general-reference-filing-with-org-mode.html#sec-5

Hope this helps,
  Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OT] Scanning for archiving
  2011-11-05 21:01 ` Jan Böcker
@ 2011-11-05 21:06   ` Marcelo de Moraes Serpa
  0 siblings, 0 replies; 16+ messages in thread
From: Marcelo de Moraes Serpa @ 2011-11-05 21:06 UTC (permalink / raw)
  To: Jan Böcker; +Cc: Org Mode

[-- Attachment #1: Type: text/plain, Size: 1775 bytes --]

Hi Jan,

I was in fact just looking at that article again a few minutes ago. I
recalled that we had discussed that before briefly and that I saw it
somewhere, and then I remembered about the discussion about your archiving
system.

Thanks again!

Marcelo.

On Sat, Nov 5, 2011 at 3:01 PM, Jan Böcker <jan.boecker@jboecker.de> wrote:

> On 11/05/2011 09:03 PM, Marcelo de Moraes Serpa wrote:
> > Hi list,
> >
> > I just bought a scanner and started to scan important documents as a
> > backup, and archiving them with meaningful metadata in orgmode files.
> > Then a question came to mind - what dpi to use? I'm not really savvy
> > when it comes to scanning or printing, and I want like a dpi that allows
> > me to reprint the document at an acceptable quality later if necessary,
> > but that also doesn't take that much space (600dpi pdfs take around 5MB).
>
> Hi Marcelo,
>
> I am using 300 dpi. Even the fine print on my cell phone contract is
> still comfortably readable at this resolution.
> I guess that about 150 dpi is sufficient for most documents, but I don't
> bother thinking about that on a case-by-case basis and just scan
> everything at 300 dpi.
>
> I do scan most documents in grayscale and only enable color when required.
>
> Said cell phone contract weighs in at 4.6 MiB for a 6-page grayscale PDF
> (about 770 KiB per page).
>
> Btw, my problem with big file sizes it not exactly disk space (which
> rapidly becomes cheaper with time) but the time it takes evince to
> display the document on my laptop :)
>
> If you are interested in the shell script I use to scan to PDF files,
> see
>
> http://www.jboecker.de/2010/04/14/general-reference-filing-with-org-mode.html#sec-5
>
> Hope this helps,
>  Jan
>

[-- Attachment #2: Type: text/html, Size: 2339 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OT] Scanning for archiving
  2011-11-05 20:03 [OT] Scanning for archiving Marcelo de Moraes Serpa
  2011-11-05 20:34 ` Achim Gratz
  2011-11-05 21:01 ` Jan Böcker
@ 2011-11-05 22:36 ` Pieter Praet
  2011-11-05 23:35   ` Samuel Wales
  2011-11-07 17:44   ` Karl Voit
  2 siblings, 2 replies; 16+ messages in thread
From: Pieter Praet @ 2011-11-05 22:36 UTC (permalink / raw)
  To: Marcelo de Moraes Serpa, Org Mode

On Sat, 5 Nov 2011 14:03:24 -0600, Marcelo de Moraes Serpa <celoserpa@gmail.com> wrote:
> Hi list,
> 
> I just bought a scanner and started to scan important documents as a
> backup, and archiving them with meaningful metadata in orgmode files. Then
> a question came to mind - what dpi to use? I'm not really savvy when it
> comes to scanning or printing, and I want like a dpi that allows me to
> reprint the document at an acceptable quality later if necessary, but that
> also doesn't take that much space (600dpi pdfs take around 5MB).
> 
> Any insights welcome,
> 
> Thanks,
> 
> Marcelo.

Using PDF for scanned documents results in *huge* files with a seriously
disappointing image quality.  Consider storing your scans in DjVu format
[1], which was developed specifically for this purpose.

I scan all docs @ 600dpi, predominantly gray-scale (only in colour when
it's *really* necessary) and store in DjVu format, all using gscan2pdf [2].

Even at that seemingly overkill resolution, single-page documents are
generally (if they aren't too "grainy") only a few 100 KiB in size.

gscan2pdf also supports a number of OCR utils, but the UI for this is
clumsy (aren't they all...), so you're better off using the CLI tools
directly.  Tesseract is recommended.

I've used this approach to "convert" piles upon piles of old bank
statements to Ledger format, with very little effort.

NOTE: When attempting something like this, a fast scanner with a *reliable*
automatic document feeder will help prevent premature hair loss ;)

Peace

-- 
Pieter

[1] http://djvu.org/resources/whatisdjvu.php
[2] http://gscan2pdf.sourceforge.net/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OT] Scanning for archiving
  2011-11-05 22:36 ` Pieter Praet
@ 2011-11-05 23:35   ` Samuel Wales
  2011-11-06 21:59     ` Pieter Praet
  2011-11-07 17:44   ` Karl Voit
  1 sibling, 1 reply; 16+ messages in thread
From: Samuel Wales @ 2011-11-05 23:35 UTC (permalink / raw)
  To: Pieter Praet; +Cc: Org Mode, Marcelo de Moraes Serpa

I used to find that 8-bit 75dpi was legible and small.

What ADF scanners are out there for Linux that have high quality
reliable ADF, are fast, and work well with CLI tools?

Is OCR at the point where it is feasible using CLI?  Combining that
with a new feature to have the Org agenda work with indexers (I
participated in a discussion on that here a long while back) would be
interesting.

On 2011-11-05, Pieter Praet <pieter@praet.org> wrote:
> NOTE: When attempting something like this, a fast scanner with a *reliable*
> automatic document feeder will help prevent premature hair loss ;)

...

> [1] http://djvu.org/resources/whatisdjvu.php
> [2] http://gscan2pdf.sourceforge.net/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OT] Scanning for archiving
  2011-11-05 23:35   ` Samuel Wales
@ 2011-11-06 21:59     ` Pieter Praet
  2011-11-07  6:14       ` TP
  0 siblings, 1 reply; 16+ messages in thread
From: Pieter Praet @ 2011-11-06 21:59 UTC (permalink / raw)
  To: Samuel Wales; +Cc: Org Mode, Marcelo de Moraes Serpa

On Sat, 5 Nov 2011 16:35:11 -0700, Samuel Wales <samologist@gmail.com> wrote:
> I used to find that 8-bit 75dpi was legible and small.
> 

True.

It all depends on why you're scanning them in the first place.

75dpi is fine when scanning with collaboration/quick-reference in mind,
but for archival/backup purposes (i.e. absolute peace of mind when your
whole collection of dead trees burns, drowns, or is simply disposed of)
or OCR, you'll want to go with 600dpi and beyond.

If using DjVu instead of PDF, the storage overhead will be negligible.

> What ADF scanners are out there for Linux that have high quality
> reliable ADF, [...]

I wish I knew...  If anyone on this list can think of a scanner whose
ADF doesn't require constant babysitting, I'm betting it won't have a
consumer-grade price tag.

> [...] are fast, [...]

Pretty much all of them, these days.

> and work well with CLI tools?
> 

As long as it's supported by SANE [1], rats are entirely optional.

> Is OCR at the point where it is feasible using CLI? [...]

Depends on how "fancy" the document layout is.  For most documents worth
scanning (let alone OCR'ing), it always has been.  Also see OCRopus [2].

> [...] Combining that
> with a new feature to have the Org agenda work with indexers (I
> participated in a discussion on that here a long while back) would be
> interesting.
> 

If you don't intend to create a perfect ASCII copy of the document, and
your index is restricted to word occurrence/frequency, it'll do just fine.

> On 2011-11-05, Pieter Praet <pieter@praet.org> wrote:
> > NOTE: When attempting something like this, a fast scanner with a *reliable*
> > automatic document feeder will help prevent premature hair loss ;)
> 
> ...
> 
> > [1] http://djvu.org/resources/whatisdjvu.php
> > [2] http://gscan2pdf.sourceforge.net/

Peace

-- 
Pieter

[1] http://www.sane-project.org/sane-supported-devices.html
[2] http://code.google.com/p/ocropus/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OT] Scanning for archiving
  2011-11-06 21:59     ` Pieter Praet
@ 2011-11-07  6:14       ` TP
  2011-11-09  8:51         ` Pieter Praet
  2011-11-20 13:57         ` Matt Lundin
  0 siblings, 2 replies; 16+ messages in thread
From: TP @ 2011-11-07  6:14 UTC (permalink / raw)
  To: Org Mode

On Sun, Nov 6, 2011 at 1:59 PM, Pieter Praet <pieter@praet.org> wrote:
> On Sat, 5 Nov 2011 16:35:11 -0700, Samuel Wales <samologist@gmail.com> wrote:
>> I used to find that 8-bit 75dpi was legible and small.
>>
>
> True.
>
> It all depends on why you're scanning them in the first place.
>
> 75dpi is fine when scanning with collaboration/quick-reference in mind,
> but for archival/backup purposes (i.e. absolute peace of mind when your
> whole collection of dead trees burns, drowns, or is simply disposed of)
> or OCR, you'll want to go with 600dpi and beyond.

One common technique is to always scan 300dpi grayscale (or color) and
use clever software to upsample to 600dpi b&w (of course somehow
segmenting scans into "picture" and "text" regions first.

>> What ADF scanners are out there for Linux that have high quality
>> reliable ADF, [...]
>
> I wish I knew...  If anyone on this list can think of a scanner whose
> ADF doesn't require constant babysitting, I'm betting it won't have a
> consumer-grade price tag.

I've heard nice things about the Fujitsu ScanSnap S1500
(http://www.fujitsu.com/global/services/computing/peripheral/scanners/product/s1500/)
and S1500M (http://www.fujitsu.com/global/services/computing/peripheral/scanners/product/s1500m/).
About $450 or so from amazon. The S1300 is about half the price but
also slower.

Apparently the S1500's are supported on Linux via Sane
(http://www.sane-project.org/sane-backends.html#S-FUJITSU). Don't see
any mention of the S1300 (but it probably also works?).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OT] Scanning for archiving
  2011-11-07  6:14       ` TP
@ 2011-11-09  8:51         ` Pieter Praet
  2011-11-20 13:57         ` Matt Lundin
  1 sibling, 0 replies; 16+ messages in thread
From: Pieter Praet @ 2011-11-09  8:51 UTC (permalink / raw)
  To: TP, Org Mode

On Sun, 6 Nov 2011 22:14:54 -0800, TP <wingusr@gmail.com> wrote:
> On Sun, Nov 6, 2011 at 1:59 PM, Pieter Praet <pieter@praet.org> wrote:
> > On Sat, 5 Nov 2011 16:35:11 -0700, Samuel Wales <samologist@gmail.com> wrote:
> >> I used to find that 8-bit 75dpi was legible and small.
> >>
> >
> > True.
> >
> > It all depends on why you're scanning them in the first place.
> >
> > 75dpi is fine when scanning with collaboration/quick-reference in mind,
> > but for archival/backup purposes (i.e. absolute peace of mind when your
> > whole collection of dead trees burns, drowns, or is simply disposed of)
> > or OCR, you'll want to go with 600dpi and beyond.
> 
> One common technique is to always scan 300dpi grayscale (or color) and
> use clever software to upsample to 600dpi b&w (of course somehow
> segmenting scans into "picture" and "text" regions first.
> 

Upsampling defies the first law of thermodynamics?

But seriously, after reading up a bit, I'm convinced :)

Quite a manual process though...  Could you recommend any (FOSS)
software that does this automatically?

> >> What ADF scanners are out there for Linux that have high quality
> >> reliable ADF, [...]
> >
> > I wish I knew...  If anyone on this list can think of a scanner whose
> > ADF doesn't require constant babysitting, I'm betting it won't have a
> > consumer-grade price tag.
> 
> I've heard nice things about the Fujitsu ScanSnap S1500
> (http://www.fujitsu.com/global/services/computing/peripheral/scanners/product/s1500/)
> and S1500M (http://www.fujitsu.com/global/services/computing/peripheral/scanners/product/s1500m/).
> About $450 or so from amazon. The S1300 is about half the price but
> also slower.
> 
> Apparently the S1500's are supported on Linux via Sane
> (http://www.sane-project.org/sane-backends.html#S-FUJITSU). Don't see
> any mention of the S1300 (but it probably also works?).
> 


Peace

-- 
Pieter

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OT] Scanning for archiving
  2011-11-07  6:14       ` TP
  2011-11-09  8:51         ` Pieter Praet
@ 2011-11-20 13:57         ` Matt Lundin
  1 sibling, 0 replies; 16+ messages in thread
From: Matt Lundin @ 2011-11-20 13:57 UTC (permalink / raw)
  To: TP; +Cc: Org Mode

TP <wingusr@gmail.com> writes:

> Apparently the S1500's are supported on Linux via Sane
> (http://www.sane-project.org/sane-backends.html#S-FUJITSU). Don't see
> any mention of the S1300 (but it probably also works?).

I can confirm that the S1300 works well with Linux.

Best,
Matt

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OT] Scanning for archiving
  2011-11-05 22:36 ` Pieter Praet
  2011-11-05 23:35   ` Samuel Wales
@ 2011-11-07 17:44   ` Karl Voit
  2011-11-09  7:40     ` Pieter Praet
  2011-11-09 14:53     ` Karl Voit
  1 sibling, 2 replies; 16+ messages in thread
From: Karl Voit @ 2011-11-07 17:44 UTC (permalink / raw)
  To: emacs-orgmode

Hi!

Inspired by «Total Recall»[3], a book of two MS Research guys, I
started life logging on my own two months ago.

For this purpose I bought an HP OfficeJet Pro 8500A Plus which costs
€ 250 and has a decent scanner. Is can scan and print full duplex.
The scanner as a 30 page ADF which is quite reliable when the paper
was not bend or stapled before.

* Pieter Praet <pieter@praet.org> wrote:
>
> Using PDF for scanned documents results in *huge* files with a seriously
> disappointing image quality.  

I can not copy that at all:

,----
| vk@gary ~2d % l 2011-11-02_13-22-45.png
| -rw------- 1 vk vk 103150 2011-11-02 13:22 2011-11-02_13-22-45.png
| vk@gary ~2d % convert 2011-11-02_13-22-45.png 2011-11-02_13-22-45.pdf
| vk@gary ~2d % l 2011-11-02_13-22-45.pdf
| -rw-r--r-- 1 vk vk 96457 2011-11-07 18:12 2011-11-02_13-22-45.pdf
| vk@gary ~2d %
`----

In this example, the compression of PDF is much better than the
original PNG one. PDF is only a container format.

> Consider storing your scans in DjVu format
> [1], which was developed specifically for this purpose.

PDF is a common standard whereas DjVu is something I - as an
advanced computer user - never faced before in real life. I am not
sure whether any of my computers can handle DjVu files at all.

The goals of DjVu sound great but I get everything with PDF too.
Although I like the idea of OGG Vorbis, I re-ripped all my CDs using
mp3 again because I could not use many music devices or music
management software packages.

I stick to the format *any* computer can handle without special
software products. And I do think that I get a higher chance of
being able to read my documents twenty years from now.

For scanned images I'd prefer PNG instead but the OS X Software of
my OfficeJet offers me the ability to generate PDF files where an
OCR software adds a searchable text layer above the scanned text.
This is *very* important to me since I am able to do full text
search on the content of my archived documents.

And I plan to archive *all* of my documents. Really all of them.

Storage space does not matter (any more) to me since I have more
disk space now already than I could possible fill with my lifetime
paper correspondence. And I do think that my disk space continues to
grow in future.

> I scan all docs @ 600dpi, predominantly gray-scale (only in colour when
> it's *really* necessary) and store in DjVu format, all using gscan2pdf [2].
>
> Even at that seemingly overkill resolution, single-page documents are
> generally (if they aren't too "grainy") only a few 100 KiB in size.

My HP software uses 300 dpi per default and it is OK to me too.

Funny side fact: grayscale scan document settings produces slightly
larger files than colored ones.

> gscan2pdf also supports a number of OCR utils, but the UI for this is
> clumsy (aren't they all...), so you're better off using the CLI tools
> directly.  Tesseract is recommended.

I played around with ocropus, tesseract, ocroscript, hocr2pdf,
exactimage, ppa:gezakovacs/pdfocr, ... to generate those sandwitch
PDF documents (OCR text above the scanned images) on GNU/Linux.
Unfortunately none of those (very cool projects) produced reliable
results on my side. The results vary from «no error but overlay font
size is incorrect and produces loss of layout» to «library error
messages I can not read or handle». 

Whereas the HP OfficeJet bundles its OS X software with OCR from
Readiris which produces perfect results even in different languages
and using a usable user interface.

> NOTE: When attempting something like this, a fast scanner with a *reliable*
> automatic document feeder will help prevent premature hair loss ;)

I have found several scanner products I was interested in:

"Canon imageFORMULA P-150": very small form factor with basic Linux
support. Price tag starts with € 260. Neat form factor and very
portable. Different version "P-150m" for Mac OS X.

The authors of [3] use Fujitsu ScanSnap starting at € 400.

I ended up with the Office Jet Pro (mentioned above) at € 250
because I got flatbed scanner *and* ADF-scanner *and* a
full-duplex/full-color network printer with a very good
price-per-printed-page-ratio (better than many laser printers!). And
all of this with a cheaper price tag than any scan-only-product I
was interested in.

So far I am almost satisfied. «Almost»? Well, HP did a good job with
this printer but they made only a 90% solution on almost all levels.
Whereas 100% would be possible with small additional effort when
creating the printer. But those resulting 90% are pretty usable.

  3. http://qr.cx/sAHU
-- 
Karl Voit

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OT] Scanning for archiving
  2011-11-07 17:44   ` Karl Voit
@ 2011-11-09  7:40     ` Pieter Praet
  2011-11-09  9:06       ` Johnny
  2011-11-09 11:05       ` Karl Voit
  2011-11-09 14:53     ` Karl Voit
  1 sibling, 2 replies; 16+ messages in thread
From: Pieter Praet @ 2011-11-09  7:40 UTC (permalink / raw)
  To: news1142, emacs-orgmode

On Mon, 7 Nov 2011 18:44:24 +0100, Karl Voit <devnull@Karl-Voit.at> wrote:
> Hi!
> 
> Inspired by «Total Recall»[3], a book of two MS Research guys, I
> started life logging on my own two months ago.
> 

Dammit, that's been on my reading list for almost 2 years now, and
*still* it isn't available in ebook format.  One would think they'd walk
their talk [1], no?

> [...]
> * Pieter Praet <pieter@praet.org> wrote:
> >
> > Using PDF for scanned documents results in *huge* files with a seriously
> > disappointing image quality.  
> 
> I can not copy that at all:
> 
> ,----
> | vk@gary ~2d % l 2011-11-02_13-22-45.png
> | -rw------- 1 vk vk 103150 2011-11-02 13:22 2011-11-02_13-22-45.png
> | vk@gary ~2d % convert 2011-11-02_13-22-45.png 2011-11-02_13-22-45.pdf
> | vk@gary ~2d % l 2011-11-02_13-22-45.pdf
> | -rw-r--r-- 1 vk vk 96457 2011-11-07 18:12 2011-11-02_13-22-45.pdf
> | vk@gary ~2d %
> `----
> 
> In this example, the compression of PDF is much better than the
> original PNG one. PDF is only a container format.
> 

The conversion to PDF has indeed reduced the filesize, but not for the
reasons you might think: If you don't explicitly provide ImageMagick's
`convert' with a compression level (`-quality' option), it will use a
default of 75%.  Thus I (perhaps incorrectly) infer that you've just
lost 25% of the image quality for a meager 7% reduction in filesize.

I do admit that the whole quality vs. filesize statement I made
regarding using PDF for scanned documents wasn't entirely correct:
I cut some corners.

The real issue is that most folks use their scanner software to save
directly to PDF, and for some reason, scanner software (especially the
proprietary variety) predominantly uses JPEG compression as default when
saving to PDF.

JPEG was developed for storing images with smooth transitions and a high
bit depth (i.e. photographs), not hard transitions and a low bit depth
(i.e. documents), so you're likely to suffer a noticeable degradation in
text quality, even when using 1:1 JPEG compression.

You're using PNG compression though, so the whole JPEG deal doesn't apply.

So, that just leaves the neverending stream of PDF security issues :)

> > Consider storing your scans in DjVu format
> > [1], which was developed specifically for this purpose.
> 
> PDF is a common standard whereas DjVu is something I - as an
> advanced computer user - never faced before in real life. I am not
> sure whether any of my computers can handle DjVu files at all.
> 

How about the Million Book Project / Universal Digital Library [2] ?
Even though every computing device is most likely to support PDF, their
collection is only available in TIFF and DjVu format.

The list of participants and partners [3] (not to mention the magnitude
and cost of their undertaking) is reason enough (for me, at least) to
assume that DjVu is deemed to be rather future-proof.

I'm guessing ISO standardization will be only a matter of time.

> The goals of DjVu sound great but I get everything with PDF too.
> Although I like the idea of OGG Vorbis, I re-ripped all my CDs using
> mp3 again because I could not use many music devices or music
> management software packages.
> 

Ahhh, VHS vs. Betamax, over and over again...

Companies only succeed in getting everyone stuck with mediocre tools if
we allow them to.  You don't *need* all devices/software to support the
superior format.  Just get the ones that do (if there are any...), try
to enlighten the people in you monkeysphere [4], and then let the free
market do its work.  Joe Average Consumer will eventually follow (unless
pornography is at stake, apparently), and the industry will be right on
his tail.

> I stick to the format *any* computer can handle without special
> software products. [...]

Somehow this implies that *every* computer is infected with Adobe's
malware.  I find that rather disconcerting, to be honest :D

> [...] And I do think that I get a higher chance of
> being able to read my documents twenty years from now.
> 

For your sake, I hope you're right!

> For scanned images I'd prefer PNG instead but the OS X Software of
> my OfficeJet offers me the ability to generate PDF files where an
> OCR software adds a searchable text layer above the scanned text.
> This is *very* important to me since I am able to do full text
> search on the content of my archived documents.
> 

May be a bit less convenient in daily usage, but you could stick to your
preference of keeping all your scans in PNG format by keeping the OCR
output in a separate ASCII file:

  #+begin_src sh
    for i in $(ls ${HOME}/msg/paper/inbox/*.png) ; do
        tesseract ${i} ${i}.txt
    done
  #+end_src

That way you can access your data even on text-only machines,
and full-text search is only a `grep' away.

> And I plan to archive *all* of my documents. Really all of them.
> 

Then you'll probably be interested in Joey Hess' git-annex [5] to keep
your archive versioned and in sync across all your devices.

> Storage space does not matter (any more) to me since I have more
> disk space now already than I could possible fill with my lifetime
> paper correspondence. And I do think that my disk space continues to
> grow in future.
> 

I'd argue it still does, otherwise you'd be keeping your scans in
TIFF format.  And digitized trees surely aren't the only type of
correspondence you are (or will be) archiving.

Efficiency should always play a major role IMO, even if the available
resources are (perceived to be) infinite.  Having a hangar instead of a
garage doesn't warrant driving a schoolbus to work, even if doesn't
guzzle a drop of gas.

> [...]
> 
> Funny side fact: grayscale scan document settings produces slightly
> larger files than colored ones.
> 

That's odd.  Probably depends on which type of compression is used.

> > gscan2pdf also supports a number of OCR utils, but the UI for this is
> > clumsy (aren't they all...), so you're better off using the CLI tools
> > directly.  Tesseract is recommended.
> 
> I played around with ocropus, tesseract, ocroscript, hocr2pdf,
> exactimage, ppa:gezakovacs/pdfocr, ... to generate those sandwitch
> PDF documents (OCR text above the scanned images) on GNU/Linux.
> Unfortunately none of those (very cool projects) produced reliable
> results on my side. The results vary from «no error but overlay font
> size is incorrect and produces loss of layout» to «library error
> messages I can not read or handle». 
> 
> Whereas the HP OfficeJet bundles its OS X software with OCR from
> Readiris which produces perfect results even in different languages
> and using a usable user interface.
> 

Sadly, I can only agree with this.  Google's involvement in Tesseract
and OCRopus does instill hope though :)

> > NOTE: When attempting something like this, a fast scanner with a *reliable*
> > automatic document feeder will help prevent premature hair loss ;)
> 
> I have found several scanner products I was interested in:
> 
> "Canon imageFORMULA P-150": very small form factor with basic Linux
> support. Price tag starts with € 260. Neat form factor and very
> portable. Different version "P-150m" for Mac OS X.
> 
> The authors of [3] use Fujitsu ScanSnap starting at € 400.
> 
> I ended up with the Office Jet Pro (mentioned above) at € 250
> because I got flatbed scanner *and* ADF-scanner *and* a
> full-duplex/full-color network printer with a very good
> price-per-printed-page-ratio (better than many laser printers!). And
> all of this with a cheaper price tag than any scan-only-product I
> was interested in.
> 
> So far I am almost satisfied. «Almost»? Well, HP did a good job with
> this printer but they made only a 90% solution on almost all levels.
> Whereas 100% would be possible with small additional effort when
> creating the printer. But those resulting 90% are pretty usable.
> 
>   3. http://qr.cx/sAHU
> -- 
> Karl Voit
> 
> 

Peace

-- 
Pieter

[1] http://www.youtube.com/watch?v=zDcq2lmw0ls
[2] http://www.ulib.org/
[3] http://www.ulib.org/ULIBAboutUs.htm
[4] http://en.wikipedia.org/wiki/Dunbar's_number
[5] http://git-annex.branchable.com/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OT] Scanning for archiving
  2011-11-09  7:40     ` Pieter Praet
@ 2011-11-09  9:06       ` Johnny
  2011-11-09 11:05       ` Karl Voit
  1 sibling, 0 replies; 16+ messages in thread
From: Johnny @ 2011-11-09  9:06 UTC (permalink / raw)
  To: Pieter Praet; +Cc: news1142, emacs-orgmode

Apologies for top-posting, but my comment is only inspired by the
conversation and doesn't exactly build on it, so here we go.

I use predominantly pdf in scanning, for one main reason only - it
handles *metadata* nicely (with gscan2pdf). This is nice for searching
later. When playing with DjVu, I didn't find an easy way to amend
metadata - is there any good working method and tools to recommend for
adding metadata for DjVu files?

Thanks.

Pieter Praet <pieter@praet.org> writes:

> On Mon, 7 Nov 2011 18:44:24 +0100, Karl Voit <devnull@Karl-Voit.at> wrote:
>> Hi!
>> 
>> Inspired by «Total Recall»[3], a book of two MS Research guys, I
>> started life logging on my own two months ago.
>> 
>
> Dammit, that's been on my reading list for almost 2 years now, and
> *still* it isn't available in ebook format.  One would think they'd walk
> their talk [1], no?
>
>> [...]
>> * Pieter Praet <pieter@praet.org> wrote:
>> >
>> > Using PDF for scanned documents results in *huge* files with a seriously
>> > disappointing image quality.  
>> 
>> I can not copy that at all:
>> 
>> ,----
>> | vk@gary ~2d % l 2011-11-02_13-22-45.png
>> | -rw------- 1 vk vk 103150 2011-11-02 13:22 2011-11-02_13-22-45.png
>> | vk@gary ~2d % convert 2011-11-02_13-22-45.png 2011-11-02_13-22-45.pdf
>> | vk@gary ~2d % l 2011-11-02_13-22-45.pdf
>> | -rw-r--r-- 1 vk vk 96457 2011-11-07 18:12 2011-11-02_13-22-45.pdf
>> | vk@gary ~2d %
>> `----
>> 
>> In this example, the compression of PDF is much better than the
>> original PNG one. PDF is only a container format.
>> 
>
> The conversion to PDF has indeed reduced the filesize, but not for the
> reasons you might think: If you don't explicitly provide ImageMagick's
> `convert' with a compression level (`-quality' option), it will use a
> default of 75%.  Thus I (perhaps incorrectly) infer that you've just
> lost 25% of the image quality for a meager 7% reduction in filesize.
>
>
> I do admit that the whole quality vs. filesize statement I made
> regarding using PDF for scanned documents wasn't entirely correct:
> I cut some corners.
>
> The real issue is that most folks use their scanner software to save
> directly to PDF, and for some reason, scanner software (especially the
> proprietary variety) predominantly uses JPEG compression as default when
> saving to PDF.
>
> JPEG was developed for storing images with smooth transitions and a high
> bit depth (i.e. photographs), not hard transitions and a low bit depth
> (i.e. documents), so you're likely to suffer a noticeable degradation in
> text quality, even when using 1:1 JPEG compression.
>
> You're using PNG compression though, so the whole JPEG deal doesn't apply.
>
> So, that just leaves the neverending stream of PDF security issues :)
>
>> > Consider storing your scans in DjVu format
>> > [1], which was developed specifically for this purpose.
>> 
>> PDF is a common standard whereas DjVu is something I - as an
>> advanced computer user - never faced before in real life. I am not
>> sure whether any of my computers can handle DjVu files at all.
>> 
>
> How about the Million Book Project / Universal Digital Library [2] ?
> Even though every computing device is most likely to support PDF, their
> collection is only available in TIFF and DjVu format.
>
> The list of participants and partners [3] (not to mention the magnitude
> and cost of their undertaking) is reason enough (for me, at least) to
> assume that DjVu is deemed to be rather future-proof.
>
> I'm guessing ISO standardization will be only a matter of time.
>
>> The goals of DjVu sound great but I get everything with PDF too.
>> Although I like the idea of OGG Vorbis, I re-ripped all my CDs using
>> mp3 again because I could not use many music devices or music
>> management software packages.
>> 
>
> Ahhh, VHS vs. Betamax, over and over again...
>
> Companies only succeed in getting everyone stuck with mediocre tools if
> we allow them to.  You don't *need* all devices/software to support the
> superior format.  Just get the ones that do (if there are any...), try
> to enlighten the people in you monkeysphere [4], and then let the free
> market do its work.  Joe Average Consumer will eventually follow (unless
> pornography is at stake, apparently), and the industry will be right on
> his tail.
>
>> I stick to the format *any* computer can handle without special
>> software products. [...]
>
> Somehow this implies that *every* computer is infected with Adobe's
> malware.  I find that rather disconcerting, to be honest :D
>
>> [...] And I do think that I get a higher chance of
>> being able to read my documents twenty years from now.
>> 
>
> For your sake, I hope you're right!
>
>> For scanned images I'd prefer PNG instead but the OS X Software of
>> my OfficeJet offers me the ability to generate PDF files where an
>> OCR software adds a searchable text layer above the scanned text.
>> This is *very* important to me since I am able to do full text
>> search on the content of my archived documents.
>> 
>
> May be a bit less convenient in daily usage, but you could stick to your
> preference of keeping all your scans in PNG format by keeping the OCR
> output in a separate ASCII file:
>
>   #+begin_src sh
>     for i in $(ls ${HOME}/msg/paper/inbox/*.png) ; do
>         tesseract ${i} ${i}.txt
>     done
>   #+end_src
>
> That way you can access your data even on text-only machines,
> and full-text search is only a `grep' away.
>
>> And I plan to archive *all* of my documents. Really all of them.
>> 
>
> Then you'll probably be interested in Joey Hess' git-annex [5] to keep
> your archive versioned and in sync across all your devices.
>
>> Storage space does not matter (any more) to me since I have more
>> disk space now already than I could possible fill with my lifetime
>> paper correspondence. And I do think that my disk space continues to
>> grow in future.
>> 
>
> I'd argue it still does, otherwise you'd be keeping your scans in
> TIFF format.  And digitized trees surely aren't the only type of
> correspondence you are (or will be) archiving.
>
> Efficiency should always play a major role IMO, even if the available
> resources are (perceived to be) infinite.  Having a hangar instead of a
> garage doesn't warrant driving a schoolbus to work, even if doesn't
> guzzle a drop of gas.
>
>> [...]
>> 
>> Funny side fact: grayscale scan document settings produces slightly
>> larger files than colored ones.
>> 
>
> That's odd.  Probably depends on which type of compression is used.
>
>> > gscan2pdf also supports a number of OCR utils, but the UI for this is
>> > clumsy (aren't they all...), so you're better off using the CLI tools
>> > directly.  Tesseract is recommended.
>> 
>> I played around with ocropus, tesseract, ocroscript, hocr2pdf,
>> exactimage, ppa:gezakovacs/pdfocr, ... to generate those sandwitch
>> PDF documents (OCR text above the scanned images) on GNU/Linux.
>> Unfortunately none of those (very cool projects) produced reliable
>> results on my side. The results vary from «no error but overlay font
>> size is incorrect and produces loss of layout» to «library error
>> messages I can not read or handle». 
>> 
>> Whereas the HP OfficeJet bundles its OS X software with OCR from
>> Readiris which produces perfect results even in different languages
>> and using a usable user interface.
>> 
>
> Sadly, I can only agree with this.  Google's involvement in Tesseract
> and OCRopus does instill hope though :)
>
>> > NOTE: When attempting something like this, a fast scanner with a *reliable*
>> > automatic document feeder will help prevent premature hair loss ;)
>> 
>> I have found several scanner products I was interested in:
>> 
>> "Canon imageFORMULA P-150": very small form factor with basic Linux
>> support. Price tag starts with € 260. Neat form factor and very
>> portable. Different version "P-150m" for Mac OS X.
>> 
>> The authors of [3] use Fujitsu ScanSnap starting at € 400.
>> 
>> I ended up with the Office Jet Pro (mentioned above) at € 250
>> because I got flatbed scanner *and* ADF-scanner *and* a
>> full-duplex/full-color network printer with a very good
>> price-per-printed-page-ratio (better than many laser printers!). And
>> all of this with a cheaper price tag than any scan-only-product I
>> was interested in.
>> 
>> So far I am almost satisfied. «Almost»? Well, HP did a good job with
>> this printer but they made only a 90% solution on almost all levels.
>> Whereas 100% would be possible with small additional effort when
>> creating the printer. But those resulting 90% are pretty usable.
>> 
>>   3. http://qr.cx/sAHU
>> -- 
>> Karl Voit
>> 
>> 
>
>
> Peace

-- 
Johnny

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OT] Scanning for archiving
  2011-11-09  7:40     ` Pieter Praet
  2011-11-09  9:06       ` Johnny
@ 2011-11-09 11:05       ` Karl Voit
  1 sibling, 0 replies; 16+ messages in thread
From: Karl Voit @ 2011-11-09 11:05 UTC (permalink / raw)
  To: emacs-orgmode

(I enjoy the OT discussion here and hope that no one gets upset
because of it on this ML ...)

* Pieter Praet <pieter@praet.org> wrote:
> On Mon, 7 Nov 2011 18:44:24 +0100, Karl Voit <devnull@Karl-Voit.at> wrote:
>> Hi!
>> 
>> Inspired by «Total Recall»[3], a book of two MS Research guys, I
>> started life logging on my own two months ago.
>
> Dammit, that's been on my reading list for almost 2 years now, and
> *still* it isn't available in ebook format.  One would think they'd walk
> their talk [1], no?

I personally do not want to read a book other than on paper - for
now. Annotating, highlighting and placing different kind of
postit-marks still does not have its digital representations I would
like to see :-(

I recommend [1] mainly because of its chapters upon how to start and
best practices. Previous chapters are future visions and motivation
I do think that we do not need (any more).

Besides the fact that the raw paper cut offers horrible handling
usability it is quite easy and fast to read.

>> | -rw------- 1 vk vk 103150 2011-11-02 13:22 2011-11-02_13-22-45.png
>> | vk@gary ~2d % convert 2011-11-02_13-22-45.png 2011-11-02_13-22-45.pdf
>> | vk@gary ~2d % l 2011-11-02_13-22-45.pdf
>> | -rw-r--r-- 1 vk vk 96457 2011-11-07 18:12 2011-11-02_13-22-45.pdf
>
> The conversion to PDF has indeed reduced the filesize, but not for the
> reasons you might think: If you don't explicitly provide ImageMagick's
> `convert' with a compression level (`-quality' option), it will use a
> default of 75%.  Thus I (perhaps incorrectly) infer that you've just
> lost 25% of the image quality for a meager 7% reduction in filesize.

Ah, thanks for the clarification! This is indeed interesting fact.

Still: I did never recognize any problem with the 75% result though
:-) It is clearly readable and zoomable on screen and produces very
good results when printed out again.

> The real issue is that most folks use their scanner software to save
> directly to PDF, and for some reason, scanner software (especially the
> proprietary variety) predominantly uses JPEG compression as default when
> saving to PDF.

OK, this is interesting (again). So I took a closer look on the
result files my HP OfficeJet is producing when I scan to PDF.

«pdfimages» and «file» shows me that in the PDF files there are
embedded «Netpbm PPM "rawbits" image data» image files.[6] Another
format I was not confronted with until now.

Seems to be a pixel-based compressed and standardized format. This
is fine to me so far. JPEG would be horrible ...

> JPEG was developed for storing images ...

Part of my job was to explain first term students the difference,
advantages and disadvantages regarding to file formats like JPEG and
PNG. I can tell stories ...

> You're using PNG compression though, so the whole JPEG deal doesn't apply.

Oh, this was just an example of how «convert *png *pdf» reduces file
size. (Which was corrected by you.)

I usually scan directly into «searchable PDF» which the HP Scan
offers me. My OfficeJet even allows me to simply put pages onto the
ADF, press (more) buttons (than necessary) on the printer itself and
without any further interaction, the searchable PDF files are placed
into a folder of my choice, using my file name convention containing
a time stamp of the scan process. This is kinda neat :-)

So I do not even need to turn on my TFT for scanning stuff. (My Mac
Mini is on 24/7 anyway.)

> So, that just leaves the neverending stream of PDF security issues :)

I do not publish blacked out PDF files. Or do you mean something
else? 

There is no security related issue that worries me for now. I have
to protect my data anyhow from being accessed by anyone else,
independent of the file formats.

>> > Consider storing your scans in DjVu format
>> > [1], which was developed specifically for this purpose.
>> 
>> PDF is a common standard whereas DjVu is something I - as an
>> advanced computer user - never faced before in real life. I am not
>> sure whether any of my computers can handle DjVu files at all.
>
> How about the Million Book Project / Universal Digital Library [2] ?

Well they are that big that they could even use a proprietary format
on purpose too. With valid arguments. They have different
requirements than I have.

> Even though every computing device is most likely to support PDF, their
> collection is only available in TIFF and DjVu format.

TIFF is a perfectly wide spread standard I would choose for
uncompressed raw data to store to. The automotive industry here is
using TIFF images for many purposes outside of CAD design. I would
not choose TIFF for long time archive format for my personal data.

> I'm guessing ISO standardization will be only a matter of time.

Hope so. Looks like a promising format.

>> Although I like the idea of OGG Vorbis, I re-ripped all my CDs using
>> mp3 again because I could not use many music devices or music
>> management software packages.
>
> Ahhh, VHS vs. Betamax, over and over again...

... and my beloved MiniDisc vs. some other formats :-)

Yes. And still: with my current knowledge, I'd choose VHS again. I
can still find VHS devices but Betamax is only available on flea
markets. There are more criteria than technical ones alone.

Please rest assured: I preach of open standards and best solutions
whenever I can. I fought (too) many wars for privacy concerns,
security awareness, LaTeX usage, Ogg Vorbis, OpenPGP, GNU/Linux,
W3C, banning flash, using Open Software products and so forth.

But for music I now prefer encoding in mp3 and for long time
archiving of documents I prefer searchable PDF files. The first one
is a pseudo-standard widely used and interpreted even with Open
Source tools. And PDF is also ISO standard and many free software
products are able to generate and interpret PDF files independent of
Adobe products.

That does not mean that there will - someday - a better choice even
for my requirements. But for now, the long term support and broad
availability of PDF reader products urges me to use PDF. And: It
would be very awkward when I'd try to generate DjVu files using my
HP OfficeJet scanner device and software :-(

When you are starting to digitalize *all* of your paper, the process
of scanning has to be as smooth as possible. Even small steps in
between cost you reliability and time.

And this is why I love Org-mode so much: really *great* featureset
and visualization techniques that work on simple text files. My life
logging framework[7] is very easy to understand because there are
just text files that have to be generated from different kind of
information sources. That is pure beauty.

> Companies only succeed in getting everyone stuck with mediocre tools if
> we allow them to.  You don't *need* all devices/software to support the
> superior format.  

In this point, our opinions differ greatly: I want my data to be
interpreted on as many platforms as possible. I do not event want
myself to bind to a single operating system. I want to be able to
switch my productive environment whenever I thing that another one
lets me work more efficiently than my current one.

I am a constant optimizer, always on the search for the best
solution out there.

Still I do have a strong feeling that I will stick to Org-mode for a
long time :-)

>> I stick to the format *any* computer can handle without special
>> software products. [...]
>
> Somehow this implies that *every* computer is infected with Adobe's
> malware.  

Fortunately not.[8]

> I find that rather disconcerting, to be honest :D

:-) I am more positive on the Adobe alternatives.

>> [...] And I do think that I get a higher chance of
>> being able to read my documents twenty years from now.
>
> For your sake, I hope you're right!

Lets meet again here in twenty years! :-) We are still in a time
frame where .+20y is within UNIX epoch *ggg*

> May be a bit less convenient in daily usage, but you could stick to your
> preference of keeping all your scans in PNG format by keeping the OCR
> output in a separate ASCII file:

I see your point. And I too am a big fan of the power of «grep» and
such.

But I prefer to keep the text data and the image data combined. I am
using long file names with basic content description and tags. I can
not think of a usable system that does not require file name
synchronization and such.

And: I do think that the availability and reliability of desktop
search engines are getting even better. Whenever I enter a phrase
search, I want to get the original file and not the extracted txt
file where I have to search for the corresponding original file.

>> And I plan to archive *all* of my documents. Really all of them.
>
> Then you'll probably be interested in Joey Hess' git-annex [5] to keep
> your archive versioned and in sync across all your devices.

Absolutely. I stumbled upon git-annex a while ago but did not take a
look at it by now. So far, I am only using Unison to sync my three
computers and Time Machine + rsync to backup on different disks.

>> Storage space does not matter (any more) to me since I have more
>> disk space now already than I could possible fill with my lifetime
>> paper correspondence. And I do think that my disk space continues to
>> grow in future.
>
> I'd argue it still does, otherwise you'd be keeping your scans in
> TIFF format.  

I do not think so since TIFF is (mostly used) uncompressed and my
method does not cut viable parts of the information. It is good
enough to me. A classical trade-off.

> And digitized trees surely aren't the only type of
> correspondence you are (or will be) archiving.

Of course not. But still: I am not able to fill my disks I got *now*
with all of my papers I own!

That does not mean that I want to use as many GBs as possible
though. And my disk space will grow constantly for the following
decades too. So: disk space is no issue any more. It is just there.

In «Total Recall» you will also find interesting figures to this
issue. I knew it before basically but I was not aware of the
consequences.

> Efficiency should always play a major role IMO, even if the available
> resources are (perceived to be) infinite.  Having a hangar instead of a
> garage doesn't warrant driving a schoolbus to work, even if doesn't
> guzzle a drop of gas.

Please do not use examples from the real world for those kind of
questions since I can very easily falsify your analogy on multiple
levels.

You can not «copy a chair» in real world. Think about that.

Besides: yes, *unnecessary* waste of storage is not the goal. But
when every hazzle to be more efficient is a trade-off to «being
simple to generate» or «being able to be used on all major
platforms» I am willing to spend more MBs of storage to easy my
life.

>> [...]
>> 
>> Funny side fact: grayscale scan document settings produces slightly
>> larger files than colored ones.
>
> That's odd.  Probably depends on which type of compression is used.

Probably. I did not look into that one that detailed.

>> Whereas the HP OfficeJet bundles its OS X software with OCR from
>> Readiris which produces perfect results even in different languages
>> and using a usable user interface.
>
> Sadly, I can only agree with this.  Google's involvement in Tesseract
> and OCRopus does instill hope though :)

Full Ack!

I'd *love* to see an perfectly usable free solution!

  6. http://en.wikipedia.org/wiki/Netpbm_format
  7. https://github.com/novoid/Memacs
  8. http://en.wikipedia.org/wiki/List_of_PDF_software
-- 
Karl Voit

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [OT] Scanning for archiving
  2011-11-07 17:44   ` Karl Voit
  2011-11-09  7:40     ` Pieter Praet
@ 2011-11-09 14:53     ` Karl Voit
  1 sibling, 0 replies; 16+ messages in thread
From: Karl Voit @ 2011-11-09 14:53 UTC (permalink / raw)
  To: emacs-orgmode

* Karl Voit <devnull@Karl-Voit.at> wrote:
>
> Inspired by «Total Recall»[3], a book of two MS Research guys, I
> started life logging on my own two months ago.
>
> For this purpose I bought an HP OfficeJet Pro 8500A Plus which costs
> € 250 and has a decent scanner. Is can scan and print full duplex.
> The scanner as a 30 page ADF which is quite reliable when the paper
> was not bend or stapled before.

Addendum: I just ordered a ScanSnap S1500M[4] which was also used by
the authors of «Total Recall». The HP OfficeJet Pro ADF is OK for
occasional scanning. For big scan jobs, the HP Scan software as well
as the ADF reliability is not sufficient I am afraid. Approx. 15-20%
of the pages are not scanned because of multiple pages were scanned
at once. The HP Scan software is not very user friendly: no
visualization of the total sum of scanned pages per job, no
auto-correction of scan skew, no auto delete of empty pages, far too
few keyboard shortcuts for basic functions, ...

Reading quite some product reviews I am confident that the ScanSnap
is able to provide a more reliable and easy to use scan experience
for large scan jobs.

After I scanned all of my current papers, I will probably sell the
ScanSnap and use the scanner of the OfficeJet Pro again.

  4. http://scanners.fcpa.fujitsu.com/scansnapit/scansnap-s1500m.php
-- 
Karl Voit

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-11-20 13:57 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-05 20:03 [OT] Scanning for archiving Marcelo de Moraes Serpa
2011-11-05 20:34 ` Achim Gratz
2011-11-05 20:52   ` Marcelo de Moraes Serpa
2011-11-05 21:01 ` Jan Böcker
2011-11-05 21:06   ` Marcelo de Moraes Serpa
2011-11-05 22:36 ` Pieter Praet
2011-11-05 23:35   ` Samuel Wales
2011-11-06 21:59     ` Pieter Praet
2011-11-07  6:14       ` TP
2011-11-09  8:51         ` Pieter Praet
2011-11-20 13:57         ` Matt Lundin
2011-11-07 17:44   ` Karl Voit
2011-11-09  7:40     ` Pieter Praet
2011-11-09  9:06       ` Johnny
2011-11-09 11:05       ` Karl Voit
2011-11-09 14:53     ` Karl Voit

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).