I've had this post flagged since it first appeared and I've been
intending to read it and comment on it fully.  Since I appear not to be
getting around to it, I thought I'd make a brief post for now.  I
haven't read your proposal fully but I've implemented some code that
might be relevant or a related cousin.

I use my cell phone's voice memo capability as a part of my GTD inbox
collection points.  I record brief voice memos when something occurs to
me, they are saved to my phone's microSD card.  Later I export the voice
memos to a special "in" folder on my laptop.  Then I review the memos,
refile them next to whatever org-mode file they apply to, create any new
headlines/TODOs if appropriate, and transcribe them into special notes
on the headline.  To make all this very fast and efficient, I've written
a library called transcribe.el.  I've attached it.

I start by populating the bongo playlist buffer with all the memo files
from my in folder (i f "~/in/phone/sounds/*.qcp") and I play the first
file.  The transcribe.el library provides a global key binding to a
command for moving the currently playing file.  I use that keystroke
once I've heard enough of the memo to refile it to an appropriate
location.  As such, the contents of the "in" folder continually
represent the memos that have not yet been processed even if I'm
interrupted.

I create any appropriate headlines/TODOs for the memo.  Then I use the
org-add-transcription command bound to "C-c v z" to add a special kind
of note to the headline/TODO.  The note is pre-populated with a link to
the memo file and a timestamp for the time the memo was taken if
supported (see below).  I transcribe the memo using a global key binding
for bongo-seek, "C-c v s", as necessary.  When I'm done, I save the note
and use bong-seek again to advance to the next memo.  Then I repeat this
"move, add headlines, transcribe note" cycle until I'm done.

With this approach, I can process my voice memos moving freely around my
org-mode buffers as appropriate and without having to switch to any
bongo buffers, and doing everything from key bindings.  As such, the
only context switches I have to do are directly related to the contexts
of the voice memos themselves.  I find it works quite well for me.

The memo's are *.qcp files in Qualcomm's PureVoice format.  The
transcribe.el library includes a bongo backend to play the PureVoice
filed using Qualcomm's pvconv converter:

    http://www.qctconnect.com/products/purevoice_downloads.html

The backend converts the files to *.wav files next to the original *.qcp
files and plays the *.wav files.  The pvconv converter is pretty fast,
but even so long *.qcp recordings can take a couple seconds to convert
before bongo can start playing the file.  If someone wants to work out
how to convert the *.qcp file asynchronously so that bongo can start
playing the *.wav before pvconv is finished, that would be great.  The
*.qcp files are so much smaller than the converted *.wav files, so the
backend deletes the *.wav file once it stops playing the file.

Phones using Qualcomm's PureVoice memos embed a timestamp into the
filename of the memo.  Currently transcribe.el can extract this
timestamp for use in the transcription note.  I'd be interested in
contributions for extracting timestamps from voice memos that do it
differently.

I'd like to hear any thoughts on this, if this can/should be integrated
with your concept of a conversation manager, or even independent of
that.  I also hope to review your proposal more thoroughly in the near
future.

Ross