From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mathias Bauer Subject: Re: Bug: text export and multi-word link descriptions with line breaks Date: Thu, 3 Apr 2014 18:30:24 +0200 Message-ID: <20140403163024.GB27299@gmx.org> References: <20140403142834.GA27238@gmx.org> <87ha6a4er6.fsf@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:42835) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WVkXA-00024c-UM for emacs-orgmode@gnu.org; Thu, 03 Apr 2014 12:30:37 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WVkX5-0007ZB-Id for emacs-orgmode@gnu.org; Thu, 03 Apr 2014 12:30:32 -0400 Received: from mout.gmx.net ([212.227.17.21]:53801) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WVkX5-0007Xi-8v for emacs-orgmode@gnu.org; Thu, 03 Apr 2014 12:30:27 -0400 Received: from mail.internal ([87.175.187.230]) by mail.gmx.com (mrgmx003) with ESMTPSA (Nemesis) id 0M54L0-1XFKmN2ksY-00zHsl for ; Thu, 03 Apr 2014 18:30:25 +0200 Received: from localhost by localhost with ESMTP id 96369117860 for ; Thu, 3 Apr 2014 18:30:24 +0200 (CEST) Received: from localhost by localhost with LMTP id WkAECYfssWwQ for ; Thu, 3 Apr 2014 18:30:24 +0200 (CEST) Content-Disposition: inline In-Reply-To: <87ha6a4er6.fsf@gmail.com> List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: emacs-orgmode@gnu.org Hello Nicolas, * Nicolas Goaziou wrote on 2014-04-03 at 17:25 (+0200): > Mathias Bauer writes: > > > I just stumbled over Org's plain text export and how it works on > > links with descriptions consisting of multiple words and line > > breaks between them. I'm running Org stable version 8.2.5h. > > > > Org source (spaces at the end of line 1 and 2 don't matter): > > > > --------------------snip-------------------- > > "OpenPGP Message Format" ([[https://tools.ietf.org/html/rfc4880][RFC > > 4880]] which obsoletes [[https://tools.ietf.org/html/rfc1991][RFC > > 1991]] and [[https://tools.ietf.org/html/rfc2440][RFC 2440]])... > > ... > > foo [[https://tools.ietf.org/html/rfc4880][RFC 4880]] bar > > baz [[https://tools.ietf.org/html/rfc1991][RFC 1991]] foo > > bar [[https://tools.ietf.org/html/rfc2440][RFC 2440]] baz > > --------------------snip-------------------- > > > > Text export result: > > > > --------------------snip-------------------- > > "OpenPGP Message Format" ([RFC 4880] which obsoletes [RFC 1991] and [RFC > > 2440])... ... foo [RFC 4880] bar baz [RFC 1991] foo bar [RFC 2440] baz > > > > > > [RFC 4880] https://tools.ietf.org/html/rfc4880 > > > > [RFC 1991] https://tools.ietf.org/html/rfc1991 > > > > [RFC 2440] https://tools.ietf.org/html/rfc2440 > > > > [RFC 4880] https://tools.ietf.org/html/rfc4880 > > > > [RFC 1991] https://tools.ietf.org/html/rfc1991 > > --------------------snip-------------------- > > > > These multiple references look quite bad. Is it possible to > > "normalize" the descriptions in some way *before* checking > > them for uniqueness and output them thereafter? > > Could you be more explicit? What does look quite bad? What did > you expect instead? How is related to line breaks in the > descriptions? Ok, let's go into more details. See the Org source text: 1. There are three links and each of them appears twice. The link targets of every two of them are identical. 2. Each of the two "[...][RFC 2440]" links appear in one line; the links "[...][RFC 4880]" and "[...][RFC 1991]" each have a newline in their description. They are in fact "[...][RFC\n4880]" and "[...][RFC 4880]" and, respectively, "[...][RFC\n1991]" and "[...][RFC 1991]". So, now let's examine the Org text export: The final reference part - the five links below the paragraph - shows two links, [RFC 4880] and [RFC 1991], which appear twice but the link [RFC 2440] appears only once there. This is, at least, inconsistent. The point is, that Org obviously considers "[...][RFC 4880]" and "[...][RFC\n4880]" as being two different links internally and list both of them in the reference part. For this listing, the \n is removed. This is, what I called "normalization" in my first post. Human eyes, however, won't see any difference between this two forms and start being surprised. I expect, Org to do the following steps while parsing the source text: 1. "Normalize" or clean the link description, i.e. remove any newlines, starting and trailing spaces, and replace any occurrences of "[ \t]+" in the interior by a single space only. (To be done.) 2. Check the tuple (description,target) for duplicates and drop them. (Seems ok to me.) 3. Below the paragraph list the tuples as "[description] target" in the order of occurrence in the original text. (Also seems ok to me.) I hope this makes this issue a little bit more clear now. Kind regards, Mathias