From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sebastian Rose Subject: Re: [bug] org-link-escape and (wrong-type-argument stringp nil) Date: Wed, 22 Sep 2010 16:25:46 +0200 Message-ID: <87fwx1yhw5.fsf@gmx.de> References: <87tylkwpq0.fsf@mundaneum.com> <87mxrc1bwj.wl%dmaus@ictsoc.de> <87sk14rz3t.fsf@gmx.de> <87sk128cuk.wl%dmaus@ictsoc.de> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from [140.186.70.92] (port=54164 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OyQGg-0000Ez-Ry for emacs-orgmode@gnu.org; Wed, 22 Sep 2010 10:26:00 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OyQGf-0007Fp-2r for emacs-orgmode@gnu.org; Wed, 22 Sep 2010 10:25:54 -0400 Received: from mailout-de.gmx.net ([213.165.64.22]:35079 helo=mail.gmx.net) by eggs.gnu.org with smtp (Exim 4.69) (envelope-from ) id 1OyQGe-0007F0-NR for emacs-orgmode@gnu.org; Wed, 22 Sep 2010 10:25:53 -0400 In-Reply-To: <87sk128cuk.wl%dmaus@ictsoc.de> (David Maus's message of "Wed, 22 Sep 2010 09:19:15 +0200") List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: David Maus Cc: =?utf-8?Q?S=C3=A9bastien?= Vauban , emacs-orgmode@gnu.org David Maus writes: > Sebastian Rose wrote: >>Is there a reason for this distinction between multibyte and unibyte? >>I favour the "shotgun-approach" if not. It's bullet-proof. > >>The JavaScript function `encodeURIComponent()' encodes the German Umlaut >>`=C3=BC' as `%C3%B6' regardless of the sources encoding actually. That's= why >>I wrote the two functions `org-protocol-unhex-string' and >>`org-protocol-unhex-compound' (s. org-protocol.el). > > Ah, yes. From my understandig of the RFC %C3%BC is a valid > representation of the "=C3=BC" character.=20=20 > > I do not yet fully understand > how to unescape such a representation. E.g. Is %C3%BC a hexencoded > multibyte char or a succession of two singlebyte chars? It's a hexencoded multibyte char. JavaScript implementations seem to turn non-ascii singlebyte chars into multibyte chars first, then encode the result. This means if a page is iso-8859-1 encoded (singlebyte `=C3=BC'), JavaScript will recode the `=C3=BC'. It's funny, but that's what I found when writing org-protocol.el=20 `org-protocol-unhex-string' and `org-protocol-unhex-compound' decode such a representation. The trick is in the utf-8 encoding itself. If a byte starts with a 1, another byte will follow. The number of leading `1's denotes the amount of bytes used for one character. On a GNU/Linux system try sh$ man utf-8 Sebastian