From mboxrd@z Thu Jan  1 00:00:00 1970
From: Sebastian Rose <sebastian_rose@gmx.de>
Subject: Re: [bug] org-link-escape and (wrong-type-argument stringp	nil)
Date: Wed, 22 Sep 2010 16:25:46 +0200
Message-ID: <87fwx1yhw5.fsf@gmx.de>
References: <87tylkwpq0.fsf@mundaneum.com> <87mxrc1bwj.wl%dmaus@ictsoc.de>
	<87sk14rz3t.fsf@gmx.de> <87sk128cuk.wl%dmaus@ictsoc.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Return-path: <emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org>
Received: from [140.186.70.92] (port=54164 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1OyQGg-0000Ez-Ry
	for emacs-orgmode@gnu.org; Wed, 22 Sep 2010 10:26:00 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <sebastian_rose@gmx.de>) id 1OyQGf-0007Fp-2r
	for emacs-orgmode@gnu.org; Wed, 22 Sep 2010 10:25:54 -0400
Received: from mailout-de.gmx.net ([213.165.64.22]:35079 helo=mail.gmx.net)
	by eggs.gnu.org with smtp (Exim 4.69)
	(envelope-from <sebastian_rose@gmx.de>) id 1OyQGe-0007F0-NR
	for emacs-orgmode@gnu.org; Wed, 22 Sep 2010 10:25:53 -0400
In-Reply-To: <87sk128cuk.wl%dmaus@ictsoc.de> (David Maus's message of "Wed, 22
	Sep 2010 09:19:15 +0200")
List-Id: "General discussions about Org-mode." <emacs-orgmode.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-orgmode>,
	<mailto:emacs-orgmode-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-orgmode>
List-Post: <mailto:emacs-orgmode@gnu.org>
List-Help: <mailto:emacs-orgmode-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-orgmode>,
	<mailto:emacs-orgmode-request@gnu.org?subject=subscribe>
Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org
Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org
To: David Maus <dmaus@ictsoc.de>
Cc: =?utf-8?Q?S=C3=A9bastien?= Vauban <wxhgmqzgwmuf@spammotel.com>, emacs-orgmode@gnu.org

David Maus <dmaus@ictsoc.de> writes:
> Sebastian Rose wrote:
>>Is there a reason for this distinction between multibyte and unibyte?
>>I favour the "shotgun-approach" if not.  It's bullet-proof.
>
>>The JavaScript function `encodeURIComponent()' encodes the German Umlaut
>>`=C3=BC' as `%C3%B6' regardless of the sources encoding actually.  That's=
 why
>>I wrote the two functions `org-protocol-unhex-string' and
>>`org-protocol-unhex-compound' (s. org-protocol.el).
>
> Ah, yes.  From my understandig of the RFC %C3%BC is a valid
> representation of the "=C3=BC" character.=20=20
>
> I do not yet fully understand
> how to unescape such a representation.  E.g. Is %C3%BC a hexencoded
> multibyte char or a succession of two singlebyte chars?


It's a hexencoded multibyte char.

JavaScript implementations seem to turn non-ascii singlebyte chars into
multibyte chars first, then encode the result.

This means if a page is iso-8859-1 encoded (singlebyte `=C3=BC'), JavaScript
will recode the `=C3=BC'.  It's funny, but that's what I found when writing
org-protocol.el=20


`org-protocol-unhex-string' and `org-protocol-unhex-compound' decode
such a representation.

The trick is in the utf-8 encoding itself.  If a byte starts with a 1,
another byte will follow.  The number of leading `1's denotes the amount
of bytes used for one character.   On a GNU/Linux system try

  sh$  man utf-8


Sebastian