From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sebastian Rose Subject: Re: Re: org-protocol: non-ASCII characters Date: Mon, 08 Feb 2010 12:30:09 +0100 Message-ID: <878wb4rmm6.fsf@gmx.de> References: <86tyu2d5xw.fsf@mn.cs.uvic.ca> <4B693FBD.9010608@jboecker.de> <86d40k9702.fsf_-_@mn.cs.uvic.ca> <4B6D73A2.8000403@jboecker.de> <4B6D7E1A.3030807@jboecker.de> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NeRou-0007hY-1e for emacs-orgmode@gnu.org; Mon, 08 Feb 2010 06:30:24 -0500 Received: from [199.232.76.173] (port=58106 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NeRos-0007h7-V4 for emacs-orgmode@gnu.org; Mon, 08 Feb 2010 06:30:23 -0500 Received: from Debian-exim by monty-python.gnu.org with spam-scanned (Exim 4.60) (envelope-from ) id 1NeRok-0004uO-Tf for emacs-orgmode@gnu.org; Mon, 08 Feb 2010 06:30:22 -0500 Received: from mail.gmx.net ([213.165.64.20]:48880) by monty-python.gnu.org with smtp (Exim 4.60) (envelope-from ) id 1NeRoj-0004tD-RA for emacs-orgmode@gnu.org; Mon, 08 Feb 2010 06:30:14 -0500 In-Reply-To: <4B6D7E1A.3030807@jboecker.de> ("Jan =?utf-8?Q?B=C3=B6cker=22?= =?utf-8?Q?'s?= message of "Sat, 06 Feb 2010 15:35:06 +0100") List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: Jan =?utf-8?Q?B=C3=B6cker?= Cc: emacs-orgmode@gnu.org, dmg@uvic.ca Jan B=C3=B6cker writes: > On 06.02.2010 14:50, Jan B=C3=B6cker wrote: >> AFAIK, your current approach is correct. > > I was wrong. The attached patch fixes a bug in the encode_uri function. > That fixes the non-ASCII characters problem in xournal for me. > > The gchar type is just typedef'd to char, which means it is signed. To > get the byte value, it must be cast to unsigned int first. > > - Jan Hi Jan and Daniel! Sorry for answering with that long delay. I read Daniel's mail last week, but I had to think about the answer. I'll just describe, what the `org-protocol-unhex-string' functions do here, and what they expect as arguments. Basically, it is OK to url-encode each character who's binary representation start with 1 (i.e., the value of the character is higher than 127). The text to be url-encoded should be UTF-8 ideally. If you use glib::ustring, it's easy to transform any iso-8859 string to utf-8. Each character, whos binary representation start with a 1, has to be url-encoded as well as the `%' character [1], but you could as url-encode the entire utf-8 string. The function that does the decoding is `org-protocol-unhex-string' which in turn uses `org-protocol-unhex-compound'. `man utf-8` shows, how org-protocol tries to decode characters. The JavaScript-Funktion `encodeURIComponent()' returns exactly what we need. It recodes a string to utf-8 and then encodes all characters, except digits, ASCII letters and these punctuation characters: -_.!~*'() See ECMA-262 Standard, Section 15.1.3 (http://bclary.com/2004/11/07/ecma-262.html#a-15.1.3 [2]): "The character is first transformed into a sequence of octets using the UTF-8 transformation..." Again, note, that the decoding mechanism relies on the fact, that the sequence to decode is url-encoded UTF-8. Example: The url-encoded unicode representation of the German umlaut `=C3=B6' is `%C3%B6'. Thus (org-protocol-unhex-string "%C3%B6") gives you "=C3=B6". In iso-8859-1, the url-encoded representation of the same character `=C3= =B6' was `%F6'. But (org-protocol-unhex-string "%F6") gives you "" - the empty string. There is no utf-8 character with this bi= nary representation, since every byte starting with a 1 (i.e. is bigger than 1= 27) starts a multibyte sequence (2 or more bytes). But: (org-protocol-unhex-string "%2F%3C") gives you, as expected, "/<" which shows, that you could savely url-encode each and every character of a utf-8 encoded string. =3D=3D Footnotes: [1] The percent character `%' has to be encoded, if followed by [0-9A-Fa-f]{2}, because org-protocol will assume, that a sequence matching "\\(%[0-9a-f][0-9a-f]\\)+" is an encoded character. That said, a `%' has to be url-encoded, since one will hardly ever know for sure, that a `%' is never followed by "[0-9a-f][0-9a-f]". [2] Get a PDF version of ECMA-262 third edition here: http://www.ecma-international.org/publications/standards/Ecma-262.htm