From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Shoulson Subject: Re: Smart Quotes Exporting Date: Fri, 15 Jun 2012 16:20:43 +0000 (UTC) Message-ID: References: <4FBB08CA.5060705@kli.org> <87d35u8rvk.fsf@gmail.com> <4FBDA56E.5030901@kli.org> <87zk8w6v4q.fsf@gmail.com> <4FC00CE0.6060308@kli.org> <87r4u75tg9.fsf@gmail.com> <4FC426AC.2030109@kli.org> <87ehq227ky.fsf@gmail.com> <4FC56F1B.5040201@kli.org> <87r4u031ye.fsf@gmail.com> <4FC7FE2C.6040702@kli.org> <878vg72bzy.fsf@gmail.com> <4FCEBCF5.1070209@kli.org> <87haunexn8.fsf@gmail.com> <874nqgeke6.fsf@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Return-path: Received: from eggs.gnu.org ([208.118.235.92]:58024) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SfZGu-0005Iz-L7 for emacs-orgmode@gnu.org; Fri, 15 Jun 2012 12:21:18 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1SfZGo-0008U6-Ja for emacs-orgmode@gnu.org; Fri, 15 Jun 2012 12:21:16 -0400 Received: from plane.gmane.org ([80.91.229.3]:53117) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SfZGo-0008TJ-8p for emacs-orgmode@gnu.org; Fri, 15 Jun 2012 12:21:10 -0400 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1SfZGb-0004B7-58 for emacs-orgmode@gnu.org; Fri, 15 Jun 2012 18:20:57 +0200 Received: from 63.116.219.130 ([63.116.219.130]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 15 Jun 2012 18:20:57 +0200 Received: from mark by 63.116.219.130 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 15 Jun 2012 18:20:57 +0200 List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org Sender: emacs-orgmode-bounces+geo-emacs-orgmode=m.gmane.org@gnu.org To: emacs-orgmode@gnu.org Nicolas Goaziou gmail.com> writes: > > Hello, > > Mark Shoulson kli.org> writes: > > >> ASCII exporter also handle UTF-8. So it's good to have there too. > > > > Really? I would have thought ASCII meant ASCII, as in 7-bit clean > > text. > > org-e-ascii.el (as old org-ascii.el) handles ASCII, Latin1 and UTF-8 > encodings. I noticed that after writing my response. The name just threw me a little. Yes, that exporter needs to handle it too. > > It looked to me like your solution would essentially boil down to "do > > string handling when there's a string, otherwise recur down and find > > the strings," which essentially means apply it to all the > > strings... and there were already functions out there applying things > > to strings, so this can just ride along with them. Here, let's look > > at your suggestion and see if we can find what I missed: > > .... > > So, if it's a string, use the regexps (if they can be smart enough to look at > > beginning and end of the string, which they can--though I haven't been using the > > :post-blank property so presumably something is amiss), and if it isn't a > > string, recur down until you get to a string... Ah, but only if it's in > > org-element-recursive-objects. > > You're missing an important part: the regexps cannot be smart enough for > quotes at the beginning or the end of the string. There, you must look > outside the string. Hence: Well, wait; regexps can make some pretty darn good guesses at the beginnings or ends of strings. Quotations don't normally end in spaces (in the conventions used with ""; French typography is different, but if you're using spaces around your quotes you have worse problems (line-breaks) to worry about). So if a string ends in space(s) followed by a quote, it's very likely that quote is an open-quote for some stuff that comes after. Conversely, if a string starts with a quote followed by some spaces, it's very likely a close- quote to what went on before. This isn't quite it; beginning-of-string followed by quote, then punctuation and then spaces is also a close-quote, etc... There is a lot of fine-tuning. But even what I currently have was able to handle your Caesar said, "/Alea Jacta est./" example. Yes, there are edge-cases which this won't catch, and it remains to be seen how pervasive and annoying those are. It may be that repeated tweaking of regexps will handle enough of the ordinary cases. It may be that after a few rounds of regexp-hacking someone will finally decide that regexp- hacking just won't handle enough of the important cases. But I think even as it stands now we'd probably handle 80-90% of the normal situations, which really is as much as we reasonably can hope for. Could I trouble someone to try applying my patch and trying it out for yourself and seeing just how bad/good the performance is? It seems to work okay for the cases I've been trying, but maybe my dataset isn't robust enough. Let's give it a test and seen how many actual cases in common usage it gets wrong. Maybe see how much can be fixed by tuning regexps. > > > ] 1. If it has a quote as its first or last position, check for > > ] objects before or after the string to guess its status. An > > ] object never starts with a white space, but you may have to > > ] check :post-blank property in order to know if previous object > > ] had white spaces at its end. > > But you can only do that from the element containing the string, not > from the string itself. The case where a quote both sits at the edge of a string (i.e. at the border of some element, formatting, etc) *and* does not have whitespace next to it, with possible punctuation, does not seem to be a normal occurrence to me. If I'm wrong, how common *is* it? > > > So the issue with the current state is that it > > would wind up applying to too much? (it would hit code and verbatim elements, > > for example, and that would be wrong.) > > No, you are not applying it too much (verbatim elements don't contain > plain-text objects) but your function hasn't got access to enough > information to be useful. The on-screen version, of course, will have to be smarter and check for the "face" formatting to make sure it doesn't happen in comments or verbatims; I am pretty sure it does not do that yet. > > wait, called on the top-level parsed tree object, recursively doing > > its thing before(?) the transcoders of the individual objects get to > > it. > > That's called a parse tree filter. That should be a possibility > indeed. The function would be applied on the parse tree and would > replace strings within elements containing plain text (that is > paragraph, verse-block and table-row types). parse tree filters are > applied very early in the export process. > > Another option would be to integrate it into > `org-element-normalize-contents', but I think the previous way is > better. Maybe. I know it sounds like I'm fixated on the plain-text solution, but I'm not convinced the envisioned problems are more than theoretical, or that they will cause an unacceptable amount of error (keeping in mind that some error *is* acceptable and unavoidable). > > The on-screen one would still use the plain-string computation, as you said, > > since the full parse isn't available. > > Yes. > > > It would also need to be tweaked not to act on verbatim/comment text, > > etc. > > Yes. You may want to use `org-element-at-point' and `org-element-type' > to tell if you're somewhere smart quotes are allowed (in table, > table-row, paragraph, verse-block elements). Probably. I think I saw some other package make these decisions by peeking at the formatting and seeing if it is set in comment-face or something, but checking the element at point is presumably more sensible. ~mark