emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
* A simple Lua filter for Pandoc
@ 2022-01-04 10:14 Juan Manuel Macías
  2022-01-04 11:26 ` Timothy
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Juan Manuel Macías @ 2022-01-04 10:14 UTC (permalink / raw)
  To: orgmode

Hi,

Very often I need to convert docx documents to Org. There are a series
of characters that I prefer to be passed to Org as Org entities and not
literally, so I have written this little filter in Lua for Pandoc. I
share it here in case it could be useful to someone. Of course, the
associative table can be expanded with more replacement cases:

#+begin_src lua :tangle entities.lua
  local chars = {["/"] = "\\slash{}", ["*"] = "\\lowast{}", ["<"] = "\\lt{}",
	  [">"] = "\\gt{}", ["†"] = "\\dagger{}", [utf8.char(0x00A0)] = "\\nbsp{}"}

  function Str (elem)
     x = elem.text:match 'http[^%s]'
     if not x then
	for i in pairs(chars) do
	   elem = pandoc.Str(elem.text:gsub (i, chars[i]))
	end
	return elem
     end
  end
#+end_src

And a quick test:

#+begin_src sh :results org
str="/ † * < > http://foo.es  "
pandoc -f markdown -t org --lua-filter=entities.lua <<< $str
#+end_src

#+RESULTS:
#+begin_src org
\slash{} \dagger{} \lowast{} \lt{} \gt{} http://foo.es \nbsp{}
#+end_src

Best regards,

Juan Manuel 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: A simple Lua filter for Pandoc
  2022-01-04 10:14 A simple Lua filter for Pandoc Juan Manuel Macías
@ 2022-01-04 11:26 ` Timothy
  2022-01-04 15:11   ` Juan Manuel Macías
  2022-01-04 14:05 ` Max Nikulin
  2022-01-04 16:28 ` Thomas S. Dye
  2 siblings, 1 reply; 10+ messages in thread
From: Timothy @ 2022-01-04 11:26 UTC (permalink / raw)
  To: Juan Manuel Macías; +Cc: emacs-orgmode

[-- Attachment #1: Type: text/plain, Size: 538 bytes --]

Hi Juan,

> Very often I need to convert docx documents to Org. There are a series
> of characters that I prefer to be passed to Org as Org entities and not
> literally, so I have written this little filter in Lua for Pandoc. I
> share it here in case it could be useful to someone. Of course, the
> associative table can be expanded with more replacement cases:

I’m quite interested in this, thanks for sharing. In fact, I’ll probably add
this to <https://github.com/tecosaur/org-pandoc-import>.

All the best,
Timothy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: A simple Lua filter for Pandoc
  2022-01-04 10:14 A simple Lua filter for Pandoc Juan Manuel Macías
  2022-01-04 11:26 ` Timothy
@ 2022-01-04 14:05 ` Max Nikulin
  2022-01-04 15:06   ` Juan Manuel Macías
  2022-01-04 16:28 ` Thomas S. Dye
  2 siblings, 1 reply; 10+ messages in thread
From: Max Nikulin @ 2022-01-04 14:05 UTC (permalink / raw)
  To: emacs-orgmode

On 04/01/2022 17:14, Juan Manuel Macías wrote:
> 
> Very often I need to convert docx documents to Org. ...
> 
>    local chars = {["/"] = "\\slash{}", ["*"] = "\\lowast{}", ["<"] = "\\lt{}",
> 	  [">"] = "\\gt{}", ["†"] = "\\dagger{}", [utf8.char(0x00A0)] = "\\nbsp{}"}
> ...
> pandoc -f markdown -t org --lua-filter=entities.lua <<< $str

Ideally it should be done pandoc and only if it causes incorrect parsing 
of org markup. NBSP, probably, should be replaced by some exporters, I 
do not think, it is a problem e.g. in HTML files.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: A simple Lua filter for Pandoc
  2022-01-04 14:05 ` Max Nikulin
@ 2022-01-04 15:06   ` Juan Manuel Macías
  2022-01-05 16:29     ` Max Nikulin
  0 siblings, 1 reply; 10+ messages in thread
From: Juan Manuel Macías @ 2022-01-04 15:06 UTC (permalink / raw)
  To: orgmode

Max Nikulin writes:

> Ideally it should be done pandoc and only if it causes incorrect
> parsing of org markup. NBSP, probably, should be replaced by some
> exporters, I do not think, it is a problem e.g. in HTML files.

The reason for this filter is my own comfort. Linguistics texts contains
a lot of certain characters such as "/" or "*", and they are often
italicized or bold. So, in order not to be more confused than necessary,
I prefer that they pass as entities. In general, there are certain
characters that I am more comfortable working with as entities than as
literal characters (for example, a lot of zero-width combining
diacritics that are used a lot in linguistics or epigraphy (and there
are no fonts that include the NFC normalized version of all possible
combinations: in fact, they are not in Unicode, and would have to go to
the private use area). Summarizing, I prefer that these characters have
their actual typographic representation only with LuaTeX. A very typical
example is the character U+0323 (COMBINING DOT BELOW). It is very
uncomfortable to work /in situ/, although there are fonts that usually
render it well (with the 'mark' otf tag).

(Naturally, I have to do, inside Org, a lot of corrections in italics
later, due to the bad habit that Word users have of applying direct
formatting. Interestingly only the pandoc docx reader trims the emphasis
before exporting to Org or Markdown, so as not to produce things like
"/ foo /". But the odt reader doesn't. I don't know if I'm missing
something.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: A simple Lua filter for Pandoc
  2022-01-04 11:26 ` Timothy
@ 2022-01-04 15:11   ` Juan Manuel Macías
  0 siblings, 0 replies; 10+ messages in thread
From: Juan Manuel Macías @ 2022-01-04 15:11 UTC (permalink / raw)
  To: orgmode

Hi Timothy:

Timothy writes:

> I’m quite interested in this, thanks for sharing. In fact, I’ll probably add
> this to <https://github.com/tecosaur/org-pandoc-import>.

Interesting package. Until now I used a number of homemade functions to
convert docx/odt files from Dired, but I think your package will be very
useful to me ;-)

Best regards,

Juan Manuel 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: A simple Lua filter for Pandoc
  2022-01-04 10:14 A simple Lua filter for Pandoc Juan Manuel Macías
  2022-01-04 11:26 ` Timothy
  2022-01-04 14:05 ` Max Nikulin
@ 2022-01-04 16:28 ` Thomas S. Dye
  2 siblings, 0 replies; 10+ messages in thread
From: Thomas S. Dye @ 2022-01-04 16:28 UTC (permalink / raw)
  To: emacs-orgmode


Juan Manuel Macías <maciaschain@posteo.net> writes:

> Hi,
>
> Very often I need to convert docx documents to Org. There are a 
> series
> of characters that I prefer to be passed to Org as Org entities 
> and not
> literally, so I have written this little filter in Lua for 
> Pandoc. I
> share it here in case it could be useful to someone. Of course, 
> the
> associative table can be expanded with more replacement cases:
>
> #+begin_src lua :tangle entities.lua
>   local chars = {["/"] = "\\slash{}", ["*"] = "\\lowast{}", 
>   ["<"] = "\\lt{}",
> 	  [">"] = "\\gt{}", ["†"] = "\\dagger{}", [utf8.char(0x00A0)] 
> = "\\nbsp{}"}
>
>   function Str (elem)
>      x = elem.text:match 'http[^%s]'
>      if not x then
> 	for i in pairs(chars) do
> 	   elem = pandoc.Str(elem.text:gsub (i, chars[i]))
> 	end
> 	return elem
>      end
>   end
> #+end_src

Neat!  Converting Word documents is no fun at all.

BTW, Babel support for Lua isn't very good AFAICT.  I poked around 
ob-lua.el recently and concluded that one problem is that the Lua 
interpreter prints pointers for some data types instead of a 
human-readable form that might be parsed.

All the best,
Tom

-- 
Thomas S. Dye
https://tsdye.online/tsdye


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: A simple Lua filter for Pandoc
  2022-01-04 15:06   ` Juan Manuel Macías
@ 2022-01-05 16:29     ` Max Nikulin
  2022-01-05 17:08       ` Juan Manuel Macías
  0 siblings, 1 reply; 10+ messages in thread
From: Max Nikulin @ 2022-01-05 16:29 UTC (permalink / raw)
  To: orgmode; +Cc: Tom Gillespie

On 04/01/2022 22:06, Juan Manuel Macías wrote:
> Max Nikulin writes:
> 
>> Ideally it should be done pandoc and only if it causes incorrect
>> parsing of org markup. NBSP, probably, should be replaced by some
>> exporters, I do not think, it is a problem e.g. in HTML files.
> 
> The reason for this filter is my own comfort. Linguistics texts contains
> a lot of certain characters such as "/" or "*", and they are often
> italicized or bold. So, in order not to be more confused than necessary,
> I prefer that they pass as entities.

It seems, lightweight markup is more annoyance than advantage for you. 
Tom posted some thoughts on more rigorous syntax in the following message:

Tom Gillespie. Re: Org-syntax: Intra-word markup.
Sat, 4 Dec 2021 09:53:11 -0800.
https://list.orgmode.org/CA+G3_PNca3HY6TUDPMfHGt35Amj9a-y8dBNQo+ZvBOV6y3nHYw@mail.gmail.com

For C and C++ it is possible to tune some aspects of compiler behavior using

     #pragma something

directives. In some cases it might be convenient to e.g. temporary 
disable emphasis markers (or switch to more verbose alternative) through

     #+some: keyword

> In general, there are certain
> characters that I am more comfortable working with as entities than as
> literal characters (for example, a lot of zero-width combining
> diacritics that are used a lot in linguistics or epigraphy (and there
> are no fonts that include the NFC normalized version of all possible
> combinations: in fact, they are not in Unicode, and would have to go to
> the private use area). Summarizing, I prefer that these characters have
> their actual typographic representation only with LuaTeX. A very typical
> example is the character U+0323 (COMBINING DOT BELOW). It is very
> uncomfortable to work /in situ/, although there are fonts that usually
> render it well (with the 'mark' otf tag).

I have seen warnings concerning complications due to variants of 
character representation, but I have no experience such nuances (either 
typing and editing or processing text programmatically). Dagger and 
non-breaking space in your code snippet were much more simpler cases.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: A simple Lua filter for Pandoc
  2022-01-05 16:29     ` Max Nikulin
@ 2022-01-05 17:08       ` Juan Manuel Macías
  2022-01-07 14:29         ` Max Nikulin
  0 siblings, 1 reply; 10+ messages in thread
From: Juan Manuel Macías @ 2022-01-05 17:08 UTC (permalink / raw)
  To: Max Nikulin; +Cc: orgmode

Max Nikulin writes:

> It seems, lightweight markup is more annoyance than advantage for you. 
> Tom posted some thoughts on more rigorous syntax in the following message:

It's generally the opposite: working in Org is a pleasant journey for
me... except when there are dozens of "/" and "*" in a document, and
they placed in 'unhappy' positions. For example, in phonetics the
"/ ... /" notation is used a lot, and there may be cases like:

#+begin_example
/foo/ /bar/ /baz/
#+end_example

In grammar the asterisk is also used a lot, to designate that a term is
not attested or to indicate that it is ungrammatical:

#+begin_example
*foo *bar *baz
#+end_example

And we can even have the combination of both:

#+begin_example
/*foo/ /*bar/ /*baz/
#+end_example

And in certain cases, they are usually expressed in italics. With these
landscapes, it's worth having a few entities rather than working from
pure LaTeX, which is more accurate, but horribly more verbose.

This is a page from a book I typesetted a couple of years ago (when the
pandemic started), entirely from Org:

https://i.imgur.com/f6X7qLs.png

Best regards,

Juan Manuel 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: A simple Lua filter for Pandoc
  2022-01-05 17:08       ` Juan Manuel Macías
@ 2022-01-07 14:29         ` Max Nikulin
  2022-01-07 15:14           ` Juan Manuel Macías
  0 siblings, 1 reply; 10+ messages in thread
From: Max Nikulin @ 2022-01-07 14:29 UTC (permalink / raw)
  To: emacs-orgmode

On 06/01/2022 00:08, Juan Manuel Macías wrote:
> Max Nikulin writes:
> 
>> It seems, lightweight markup is more annoyance than advantage for you.
>> Tom posted some thoughts on more rigorous syntax in the following message:
> 
> It's generally the opposite: working in Org is a pleasant journey for
> me... except when there are dozens of "/" and "*" in a document, and
> they placed in 'unhappy' positions. For example, in phonetics the
> "/ ... /" notation is used a lot, and there may be cases like:
> 
> #+begin_example
> /foo/ /bar/ /baz/
> #+end_example

Unless you were seeking for a lightweight markup I would remind you 
about macro:
---- >8 ----
#+macro: ph @@x:@@/@@x:@@$1@@x:@@/@@x:@@

/{{{ph(foo)}}} {{{ph(bar)}}} {{{ph(baz)}}}/
---- 8< ----

Form my point of view it is not worse than "\slash{}" entities.

LISP can bee easily transformed to a domain specific language by means 
of LISP macros (it is its strong and weak side simultaneously). I am 
unaware whether a comparable framework exists for creating custom 
lightweight markups. In LaTeX for your examples I expect something like 
\phonetic{foo} commands to have logical markup. Certainly with some 
hints /foo/ in particular part of text might be considered as "phonetic" 
rather than "italic" in intermediate representation keeping source 
easily readable.

> https://i.imgur.com/f6X7qLs.png

In this example there is no need to replace "<" by entity since it can 
not be confused with <http://te.st/> links.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: A simple Lua filter for Pandoc
  2022-01-07 14:29         ` Max Nikulin
@ 2022-01-07 15:14           ` Juan Manuel Macías
  0 siblings, 0 replies; 10+ messages in thread
From: Juan Manuel Macías @ 2022-01-07 15:14 UTC (permalink / raw)
  To: Max Nikulin; +Cc: orgmode

Max Nikulin writes:

> Form my point of view it is not worse than "\slash{}" entities.

Yes, I also use macros a lot, especially for more complex constructions.
Macros, entities and other tricks have their pros and cons, but they
allow me to have a certain group of characters under control.

>> https://i.imgur.com/f6X7qLs.png
>
> In this example there is no need to replace "<" by entity since it can 
> not be confused with <http://te.st/> links.

As I have already said, this comes from a desire to keep a series of
characters under control, because they can be confused with the Org
marks or because I can get confused when working with the text,
especially if it is an imported text. In the case of "<" and ">", many
authors usually use them for various contexts in philology (instead of
the correct characters for some of those contexts, such as
RIGHT-POINTING ANGLE BRACKET and LEFT-POINTING ANGLE BRACKET: in that
case, I have to replace with the correct symbol). In general, I am
calmer if I have, in one way or another, all those symbols conveniently
"delimited", because even if Org is not wrong, I can get confused.

Anyway, regardless of the idiosyncrasy of my workflow, the origin of
this thread (the function in Lua) was in case someone wants to adapt it
to their own workflow, and needs to substitute some strings for others
when importing from docx or odt.

Best regards,

Juan Manuel 


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-01-07 15:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-04 10:14 A simple Lua filter for Pandoc Juan Manuel Macías
2022-01-04 11:26 ` Timothy
2022-01-04 15:11   ` Juan Manuel Macías
2022-01-04 14:05 ` Max Nikulin
2022-01-04 15:06   ` Juan Manuel Macías
2022-01-05 16:29     ` Max Nikulin
2022-01-05 17:08       ` Juan Manuel Macías
2022-01-07 14:29         ` Max Nikulin
2022-01-07 15:14           ` Juan Manuel Macías
2022-01-04 16:28 ` Thomas S. Dye

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).