Hi, Very often I need to convert docx documents to Org. There are a series of characters that I prefer to be passed to Org as Org entities and not literally, so I have written this little filter in Lua for Pandoc. I share it here in case it could be useful to someone. Of course, the associative table can be expanded with more replacement cases: #+begin_src lua :tangle entities.lua local chars = {["/"] = "\\slash{}", ["*"] = "\\lowast{}", ["<"] = "\\lt{}", [">"] = "\\gt{}", ["†"] = "\\dagger{}", [utf8.char(0x00A0)] = "\\nbsp{}"} function Str (elem) x = elem.text:match 'http[^%s]' if not x then for i in pairs(chars) do elem = pandoc.Str(elem.text:gsub (i, chars[i])) end return elem end end #+end_src And a quick test: #+begin_src sh :results org str="/ † * < > http://foo.es " pandoc -f markdown -t org --lua-filter=entities.lua <<< $str #+end_src #+RESULTS: #+begin_src org \slash{} \dagger{} \lowast{} \lt{} \gt{} http://foo.es \nbsp{} #+end_src Best regards, Juan Manuel
[-- Attachment #1: Type: text/plain, Size: 538 bytes --] Hi Juan, > Very often I need to convert docx documents to Org. There are a series > of characters that I prefer to be passed to Org as Org entities and not > literally, so I have written this little filter in Lua for Pandoc. I > share it here in case it could be useful to someone. Of course, the > associative table can be expanded with more replacement cases: I’m quite interested in this, thanks for sharing. In fact, I’ll probably add this to <https://github.com/tecosaur/org-pandoc-import>. All the best, Timothy
On 04/01/2022 17:14, Juan Manuel Macías wrote:
>
> Very often I need to convert docx documents to Org. ...
>
> local chars = {["/"] = "\\slash{}", ["*"] = "\\lowast{}", ["<"] = "\\lt{}",
> [">"] = "\\gt{}", ["†"] = "\\dagger{}", [utf8.char(0x00A0)] = "\\nbsp{}"}
> ...
> pandoc -f markdown -t org --lua-filter=entities.lua <<< $str
Ideally it should be done pandoc and only if it causes incorrect parsing
of org markup. NBSP, probably, should be replaced by some exporters, I
do not think, it is a problem e.g. in HTML files.
Max Nikulin writes:
> Ideally it should be done pandoc and only if it causes incorrect
> parsing of org markup. NBSP, probably, should be replaced by some
> exporters, I do not think, it is a problem e.g. in HTML files.
The reason for this filter is my own comfort. Linguistics texts contains
a lot of certain characters such as "/" or "*", and they are often
italicized or bold. So, in order not to be more confused than necessary,
I prefer that they pass as entities. In general, there are certain
characters that I am more comfortable working with as entities than as
literal characters (for example, a lot of zero-width combining
diacritics that are used a lot in linguistics or epigraphy (and there
are no fonts that include the NFC normalized version of all possible
combinations: in fact, they are not in Unicode, and would have to go to
the private use area). Summarizing, I prefer that these characters have
their actual typographic representation only with LuaTeX. A very typical
example is the character U+0323 (COMBINING DOT BELOW). It is very
uncomfortable to work /in situ/, although there are fonts that usually
render it well (with the 'mark' otf tag).
(Naturally, I have to do, inside Org, a lot of corrections in italics
later, due to the bad habit that Word users have of applying direct
formatting. Interestingly only the pandoc docx reader trims the emphasis
before exporting to Org or Markdown, so as not to produce things like
"/ foo /". But the odt reader doesn't. I don't know if I'm missing
something.
Hi Timothy:
Timothy writes:
> I’m quite interested in this, thanks for sharing. In fact, I’ll probably add
> this to <https://github.com/tecosaur/org-pandoc-import>.
Interesting package. Until now I used a number of homemade functions to
convert docx/odt files from Dired, but I think your package will be very
useful to me ;-)
Best regards,
Juan Manuel
Juan Manuel Macías <maciaschain@posteo.net> writes: > Hi, > > Very often I need to convert docx documents to Org. There are a > series > of characters that I prefer to be passed to Org as Org entities > and not > literally, so I have written this little filter in Lua for > Pandoc. I > share it here in case it could be useful to someone. Of course, > the > associative table can be expanded with more replacement cases: > > #+begin_src lua :tangle entities.lua > local chars = {["/"] = "\\slash{}", ["*"] = "\\lowast{}", > ["<"] = "\\lt{}", > [">"] = "\\gt{}", ["†"] = "\\dagger{}", [utf8.char(0x00A0)] > = "\\nbsp{}"} > > function Str (elem) > x = elem.text:match 'http[^%s]' > if not x then > for i in pairs(chars) do > elem = pandoc.Str(elem.text:gsub (i, chars[i])) > end > return elem > end > end > #+end_src Neat! Converting Word documents is no fun at all. BTW, Babel support for Lua isn't very good AFAICT. I poked around ob-lua.el recently and concluded that one problem is that the Lua interpreter prints pointers for some data types instead of a human-readable form that might be parsed. All the best, Tom -- Thomas S. Dye https://tsdye.online/tsdye
On 04/01/2022 22:06, Juan Manuel Macías wrote: > Max Nikulin writes: > >> Ideally it should be done pandoc and only if it causes incorrect >> parsing of org markup. NBSP, probably, should be replaced by some >> exporters, I do not think, it is a problem e.g. in HTML files. > > The reason for this filter is my own comfort. Linguistics texts contains > a lot of certain characters such as "/" or "*", and they are often > italicized or bold. So, in order not to be more confused than necessary, > I prefer that they pass as entities. It seems, lightweight markup is more annoyance than advantage for you. Tom posted some thoughts on more rigorous syntax in the following message: Tom Gillespie. Re: Org-syntax: Intra-word markup. Sat, 4 Dec 2021 09:53:11 -0800. https://list.orgmode.org/CA+G3_PNca3HY6TUDPMfHGt35Amj9a-y8dBNQo+ZvBOV6y3nHYw@mail.gmail.com For C and C++ it is possible to tune some aspects of compiler behavior using #pragma something directives. In some cases it might be convenient to e.g. temporary disable emphasis markers (or switch to more verbose alternative) through #+some: keyword > In general, there are certain > characters that I am more comfortable working with as entities than as > literal characters (for example, a lot of zero-width combining > diacritics that are used a lot in linguistics or epigraphy (and there > are no fonts that include the NFC normalized version of all possible > combinations: in fact, they are not in Unicode, and would have to go to > the private use area). Summarizing, I prefer that these characters have > their actual typographic representation only with LuaTeX. A very typical > example is the character U+0323 (COMBINING DOT BELOW). It is very > uncomfortable to work /in situ/, although there are fonts that usually > render it well (with the 'mark' otf tag). I have seen warnings concerning complications due to variants of character representation, but I have no experience such nuances (either typing and editing or processing text programmatically). Dagger and non-breaking space in your code snippet were much more simpler cases.
Max Nikulin writes: > It seems, lightweight markup is more annoyance than advantage for you. > Tom posted some thoughts on more rigorous syntax in the following message: It's generally the opposite: working in Org is a pleasant journey for me... except when there are dozens of "/" and "*" in a document, and they placed in 'unhappy' positions. For example, in phonetics the "/ ... /" notation is used a lot, and there may be cases like: #+begin_example /foo/ /bar/ /baz/ #+end_example In grammar the asterisk is also used a lot, to designate that a term is not attested or to indicate that it is ungrammatical: #+begin_example *foo *bar *baz #+end_example And we can even have the combination of both: #+begin_example /*foo/ /*bar/ /*baz/ #+end_example And in certain cases, they are usually expressed in italics. With these landscapes, it's worth having a few entities rather than working from pure LaTeX, which is more accurate, but horribly more verbose. This is a page from a book I typesetted a couple of years ago (when the pandemic started), entirely from Org: https://i.imgur.com/f6X7qLs.png Best regards, Juan Manuel
On 06/01/2022 00:08, Juan Manuel Macías wrote: > Max Nikulin writes: > >> It seems, lightweight markup is more annoyance than advantage for you. >> Tom posted some thoughts on more rigorous syntax in the following message: > > It's generally the opposite: working in Org is a pleasant journey for > me... except when there are dozens of "/" and "*" in a document, and > they placed in 'unhappy' positions. For example, in phonetics the > "/ ... /" notation is used a lot, and there may be cases like: > > #+begin_example > /foo/ /bar/ /baz/ > #+end_example Unless you were seeking for a lightweight markup I would remind you about macro: ---- >8 ---- #+macro: ph @@x:@@/@@x:@@$1@@x:@@/@@x:@@ /{{{ph(foo)}}} {{{ph(bar)}}} {{{ph(baz)}}}/ ---- 8< ---- Form my point of view it is not worse than "\slash{}" entities. LISP can bee easily transformed to a domain specific language by means of LISP macros (it is its strong and weak side simultaneously). I am unaware whether a comparable framework exists for creating custom lightweight markups. In LaTeX for your examples I expect something like \phonetic{foo} commands to have logical markup. Certainly with some hints /foo/ in particular part of text might be considered as "phonetic" rather than "italic" in intermediate representation keeping source easily readable. > https://i.imgur.com/f6X7qLs.png In this example there is no need to replace "<" by entity since it can not be confused with <http://te.st/> links.
Max Nikulin writes: > Form my point of view it is not worse than "\slash{}" entities. Yes, I also use macros a lot, especially for more complex constructions. Macros, entities and other tricks have their pros and cons, but they allow me to have a certain group of characters under control. >> https://i.imgur.com/f6X7qLs.png > > In this example there is no need to replace "<" by entity since it can > not be confused with <http://te.st/> links. As I have already said, this comes from a desire to keep a series of characters under control, because they can be confused with the Org marks or because I can get confused when working with the text, especially if it is an imported text. In the case of "<" and ">", many authors usually use them for various contexts in philology (instead of the correct characters for some of those contexts, such as RIGHT-POINTING ANGLE BRACKET and LEFT-POINTING ANGLE BRACKET: in that case, I have to replace with the correct symbol). In general, I am calmer if I have, in one way or another, all those symbols conveniently "delimited", because even if Org is not wrong, I can get confused. Anyway, regardless of the idiosyncrasy of my workflow, the origin of this thread (the function in Lua) was in case someone wants to adapt it to their own workflow, and needs to substitute some strings for others when importing from docx or odt. Best regards, Juan Manuel