emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
* Fallback fonts in LaTeX export for non latin scripts
@ 2023-08-30  8:25 Juan Manuel Macías
  2023-08-31  8:17 ` Ihor Radchenko
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Manuel Macías @ 2023-08-30  8:25 UTC (permalink / raw)
  To: orgmode

The Unicode TeX engines, LuaTeX and XeTeX, have certain features to
apply fonts to scripts (Greek, Cyrillic, Arabic, etc.), without the need
to switch fonts explicitly. But LaTeX does not include any functionality
for loading 'fallback fonts' out of the box. Seeing things from TeX and
LaTeX this is understandable: since LaTeX is a typographic tool, the
user has the responsibility of choosing the fonts and knowing which
fonts to use. But from the Org side things may look different, as the
average user (who may not be interested in typographical or font
complexities) is looking for immediate readability of their texts when
exporting to any format. We know that, when exporting to LaTeX, this
does not always happen, if texts include non-Latin scripts.

These days I'm working on some experimental code to try to provide Org
with some sort of fallbacks fonts on LaTeX export. The functionality
would (for now) be linked to LuaTeX + babel package, since XeTeX,
although it has the ucharclasses package, is more limited.

The idea is to start from a defcustom that is an alist where each element
has the structure (script font). There would also be a default script +
font, for example ("latin" "Linux Libertine"). At the moment it would
only work for the default roman font, but it can be extended to default
sans serif, mono, etc.

The functionality would not be activated by default. When activated, it
also enables LuaTeX as the default LaTeX engine, and on each export a
list of non-latin scripts in the buffer is extracted. Perhaps with
some code like this, which checks for any non-latin characters:

(let ((scripts))
  (save-excursion
    (goto-char (point-min))
    (while
        (re-search-forward "\\([^\u0000-\u007F\u0080-\u00FF\u0100-\u017F]\\)" nil t)
      (let ((script (aref char-script-table
                          (string-to-char (match-string 1)))))
        (add-to-list 'scripts script)
        (setq script-list scripts))))
  script-list)

?

Once the list has been extracted, an ad hoc preamble would be formatted
assigning each script the chosen font.

WDYT? Do you think this would be a viable path? I think that in a few
days I can offer something usable for discussion.

Best regards,

Juan Manuel

--
Juan Manuel Macías


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-08-30  8:25 Fallback fonts in LaTeX export for non latin scripts Juan Manuel Macías
@ 2023-08-31  8:17 ` Ihor Radchenko
  2023-08-31 11:42   ` Juan Manuel Macías
  0 siblings, 1 reply; 21+ messages in thread
From: Ihor Radchenko @ 2023-08-31  8:17 UTC (permalink / raw)
  To: Juan Manuel Macías, Timothy; +Cc: orgmode

Juan Manuel Macías <maciaschain@posteo.net> writes:

> These days I'm working on some experimental code to try to provide Org
> with some sort of fallbacks fonts on LaTeX export. The functionality
> would (for now) be linked to LuaTeX + babel package, since XeTeX,
> although it has the ucharclasses package, is more limited.

Thanks! That would be a welcome addition.

> The idea is to start from a defcustom that is an alist where each element
> has the structure (script font). There would also be a default script +
> font, for example ("latin" "Linux Libertine"). At the moment it would
> only work for the default roman font, but it can be extended to default
> sans serif, mono, etc.

Are the fonts you have in mind shipped with LuaTeX distribution?

> The functionality would not be activated by default. When activated, it
> also enables LuaTeX as the default LaTeX engine, and on each export a
> list of non-latin scripts in the buffer is extracted. Perhaps with
> some code like this, which checks for any non-latin characters:
>
> (let ((scripts))
>   (save-excursion
>     (goto-char (point-min))
>     (while
>         (re-search-forward "\\([^\u0000-\u007F\u0080-\u00FF\u0100-\u017F]\\)" nil t)
>       (let ((script (aref char-script-table
>                           (string-to-char (match-string 1)))))
>         (add-to-list 'scripts script)
>         (setq script-list scripts))))
>   script-list)
>
> ?
>
> Once the list has been extracted, an ad hoc preamble would be formatted
> assigning each script the chosen font.
>
> WDYT? Do you think this would be a viable path? I think that in a few
> days I can offer something usable for discussion.

Adding Timothy to CC. His WIP conditional preamble branch looks suitable
to add the proposed functionality.

What will happen if LuaTeX is not installed on the system?

Also, just to double check, is LuaTeX fully compatible to LaTeX? That
is, if we have an existing org file using LaTeX-specific commands and
packages, will it work with LuaTeX?

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-08-31  8:17 ` Ihor Radchenko
@ 2023-08-31 11:42   ` Juan Manuel Macías
  2023-09-01  9:18     ` Ihor Radchenko
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Manuel Macías @ 2023-08-31 11:42 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: Timothy, orgmode

Ihor Radchenko writes:

> Juan Manuel Macías <maciaschain@posteo.net> writes:

>> The idea is to start from a defcustom that is an alist where each element
>> has the structure (script font). There would also be a default script +
>> font, for example ("latin" "Linux Libertine"). At the moment it would
>> only work for the default roman font, but it can be extended to default
>> sans serif, mono, etc.
>
> Are the fonts you have in mind shipped with LuaTeX distribution?

Yes, in fact the complete installation of TeX live includes a wide
catalog of free opentype fonts with good coverage for non-Latin scripts.
Added to that, more free (as in freedom) easily accessible fonts can be
recommended. Even many GNU/Linux distros already include them. In any
case, the fonts issue is the most delicate part. What default fonts to
add to the list? Here the user's taste or preferences would influence.
It must also be taken into account that if one has typographical
scruples, not all fonts match each other. For design purposes, I mean.
The Computer Modern, which is a modern style font (similar to the Didot
or Bodoni), does not usually pair well with (for example) a Garamond,
which is in the Renaissance style. That's why I think the best solution
would be to offer a basic defcustom, based on the purely utilitarian,
and let the user modify or extend it according to their taste,
preferences or convenience.

Another thing to keep in mind is the following. Offering basic
readability based on the unicode scripts means that we rely on scripts
and not languages. For example, the Cyrillic script covers several
languages, as you well know: Russian, Bulgarian, etc. The Latin script
is used for languages as diverse as English or Vietnamese. The choice of
font based on the script is a low-level LuaTeX functionality, that is,
it does not add features specific to each language, such as hyphenation
patterns. This means that long texts in (for example) Cyrillic or Greek
are not justified well because LaTeX does not know how hyphenate them:

https://i.imgur.com/PSja3x2.png

However, this may be sufficient for documents containing words or small
texts in non latin scripts, rather than long texts.

There is another possibility that I am working on in parallel: relying
on languages instead of scripts. This would add both readability and
support for each particular language. There could be two options for the
user: a basic one (the low level one, based on scripts: ensures
readability but the document may not look pretty) and an advanced one,
based on language support. Something like this occurred to me:

#+LaTeX_Header: % !enable-fonts-for ancientgreek russian:Old Standard
 arabic

This means: enable default fonts for ancient Greek and Arabic
(associated with Greek and Arabic scripts). For Russian, enable the Old
Standard font (included in TeX live). And in the case of Arabic, enable
'bidi' (bidirectional text). If the user added that line it would be
enough to do the magic. I hope :-)

>> The functionality would not be activated by default. When activated, it
>> also enables LuaTeX as the default LaTeX engine, and on each export a
>> list of non-latin scripts in the buffer is extracted. Perhaps with
>> some code like this, which checks for any non-latin characters:
>>
>> (let ((scripts))
>>   (save-excursion
>>     (goto-char (point-min))
>>     (while
>>         (re-search-forward "\\([^\u0000-\u007F\u0080-\u00FF\u0100-\u017F]\\)" nil t)
>>       (let ((script (aref char-script-table
>>                           (string-to-char (match-string 1)))))
>>         (add-to-list 'scripts script)
>>         (setq script-list scripts))))
>>   script-list)
>>
>> ?
>>
>> Once the list has been extracted, an ad hoc preamble would be formatted
>> assigning each script the chosen font.
>>
>> WDYT? Do you think this would be a viable path? I think that in a few
>> days I can offer something usable for discussion.
>
> Adding Timothy to CC. His WIP conditional preamble branch looks suitable
> to add the proposed functionality.

Great!

> What will happen if LuaTeX is not installed on the system?

Yes, there should be some kind of warning. Also it's not just LuaTeX,
but certain packages for fonts and multilingual support. The problem is
that the different versions of TeX live cooked in the distros 
usually name these packages differently. This is another added problem...
Arch or Gentoo offer a more vanilla TeX live.

> Also, just to double check, is LuaTeX fully compatible to LaTeX? That
> is, if we have an existing org file using LaTeX-specific commands and
> packages, will it work with LuaTeX?

Yes, it is fully compatible, except that LuaLaTeX does not need to load
the fontenc or inputenc packages. LuaTeX is intended to be the natural
replacement for pdfTeX. The latest edition of The LaTeX Companion is
already very focused on LuaTeX. And 90% of the new LaTeX packages that
are uploaded to CTAN only work in LuaLaTeX. One of the essential
advantages of LuaTeX is that TeX now (finally!) has a simple scripting
language. With a little Lua you can achieve very low level things in TeX
that were horribly complicated in 'pure TeX'.

-- 
Juan Manuel Macías

https://juanmanuelmacias.com

https://lunotipia.juanmanuelmacias.com

https://gnutas.juanmanuelmacias.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-08-31 11:42   ` Juan Manuel Macías
@ 2023-09-01  9:18     ` Ihor Radchenko
  2023-09-02 21:39       ` Juan Manuel Macías
  0 siblings, 1 reply; 21+ messages in thread
From: Ihor Radchenko @ 2023-09-01  9:18 UTC (permalink / raw)
  To: Juan Manuel Macías; +Cc: Timothy, orgmode

Juan Manuel Macías <maciaschain@posteo.net> writes:

> ...In any
> case, the fonts issue is the most delicate part. What default fonts to
> add to the list? Here the user's taste or preferences would influence.

Commonly available libre fonts look like a good candidate.

> It must also be taken into account that if one has typographical
> scruples, not all fonts match each other. For design purposes, I mean.
> The Computer Modern, which is a modern style font (similar to the Didot
> or Bodoni), does not usually pair well with (for example) a Garamond,
> which is in the Renaissance style. That's why I think the best solution
> would be to offer a basic defcustom, based on the purely utilitarian,
> and let the user modify or extend it according to their taste,
> preferences or convenience.

+1.

> Another thing to keep in mind is the following. Offering basic
> readability based on the unicode scripts means that we rely on scripts
> and not languages. For example, the Cyrillic script covers several
> languages, as you well know: Russian, Bulgarian, etc. The Latin script
> is used for languages as diverse as English or Vietnamese. The choice of
> font based on the script is a low-level LuaTeX functionality, that is,
> it does not add features specific to each language, such as hyphenation
> patterns. This means that long texts in (for example) Cyrillic or Greek
> are not justified well because LaTeX does not know how hyphenate them:
> ...
> There is another possibility that I am working on in parallel: relying
> on languages instead of scripts. This would add both readability and
> support for each particular language. There could be two options for the
> user: a basic one (the low level one, based on scripts: ensures
> readability but the document may not look pretty) and an advanced one,
> based on language support. Something like this occurred to me:
>
> #+LaTeX_Header: % !enable-fonts-for ancientgreek russian:Old Standard
>  arabic

We already have #+language keyword and
`org-latex-guess-babel-language'/`org-latex-guess-polyglossia-language'.
May as well have default fonts for a given language.

As for multiple languages, do we actually support this?

>> What will happen if LuaTeX is not installed on the system?
>
> Yes, there should be some kind of warning. Also it's not just LuaTeX,
> but certain packages for fonts and multilingual support. The problem is
> that the different versions of TeX live cooked in the distros 
> usually name these packages differently. This is another added problem...
> Arch or Gentoo offer a more vanilla TeX live.

We might use `org-latex-known-warnings'.

>> Also, just to double check, is LuaTeX fully compatible to LaTeX? That
>> is, if we have an existing org file using LaTeX-specific commands and
>> packages, will it work with LuaTeX?
>
> Yes, it is fully compatible, except that LuaLaTeX does not need to load
> the fontenc or inputenc packages. LuaTeX is intended to be the natural
> replacement for pdfTeX. The latest edition of The LaTeX Companion is
> already very focused on LuaTeX. And 90% of the new LaTeX packages that
> are uploaded to CTAN only work in LuaLaTeX. One of the essential
> advantages of LuaTeX is that TeX now (finally!) has a simple scripting
> language. With a little Lua you can achieve very low level things in TeX
> that were horribly complicated in 'pure TeX'.

Then, we might even consider LuaTeX as the new default for
`org-latex-compiler'.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-01  9:18     ` Ihor Radchenko
@ 2023-09-02 21:39       ` Juan Manuel Macías
  2023-09-03  7:22         ` Ihor Radchenko
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Manuel Macías @ 2023-09-02 21:39 UTC (permalink / raw)
  To: orgmode; +Cc: Ihor Radchenko, Timothy

[-- Attachment #1: Type: text/plain, Size: 1551 bytes --]

Finally I can upload some usable code here, in this case to be able to
load and manage fonts for languages with non-Latin scripts, through
babel and fontspec (in LuaLaTeX). It is an attempt to simplify from Org
the multiform syntax of babel + fontspec. Of course, it is more limited,
but for regular use I think it may be enough.

Since this code is mostly a proof of concept and the names of many
things (and the things themselves) are still tentative, I thought it
would be more useful to attach it in an *.el file, rather than a regular
patch. Loading that file everything should work fine. I also attach an
org document with some examples of use. In any case, there are more
explanations inside the .el file.

One of the big problems I have encountered when trying to create a
"(LaTeX) Babel interface in Org" is the *horrible* multiplicity that
Babel has for language names. That is the reason for the :babel-alt
property in 'org-latex-language-alist', which collects the names that
babel supports for \babelprovide, which are not always the same as the
'classic' babel syntax.

Finally, I find this way more useful (that is, loading fonts with
language support), instead of a fallback font system based only on the
Unicode scripts. It is less 'automatic', but more precise, and it also
does not require much 'specialized' intervention on the part of the
user.

Best regards,

-- 
Juan Manuel Macías

https://juanmanuelmacias.com

https://lunotipia.juanmanuelmacias.com

https://gnutas.juanmanuelmacias.com


[-- Attachment #2: test-lang.org --]
[-- Type: application/vnd.lotus-organizer, Size: 2866 bytes --]

[-- Attachment #3: unicode-font-support.el --]
[-- Type: text/plain, Size: 8721 bytes --]

;; -*- lexical-binding: t; -*-

;; A proof of concept for Unicode font support in LaTeX export, using
;; babel and fontspec, with luatex as the default compiler.

;; Use example:

;; It is not necessary to load languages with non-Latin alphabet in babel options:
;; #+LaTeX_Header: \usepackage[AUTO]{babel}

;; Languages and fonts (there may be multiple lines):

;; #+LaTeX_Header: % !enable-fonts-for ancientgreek:Linux Libertine O(Scale=MatchLowercase)
;; #+LaTeX_Header: % !enable-fonts-for russian:FreeSerif(Numbers=Lowercase,Color=blue) :: arabic

;; Explanation:

;; - lang = enable default font for lang
;; - lang:font = enable font for lang in current document
;; - lanf:font(options) = enable font for lang in this document with options
;; - :: = separator


;; code

;;  This is supposed to be a defcustom.

(setq org-latex-uc-fonts-support t)

;; A mini version of `org-latex-language-alist', for this proof of
;; concept. Babel uses various names for languages. The ones that
;; interest us here are those collected in `:babel-alt', which is
;; always a list. The names sometimes match the `classic' babel name
;; and other times they don't. And in the case of "el-polyton" there
;; are two possible names. For a list of these names see:
;; [[https://CTAN/macros/latex/required/babel/base/babel.pdf]],
;; p. 22.

(defconst org-latex-language-alist
  '(("en"  :babel "american" :babel-alt ("english-unitedstates") :polyglossia "english" :polyglossia-variant "usmax" :lang-name "English" :script "latin" :code "latn")
    ("ar" :babel "arabic" :babel-alt ("arabic") :polyglossia "arabic" :lang-name "Arabic" :script "arabic" :code "arab")
    ("el"  :babel "greek" :babel-alt ("greek") :polyglossia "greek" :lang-name "Greek" :script "greek" :code "grk")
    ("el-polyton" :babel "polutonikogreek" :babel-alt ("ancientgreek" "polytonicgreek") :polyglossia "greek" :polyglossia-variant "polytonic" :lang-name "Polytonic Greek" :script "greek" :code "grk")
    ("ru"  :babel "russian" :babel-alt ("russian") :polyglossia "russian" :lang-name "Russian" :script "cyrillic" :code "cyrl"))
  "TODO")

;; This is supposed to be a defcustom for the main fonts. `'default'
;; means 'use the main default fonts'. Otherwise, the value must be
;; a plist. Valid props. are:

;; - :main = roman font
;; - :sans = sans font
;; - :mono = mono font
;; - :math = math font
;; - :...-options = font options

;; For the font options and the fontspec package syntax, see
;; [[https://CTAN/macros/unicodetex/latex/fontspec/fontspec.pdf]]

(setq org-latex-uc-fonts-support-default-main-fonts
      '(:main "FreeSerif" :mono "inconsolatan" :mono-options "Scale=0.95"))

;; This is supposed to be a defcustom. Each element has the structure:
;; script - font - (optional) font options

(setq org-latex-uc-fonts-support-default-scripts-fonts
      '(("greek" "Linux Libertine")
	("cyrillic" "Old Standard")
	("arabic" "FreeSerif")))

;; Get main fonts (declared in
;; `org-latex-uc-fonts-support-default-main-fonts')

(defun org-latex-uc-fonts-support-get-main-fonts (plist prop)
  (let ((format))
    (if (not
	 (plist-member plist prop))
	(ignore)
      (let* ((value (plist-get plist prop))
	     (prop-name
	      (replace-regexp-in-string ":" "" (symbol-name prop)))
	     (options (plist-get
		       plist
		       (intern
			(format
			 ":%s-options"
			 prop-name)))))
	(setq format
	      (format
	       "\\\\set%sfont{%s}[%s]"
	       prop-name value
	       (if options options "")
	       ))))
    format))

;; get non latin fonts explicitly added

(defun org-latex-uc-fonts-support-get-fonts-other-languages (header)
  (interactive)
  (let ((format-str)
	(lines))
    (with-temp-buffer
      (insert header)
      (save-excursion
	(goto-char (point-min))
	(while (re-search-forward "%\s+!enable-fonts-for\s+\\(.+\\)" nil t)
	  (add-to-list 'lines (match-string 1)))))
    (let* ((lines-list
	    (mapcar
	     (lambda (x)
	       (split-string x "::"))
	     lines))
	   (flat (flatten-list lines-list))
	   (format-list (mapcar
			 (lambda (x)
			   (org-latex-uc-fonts-support-format-font-for-language (string-trim x)))
			 flat)))
      (setq format-str (mapconcat #'identity format-list "\n\n")))
    format-str))

;; format each lang/font

(defun org-latex-uc-fonts-support-format-font-for-language (lang)
  (let* ((regexp "\\([^:]+\\):*\\([^()]*\\)(*\\([^()]*\\))*")
	 (lang-name (when (string-match regexp lang)
		      (match-string 1 lang)))
	 (lang-explicit-font (when (string-match regexp lang)
			       (match-string 2 lang)))
	 (lang-explicit-font-opts (when (string-match regexp lang)
				    (match-string 3 lang)))
	 (lang-alias (let ((candidato))
		       (mapc (lambda (x)
			       (when (member :babel-alt x)
				 (let* ((plist (cdr x))
					(babel-alt (plist-get plist :babel-alt)))
				   (when (member lang-name babel-alt)
				     (setq candidato (car x))))))
			     org-latex-language-alist)
		       candidato))
	 (plist (cdr (assoc lang-alias org-latex-language-alist)))
	 (script (plist-get plist :script))
	 (default-script-font (assoc script org-latex-uc-fonts-support-default-scripts-fonts))
	 (default-font (nth 1 default-script-font))
	 (default-font-options (nth 2 default-script-font))
	 (default-font-options? (if default-font-options
				    default-font-options
				  "")))
    (format
     "\\\\babelprovide[onchar=ids fonts]{%s}\n
    \\\\babelfont[%s]{rm}[%s]{%s}\n"
     lang-name
     lang-name
     (if (not (equal lang-explicit-font-opts "")) lang-explicit-font-opts default-font-options?)
     (if (not (equal lang-explicit-font "")) lang-explicit-font default-font))))

;; make preamble definitions. This is supposed to be part of
;; `org-latex-guess-babel-language', as in the modified version below

(defun org-latex-uc-fonts-support-make-preamble (header)
  (let* ((main-fonts (unless (eq 'org-latex-uc-fonts-support-default-main-fonts 'default)
		       (mapconcat #'identity
				  (cl-remove-if-not #'identity
						    (mapcar
						     (lambda (elt)
						       (let ((str (org-latex-uc-fonts-support-get-main-fonts
								   org-latex-uc-fonts-support-default-main-fonts
								   elt)))
							 (when str str)))
						     (list :main :sans :mono :math)))
				  "\n")))
	 (other-fonts-per-language
	  (org-latex-uc-fonts-support-get-fonts-other-languages header))
	 (preamble (with-temp-buffer
		     (insert "\n\n")
		     (when main-fonts
		       (insert main-fonts))
		     (insert "\n\n")
		     (when other-fonts-per-language
		       (insert other-fonts-per-language))
		     (buffer-string))))
    preamble))

(defun org-latex-guess-babel-language (header info)
  "Modified version for this proof of concept"
  (let* ((language-code (plist-get info :language))
	 (plist (cdr
		 (assoc language-code org-latex-language-alist)))
	 (language (plist-get plist :babel))
	 (language-ini-only (plist-get plist :babel-ini-only))
	 ;; If no language is set, or Babel package is not loaded, or
	 ;; LANGUAGE keyword value is a language served by Babel
	 ;; exclusively through ini files, return HEADER as-is.
	 (header (if (or language-ini-only
			 (not (stringp language-code))
			 (not (string-match "\\\\usepackage\\[\\(.*\\)\\]{babel}" header)))
		     header
		   (let ((options (save-match-data
				    (org-split-string (match-string 1 header) ",[ \t]*"))))
		     ;; If LANGUAGE is already loaded, return header
		     ;; without AUTO.  Otherwise, replace AUTO with language or
		     ;; append language if AUTO is not present.  Languages that are
		     ;; served in Babel exclusively through ini files are not added
		     ;; to the babel argument, and must be loaded using
		     ;; `\babelprovide'.
		     (replace-match
		      (mapconcat (lambda (option) (if (equal "AUTO" option) language option))
				 (cond ((member language options) (delete "AUTO" options))
				       ((member "AUTO" options) options)
				       (t (append options (list language))))
				 ", ")
		      t nil header 1)))))
    ;;; adition:
    (when org-latex-uc-fonts-support
      (setq header (let ((form (org-latex-uc-fonts-support-make-preamble header)))
		     (replace-regexp-in-string
		      "\\(\\\\usepackage\\[?.*\\]?{babel}\\)"
		      (format "\n\\\\usepackage{fontspec}\n\n\\1\n%s" form)
		      header))))
    ;;;
    ;; If `\babelprovide[args]{AUTO}' is present, AUTO is
    ;; replaced by LANGUAGE.
    (if (not (string-match "\\\\babelprovide\\[.*\\]{\\(.+\\)}" header))
	header
      (let ((prov (match-string 1 header)))
	(if (equal "AUTO" prov)
	    (replace-regexp-in-string (format
				       "\\(\\\\babelprovide\\[.*\\]\\)\\({\\)%s}" prov)
				      (format "\\1\\2%s}"
					      (or language language-ini-only))
				      header t)
	  header)))))

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-02 21:39       ` Juan Manuel Macías
@ 2023-09-03  7:22         ` Ihor Radchenko
  2023-09-03 11:05           ` Juan Manuel Macías
  0 siblings, 1 reply; 21+ messages in thread
From: Ihor Radchenko @ 2023-09-03  7:22 UTC (permalink / raw)
  To: Juan Manuel Macías; +Cc: orgmode, Timothy

Juan Manuel Macías <maciaschain@posteo.net> writes:

> Finally I can upload some usable code here, in this case to be able to
> load and manage fonts for languages with non-Latin scripts, through
> babel and fontspec (in LuaLaTeX). It is an attempt to simplify from Org
> the multiform syntax of babel + fontspec. Of course, it is more limited,
> but for regular use I think it may be enough.

I can see that you did not add defaults for Chinese, which is one of the
problematic scripts for LaTeX. Can you add it?

> ;; #+LaTeX_Header: % !enable-fonts-for ancientgreek:Linux Libertine O(Scale=MatchLowercase)
> ;; #+LaTeX_Header: % !enable-fonts-for russian:FreeSerif(Numbers=Lowercase,Color=blue) :: arabic

I do not like this approach.
Would be more consistent to allow multiple languages in #+language +
#+LATEX_FONT keyword to optionally specify per-language font:

#+LANGUAGE: <main language> <other languages...>
#+LATEX_FONT[lang]: font

#+language: ancientgreek russian arabic
#+latex_font[ancientgreek]: "Linux Libertine O" Scale=MatchLowercase
#+latex_font[russian]: "FreeSerif" Numbers=Lowercase,Color=blue

Also, I think that it may still make sense to have some kind of fallback
font if the specified fonts are not sufficient. For example, when using
emoji symbols, which do not correspond to any language.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-03  7:22         ` Ihor Radchenko
@ 2023-09-03 11:05           ` Juan Manuel Macías
  2023-09-04  8:09             ` Ihor Radchenko
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Manuel Macías @ 2023-09-03 11:05 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: orgmode, Timothy

Thanks for your comments!

Ihor Radchenko writes:

> Juan Manuel Macías <maciaschain@posteo.net> writes:
>
>> Finally I can upload some usable code here, in this case to be able to
>> load and manage fonts for languages with non-Latin scripts, through
>> babel and fontspec (in LuaLaTeX). It is an attempt to simplify from Org
>> the multiform syntax of babel + fontspec. Of course, it is more limited,
>> but for regular use I think it may be enough.
>
> I can see that you did not add defaults for Chinese, which is one of the
> problematic scripts for LaTeX. Can you add it?

In that first proof of concept I only put a few scripts, less
problematic, simply to show the functionality. In CJK languages things
are a little more complicated, but it can be done too. The idea is to
cover all scripts. In the next code I submit, when I redo the current
one, I will try to introduce the case of CJK scripts.

>> ;; #+LaTeX_Header: % !enable-fonts-for ancientgreek:Linux Libertine O(Scale=MatchLowercase)
>> ;; #+LaTeX_Header: % !enable-fonts-for russian:FreeSerif(Numbers=Lowercase,Color=blue) :: arabic
>
> I do not like this approach.

I'm not a big fan of doing it like that either. I chose this option
because I didn't have to define a new keyword and to be less "intrusive"
with the actual code. But on the other hand it adds a new syntax. Well,
I discard it, to the detriment of an idea that you mention below.

> Would be more consistent to allow multiple languages in #+language +
> #+LATEX_FONT keyword to optionally specify per-language font:

> #+language: ancientgreek russian arabic

Of course, this syntax would be the most appropriate and consistent
within Org. The problem is LaTeX, specifically babel, and that certain
inconsistencies would be created with the rest of the backends. At first
some pitfalls come to mind:

- The keyword #+language accepts for now only language codes (es, en,
  el, ar, ru, etc.). Consistency with other backends should
  be maintained in this regard: ancientgreek is not a valid language
  code, but a name that only babel understands. If we put something
  like (a valid language code):

  #+language: el-polyton

  this could be translated in babel as polutonikogreek (in the classic
  syntax, that is, the languages that are loaded in the options of
  \usepackage[options]{babel}), or, in the new syntax, ancientgreek and
  polytonicgreek, which are actually two different languages: the first
  is ancient polytonic Greek and the second modern polytonic Greek. To
  add more confusion to the matter, in classical babel syntax
  greek.ancient and greek.polytonic are also supported. But neither of
  these things can be deduced by simply putting el-polyton, unless
  breaking the consistency with the other backends.

- Added to this is that Babel has two ways to load languages: the
  classic syntax and the \babelprovide command, which is the one we are
  interested in here for languages with non-Latin scripts, because the
  onchar=ids fonts property must be added here. And what happens if the
  user has already defined several languages with babel, using the
  current procedure: \usepackage[french, english, AUTO]{babel}?

Therefore, the least complicated thing, in my opinion, is to leave the
syntax of the keyword #+language as it is. It is not necessary for the
user to explicitly define secondary non-latin languages. The idea is
that Org is responsible for generating the necessary babel code by
simply giving a command like enable font for X language. What we are
talking about here is ensuring readability using a series of fonts that
LaTeX does not load by default, not even LuaLaTeX. And, after all, Org
is monolingual: it does not have multilingual support at the moment;
that is, there is nothing in Org to switch languages in the middle of
the document. What happens is that here we take advantage of the
functionality that Babel has to automatically apply a font for a
non-Latin language/script, also loading its properties (hyphen rules,
captions, etc.).

A new keyword #+latex_language could be created, which would understand
the babel names, but I think it is unnecessary and would add more
complexity. As I said before, defining the necessary fonts would be
enough, since my idea in this is a basic practicality to ensure the
readability of the documents. And anyone looking for more advanced
functions would have to enter LaTeX code explicitly.

> #+latex_font[ancientgreek]: "Linux Libertine O" Scale=MatchLowercase
>
> #+latex_font[russian]: "FreeSerif" Numbers=Lowercase,Color=blue

I like this idea, but with the exception that in the two examples you
give the user is declaring two fonts for both languages. In my example
there was also Arabic, where the default font for the Arabic script is
used. Note that each script would have default fonts, which the user can
change or not change in their document. A user could simply put
something like "enable the default fonts for ancientgreek, russian,
malayalam, georgian, chinese". And nothing more. Or choose some other
font with or without options for a specific lang.

Could be:

#+latex_font: ancientgreek, russian, malayalam, sanskrit-devanagari

beside:

#+latex_font[arabic]: "FreeSerif" Numbers=Lowercase,Color=blue

This last syntax would also be valid to modify the main default fonts:

#+latex_font[main]: "FreeSerif" Numbers=Lowercase
#+latex_font[sans]: "some font"
#+latex_font[mono]: "some font"
#+latex_font[math]: "some font"

A practical use case. Suppose a user has a document in Spanish, which
includes passages in Greek and Russian. It would be enough to use the
Old Standard font (included in TeX live) for the entire document,
ensuring consistency:

#+latex_header: \usepackage[AUTO]{babel}
#+language:es
#+latex_font[main,greek,russian]: Old Standard

> Also, I think that it may still make sense to have some kind of fallback
> font if the specified fonts are not sufficient. For example, when using
> emoji symbols, which do not correspond to any language.

Yes I agree. That could also be included in the generated preamble.

--
Juan Manuel Macías

https://juanmanuelmacias.com

https://lunotipia.juanmanuelmacias.com

https://gnutas.juanmanuelmacias.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-03 11:05           ` Juan Manuel Macías
@ 2023-09-04  8:09             ` Ihor Radchenko
  2023-09-04 22:22               ` Juan Manuel Macías
  0 siblings, 1 reply; 21+ messages in thread
From: Ihor Radchenko @ 2023-09-04  8:09 UTC (permalink / raw)
  To: Juan Manuel Macías; +Cc: orgmode, Timothy

Juan Manuel Macías <maciaschain@posteo.net> writes:

>> #+language: ancientgreek russian arabic
>
> Of course, this syntax would be the most appropriate and consistent
> within Org. The problem is LaTeX, specifically babel, and that certain
> inconsistencies would be created with the rest of the backends. At first
> some pitfalls come to mind:
>
> - The keyword #+language accepts for now only language codes (es, en,
>   el, ar, ru, etc.). Consistency with other backends should
>   be maintained in this regard: ancientgreek is not a valid language
>   code, but a name that only babel understands. If we put something
>   like (a valid language code):
>
>   #+language: el-polyton
>
>   this could be translated in babel as polutonikogreek (in the classic
>   syntax, that is, the languages that are loaded in the options of
>   \usepackage[options]{babel}), or, in the new syntax, ancientgreek and
>   polytonicgreek, which are actually two different languages: the first
>   is ancient polytonic Greek and the second modern polytonic Greek. To
>   add more confusion to the matter, in classical babel syntax
>   greek.ancient and greek.polytonic are also supported. But neither of
>   these things can be deduced by simply putting el-polyton, unless
>   breaking the consistency with the other backends.

I am now working on unifying Org translation system as discussed in
https://orgmode.org/list/87o7iw8yem.fsf@bzg.fr
As a part of the effort, I plan to introduce a new constant that will
unify language abbreviations across Org and also associate them with
more human-readable names.

(defconst org-language-abbrevs
  '(("am".  "Amharic")
    ("ar" . "Arabic")
    ("ast" . "Asturian")
    ("bg" . "Bulgarian")
    ("bn" . "Bengali")
    ...))

The idea is to allow
#+language: Austrian German, Greek
as a valid specifier, in addition to
#+language: de-at, el

Then, across Org, we will make use of the standardized language
abbreviations.

> - Added to this is that Babel has two ways to load languages: the
>   classic syntax and the \babelprovide command, which is the one we are
>   interested in here for languages with non-Latin scripts, because the
>   onchar=ids fonts property must be added here. And what happens if the
>   user has already defined several languages with babel, using the
>   current procedure: \usepackage[french, english, AUTO]{babel}?

For LaTeX specifically, `org-latex-language-alist', will be re-used to
map whatever is allowed in #+language keyword to its name in
babel/polyglossia.

Does it make sense?

> Therefore, the least complicated thing, in my opinion, is to leave the
> syntax of the keyword #+language as it is. It is not necessary for the
> user to explicitly define secondary non-latin languages. The idea is
> that Org is responsible for generating the necessary babel code by
> simply giving a command like enable font for X language. What we are
> talking about here is ensuring readability using a series of fonts that
> LaTeX does not load by default, not even LuaLaTeX. And, after all, Org
> is monolingual: it does not have multilingual support at the moment;
> that is, there is nothing in Org to switch languages in the middle of
> the document. What happens is that here we take advantage of the
> functionality that Babel has to automatically apply a font for a
> non-Latin language/script, also loading its properties (hyphen rules,
> captions, etc.).
>
> A new keyword #+latex_language could be created, which would understand
> the babel names, but I think it is unnecessary and would add more
> complexity. As I said before, defining the necessary fonts would be
> enough, since my idea in this is a basic practicality to ensure the
> readability of the documents. And anyone looking for more advanced
> functions would have to enter LaTeX code explicitly.

I think that we should move towards multi-language support.
Such support would practically simplify WORG and orgmode.org translation
process, and may also be used as a basis to allow translating the
Org manual.

My rough idea is to allow specifying language as affiliated
keyword and, in future, allow selective export to certain target
language.

Multi-language documents are another potential target to support.

>> #+latex_font[ancientgreek]: "Linux Libertine O" Scale=MatchLowercase
>>
>> #+latex_font[russian]: "FreeSerif" Numbers=Lowercase,Color=blue
>
> I like this idea, but with the exception that in the two examples you
> give the user is declaring two fonts for both languages. In my example
> there was also Arabic, where the default font for the Arabic script is
> used.

My idea was that

#+language: ancientgreek russian arabic

implies "use default font for arabic", unless #+latex_font is specified.

> #+latex_font[arabic]: "FreeSerif" Numbers=Lowercase,Color=blue
>
> This last syntax would also be valid to modify the main default fonts:
>
> #+latex_font[main]: "FreeSerif" Numbers=Lowercase
> #+latex_font[sans]: "some font"
> #+latex_font[mono]: "some font"
> #+latex_font[math]: "some font"
>
> A practical use case. Suppose a user has a document in Spanish, which
> includes passages in Greek and Russian. It would be enough to use the
> Old Standard font (included in TeX live) for the entire document,
> ensuring consistency:
>
> #+latex_header: \usepackage[AUTO]{babel}
> #+language:es
> #+latex_font[main,greek,russian]: Old Standard

Looks reasonable.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-04  8:09             ` Ihor Radchenko
@ 2023-09-04 22:22               ` Juan Manuel Macías
  2023-09-05 10:44                 ` Ihor Radchenko
  2023-09-05 16:42                 ` Max Nikulin
  0 siblings, 2 replies; 21+ messages in thread
From: Juan Manuel Macías @ 2023-09-04 22:22 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: orgmode, Timothy

Ihor Radchenko writes:

> Juan Manuel Macías <maciaschain@posteo.net> writes:
>
>>> #+language: ancientgreek russian arabic
>>
>> Of course, this syntax would be the most appropriate and consistent
>> within Org. The problem is LaTeX, specifically babel, and that certain
>> inconsistencies would be created with the rest of the backends. At first
>> some pitfalls come to mind:
>>
>> - The keyword #+language accepts for now only language codes (es, en,
>>   el, ar, ru, etc.). Consistency with other backends should
>>   be maintained in this regard: ancientgreek is not a valid language
>>   code, but a name that only babel understands. If we put something
>>   like (a valid language code):
>>
>>   #+language: el-polyton
>>
>>   this could be translated in babel as polutonikogreek (in the classic
>>   syntax, that is, the languages that are loaded in the options of
>>   \usepackage[options]{babel}), or, in the new syntax, ancientgreek and
>>   polytonicgreek, which are actually two different languages: the first
>>   is ancient polytonic Greek and the second modern polytonic Greek. To
>>   add more confusion to the matter, in classical babel syntax
>>   greek.ancient and greek.polytonic are also supported. But neither of
>>   these things can be deduced by simply putting el-polyton, unless
>>   breaking the consistency with the other backends.
>
> I am now working on unifying Org translation system as discussed in
> https://orgmode.org/list/87o7iw8yem.fsf@bzg.fr
> As a part of the effort, I plan to introduce a new constant that will
> unify language abbreviations across Org and also associate them with
> more human-readable names.
>
> (defconst org-language-abbrevs
>   '(("am".  "Amharic")
>     ("ar" . "Arabic")
>     ("ast" . "Asturian")
>     ("bg" . "Bulgarian")
>     ("bn" . "Bengali")
>     ...))
>
> The idea is to allow
>
> #+language: Austrian German, Greek
> as a valid specifier, in addition to
>
> #+language: de-at, el
>
> Then, across Org, we will make use of the standardized language
> abbreviations.

Great! I think it's great news. Yes, I agree with what you say below. I
think Org should move towards a multilingual support that is 100% native
to Org. That is, Org had its own "selectlanguage" mechanism, to be able
to delimit text segments in other languages and have control over them,
both within Org and when exporting to the different backends. That
scenario seems very desirable to me, and I would like to contribute my
help to the best of my ability (and time).

In LaTeX, as I mentioned, things are complicated. There is Babel and
Polyglossia, and there is LuaTeX and XeTeX. In addition, there is also
pdfTeX, which is still the default engine and (to be honest) is the
engine used by a high percentage of LaTeX users. Although perhaps things
will change soon to the detriment of LuaTeX. Both babel and polyglossia
could be supported, but that means more work, more code, and more
complications. And we are not sure that polyglossia is no longer
maintained. After all, babel is the official LaTeX package for language
support, and polyglossia appeared at a time when babel had no support
for the new unicode engines. Now Babel supports all of that and is much
more powerful, but its interface has also grown in complexity. There is
the problem of the double syntax for loading languages: the old one,
which loads traditional ldf files, and the modern one (\babelprovide),
which loads languages using ini files. It is more powerful, with more
options, but has added more verbosity to babel. I have taken advantage
of \babelprovide, specifically its onchar=id fonts property, to
automatically apply fonts to non-Latin scripts.

>> I like this idea, but with the exception that in the two examples you
>> give the user is declaring two fonts for both languages. In my example
>> there was also Arabic, where the default font for the Arabic script is
>> used.
>
> My idea was that
>
> #+language: ancientgreek russian arabic
>
> implies "use default font for arabic", unless #+latex_font is specified.

This seems the most consistent to me for Org, but, as I mentioned in the
other email, I have some concerns. Currently, what we are talking about
is simply font support for non-Latin languages. If it is allowed, in the
current state of things, that #+language can accept a list of language
names, we can give the user a wrong perception of reality. That is:
multilingual support that does not exist as such. It is more like font
support for non-Latin languages. And only in LaTeX, and specifically in
LuaLaTeX. Furthermore, the user could mix languages that in Babel are
loaded through ldf and others through ini files. For example, something
like this:

#+language: spanish, english, french, russian

in Babel it would be:

\usepackage[english,french,spanish]{babel}

and here we need babelprovide for the font (and load Russian via ini
file):

\babelprovide[onchar=id fonts, import]{russian}
\babelfont[russian]{rm}[options]{somefont}

Org would have to discern which name refers to a non-Latin language
(which wouldn't be complicated with the functionality you're working on)
and then apply the default font by adding a line with \babelprovide.

Of course, English, French and Spanish can also be loaded via ini files:

\babelprovide[main,import]{spanish}
\babelprovide[import]{french}
\babelprovide[import]{english}

Even babel also supports:

\usepackage[english,french,spanish,provide*=*]{babel}

but in that line we cannot put Russian with onchar, etc. And then there
is pdfTeX, where only the classic babel syntax is allowed, without any
"*provide".

In short, I find everything very confusing. I am not opposed to doing it
as you propose (in fact, it is the option I like the most, especially
when org is polyglot in the future), but I also want to warn of possible
complications.

Therefore, since we are, for now, with fonts for non-Latin languages, I
think it should be made clear that the keyword is about fonts (and about
LuaLaTeX). Maybe through two keywords:

#+lualatex_fonts_for: language(s)
#+lualatex_fonts[language(s)]: "font" options

?

I think it's ugly, but I can't think of anything else.

By the way, and as a side note, is it currently possible in Org to
define a keyword within :options-alist of the style #+foo[anything] or
would something like org-collect-keywords have to be modified?

-- 
Juan Manuel Macías

https://juanmanuelmacias.com

https://lunotipia.juanmanuelmacias.com

https://gnutas.juanmanuelmacias.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-04 22:22               ` Juan Manuel Macías
@ 2023-09-05 10:44                 ` Ihor Radchenko
  2023-09-20 14:03                   ` Juan Manuel Macías
  2023-09-05 16:42                 ` Max Nikulin
  1 sibling, 1 reply; 21+ messages in thread
From: Ihor Radchenko @ 2023-09-05 10:44 UTC (permalink / raw)
  To: Juan Manuel Macías; +Cc: orgmode, Timothy

Juan Manuel Macías <maciaschain@posteo.net> writes:

>> The idea is to allow
>>
>> #+language: Austrian German, Greek
>> as a valid specifier, in addition to
>>
>> #+language: de-at, el
>>
>> Then, across Org, we will make use of the standardized language
>> abbreviations.
>
> In LaTeX, as I mentioned, things are complicated. There is Babel and
> Polyglossia, and there is LuaTeX and XeTeX. In addition, there is also
> pdfTeX, which is still the default engine and (to be honest) is the
> engine used by a high percentage of LaTeX users. Although perhaps things
> will change soon to the detriment of LuaTeX. Both babel and polyglossia
> could be supported, but that means more work, more code, and more
> complications. And we are not sure that polyglossia is no longer
> maintained. After all, babel is the official LaTeX package for language
> support, and polyglossia appeared at a time when babel had no support
> for the new unicode engines. Now Babel supports all of that and is much
> more powerful, but its interface has also grown in complexity. There is
> the problem of the double syntax for loading languages: the old one,
> which loads traditional ldf files, and the modern one (\babelprovide),
> which loads languages using ini files. It is more powerful, with more
> options, but has added more verbosity to babel. I have taken advantage
> of \babelprovide, specifically its onchar=id fonts property, to
> automatically apply fonts to non-Latin scripts.

> ...
> multilingual support that does not exist as such. It is more like font
> support for non-Latin languages. And only in LaTeX, and specifically in
> LuaLaTeX. Furthermore, the user could mix languages that in Babel are
> loaded through ldf and others through ini files. For example, something
> like this:
>
> #+language: spanish, english, french, russian
>
> in Babel it would be:
>
> \usepackage[english,french,spanish]{babel}
>
> and here we need babelprovide for the font (and load Russian via ini
> file):
>
> \babelprovide[onchar=id fonts, import]{russian}
> \babelfont[russian]{rm}[options]{somefont}
>
> Org would have to discern which name refers to a non-Latin language
> (which wouldn't be complicated with the functionality you're working on)
> and then apply the default font by adding a line with \babelprovide.
>
> Of course, English, French and Spanish can also be loaded via ini files:
>
> \babelprovide[main,import]{spanish}
> \babelprovide[import]{french}
> \babelprovide[import]{english}
>
> Even babel also supports:
>
> \usepackage[english,french,spanish,provide*=*]{babel}
>
> but in that line we cannot put Russian with onchar, etc. And then there
> is pdfTeX, where only the classic babel syntax is allowed, without any
> "*provide".

Aren't we already handling this problem in `org-latex-make-preamble'?

>> My idea was that
>>
>> #+language: ancientgreek russian arabic
>>
>> implies "use default font for arabic", unless #+latex_font is specified.
>
> This seems the most consistent to me for Org, but, as I mentioned in the
> other email, I have some concerns. Currently, what we are talking about
> is simply font support for non-Latin languages. If it is allowed, in the
> current state of things, that #+language can accept a list of language
> names, we can give the user a wrong perception of reality. That is:

 <complications with full support not being possible in all the LaTeX flavors>

> In short, I find everything very confusing. I am not opposed to doing it
> as you propose (in fact, it is the option I like the most, especially
> when org is polyglot in the future), but I also want to warn of possible
> complications.
>
> Therefore, since we are, for now, with fonts for non-Latin languages, I
> think it should be made clear that the keyword is about fonts (and about
> LuaLaTeX). Maybe through two keywords:
>
> #+lualatex_fonts_for: language(s)
> #+lualatex_fonts[language(s)]: "font" options
>
> ?
>
> I think it's ugly, but I can't think of anything else.

Maybe just

#+lualatex_fonts[languages(s)]: default

to force the default.

> By the way, and as a side note, is it currently possible in Org to
> define a keyword within :options-alist of the style #+foo[anything] or
> would something like org-collect-keywords have to be modified?

We will need to add things to `org-element-dual-keywords' and make sure
that the code expects the keyword value to be a list, as returned by the
parser. AFAIU, it should be enough.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-04 22:22               ` Juan Manuel Macías
  2023-09-05 10:44                 ` Ihor Radchenko
@ 2023-09-05 16:42                 ` Max Nikulin
  2023-09-05 18:33                   ` Juan Manuel Macías
  1 sibling, 1 reply; 21+ messages in thread
From: Max Nikulin @ 2023-09-05 16:42 UTC (permalink / raw)
  To: emacs-orgmode

On 05/09/2023 05:22, Juan Manuel Macías wrote:
> \usepackage[english,french,spanish,provide*=*]{babel}
> 
> but in that line we cannot put Russian with onchar, etc.

Cyrillic letters may appear not only in Russian just as French and 
Spanish use Latin script. So language detection based on symbol code 
points works only for distinct enough languages. Explicit markup may 
still be necessary to switch hyphenation rules, dash styles, etc.

I have a couple of bookmarks for language detection libraries (not for 
Emacs), but I am unsure if they may work for texts containing fragments 
written in different languages.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-05 16:42                 ` Max Nikulin
@ 2023-09-05 18:33                   ` Juan Manuel Macías
  2023-09-06  9:29                     ` Ihor Radchenko
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Manuel Macías @ 2023-09-05 18:33 UTC (permalink / raw)
  To: Max Nikulin; +Cc: emacs-orgmode, Ihor Radchenko

Max Nikulin writes:

> Cyrillic letters may appear not only in Russian just as French and
> Spanish use Latin script. So language detection based on symbol code
> points works only for distinct enough languages. Explicit markup may
> still be necessary to switch hyphenation rules, dash styles, etc.

True. Thanks for pointing it out. Indeed, \babelprovide with the
ochar=id fonts option only makes sense when 1 foreign language = 1
script. For example, different variants of Greek cannot be combined
without an explicit switch.

And something like this wouldn't work either:

\babelprovide[import,onchar=id fonts]{russian}
\babelprovide[import,onchar=id fonts]{bulgarian}
\babelfont[russian]{rm}[Color=blue]{Old Standard}
\babelfont[bulgarian]{rm}[Color=green]{FreeSerif}

because bulgarian overwrites russian.

Well, another added complication :-(.

-- 
Juan Manuel Macías 

https://juanmanuelmacias.com

https://lunotipia.juanmanuelmacias.com

https://gnutas.juanmanuelmacias.com




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-05 18:33                   ` Juan Manuel Macías
@ 2023-09-06  9:29                     ` Ihor Radchenko
  2023-09-06 14:58                       ` Juan Manuel Macías
  0 siblings, 1 reply; 21+ messages in thread
From: Ihor Radchenko @ 2023-09-06  9:29 UTC (permalink / raw)
  To: Juan Manuel Macías; +Cc: Max Nikulin, emacs-orgmode, Ihor Radchenko

Juan Manuel Macías <maciaschain@posteo.net> writes:

> True. Thanks for pointing it out. Indeed, \babelprovide with the
> ochar=id fonts option only makes sense when 1 foreign language = 1
> script. For example, different variants of Greek cannot be combined
> without an explicit switch.
>
> And something like this wouldn't work either:
>
> \babelprovide[import,onchar=id fonts]{russian}
> \babelprovide[import,onchar=id fonts]{bulgarian}
> \babelfont[russian]{rm}[Color=blue]{Old Standard}
> \babelfont[bulgarian]{rm}[Color=green]{FreeSerif}
>
> because bulgarian overwrites russian.
>
> Well, another added complication :-(.

AFAIU, there is simply no way to solve this unless the user manually
indicates the indented language.

Do I understand correctly that onchar=id will not break anything if text
is correctly marked with \selectlanguage{<lang>}?

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-06  9:29                     ` Ihor Radchenko
@ 2023-09-06 14:58                       ` Juan Manuel Macías
  2023-09-07 10:22                         ` Ihor Radchenko
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Manuel Macías @ 2023-09-06 14:58 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: Max Nikulin, emacs-orgmode, Ihor Radchenko

Ihor Radchenko writes:

> Do I understand correctly that onchar=id will not break anything if text
> is correctly marked with \selectlanguage{<lang>}?

To load language features (hyphen rules, captions, etc.) there is no
problem. But to load a font associated with a language, the font of the
last declared language will always be loaded. Well, it is not a problem,
because if in a document there are texts in Russian and Bulgarian, for
example, the natural thing is that they go in the same font, since both
languages share the Cyrillic script. But there may be cases when the
author needs different fonts. In such a case, the user should not use
the onchar = etc property:

https://i.imgur.com/vmsCNkP.png

In any case (to organize myself mentally) I thought that it could be
done on two levels:

- Level 0: The fonts associated with each script are loaded (from a
  defcustom list) if luatex is the current engine. And low-level code is
  generated in Lua with the luaotfload.add_fallback function. That code
  can be in a Lua file or directly within the preamble, enclosed in the
  \directlua primitive (mode=harf means that HarfBuzz is used as otf
  rendering):

   \directlua
   {luaotfload.add_fallback("orgfallback",
   {
   "oldstandard:mode=harf;script=grek;",
   "oldstandard:mode=harf;script=cyrl;",
   "freeserif:mode=harf;script=arab;",
   "freeserif:mode=harf;script=dev2;",
   etc., etc.
   })
   }

  And, to load the fallback fonts:

  \setmainfont{latinmodernroman}[RawFeature={fallback=orgfallback}]

 At this level per-language properties are not loaded, but at least
 readability is ensured. The user cannot modify the fonts associated
 with each script within the document, but can modify, of course, the
 defcustom.

- Level 1: The user can load language properties and associate fonts
  with each language using Babel's high-level code (via keywords in Org,
  as we have commented in previous messages). Here you can also modify
  the default fonts (also, as we mentioned before): main, mono, sans and
  math. If the language is declared with an asterisk (for example:
  russian*) the onchar=etc property will be included in the preamble,
  and it would not be necessary to switch to russian explicitly. It is
  assumed that in this scenario the only language with Cyrillic script
  would be Russian. For language swithcing, in the rest of the cases,
  some babel command would have to be used using @@latex:@@, special
  blocks, etc. When Org already has its own language switching
  mechanism, this would be used instead. Wdyt?

-- 

Juan Manuel Macías 

https://juanmanuelmacias.com

https://lunotipia.juanmanuelmacias.com

https://gnutas.juanmanuelmacias.com




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-06 14:58                       ` Juan Manuel Macías
@ 2023-09-07 10:22                         ` Ihor Radchenko
  2023-09-07 12:04                           ` Juan Manuel Macías
  0 siblings, 1 reply; 21+ messages in thread
From: Ihor Radchenko @ 2023-09-07 10:22 UTC (permalink / raw)
  To: Juan Manuel Macías; +Cc: Max Nikulin, emacs-orgmode, Ihor Radchenko

Juan Manuel Macías <maciaschain@posteo.net> writes:

>> Do I understand correctly that onchar=id will not break anything if text
>> is correctly marked with \selectlanguage{<lang>}?
>
> To load language features (hyphen rules, captions, etc.) there is no
> problem. But to load a font associated with a language, the font of the
> last declared language will always be loaded.

May we explicitly set the needed font around language environments?

Something like

\setfontforrussian
\selectlanguage{russian}
....

\setfontforbulgarian
\selectlanguage{bulgarian}
....


> In any case (to organize myself mentally) I thought that it could be
> done on two levels:
>
> - Level 0: The fonts associated with each script are loaded (from a
>   defcustom list) if luatex is the current engine. And low-level code is
>   generated in Lua with the luaotfload.add_fallback function. That code
>   can be in a Lua file or directly within the preamble, enclosed in the
>   \directlua primitive (mode=harf means that HarfBuzz is used as otf
>   rendering):
> ...

Sounds reasonable.

> - Level 1: The user can load language properties and associate fonts
>   with each language using Babel's high-level code (via keywords in Org,
>   as we have commented in previous messages). Here you can also modify
>   the default fonts (also, as we mentioned before): main, mono, sans and
>   math. If the language is declared with an asterisk (for example:
>   russian*) the onchar=etc property will be included in the preamble,
>   and it would not be necessary to switch to russian explicitly. It is
>   assumed that in this scenario the only language with Cyrillic script
>   would be Russian. For language swithcing, in the rest of the cases,
>   some babel command would have to be used using @@latex:@@, special
>   blocks, etc. When Org already has its own language switching
>   mechanism, this would be used instead. Wdyt?

I am not sure if I like "russian*" idea. May you explain a bit more
about how onchar works? What if language characters are intersecting,
and not using exactly the same char sets?

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-07 10:22                         ` Ihor Radchenko
@ 2023-09-07 12:04                           ` Juan Manuel Macías
  2023-09-08  7:42                             ` Ihor Radchenko
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Manuel Macías @ 2023-09-07 12:04 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: Max Nikulin, emacs-orgmode

Ihor Radchenko writes:

> May we explicitly set the needed font around language environments?
>
> Something like
>
> \setfontforrussian
> \selectlanguage{russian}
> ....
>
> \setfontforbulgarian
> \selectlanguage{bulgarian}
> ....

There's no need. With \babelfont you can associate a font with a
language (declared with both the classic syntax and \babelprovide. And
when you use \selectlanguage, \foreignlanguage or any other babel
command or environment to switch languages, the associated font is
activated for that language. For example:

\babelprovide[import]{russian}
\babelprovide[import]{bulgarian}
\babelfont[russian]{rm}[]{Old Standard}
\babelfont[bulgarian]{rm}[]{Freeserif} 

and then:

\selectlanguage{russian}
...
\selectlanguage{bulgarian}
...

\babelprovide supports several properties. Adding the onchar=ids
fonts/letters property equates language and script, and everything in
that script is associated with a font. This would only make sense to use
when there is only one language in the document that has that script, as
we discussed before. In case like russian/bulgarian, the source of the
last babelprovide is overwritten for all cases where that script
appears.

>
>> In any case (to organize myself mentally) I thought that it could be
>> done on two levels:
>>
>> - Level 0: The fonts associated with each script are loaded (from a
>>   defcustom list) if luatex is the current engine. And low-level code is
>>   generated in Lua with the luaotfload.add_fallback function. That code
>>   can be in a Lua file or directly within the preamble, enclosed in the
>>   \directlua primitive (mode=harf means that HarfBuzz is used as otf
>>   rendering):
>> ...
>
> Sounds reasonable.
>
>> - Level 1: The user can load language properties and associate fonts
>>   with each language using Babel's high-level code (via keywords in Org,
>>   as we have commented in previous messages). Here you can also modify
>>   the default fonts (also, as we mentioned before): main, mono, sans and
>>   math. If the language is declared with an asterisk (for example:
>>   russian*) the onchar=etc property will be included in the preamble,
>>   and it would not be necessary to switch to russian explicitly. It is
>>   assumed that in this scenario the only language with Cyrillic script
>>   would be Russian. For language swithcing, in the rest of the cases,
>>   some babel command would have to be used using @@latex:@@, special
>>   blocks, etc. When Org already has its own language switching
>>   mechanism, this would be used instead. Wdyt?
>
> I am not sure if I like "russian*" idea. May you explain a bit more
> about how onchar works? What if language characters are intersecting,
> and not using exactly the same char sets?

Basically, it's like I said above. According to the Babel Manual:

#+begin_quote
onchar= ids | fonts | letters

This option is much like an ‘event’ called when a character belonging to
the script of this locale is found (as its name implies, it acts on
characters, not on spaces). There are currently two ‘actions’, which can
be used at the same time (separated by a space): with ids the \language
and the \localeid are set to the values of this locale; with fonts, the
fonts are changed to those of this locale (as set with \babelfont).
Characters can be added or modified with \babelcharproperty.

[...] Option letters restricts the ‘actions’ to letters, in the TEX
sense (i. e., with catcode 11). Digits and punctuation are then
considered part of current locale (as set by a selector). This option is
useful when the main script is non-Latin and there is a secondary one
whose script is Latin.
#+end_quote


-- 
Juan Manuel Macías 

https://juanmanuelmacias.com

https://lunotipia.juanmanuelmacias.com

https://gnutas.juanmanuelmacias.com




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-07 12:04                           ` Juan Manuel Macías
@ 2023-09-08  7:42                             ` Ihor Radchenko
  0 siblings, 0 replies; 21+ messages in thread
From: Ihor Radchenko @ 2023-09-08  7:42 UTC (permalink / raw)
  To: Juan Manuel Macías; +Cc: Max Nikulin, emacs-orgmode

Juan Manuel Macías <maciaschain@posteo.net> writes:

>> I am not sure if I like "russian*" idea. May you explain a bit more
>> about how onchar works? What if language characters are intersecting,
>> and not using exactly the same char sets?
>
> Basically, it's like I said above. According to the Babel Manual:
>
> #+begin_quote
> onchar= ids | fonts | letters
>
> This option is much like an ‘event’ called when a character belonging to
> the script of this locale is found (as its name implies, it acts on
> characters, not on spaces). There are currently two ‘actions’, which can
> be used at the same time (separated by a space): with ids the \language
> and the \localeid are set to the values of this locale; with fonts, the
> fonts are changed to those of this locale (as set with \babelfont).
> Characters can be added or modified with \babelcharproperty.
>
> [...] Option letters restricts the ‘actions’ to letters, in the TEX
> sense (i. e., with catcode 11). Digits and punctuation are then
> considered part of current locale (as set by a selector). This option is
> useful when the main script is non-Latin and there is a secondary one
> whose script is Latin.
> #+end_quote

Thanks for the explanation!
Then, language* it is. I have no better idea.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-05 10:44                 ` Ihor Radchenko
@ 2023-09-20 14:03                   ` Juan Manuel Macías
  2023-09-21  9:00                     ` Ihor Radchenko
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Manuel Macías @ 2023-09-20 14:03 UTC (permalink / raw)
  To: orgmode; +Cc: Ihor Radchenko, Timothy, Max Nikulin

Some new information about Babel that may be of interest to the topic of
this thread.

I have received an email from Javier Bezos (whom I know from the
Spanish-speaking TeX users' mailing list), who is the current babel
mantainer, as well as the person responsible for all the improvements
and new features of the package. Although he is not currently an
Emacs/Org user, he has been following this thread with great interest,
so I am transmitting here, with his permission, some interesting
comments from him:

#+begin_quote

[...] I am very interested in all possible improvements in babel so that
it integrates as best as possible with automatically generated files.
Among them are the possibility of using BCP47 codes or using a language
(at least basically) without the need for a prior declaration. These are
things already done, but there are others that can still be improved.

[...] any suggestion for improvement is very welcome [...]

Among the things I agree on is name issue. I am unifying the dice in the
CLDR as much as possible, and already, in fact, it is very advanced:

https://latex3.github.io/babel/guides/locale-naming.html

[...]

The ini files contain information that is not actually used by babel,
but that could be useful in other packages or even external
applications. One of them is the name of the language in English and in
the vernacular form, as they are in the Unicode CLDR. As I explain in
the link I gave you, the purpose is that the babel name is based on the
CLDR name with mechanical changes. Anyway, CLDR names are also included
in the ini files, to establish correspondences more easily.

#+end_quote


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-20 14:03                   ` Juan Manuel Macías
@ 2023-09-21  9:00                     ` Ihor Radchenko
  2023-09-24 18:24                       ` Juan Manuel Macías
  0 siblings, 1 reply; 21+ messages in thread
From: Ihor Radchenko @ 2023-09-21  9:00 UTC (permalink / raw)
  To: Juan Manuel Macías; +Cc: orgmode, Timothy, Max Nikulin

Juan Manuel Macías <maciaschain@posteo.net> writes:

> [...] I am very interested in all possible improvements in babel so that
> it integrates as best as possible with automatically generated files.
> Among them are the possibility of using BCP47 codes or using a language
> (at least basically) without the need for a prior declaration. These are
> things already done, but there are others that can still be improved.

Do I understand correctly that babel, in future, may be able to
auto-detect more languages without explicitly declaring them?

> [...] any suggestion for improvement is very welcome [...]

This is a bit too out of context. Improvement of what?

> Among the things I agree on is name issue. I am unifying the dice in the
> CLDR as much as possible, and already, in fact, it is very advanced:
>
> https://latex3.github.io/babel/guides/locale-naming.html

AFAIU, the relevant quote is

    They are taken from the CLDR. Wherever the CLDR doesn’t provide a name
    (eg, “Medieval Latin”), the pattern followed in practice for other names
    is applied, namely, use the ‘natural’ form in English: medievallatin.
    They should be preferably based on the description field in the IANA
    registry (eg, polytonicgreek), although some simplifications can be
    necessary, because some names are “too” descriptive. See also the
    templates for about 500 locales already available. As a secondary
    source, Glottolog is used, too. (Wikipedia articles can be taken as a
    complementary but unreliable source, and its information must be
    verified; on the other hand, internal data, like this one, is useful for
    both names and tags.)

I am not very sure about "some simplifications" referring to IANA. I
guess it is referring to language names in
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
like "Puter idiom of Romansh".

From Org perspective, verbosity is not a primary concern as long as we
provide #+language: completion support. Probably, we should favor names
that are more likely known (or can be easily found) by the language
users. IANA and https://glottolog.org/ look like good sources we can
link to.

We can also provide multiple language name variants though I don't see a
need to bother unless we get user requests to do such thing.

> The ini files contain information that is not actually used by babel,
> but that could be useful in other packages or even external
> applications. One of them is the name of the language in English and in
> the vernacular form, as they are in the Unicode CLDR. As I explain in
> the link I gave you, the purpose is that the babel name is based on the
> CLDR name with mechanical changes. Anyway, CLDR names are also included
> in the ini files, to establish correspondences more easily.

Are the "verbose" language names (name.english) changed to "simplify"
them? Or is it only done for name.babel?

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-21  9:00                     ` Ihor Radchenko
@ 2023-09-24 18:24                       ` Juan Manuel Macías
  2023-09-26 10:37                         ` Ihor Radchenko
  0 siblings, 1 reply; 21+ messages in thread
From: Juan Manuel Macías @ 2023-09-24 18:24 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: orgmode, Timothy, Max Nikulin

Sorry for the late reply.

Ihor Radchenko writes:

> Juan Manuel Macías <maciaschain@posteo.net> writes:
>
>> [...] I am very interested in all possible improvements in babel so that
>> it integrates as best as possible with automatically generated files.
>> Among them are the possibility of using BCP47 codes or using a language
>> (at least basically) without the need for a prior declaration. These are
>> things already done, but there are others that can still be improved.
>
> Do I understand correctly that babel, in future, may be able to
> auto-detect more languages without explicitly declaring them?

Correct. Indeed, it is possible to use the command \foreignlanguage or
its environment version (otherlanguage*) without having to
declare the language previously. I would say that \foreignlanguage is a
command that covers a high percentage of use cases in multilingual
documents, since it is intended for short fragments of text and only
loads the hyphen rules of the host language.

>> [...] any suggestion for improvement is very welcome [...]
>
> This is a bit too out of context. Improvement of what?

I think it is related to the previous paragraph: "I am very interested
in all possible improvements in babel so that it integrates as best as
possible with automatically generated files[...]"

>> Among the things I agree on is name issue. I am unifying the dice in the
>> CLDR as much as possible, and already, in fact, it is very advanced:
>>
>> https://latex3.github.io/babel/guides/locale-naming.html
>
> AFAIU, the relevant quote is
>
>     They are taken from the CLDR. Wherever the CLDR doesn’t provide a name
>     (eg, “Medieval Latin”), the pattern followed in practice for other names
>     is applied, namely, use the ‘natural’ form in English: medievallatin.
>     They should be preferably based on the description field in the IANA
>     registry (eg, polytonicgreek), although some simplifications can be
>     necessary, because some names are “too” descriptive. See also the
>     templates for about 500 locales already available. As a secondary
>     source, Glottolog is used, too. (Wikipedia articles can be taken as a
>     complementary but unreliable source, and its information must be
>     verified; on the other hand, internal data, like this one, is useful for
>     both names and tags.)
>
> I am not very sure about "some simplifications" referring to IANA. I
> guess it is referring to language names in
> https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
> like "Puter idiom of Romansh".
>
> From Org perspective, verbosity is not a primary concern as long as we
> provide #+language: completion support. Probably, we should favor names
> that are more likely known (or can be easily found) by the language
> users. IANA and https://glottolog.org/ look like good sources we can
> link to.
>
> We can also provide multiple language name variants though I don't see a
> need to bother unless we get user requests to do such thing.

I agree. I even think it would be a good point to also include the
vernacular name of each language.

By the way, Javier has also told me that he is going to consider the
'onchar=ids fonts' issue related to the case of several languages that
use the same script (already discussed here in past messages).

Best regards,

Juan Manuel 

-- 
Juan Manuel Macías

https://juanmanuelmacias.com

https://lunotipia.juanmanuelmacias.com

https://gnutas.juanmanuelmacias.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fallback fonts in LaTeX export for non latin scripts
  2023-09-24 18:24                       ` Juan Manuel Macías
@ 2023-09-26 10:37                         ` Ihor Radchenko
  0 siblings, 0 replies; 21+ messages in thread
From: Ihor Radchenko @ 2023-09-26 10:37 UTC (permalink / raw)
  To: Juan Manuel Macías; +Cc: orgmode, Timothy, Max Nikulin

Juan Manuel Macías <maciaschain@posteo.net> writes:

>>> [...] any suggestion for improvement is very welcome [...]
>>
>> This is a bit too out of context. Improvement of what?
>
> I think it is related to the previous paragraph: "I am very interested
> in all possible improvements in babel so that it integrates as best as
> possible with automatically generated files[...]"

That's good to hear. In practical terms, if Javier gives us some contact
email, we may CC him when we think that what we discuss is related to
Babel.

>> We can also provide multiple language name variants though I don't see a
>> need to bother unless we get user requests to do such thing.
>
> I agree. I even think it would be a good point to also include the
> vernacular name of each language.

Sounds reasonable. Although, let's come back to this when we have actual
code to discuss.

> By the way, Javier has also told me that he is going to consider the
> 'onchar=ids fonts' issue related to the case of several languages that
> use the same script (already discussed here in past messages).

That would be nice, although determining language may not be trivial.
AFAIK, automatic language detection often relies upon word frequencies
(for example, see https://pypi.org/project/langdetect/) and cannot be
reliable for very short text fragments. In the case of texts combining
multiple languages arbitrarily, the problem becomes even more difficult.
In some cases (dialects), multiple languages can be valid for the same
text fragment.

That said, frequency-based approach can mostly work well, except certain
edge cases. But it requires word corpus. I am not sure how feasible it
would be to include into TeX distribution. (Maybe not very hard - it is
already quite large and a few dictionary files will not change much).

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-09-26 10:36 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-30  8:25 Fallback fonts in LaTeX export for non latin scripts Juan Manuel Macías
2023-08-31  8:17 ` Ihor Radchenko
2023-08-31 11:42   ` Juan Manuel Macías
2023-09-01  9:18     ` Ihor Radchenko
2023-09-02 21:39       ` Juan Manuel Macías
2023-09-03  7:22         ` Ihor Radchenko
2023-09-03 11:05           ` Juan Manuel Macías
2023-09-04  8:09             ` Ihor Radchenko
2023-09-04 22:22               ` Juan Manuel Macías
2023-09-05 10:44                 ` Ihor Radchenko
2023-09-20 14:03                   ` Juan Manuel Macías
2023-09-21  9:00                     ` Ihor Radchenko
2023-09-24 18:24                       ` Juan Manuel Macías
2023-09-26 10:37                         ` Ihor Radchenko
2023-09-05 16:42                 ` Max Nikulin
2023-09-05 18:33                   ` Juan Manuel Macías
2023-09-06  9:29                     ` Ihor Radchenko
2023-09-06 14:58                       ` Juan Manuel Macías
2023-09-07 10:22                         ` Ihor Radchenko
2023-09-07 12:04                           ` Juan Manuel Macías
2023-09-08  7:42                             ` Ihor Radchenko

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).