emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed
From: Christophe Pouzat <christophe.pouzat@gmail.com>
To: emacs-orgmode@gnu.org
Subject: Re: Efficiency of Org v. LaTeX v. Word ---LOOK AT THE DATA!
Date: Sun, 28 Dec 2014 22:40:24 +0100	[thread overview]
Message-ID: <54A078C8.90501@gmail.com> (raw)
In-Reply-To: <m2h9wi6nas.fsf@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 13854 bytes --]

Hi all,

After seeing Ken's mail:

Le 26/12/2014 23:47, Ken Mankoff a écrit :
> People here might be interested in a publication from [2014-12-19 Fri]
> available at http://dx.doi.org/10.1371/journal.pone.0115069
>
> Title: An Efficiency Comparison of Document Preparation Systems Used
> in Academic Research and Development
>
> Summary: Word users are more efficient and have less errors than even
> experienced LaTeX users.
>
> Someone here should repeat experiment and add Org into the mix, perhaps
> Org -> ODT and/or Org -> LaTeX and see if it helps or hurts. I assume
> Org would trump LaTeX, but would Org -> ODT or Org -> X -> DOCX (via
> pandoc) beat straight Word?
>
>    -k.
>
>
and some of replies it triggered on the list, I went to check the paper. 
As many of you guys I found some "results" puzzling in particular:
1. the use of bar graphs when the data would better be displayed 
directly (that qualifies immediately the paper as "low quality" for me).
2. the larger error bars observed for LaTeX when compared to Word.
3. the systematic inverse relationship between the blue and pink bars 
heights.

So I went to figshare to download the data and looked at them. A quick 
and dirty "analysis" is attached to this mail in PDF format (generated 
with org, of course, and this awful software called LaTeX!) and the 
source org file can be found at the bottom of this mail. I used R to do 
the figures (and I'm sure the authors of the paper will then criticize 
me for not using Excel with which everyone knows errors are generated 
much more efficiently).

I managed to understand the inverse relationship in point 3 above: the 
authors considered 3 types of mistakes / errors:
1. Formatting and typos error.
2. Orthographic and grammatical errors.
3. Missing words and signs.
Clearly, following the mail of Tom (Dye) on the list and on the Plos web 
site, I would argue that formatting errors in LaTeX are bona fide bugs. 
But the point I want to make is that the third source accounts for 80% 
of the total errors (what's shown in pink bars in the paper) and clearly 
the authors counted what the subjects did not have time to type as an 
error of this type. Said differently, the blue and pink bars are showing 
systematically the same thing by construction! The second type of error 
in not a LaTeX issue (and in fact does not differ significantly from the 
Word case) but an "environment" issue (what spelling corrector had the 
LaTeX users access to?).

There is another strange thing in the table copy case. For both the 
expert and novice group in LaTeX, there is one among 10 subjects that 
did produce 0% of the table but still manage to produce 22 typographic 
errors!

The overall worst performance of LaTeX users remains to be explained and 
as mentioned in on the mails in the list, that does not make sense at 
least for the continuous text exercise. The method section of the paper 
is too vague but my guess is that some LaTeX users did attempt to 
reproduce the exact layout of the text they had to copy, something LaTeX 
is definitely not design to provide quickly.

One more point: how many of you guys could specify their total number of 
hours of experience with LaTeX (or any other software you are currently 
using)? That what the subjects of this study had to specify...

Let me know what you think,

Christophe

My org buffer:

#+TITLE: An Efficiency Comparison of Document Preparation Systems Used 
in Academic Research and Development: A Re-analysis.
#+DATE: <2014-12-28 dim.>
#+AUTHOR: Christophe Pouzat
#+EMAIL: christophe.pouzat@gmail.com
#+OPTIONS: ':nil *:t -:t ::t <:t H:3 \n:nil ^:t arch:headline
#+OPTIONS: author:t c:nil creator:comment d:(not "LOGBOOK") date:t
#+OPTIONS: e:t email:nil f:t inline:t num:t p:nil pri:nil stat:t
#+OPTIONS: tags:t tasks:t tex:t timestamp:t toc:nil todo:t |:t
#+CREATOR: Emacs 24.4.1 (Org mode 8.2.10)
#+DESCRIPTION:
#+EXCLUDE_TAGS: noexport
#+KEYWORDS:
#+LANGUAGE: en
#+SELECT_TAGS: export
#+LaTeX_HEADER: \usepackage{alltt}
#+LaTeX_HEADER: \usepackage[usenames,dvipsnames]{xcolor}
#+LaTeX_HEADER: \renewenvironment{verbatim}{\begin{alltt} \scriptsize 
\color{Bittersweet} \vspace{0.2cm} }{\vspace{0.2cm} \end{alltt} 
\normalsize \color{black}}
#+LaTeX_HEADER: \definecolor{lightcolor}{gray}{.55}
#+LaTeX_HEADER: \definecolor{shadecolor}{gray}{.85}
#+LaTeX_HEADER: \usepackage{minted}
#+LaTeX_HEADER: \hypersetup{colorlinks=true}

#+NAME: org-latex-set-up
#+BEGIN_SRC emacs-lisp :results silent :exports none
(setq org-latex-listings 'minted)
(setq org-latex-minted-options
       '(("bgcolor" "shadecolor")
     ("fontsize" "\\scriptsize")))
(setq org-latex-pdf-process
       '("pdflatex -shell-escape -interaction nonstopmode 
-output-directory %o %f"
     "biber %b"
     "pdflatex -shell-escape -interaction nonstopmode -output-directory 
%o %f"
     "pdflatex -shell-escape -interaction nonstopmode -output-directory 
%o %f"))
#+END_SRC

* Introduction
This is a re-analysis of the data presented in 
[[http://dx.doi.org/10.1371/journal.pone.0115069][An Efficiency 
Comparison of Document Preparation Systems Used in Academic Research and 
Development]]. My "interest" in this paper was triggered by a discussion 
on the [[http://article.gmane.org/gmane.emacs.orgmode/93655][emacs org 
mode mailing list]]. Ignoring the "message" of the paper, what stroke me 
was the systematic use of bar graphs: a way of displaying data that 
*should never be used* since when many observations are considered, a 
box plot is going to do a much better job and when, like in the present 
paper, few (10 in each of the 4 categories) observations are available, 
a direct display or even a simple table is going to do a *much better* 
job. Since it turns out that the data are available both on the Plos web 
site and on 
[[http://figshare.com/articles/_An_Efficiency_Comparison_of_Document_Preparation_Systems_Used_in_Academic_Research_and_Development_/1275631][figshare]], 
I decided to re-analyze them.

* Getting the data, etc.

We get the data with:

#+BEGIN_SRC sh
wget http://files.figshare.com/1849394/S1_Materials.xlsx
#+END_SRC

#+RESULTS:
Using for instance [[http://dag.wiee.rs/home-made/unoconv/][unoconv]], 
we can convert the =Excel= file into a friendlier =csv= file:

#+BEGIN_SRC sh
unoconv -f csv S1_Materials.xlsx
#+END_SRC

#+RESULTS:
We then get the data with =R= =read.csv= function:

#+NAME: data-table
#+BEGIN_SRC R :session *R* :results silent
efficiency <- read.csv("S1_Materials.csv",header=TRUE,dec=",")
#+END_SRC
The description of this table is obtained with:

#+BEGIN_SRC sh :exports both :results output
wget http://files.figshare.com/1849395/S2_Materials.txt
cat "S2_Materials.txt"
#+END_SRC

* Making some figures
We can now make a figure out of the same data as figures 4, 5 and 6 of 
the paper but showing the actual data. We start with the "continuous 
text" exercise. We represent, in each of the four categories, each of 
the 10 individuals by a number between 0 and 9. Some horizontal jitter 
has been added to avoid overlaps. Category 1 corresponds to expert 
=Word= users; 2 to novice =Word= users; 3 to expert \LaTeX{} users; 4 to 
novice \LaTeX{} users:

#+HEADER: :file continuous.png :width 1000 :height 1000
#+BEGIN_SRC R :session *R* :exports both :results output graphics
layout(matrix(1:4,nc=2,byrow=TRUE))
par(cex=2)
plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),ylim=c(0,100),
      xlab="User category",ylab="",main="Fraction of text")
with(efficiency,
      sapply(1:4,
             function(k) points(runif(10,k-0.2,k+0.2),
                                PROZENT1[Kenntnisse==k],
                                pch = paste(0:9))))

with(efficiency,
      plot(c(1,4),c(0,100),type="n",
           xlim=c(0.5,4.5),ylim=range(FEHLERSFT),xlab="User category",
           ylab="",main="Formatting errors and typos"))
with(efficiency,
      sapply(1:4,
             function(k) points(runif(10,k-0.2,k+0.2),
                                FEHLERSFT[Kenntnisse==k],
                                pch = paste(0:9))))

with(efficiency,
      plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),
           ylim=range(FEHLEROFT),xlab="User category",ylab="",
           main="Orthographic and grammatical mistakes"))
with(efficiency,
      sapply(1:4,
             function(k) points(runif(10,k-0.2,k+0.2),
                                FEHLEROFT[Kenntnisse==k],
                                pch = paste(0:9))))

with(efficiency,
plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),ylim=range(FEHLENDFT),
           xlab="User category",ylab="",main="Missing words and signs"))
with(efficiency,
      sapply(1:4,
             function(k) points(runif(10,k-0.2,k+0.2),
                                FEHLENDFT[Kenntnisse==k],
                                pch = paste(0:9))))
#+END_SRC


Notice that the number of "missing words and signs" exactly mirrors the 
fraction of written text. We will see that this observation holds for 
the two following exercises. This "missing words and signs" is always 
roughly ten times as large as the two other sources of mistakes. This 
explains the inverse relationship between the blue and pink bars on each 
of the 3 figures.

Let's keep going with the "table exercise":

#+HEADER: :file table.png :width 1000 :height 1000
#+BEGIN_SRC R :session *R* :exports both :results output graphics
layout(matrix(1:4,nc=2,byrow=TRUE))
par(cex=2)
plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),ylim=c(0,100),
      xlab="User category",ylab="",main="Fraction of text")
with(efficiency,sapply(1:4,
                        function(k) points(runif(10,k-0.2,k+0.2),
                                           PROZENT2[Kenntnisse==k],
                                           pch = paste(0:9))))

with(efficiency,plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),
                      ylim=range(FEHLERST),xlab="User category",
                      ylab="",main="Formatting errors and typos"))
with(efficiency,sapply(1:4,
                        function(k) points(runif(10,k-0.2,k+0.2),
                                           FEHLERST[Kenntnisse==k],
                                           pch = paste(0:9))))

with(efficiency,plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),
                      ylim=range(FEHLEROT),xlab="User category",
                      ylab="",main="Orthographic and grammatical mistakes"))
with(efficiency,sapply(1:4,
                        function(k) points(runif(10,k-0.2,k+0.2),
                                           FEHLEROT[Kenntnisse==k],
                                           pch = paste(0:9))))

with(efficiency,plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),
                      ylim=range(FEHLENDT),xlab="User category",ylab="",
                      main="Missing words and signs"))
with(efficiency,sapply(1:4,
                        function(k) points(runif(10,k-0.2,k+0.2),
                                           FEHLENDT[Kenntnisse==k],
                                           pch = paste(0:9))))
#+END_SRC

We also see a strange thing here: in each of the expert \LaTeX{} and the 
novice \LaTeX{} users we have one individual who did not right anything 
but still manage to produce 22 "formatting errors and typos" (!) but 
luckily no orthographic or grammatical error...

#+BEGIN_SRC R :session *R* :exports both
with(efficiency,cbind(c(PROZENT2[Kenntnisse==3][10],
                         FEHLERST[Kenntnisse==3][10],
                         FEHLEROT[Kenntnisse==3][10],
                         FEHLENDT[Kenntnisse==3][10]),
                       c(PROZENT2[Kenntnisse==4][7],
                         FEHLERST[Kenntnisse==4][7],
                         FEHLEROT[Kenntnisse==4][7],
                         FEHLENDT[Kenntnisse==4][7])))
#+END_SRC


Now for the "equations" exercise:

#+HEADER: :file equation.png :width 1000 :height 1000
#+BEGIN_SRC R :session *R* :exports both :results output graphics
layout(matrix(1:4,nc=2,byrow=TRUE))
par(cex=2)
plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),ylim=c(0,100),
      xlab="User category",ylab="",main="Fraction of text")
with(efficiency,
      sapply(1:4,
             function(k) points(runif(10,k-0.2,k+0.2),
                                PROZENT3[Kenntnisse==k],
                                pch = paste(0:9))))

with(efficiency,
      plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),
           ylim=range(FEHLERSFOR),xlab="User category",ylab="",
           main="Formatting errors and typos"))
with(efficiency,
      sapply(1:4,
             function(k) points(runif(10,k-0.2,k+0.2),
                                FEHLERSFOR[Kenntnisse==k],
                                pch = paste(0:9))))

with(efficiency,
plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),ylim=range(FEHLEROFOR),
           xlab="User category",ylab="",
           main="Orthographic and grammatical mistakes"))
with(efficiency,
      sapply(1:4,
             function(k) points(runif(10,k-0.2,k+0.2),
                                FEHLEROFOR[Kenntnisse==k],
                                pch = paste(0:9))))

with(efficiency,
      plot(c(1,4),c(0,100),type="n",xlim=c(0.5,4.5),
           ylim=range(FEHLENDFOR),xlab="User category",ylab="",
           main="Missing words and signs"))
with(efficiency,
      sapply(1:4,
             function(k) points(runif(10,k-0.2,k+0.2),
                                FEHLENDFOR[Kenntnisse==k],
                                pch = paste(0:9))))
#+END_SRC



-- 
A Master Carpenter has many tools and is expert with most of them. If you only know how to use a hammer, every problem starts to look like a nail. Stay away from that trap.

Richard B Johnson.

--

Christophe Pouzat
MAP5 - Mathématiques Appliquées à Paris 5
CNRS UMR 8145
45, rue des Saints-Pères
75006 PARIS
France

tel: +33142863828
mobile: +33662941034
web: http://xtof.disque.math.cnrs.fr


[-- Attachment #2: EfficiencyComparison.pdf --]
[-- Type: application/pdf, Size: 218418 bytes --]

  parent reply	other threads:[~2014-12-28 21:40 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-26 22:47 Efficiency of Org v. LaTeX v. Word Ken Mankoff
2014-12-26 23:36 ` Thomas S. Dye
2014-12-27  2:21   ` briangpowell .
2014-12-27 14:36     ` Eric S Fraga
2014-12-27  3:26 ` Christopher W. Ryan
2014-12-28 22:45   ` Bob Newell
2014-12-27  4:27 ` Nick Dokos
2014-12-27  9:06   ` Peter Neilson
2014-12-27 14:38     ` Eric S Fraga
2014-12-27  9:48 ` Achim Gratz
2014-12-27 10:05 ` Paul Rudin
2014-12-27 10:36   ` M
2014-12-27 11:36     ` Fabrice Popineau
2014-12-28 22:43       ` Pascal Fleury
2014-12-31 18:19     ` Paul Rudin
2014-12-27 13:37 ` Daniele Pizzolli
2014-12-28 21:40 ` Christophe Pouzat [this message]
2014-12-29 19:47   ` Efficiency of Org v. LaTeX v. Word ---LOOK AT THE DATA! Thomas S. Dye
2014-12-31 16:59   ` Colin Baxter
2015-01-04 20:38 ` Efficiency of Org v. LaTeX v. Word John Kitchin
2015-01-04 21:15   ` Andreas Leha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.orgmode.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54A078C8.90501@gmail.com \
    --to=christophe.pouzat@gmail.com \
    --cc=emacs-orgmode@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).