Re: state of the art in org-mode tables e.g. join, etc

emacs-orgmode@gnu.org archives
 help / color / mirror / code / Atom feed

From: Derek Feichtinger <derek.feichtinger@psi.ch>
To: John Kitchin <jkitchin@andrew.cmu.edu>
Cc: Tim Cross <theophilusx@gmail.com>, emacs-orgmode@gnu.org
Subject: Re: state of the art in org-mode tables e.g. join, etc
Date: Mon, 22 Feb 2021 09:27:50 +0100	[thread overview]
Message-ID: <87r1l8tsl5.fsf@psi.ch> (raw)
In-Reply-To: <CAJ51ETrQzgP3DGespaGx0Yj5gLSDciVtV-9G4gM33-xZdHxNug@mail.gmail.com>

Hi John,

I invested time some years ago in preparing babel examples, and a lot of
the description went into using tables. The most detailed documents I
had for elisp and python.

In order to be productive, e.g. for producing all kinds of scientific
graphs, but also for doing the finances and planning for our scientific
computing section I ended up the same as you with mostly going to python
and leveraging Pandas. I think all of us end up using ":colnames no" as
the most convenient solution.

https://github.com/dfeich/org-babel-examples/blob/master/python3/python3-babel.org

(especially look at the Pandas section 10)

In that file I also tangle a python library "orgbabelhelper" that is
available in Conda and PyPi. I mainly use that to work with my tables.

Best regards
Derek

-- 
Paul Scherrer Institut
Dr. Derek Feichtinger                   Phone:   +41 56 310 47 33
Group Head HPC and Emerging Technologies  Email: derek.feichtinger@psi.ch
Building/Room No. OHSA/D17
Forschungsstrasse 111
CH-5232 Villigen PSI 

On Sun, Feb 21 2021, John Kitchin <jkitchin@andrew.cmu.edu> wrote:

> For fun, here is the sqlite equivalent of the Pandas example using the same
> tables as before
>
>
> ** aggregation example
>
> Examples from https://github.com/tbanel/orgaggregate
>
>
> #+NAME: original
> | Day       | Color | Level | Quantity |
> |-----------+-------+-------+----------|
> | Monday    | Red   |    30 |       11 |
> | Monday    | Blue  |    25 |        3 |
> | Tuesday   | Red   |    51 |       12 |
> | Tuesday   | Red   |    45 |       15 |
> | Tuesday   | Blue  |    33 |       18 |
> | Wednesday | Red   |    27 |       23 |
> | Wednesday | Blue  |    12 |       16 |
> | Wednesday | Blue  |    15 |       15 |
> | Thursday  | Red   |    39 |       24 |
> | Thursday  | Red   |    41 |       29 |
> | Thursday  | Red   |    49 |       30 |
> | Friday    | Blue  |     7 |        5 |
> | Friday    | Blue  |     6 |        8 |
> | Friday    | Blue  |    11 |        9 |
>
>
> #+begin_src sqlite :db ":memory:" :var orgtable=original :colnames yes
> drop table if exists testtable;
> create table testtable(Day str, Color str, Level int, Quantity int);
> .mode csv testtable
> .import $orgtable testtable
> select Color, count(*) from testtable group by Color;
> #+end_src
>
> #+RESULTS:
> | Color | count(*) |
> |-------+----------|
> | Blue  |        7 |
> | Red   |        7 |
>
> ** join example
>
> Example from https://github.com/tbanel/orgtbljoin
>
> #+name: nutrition
> | type     | Fiber | Sugar | Protein | Carb |
> |----------+-------+-------+---------+------|
> | eggplant |   2.5 |   3.2 |     0.8 |  8.6 |
> | tomatoe  |   0.6 |   2.1 |     0.8 |  3.4 |
> | onion    |   1.3 |   4.4 |     1.3 |  9.0 |
> | egg      |     0 |  18.3 |    31.9 | 18.3 |
> | rice     |   0.2 |     0 |     1.5 | 16.0 |
> | bread    |   0.7 |   0.7 |     3.3 | 16.0 |
> | orange   |   3.1 |  11.9 |     1.3 | 17.6 |
> | banana   |   2.1 |   9.9 |     0.9 | 18.5 |
> | tofu     |   0.7 |   0.5 |     6.6 |  1.4 |
> | nut      |   2.6 |   1.3 |     4.9 |  7.2 |
> | corn     |   4.7 |   1.8 |     2.8 | 21.3 |
>
>
> #+name: recipe
> | type     | quty |
> |----------+------|
> | onion    |   70 |
> | tomatoe  |  120 |
> | eggplant |  300 |
> | tofu     |  100 |
>
>
> #+begin_src sqlite :db ":memory:" :var nut=nutrition rec=recipe :colnames
> yes
> drop table if exists nutrition;
> drop table if exists recipe;
> create table nutrition(type str, Fiber float, Sugar float, Protein float,
> Carb float);
> create table recipe(type str, quty int);
>
> .mode csv nutrition
> .import $nut nutrition
>
> .mode csv recipe
> .import $rec recipe
>
> select * from recipe, nutrition where recipe.type=nutrition.type;
> #+end_src
>
> #+RESULTS:
> | type     | quty | type     | Fiber | Sugar | Protein | Carb |
> |----------+------+----------+-------+-------+---------+------|
> | onion    |   70 | onion    |   1.3 |   4.4 |     1.3 |  9.0 |
> | tomatoe  |  120 | tomatoe  |   0.6 |   2.1 |     0.8 |  3.4 |
> | eggplant |  300 | eggplant |   2.5 |   3.2 |     0.8 |  8.6 |
> | tofu     |  100 | tofu     |   0.7 |   0.5 |     6.6 |  1.4 |
>
>
> John
>
> -----------------------------------
> Professor John Kitchin
> Doherty Hall A207F
> Department of Chemical Engineering
> Carnegie Mellon University
> Pittsburgh, PA 15213
> 412-268-7803
> @johnkitchin
> http://kitchingroup.cheme.cmu.edu
>
>
>
> On Sun, Feb 21, 2021 at 10:03 AM John Kitchin <jkitchin@andrew.cmu.edu>
> wrote:
>
>> Thanks Tim and Greg. I had mostly come to the same conclusions that it is
>> probably best to outsource this. I worked out some examples from
>> the orgtbljoin and orgaggregate packages with Pandas below, in case anyone
>> is interested in seeing how it works. A key point is using the ":colnames
>> no" header args to get the column names for Pandas. It seems like a pretty
>> good approach.
>>
>> * org-mode tables with Pandas
>> ** Aggregating from a table
>>
>> Examples from https://github.com/tbanel/orgaggregate
>>
>>
>> #+NAME: original
>> | Day       | Color | Level | Quantity |
>> |-----------+-------+-------+----------|
>> | Monday    | Red   |    30 |       11 |
>> | Monday    | Blue  |    25 |        3 |
>> | Tuesday   | Red   |    51 |       12 |
>> | Tuesday   | Red   |    45 |       15 |
>> | Tuesday   | Blue  |    33 |       18 |
>> | Wednesday | Red   |    27 |       23 |
>> | Wednesday | Blue  |    12 |       16 |
>> | Wednesday | Blue  |    15 |       15 |
>> | Thursday  | Red   |    39 |       24 |
>> | Thursday  | Red   |    41 |       29 |
>> | Thursday  | Red   |    49 |       30 |
>> | Friday    | Blue  |     7 |        5 |
>> | Friday    | Blue  |     6 |        8 |
>> | Friday    | Blue  |    11 |        9 |
>>
>>
>> #+BEGIN_SRC ipython :var data=original :colnames no
>> import pandas as pd
>>
>> pd.DataFrame(data[1:], columns=data[0]).groupby('Color').size()
>> #+END_SRC
>>
>> #+RESULTS:
>> :results:
>> # Out [1]:
>> # text/plain
>> : Color
>> : Blue    7
>> : Red     7
>> : dtype: int64
>> :end:
>>
>> The categorical stuff here is just to get the days sorted the same way as
>> the example. It is otherwise not needed. I feel there should be a more
>> clever way to do this, but didn't think of it.
>>
>> #+BEGIN_SRC ipython :var data=original :colnames no
>> df = pd.DataFrame(data[1:], columns=data[0])
>> days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday',
>> 'Saturday', 'Sunday']
>> df['Day'] = pd.Categorical(df['Day'], categories=days, ordered=True)
>>
>> (df
>>  .groupby('Day')
>>  .agg({'Level': 'mean',
>>        'Quantity': 'sum'})
>>  .sort_values('Day'))
>> #+END_SRC
>>
>> #+RESULTS:
>> :results:
>> # Out [2]:
>> # text/plain
>> :            Level  Quantity
>> : Day
>> : Monday      27.5        14
>> : Tuesday     43.0        45
>> : Wednesday   18.0        54
>> : Thursday    43.0        83
>> : Friday       8.0        22
>> : Saturday     NaN         0
>> : Sunday       NaN         0
>>
>>
>> [[file:/var/folders/3q/ht_2mtk52hl7ydxrcr87z2gr0000gn/T/ob-ipython-htmlMnDA9a.html]]
>> :end:
>>
>> ** Joining tables
>>
>> Example from https://github.com/tbanel/orgtbljoin
>>
>> #+name: nutrition
>> | type     | Fiber | Sugar | Protein | Carb |
>> |----------+-------+-------+---------+------|
>> | eggplant |   2.5 |   3.2 |     0.8 |  8.6 |
>> | tomatoe  |   0.6 |   2.1 |     0.8 |  3.4 |
>> | onion    |   1.3 |   4.4 |     1.3 |  9.0 |
>> | egg      |     0 |  18.3 |    31.9 | 18.3 |
>> | rice     |   0.2 |     0 |     1.5 | 16.0 |
>> | bread    |   0.7 |   0.7 |     3.3 | 16.0 |
>> | orange   |   3.1 |  11.9 |     1.3 | 17.6 |
>> | banana   |   2.1 |   9.9 |     0.9 | 18.5 |
>> | tofu     |   0.7 |   0.5 |     6.6 |  1.4 |
>> | nut      |   2.6 |   1.3 |     4.9 |  7.2 |
>> | corn     |   4.7 |   1.8 |     2.8 | 21.3 |
>>
>>
>> #+name: recipe
>> | type     | quty |
>> |----------+------|
>> | onion    |   70 |
>> | tomatoe  |  120 |
>> | eggplant |  300 |
>> | tofu     |  100 |
>>
>>
>> #+BEGIN_SRC ipython :var nut=nutrition recipe=recipe :colnames no
>> nutrition = pd.DataFrame(nut[1:], columns=nut[0])
>> rec = pd.DataFrame(recipe[1:], columns=recipe[0])
>>
>> pd.merge(rec, nutrition, on='type')
>> #+END_SRC
>>
>> #+RESULTS:
>> :results:
>> # Out [4]:
>> # text/plain
>> :        type  quty  Fiber  Sugar  Protein  Carb
>> : 0     onion    70    1.3    4.4      1.3   9.0
>> : 1   tomatoe   120    0.6    2.1      0.8   3.4
>> : 2  eggplant   300    2.5    3.2      0.8   8.6
>> : 3      tofu   100    0.7    0.5      6.6   1.4
>> :end:
>>
>>
>> John
>>
>> -----------------------------------
>> Professor John Kitchin
>> Doherty Hall A207F
>> Department of Chemical Engineering
>> Carnegie Mellon University
>> Pittsburgh, PA 15213
>> 412-268-7803
>> @johnkitchin
>> http://kitchingroup.cheme.cmu.edu
>>
>>
>>
>> On Sun, Feb 21, 2021 at 1:54 AM Tim Cross <theophilusx@gmail.com> wrote:
>>
>>>
>>> Greg Minshall <minshall@umich.edu> writes:
>>>
>>> > John,
>>> >
>>> >> Is there a state of the art in using org-tables as little databases
>>> >> with joins and stuff?
>>> >
>>> > i have to admit i do all that with an R code source block.  (the dplyr
>>> > package has the relevant joins, e.g. dplyr::inner_join().)  and, in R,
>>> > ":colnames yes" as a header argument gives you header lines on results.
>>> > (maybe that's ?now? for "all" languages?)
>>> >
>>>
>>> For really complex joins and ad hoc queries, I would do similar or put
>>> the data into sqlite. For more simple ones, I just define a table which
>>> uses table formulas to extract the values from the other tables - the
>>> downside being the tables need to have the same data ordering or the
>>> formulas need to be somewhat complex. Provided the tables have the same
>>> number of records in the same order, table formulas are usually fairly
>>> easy.
>>>
>>> I did think about writing some elisp functions to use in my table
>>> formulas to make things easier, but then decided I was just re-inventing
>>> and well defined database solution and figured when I need it, just use
>>> sqlite. However, it has been a while since I needed this level of
>>> complexity, so perhaps things have moved on and there are better ways
>>> now.
>>>
>>> --
>>> Tim Cross
>>>
>>>

next prev parent reply	other threads:[~2021-02-22  8:39 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-20 21:15 state of the art in org-mode tables e.g. join, etc John Kitchin
2021-02-21  4:40 ` Greg Minshall
2021-02-21  6:45   ` Tim Cross
2021-02-21 15:03     ` John Kitchin
2021-02-21 16:23       ` John Kitchin
2021-02-22  6:52         ` Cook, Malcolm
2021-02-22  8:12           ` Greg Minshall
2021-02-22 15:21             ` Cook, Malcolm
2021-02-22 18:41               ` Greg Minshall
2021-02-25 14:50           ` John Kitchin
2021-02-22  8:27         ` Derek Feichtinger [this message]
2021-02-24 22:21           ` John Kitchin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.orgmode.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87r1l8tsl5.fsf@psi.ch \
    --to=derek.feichtinger@psi.ch \
    --cc=emacs-orgmode@gnu.org \
    --cc=jkitchin@andrew.cmu.edu \
    --cc=theophilusx@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).