From: Nicolas Goaziou <firstname.lastname@example.org>
To: Utkarsh Singh <email@example.com>
Cc: firstname.lastname@example.org, email@example.com
Subject: Re: [PATCH] org-table-import: Make it more smarter for interactive use
Date: Tue, 20 Apr 2021 15:40:12 +0200 [thread overview]
Message-ID: <firstname.lastname@example.org> (raw)
In-Reply-To: <email@example.com> (Utkarsh Singh's message of "Mon, 19 Apr 2021 19:53:25 +0530")
Utkarsh Singh <firstname.lastname@example.org> writes:
> At first I was also reluctant in creating a new function but decided to
> do so because:
> + org-table-convert-region is currently doing two thing 'guessing the
> separator' and 'converting the region'. I thought it was a good idea to
> separate out function into it's atomic operations.
I understand, but there is sometimes a (difficult) line to draw between
"separating concerns" and "function proliferation". Anyway, that's fine
> + Current guessing technique is quite basic as it assumes that data
> (file that has to be imported) has no error/inconsistency in it. I
> would like to show you the doc string of Python's CSV library
> implementation to guess separator (region inside """):
> Looks for text enclosed between two identical quotes
> (the probable quotechar) which are preceded and followed
> by the same character (the probable delimiter).
> For example:
> ,'some text',
> The quote with the most wins, same with the delimiter.
> If there is no quotechar the delimiter can't be determined
> this way.
> And if this functions fails then we have:
> The delimiter /should/ occur the same number of times on
> each row. However, due to malformed data, it may not. We don't want
> an all or nothing approach, so we allow for small variations in this
> 1) build a table of the frequency of each character on every line.
> 2) build a table of frequencies of this frequency (meta-frequency?),
> e.g. 'x occurred 5 times in 10 rows, 6 times in 1000 rows,
> 7 times in 2 rows'
> 3) use the mode of the meta-frequency to determine the /expected/
> frequency for that character
> 4) find out how often the character actually meets that goal
> 5) the character that best meets its goal is the delimiter
> For performance reasons, the data is evaluated in chunks, so it can
> try and evaluate the smallest portion of the data possible, evaluating
> additional chunks as necessary.
For the problem we're trying to solve, this sounds like over-engineering
to me. Do we want so badly to guess a separator?
> I tried to do similar in Elisp but currently facing some issues due to
> my inexperience in functional programming. Also moving the 'guessing'
> part out the function may lead to development of even better algorithm
> than Python counterpart.
> Modified version of concerned function:
> (defun org-table-guess-separator (beg0 end0)
> "Guess separator for `org-table-convert-region' for region BEG0 to END0.
> List of preferred separator:
> comma, TAB, semicolon, colon or SPACE.
> If region contains a line which doesn't contain the required
> separator then discard the separator and search again using next
> (let* ((beg (save-excursion
> (goto-char (min beg0 end0))
> (end (save-excursion
> (goto-char (max beg0 end0))
Thinking again about it, this needs extra care, as end0 might end up on
an empty line. You tried to avoid this in your first function, but
I think this was not sufficient either. Actually, beg0 could also start
on an empty line.
This needs to be tested extensively, but as a first approximation,
I think `beg' needs to be defined as:
(goto-char (min beg0 end0))
(skip-chars-forward " \t\n")
(if (eobp) (point) (line-beginning-position)))
and `end' as
(goto-char (max beg end0))
(skip-chars-backward " \t\n" beg)
(if (= beg (point)) (point) (line-end-position)))
Then you need to bail out if beg = end.
> (sep-rexp '(("," "^[^\n,]+$")
sep-rexp -> sep-regexp
> ("\t" "^[^\n\t]+$")
> (";" "^[^\n;]+$")
> (":" "^[^\n:]+$")
> (" " "^\\([^'\"][^\n\s][^'\"]\\)+$")))
At this point, I suggest to use `rx' macro instead.
> (tmp (car sep-rexp))
> (goto-char beg)
> (while (and (not sep)
> (if (save-excursion
> (not (re-search-forward (nth 1 tmp) end t)))
> (setq sep (nth 0 tmp))
> (setq sep-rexp (cdr sep-rexp))
> (setq tmp (car sep-rexp)))))
I suggest this (yes, I like pattern-matching, `car' and `cdr' are so
(pcase-dolist (`(,sep ,regexp) sep-regexp)
(unless (re-search-forward regexp end t)
(throw :found sep))))
Again all this needs to extensively tested, as there are a lot of
dangers lurking around.
next prev parent reply other threads:[~2021-04-20 13:41 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-19 4:43 [PATCH] org-table-import: Make it more smarter for interactive use Utkarsh Singh
2021-04-19 8:19 ` Nicolas Goaziou
2021-04-19 14:23 ` Utkarsh Singh
2021-04-20 13:40 ` Nicolas Goaziou [this message]
2021-04-20 17:15 ` Utkarsh Singh
2021-04-23 4:58 ` Utkarsh Singh
2021-04-27 20:21 ` bug#47885: " Nicolas Goaziou
2021-04-28 8:37 ` Utkarsh Singh
2021-04-28 16:38 ` Maxim Nikulin
2021-05-10 18:36 ` Utkarsh Singh
2021-05-12 17:08 ` Maxim Nikulin
2021-05-14 14:54 ` Utkarsh Singh
2021-05-15 9:13 ` Bastien
2021-05-15 10:10 ` Utkarsh Singh
2021-05-15 10:30 ` Bastien
2021-05-15 11:09 ` Utkarsh Singh
2021-05-17 5:29 ` Bastien
2021-05-17 16:27 ` Utkarsh Singh
2021-06-01 16:23 ` Maxim Nikulin
2021-06-01 17:46 ` Utkarsh Singh
2021-06-02 12:06 ` Maxim Nikulin
2021-06-02 15:08 ` Utkarsh Singh
2021-06-02 16:44 ` Maxim Nikulin
2021-06-04 4:04 ` Utkarsh Singh
2021-06-05 12:40 ` Maxim Nikulin
2021-06-05 17:50 ` Utkarsh Singh
2021-06-09 12:15 ` Maxim Nikulin
2021-09-26 8:40 ` Bastien
2021-05-16 16:24 ` Maxim Nikulin
2021-05-17 16:30 ` Utkarsh Singh
2021-05-18 10:24 ` Utkarsh Singh
2021-05-18 12:31 ` Maxim Nikulin
2021-05-18 15:05 ` Utkarsh Singh
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
List information: https://www.orgmode.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).