From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:4a6f::]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id sMb9OgLafmDpWwEAgWs5BA (envelope-from ) for ; Tue, 20 Apr 2021 15:41:22 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id mFHBNgLafmA8CwAA1q6Kng (envelope-from ) for ; Tue, 20 Apr 2021 13:41:22 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 728651D79C for ; Tue, 20 Apr 2021 15:41:22 +0200 (CEST) Received: from localhost ([::1]:38828 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lYqcz-0005Qm-LM for larch@yhetil.org; Tue, 20 Apr 2021 09:41:21 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:40968) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lYqc2-0005PQ-Ug for emacs-orgmode@gnu.org; Tue, 20 Apr 2021 09:40:23 -0400 Received: from relay7-d.mail.gandi.net ([217.70.183.200]:46775) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lYqbx-0000rB-CN for emacs-orgmode@gnu.org; Tue, 20 Apr 2021 09:40:22 -0400 X-Originating-IP: 185.131.40.67 Received: from localhost (40-67.ipv4.commingeshautdebit.fr [185.131.40.67]) (Authenticated sender: admin@nicolasgoaziou.fr) by relay7-d.mail.gandi.net (Postfix) with ESMTPSA id B3C1D20003; Tue, 20 Apr 2021 13:40:13 +0000 (UTC) From: Nicolas Goaziou To: Utkarsh Singh Subject: Re: [PATCH] org-table-import: Make it more smarter for interactive use References: <87czuq9958.fsf@gmail.com> <8735vmelfs.fsf@nicolasgoaziou.fr> <87k0oyfj4y.fsf@gmail.com> Mail-Followup-To: Utkarsh Singh , 47885@debbugs.gnu.org, emacs-orgmode@gnu.org Date: Tue, 20 Apr 2021 15:40:12 +0200 In-Reply-To: <87k0oyfj4y.fsf@gmail.com> (Utkarsh Singh's message of "Mon, 19 Apr 2021 19:53:25 +0530") Message-ID: <87im4h9irn.fsf@nicolasgoaziou.fr> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Received-SPF: pass client-ip=217.70.183.200; envelope-from=mail@nicolasgoaziou.fr; helo=relay7-d.mail.gandi.net X-Spam_score_int: -25 X-Spam_score: -2.6 X-Spam_bar: -- X-Spam_report: (-2.6 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-orgmode@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 47885@debbugs.gnu.org, emacs-orgmode@gnu.org Errors-To: emacs-orgmode-bounces+larch=yhetil.org@gnu.org Sender: "Emacs-orgmode" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1618926082; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post; bh=kqt/EVsbvNzdAFpwAqURJgUwVdcJXr/aK4zSRORbfg4=; b=GCw1RX3Dj4Lhp+t2jzIRB0p1hA2aRrP/upXgZriRqrhvrn6/OsbULvt53+owprNcV7kiZf pQORn5qip+T4Ax06pj4Y3Hiw8Oauagwjc+QF0b/XO0JPyTdMi2df1NGapru/Aag8GdZMEf l0Xmjf1/dwjkgLdTPliCKy7ui0axa+lIx0D+3wmPrSWLd2MNHSzSYOhH7gB1UQL6aNMCAn pfZjey6h9b4FVdPPkop7PRClog4Gp/W74KLmZj/miUDBLuC0dZwjnqzgpTGOguHQi4Ol1F slOwnANWB4BxcSGwDSduLYlhQ1GJ8gR2kU1H4n2GqKsJVdb1v6JbrkorzsXExA== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1618926082; a=rsa-sha256; cv=none; b=inZmWkuMxvqkNqgXDIK5UlQCCfVHcOYZ01uP9OiQEJ3xqH9f5zqrvtp0hjUtApddSsAGk6 ojIR8FSZBaTBUhO7go8e+AQ9dR+X0+I7HJNbCOjYdVKQLyxN5N3EAlRqQqCv8xXAAXWZqN WmsIcZbf1hQFgYT9jABI1d1GQOGQ0fn04s9ENH1SmEQna1Kx1nvCwfpOggSRtmQYB6I78y 2jlUu7b3yK1LX6N5gBPm3KSyzUyhE4vQncaM6f6WqbHnpgedb1ZSCf5CMeYccDNu2SNDkG pbwn4Jfh//jlGiT7UPsHoEdikKV3DkneC6xgXOqBU/RNbTorW2ZsXDYU6uI3rA== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; spf=pass (aspmx1.migadu.com: domain of emacs-orgmode-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=emacs-orgmode-bounces@gnu.org X-Migadu-Spam-Score: -2.44 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of emacs-orgmode-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=emacs-orgmode-bounces@gnu.org X-Migadu-Queue-Id: 728651D79C X-Spam-Score: -2.44 X-Migadu-Scanner: scn0.migadu.com X-TUID: Zac2ZiEYBuTS Hello, Utkarsh Singh writes: > At first I was also reluctant in creating a new function but decided to > do so because: > > + org-table-convert-region is currently doing two thing 'guessing the > separator' and 'converting the region'. I thought it was a good idea to > separate out function into it's atomic operations. I understand, but there is sometimes a (difficult) line to draw between "separating concerns" and "function proliferation". Anyway, that's fine here. > + Current guessing technique is quite basic as it assumes that data > (file that has to be imported) has no error/inconsistency in it. I > would like to show you the doc string of Python's CSV library > implementation to guess separator (region inside """): > > """ > Looks for text enclosed between two identical quotes > (the probable quotechar) which are preceded and followed > by the same character (the probable delimiter). > For example: > ,'some text', > The quote with the most wins, same with the delimiter. > If there is no quotechar the delimiter can't be determined > this way. > """ > > And if this functions fails then we have: > > """ > The delimiter /should/ occur the same number of times on > each row. However, due to malformed data, it may not. We don't want > an all or nothing approach, so we allow for small variations in this > number. > 1) build a table of the frequency of each character on every line. > 2) build a table of frequencies of this frequency (meta-frequency?), > e.g. 'x occurred 5 times in 10 rows, 6 times in 1000 rows, > 7 times in 2 rows' > 3) use the mode of the meta-frequency to determine the /expected/ > frequency for that character > 4) find out how often the character actually meets that goal > 5) the character that best meets its goal is the delimiter > For performance reasons, the data is evaluated in chunks, so it can > try and evaluate the smallest portion of the data possible, evaluating > additional chunks as necessary. > """ For the problem we're trying to solve, this sounds like over-engineering to me. Do we want so badly to guess a separator? > I tried to do similar in Elisp but currently facing some issues due to > my inexperience in functional programming. Also moving the 'guessing' > part out the function may lead to development of even better algorithm > than Python counterpart. > > Modified version of concerned function: > > (defun org-table-guess-separator (beg0 end0) > "Guess separator for `org-table-convert-region' for region BEG0 to END0. > > List of preferred separator: > comma, TAB, semicolon, colon or SPACE. > > If region contains a line which doesn't contain the required > separator then discard the separator and search again using next > separator." > (let* ((beg (save-excursion > (goto-char (min beg0 end0)) > (line-beginning-position))) > (end (save-excursion > (goto-char (max beg0 end0)) > (line-end-position))) Thinking again about it, this needs extra care, as end0 might end up on an empty line. You tried to avoid this in your first function, but I think this was not sufficient either. Actually, beg0 could also start on an empty line. This needs to be tested extensively, but as a first approximation, I think `beg' needs to be defined as: (save-excursion (goto-char (min beg0 end0)) (skip-chars-forward " \t\n") (if (eobp) (point) (line-beginning-position))) and `end' as (save-excursion (goto-char (max beg end0)) (skip-chars-backward " \t\n" beg) (if (= beg (point)) (point) (line-end-position))) Then you need to bail out if beg = end. > (sep-rexp '(("," "^[^\n,]+$") sep-rexp -> sep-regexp > ("\t" "^[^\n\t]+$") > (";" "^[^\n;]+$") > (":" "^[^\n:]+$") > (" " "^\\([^'\"][^\n\s][^'\"]\\)+$"))) At this point, I suggest to use `rx' macro instead. > (tmp (car sep-rexp)) > sep) > (save-excursion > (goto-char beg) > (while (and (not sep) > (if (save-excursion > (not (re-search-forward (nth 1 tmp) end t))) > (setq sep (nth 0 tmp)) > (setq sep-rexp (cdr sep-rexp)) > (setq tmp (car sep-rexp))))) I suggest this (yes, I like pattern-matching, `car' and `cdr' are so 80's) instead: (save-excursion (goto-char beg) (catch :found (pcase-dolist (`(,sep ,regexp) sep-regexp) (save-excursion (unless (re-search-forward regexp end t) (throw :found sep)))) nil)) Again all this needs to extensively tested, as there are a lot of dangers lurking around. Regards, -- Nicolas Goaziou