From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:2:bcc0::]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id QC6+FH2SfWD5TQAAgWs5BA (envelope-from ) for ; Mon, 19 Apr 2021 16:23:57 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id gKFLEH2SfWAwQAAAB5/wlQ (envelope-from ) for ; Mon, 19 Apr 2021 14:23:57 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 6C57B8ACC for ; Mon, 19 Apr 2021 16:23:56 +0200 (CEST) Received: from localhost ([::1]:57068 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lYUod-0006v9-18 for larch@yhetil.org; Mon, 19 Apr 2021 10:23:55 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:35872) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lYUoB-0006us-HU for emacs-orgmode@gnu.org; Mon, 19 Apr 2021 10:23:27 -0400 Received: from mail-pf1-x42e.google.com ([2607:f8b0:4864:20::42e]:46820) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1lYUo9-00052n-Jm for emacs-orgmode@gnu.org; Mon, 19 Apr 2021 10:23:27 -0400 Received: by mail-pf1-x42e.google.com with SMTP id d124so23250682pfa.13 for ; Mon, 19 Apr 2021 07:23:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:subject:references:cc:date:in-reply-to:message-id :user-agent:mime-version; bh=13IPtaGmOOwaxYsyBhyaECe/uc6zsM4sOyXxKnHgWtc=; b=IxswMqqIZMLZ4uE5f/ycRpJsi8wS2MxLhL9LEEHVtkl4elLep2TVIyPHKPfhYizAA3 yjdBCfszxXE2JGkksxHBWhvSFdq4mc9X5PvtJqZIP37hxrFsi8zwqjbrJXCkyvphKFwo u3lb1WU7ZhGLpd7EYe9/fWPcLqMfELgB82X7hLSoPVbe679q46YKw3hF16uv6jXrQBmG mCMOJ5zGlo3YrSozN0pAH3eSvbi5JpcYszDqXkuFWif6C+TS6W8W89uKKzWK35hJW+53 tgmpr+5lm1jQsDaKtKf4kn3+UTTtirH6IjNyoQJY6zO4ituIKKRF7fcpltH9DRprU7gR 2D2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:references:cc:date:in-reply-to :message-id:user-agent:mime-version; bh=13IPtaGmOOwaxYsyBhyaECe/uc6zsM4sOyXxKnHgWtc=; b=t1t8ld//hDN2PPs7Evr4Jz4v3RNx2OJJbdKjddEAPCjEbFnCol4r9T0kKwAPv5I7QH ouhDdkW8ZQsuUwkDQ3cCUIxaKjYuyrW4WRSuGjC4GEu0qx7Nb9iqLXGixPPU2qZHdcuH L4lOLaDFArkji3jrWSVILzsj4xJ+RVEmeOhPGBAv6IP6S6+TzwJSIQYGExF3Ir+CkU2G /CPZFwpI/rB7U/3vZXn6V4CAUEh0lEyzqJriA67/d9ekB81Ik/QJxUrZoKjEG7FCHbRT JACa6/UYyRdhKmN+/cipu8e3bsQeEbA4UvNzQMT9cmzmUNbgbpMOnwMzmd2Ye7e+8oJS CmNw== X-Gm-Message-State: AOAM5319yNkCts8XlhMiF78AA3JUh54mQDd4PgS/uhukjrupAZNafpdO omT49TuGGjwYtTjOKtL8WfjvEBxwEng= X-Google-Smtp-Source: ABdhPJyQBhJX/SrGLyTgwDPXMwEkzgMBXOZrqQFzRxpPqNGH6iclHZGN664s/1F9gcnSEmkeNcBekw== X-Received: by 2002:a63:f443:: with SMTP id p3mr12122015pgk.378.1618842203762; Mon, 19 Apr 2021 07:23:23 -0700 (PDT) Received: from localhost ([45.251.50.123]) by smtp.gmail.com with ESMTPSA id i66sm10538396pgd.8.2021.04.19.07.23.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 19 Apr 2021 07:23:23 -0700 (PDT) From: Utkarsh Singh To: Nicolas Goaziou Subject: Re: [PATCH] org-table-import: Make it more smarter for interactive use References: <87czuq9958.fsf@gmail.com> <8735vmelfs.fsf@nicolasgoaziou.fr> Date: Mon, 19 Apr 2021 19:53:25 +0530 In-Reply-To: <8735vmelfs.fsf@nicolasgoaziou.fr> (Nicolas Goaziou's message of "Mon, 19 Apr 2021 10:19:03 +0200") Message-ID: <87k0oyfj4y.fsf@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Received-SPF: pass client-ip=2607:f8b0:4864:20::42e; envelope-from=utkarsh190601@gmail.com; helo=mail-pf1-x42e.google.com X-Spam_score_int: 1 X-Spam_score: 0.1 X-Spam_bar: / X-Spam_report: (0.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, PDS_OTHER_BAD_TLD=1.999, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-orgmode@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 47885@debbugs.gnu.org, emacs-orgmode@gnu.org Errors-To: emacs-orgmode-bounces+larch=yhetil.org@gnu.org Sender: "Emacs-orgmode" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1618842236; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=13IPtaGmOOwaxYsyBhyaECe/uc6zsM4sOyXxKnHgWtc=; b=pI9oWsjghYrJrL8pRZbYFqkI1OnqM47l+ovGA7Q1WY8l64y9wUOcfbHuGuiSgERVyx4pzy yKSVexmiv39UI8pt/D6kL9Q+KJrniw8Uv7nxx6KRTAlKrElc3EWq3uUpTfmHHw/7LvHmVd g4Yzsas31g2kseNeiKEYwIOEwRPdgmZqVL42L/I1OUPwvyuZpQc3h0PKsAZXbhnlM0wUqk 67wrc+3n4tmFUq7XP1UlzMIgQULnioo41xm7dOuJtcEss4F6CXLR8ZMwXW25FHMbwZYPZZ F01xphDQpE5GfwP9YhB6zVrmEvV0PRBVowEVhBnLabvqNcwPgzg6Rw8DMP8Kjg== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1618842236; a=rsa-sha256; cv=none; b=DPGfCA6IdM3kqBGRlUANd0UmuR7EmlWAJx39NvqL2Ff8t/mdBk+rWItqzwcX9hqv2k1w5r Mw6Dyo8S3qufyQeLW+/2S3B4BLbiGgcP1Q9aiWbfjleJ2RQI9ezHYwuAxLu20UN++/1pco d+SKgxRWoeu8hBkGIrUZhn7dh8iGQZFHeAsnhb0Dw1KSI31z/stzAsup3nfn0KYwDbVfdq 6vzspNTNBmrC51H5C2igFMDim6tCQwWBJY7JqvfbTCC6AH7yjKDUH4Zj2mD+kdBuMEEz4h Ge/093k6MqeeqH9DcuoKGM5iPiwqJE1Ysbso3BtWJS85wFF/apEypi7IlqwEkg== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20161025 header.b=IxswMqqI; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of emacs-orgmode-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=emacs-orgmode-bounces@gnu.org X-Migadu-Spam-Score: -3.76 Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20161025 header.b=IxswMqqI; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of emacs-orgmode-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=emacs-orgmode-bounces@gnu.org X-Migadu-Queue-Id: 6C57B8ACC X-Spam-Score: -3.76 X-Migadu-Scanner: scn0.migadu.com X-TUID: cCN2KSz4JsZi On 2021-04-19, 10:19 +0200, Nicolas Goaziou wrote: >> My previous patch proposed to add support for importing file with >> arbitrary name and building upon that this patch tries to make use of it >> by making org-table-import smarter by simply adding more separators >> (delimiters). > > Good idea, thank you. Some comments follow. > >> +(defun org-table-guess-separator (beg0 end0) >> + "Guess separator for `org-table-convert-region' for region BEG0 to END0. >> + >> +List of preferred separator: >> +comma, TAB, ';', ':' or SPACE > > I suggest to use full names everywhere: comma, TAB, semicolon, colon, or > SPACE. > >> +If region contains a line which doesn't contain the required >> +separator then discard the separator and search again using next >> +separator." >> + (let ((beg (save-excursion >> + (goto-char (min beg0 end0)) >> + (beginning-of-line 1) >> + (point))) > > (beginning-of-line 1) + (point) -> (line-beginning-position) > > since you don't intent to move point. > >> + (end (save-excursion >> + (goto-char (max beg0 end0)) >> + (end-of-line 1) >> + (if (bolp) (backward-char 1) (end-of-line 1)) > > I'm not sure about what you mean above. First, the second call to > end-of-line is useless, since you're already at the end of the line. > Second, what is wrong if point is at an empty line? Why do you want to > move it back? > >> + (point)))) > > You may want to use `line-end-position'. > >> + (save-excursion >> + (goto-char beg) >> + (cond >> + ((not (re-search-forward "^[^\n,]+$" end t)) '(4)) >> + ((not (re-search-forward "^[^\n\t]+$" end t)) '(16)) >> + ((not (re-search-forward "^[^\n;]+$" end t)) ";") >> + ((not (re-search-forward "^[^\n:]+$" end t)) ":") >> + ((not (re-search-forward "^\\([^'\"][^\n\s][^'\"]\\)+$" end t)) " ") >> + (t nil))))) > > I think you need to wrap `save-excursion' around each > `re-search-forward' call. Otherwise each test starts at the first line > containing the separator previously tested. >> + >> ;;;###autoload >> (defun org-table-convert-region (beg0 end0 &optional separator) >> "Convert region to a table. >> @@ -862,10 +891,7 @@ org-table-convert-region >> integer When a number, use that many spaces, or a TAB, as field separator >> regexp When a regular expression, use it to match the separator >> nil When nil, the command tries to be smart and figure out the >> - separator in the following way: >> - - when each line contains a TAB, assume TAB-separated material >> - - when each line contains a comma, assume CSV material >> - - else, assume one or more SPACE characters as separator." >> + separator using `org-table-guess-seperator'." Thanks for reviewing the patch! > I wonder if creating a new function is warranted here. You could add the > news checks after those already present in the code, couldn't you? At first I was also reluctant in creating a new function but decided to do so because: + org-table-convert-region is currently doing two thing 'guessing the separator' and 'converting the region'. I thought it was a good idea to separate out function into it's atomic operations. + Current guessing technique is quite basic as it assumes that data (file that has to be imported) has no error/inconsistency in it. I would like to show you the doc string of Python's CSV library implementation to guess separator (region inside """): """ Looks for text enclosed between two identical quotes (the probable quotechar) which are preceded and followed by the same character (the probable delimiter). For example: ,'some text', The quote with the most wins, same with the delimiter. If there is no quotechar the delimiter can't be determined this way. """ And if this functions fails then we have: """ The delimiter /should/ occur the same number of times on each row. However, due to malformed data, it may not. We don't want an all or nothing approach, so we allow for small variations in this number. 1) build a table of the frequency of each character on every line. 2) build a table of frequencies of this frequency (meta-frequency?), e.g. 'x occurred 5 times in 10 rows, 6 times in 1000 rows, 7 times in 2 rows' 3) use the mode of the meta-frequency to determine the /expected/ frequency for that character 4) find out how often the character actually meets that goal 5) the character that best meets its goal is the delimiter For performance reasons, the data is evaluated in chunks, so it can try and evaluate the smallest portion of the data possible, evaluating additional chunks as necessary. """ I tried to do similar in Elisp but currently facing some issues due to my inexperience in functional programming. Also moving the 'guessing' part out the function may lead to development of even better algorithm than Python counterpart. Modified version of concerned function: (defun org-table-guess-separator (beg0 end0) "Guess separator for `org-table-convert-region' for region BEG0 to END0. List of preferred separator: comma, TAB, semicolon, colon or SPACE. If region contains a line which doesn't contain the required separator then discard the separator and search again using next separator." (let* ((beg (save-excursion (goto-char (min beg0 end0)) (line-beginning-position))) (end (save-excursion (goto-char (max beg0 end0)) (line-end-position))) (sep-rexp '(("," "^[^\n,]+$") ("\t" "^[^\n\t]+$") (";" "^[^\n;]+$") (":" "^[^\n:]+$") (" " "^\\([^'\"][^\n\s][^'\"]\\)+$"))) (tmp (car sep-rexp)) sep) (save-excursion (goto-char beg) (while (and (not sep) (if (save-excursion (not (re-search-forward (nth 1 tmp) end t))) (setq sep (nth 0 tmp)) (setq sep-rexp (cdr sep-rexp)) (setq tmp (car sep-rexp))))) sep))) Version without using iteration: (defun org-table-guess-separator (beg0 end0) "Guess separator for `org-table-convert-region' for region BEG0 to END0. List of preferred separator: COMMA, TAB, SEMICOLON, COLON or SPACE. If region contains a line which doesn't contain the required separator then discard the separator and search again using next separator." (let ((beg (save-excursion (goto-char (min beg0 end0)) (line-beginning-position))) (end (save-excursion (goto-char (max beg0 end0)) (line-end-position)))) (save-excursion (goto-char beg) (cond ((save-excursion (not (re-search-forward "^[^\n,]+$" end t))) ",") ((save-excursion (not (re-search-forward "^[^\n\t]+$" end t))) "\t") ((save-excursion (not (re-search-forward "^[^\n;]+$" end t))) ";") ((save-excursion (not (re-search-forward "^[^\n:]+$" end t))) ":") ((save-excursion (not (re-search-forward "^\\([^'\"][^\n\s][^'\"]\\)+$" end t))) " ") (t nil))))) -- Utkarsh Singh http://utkarshsingh.xyz