From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <emacs-orgmode-bounces+larch=yhetil.org@gnu.org>
Received: from mp1 ([2001:41d0:2:4a6f::])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	by ms11 with LMTPS
	id eDWcH78+51/XbwAA0tVLHw
	(envelope-from <emacs-orgmode-bounces+larch=yhetil.org@gnu.org>)
	for <larch@yhetil.org>; Sat, 26 Dec 2020 13:46:39 +0000
Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	by mp1 with LMTPS
	id yLtWG78+51/JOAAAbx9fmQ
	(envelope-from <emacs-orgmode-bounces+larch=yhetil.org@gnu.org>)
	for <larch@yhetil.org>; Sat, 26 Dec 2020 13:46:39 +0000
Received: from lists.gnu.org (lists.gnu.org [209.51.188.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by aspmx1.migadu.com (Postfix) with ESMTPS id CB1D59403CA
	for <larch@yhetil.org>; Sat, 26 Dec 2020 13:46:38 +0000 (UTC)
Received: from localhost ([::1]:42298 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-orgmode-bounces+larch=yhetil.org@gnu.org>)
	id 1kt9u1-0004YG-MW
	for larch@yhetil.org; Sat, 26 Dec 2020 08:46:37 -0500
Received: from eggs.gnu.org ([2001:470:142:3::10]:48992)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <yantar92@gmail.com>)
 id 1kt9tD-0004Xc-TB
 for emacs-orgmode@gnu.org; Sat, 26 Dec 2020 08:45:47 -0500
Received: from mail-pl1-x636.google.com ([2607:f8b0:4864:20::636]:44554)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <yantar92@gmail.com>)
 id 1kt9tA-0001QO-RR
 for emacs-orgmode@gnu.org; Sat, 26 Dec 2020 08:45:47 -0500
Received: by mail-pl1-x636.google.com with SMTP id r4so3478742pls.11
 for <emacs-orgmode@gnu.org>; Sat, 26 Dec 2020 05:45:44 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=from:to:subject:in-reply-to:references:date:message-id:mime-version;
 bh=kn9kD7TdmPTAktzoaSWdFGH1lR/KbaFX/kcYkh9LHL8=;
 b=TREkPeUOT1t5ac2FO7UlNRWFJL9f8KSxlLG0ImMY8nwx+p2gYtn1m/SF7SRx63Q0Qo
 HB8FgKuc/5qBGvPcUIjMIU/MLaHsFPY7muztmtZR7jkIkU9qXzTdrWzsvhR1UrkuMJ3a
 AFbciGsCAUbl0gWAaaVlz1Z7GzVaAuyWmen9B3SDm26Pmv3YAWEW+Puobk9tSg7kLXEZ
 3lzjMoaoK6GAXtG0GOrjEAH4tweYgbKAtfT4KJ4yy7YlQ318gHHr3ZAhEdtsjwLooXrU
 TpOc2kNk9tpeEkVJu5n4CHChrdl6gV7a/517rXThD5msL2rURyXlPJq7rFRQHbNRUiQL
 w4ug==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:subject:in-reply-to:references:date
 :message-id:mime-version;
 bh=kn9kD7TdmPTAktzoaSWdFGH1lR/KbaFX/kcYkh9LHL8=;
 b=F9+3i4MVgJJu1O8bMM4R5/ZoAKXR5P02OVrOQKGbuq6OtEdWlAQWW05jJPngv6kuZ7
 2+7hjng+W6WGUR1a7fgkoHb5+FMSdgLw5mT7m01A05h40YF9C7/q/RjcrkIuDRXF1qdt
 ky81lOhmLxKYeiXm/Xixi9aJ/YTq80ygu6VlPar4g1EI3cv4lq03zhT7nM97qAtAqPbG
 NXvpPKHDz6L+/tfGYojjQMdGeQBJokfHttGXW+0311nKmRvAEaRFtu6BMd/UYpJoGbwh
 Cbh9s1NtdYLsH+ueFGgz/FRMZ365GO5sqsq/aapMIHqlqB8+cQvdCckFs7mZfxNiQfSd
 /GgA==
X-Gm-Message-State: AOAM533F0guxAkxFXE74PvT1uD1PdpZaxINKgon/JBeAWN2P+T5MNMhL
 6DlDAv0WkUWaWYtd4pqprpc=
X-Google-Smtp-Source: ABdhPJzHS9U34T+9HoW73Xc4KdfJEdE5n0/IhuT8aQA68swL7Zqph73i5aU26h2q+kp1WM5Atvp2lA==
X-Received: by 2002:a17:902:b116:b029:dc:c93:1d6b with SMTP id
 q22-20020a170902b116b02900dc0c931d6bmr19473906plr.22.1608990342578; 
 Sat, 26 Dec 2020 05:45:42 -0800 (PST)
Received: from localhost ([96.44.161.8])
 by smtp.gmail.com with ESMTPSA id z2sm22131283pgl.49.2020.12.26.05.45.41
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Sat, 26 Dec 2020 05:45:41 -0800 (PST)
From: Ihor Radchenko <yantar92@gmail.com>
To: Maxim Nikulin <manikulin@gmail.com>, emacs-orgmode@gnu.org
Subject: Re: Yet another browser extension for capturing notes - LinkRemark
In-Reply-To: <rs7800$838$1@ciao.gmane.io>
References: <rs4mrd$nip$1@ciao.gmane.io> <87v9cqx91l.fsf@localhost>
 <rs7800$838$1@ciao.gmane.io>
Date: Sat, 26 Dec 2020 21:49:41 +0800
Message-ID: <87sg7spthm.fsf@localhost>
MIME-Version: 1.0
Content-Type: text/plain
Received-SPF: pass client-ip=2607:f8b0:4864:20::636;
 envelope-from=yantar92@gmail.com; helo=mail-pl1-x636.google.com
X-Spam_score_int: -17
X-Spam_score: -1.8
X-Spam_bar: -
X-Spam_report: (-1.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,
 FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001,
 RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: emacs-orgmode@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "General discussions about Org-mode." <emacs-orgmode.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-orgmode>,
 <mailto:emacs-orgmode-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-orgmode>
List-Post: <mailto:emacs-orgmode@gnu.org>
List-Help: <mailto:emacs-orgmode-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-orgmode>,
 <mailto:emacs-orgmode-request@gnu.org?subject=subscribe>
Errors-To: emacs-orgmode-bounces+larch=yhetil.org@gnu.org
Sender: "Emacs-orgmode" <emacs-orgmode-bounces+larch=yhetil.org@gnu.org>
X-Migadu-Flow: FLOW_IN
X-Migadu-Spam-Score: -2.52
Authentication-Results: aspmx1.migadu.com;
	dkim=pass header.d=gmail.com header.s=20161025 header.b=TREkPeUO;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (aspmx1.migadu.com: domain of emacs-orgmode-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=emacs-orgmode-bounces@gnu.org
X-Migadu-Queue-Id: CB1D59403CA
X-Spam-Score: -2.52
X-Migadu-Scanner: scn1.migadu.com
X-TUID: NU1ce4KsEfdM

Maxim Nikulin <manikulin@gmail.com> writes:

> I just inspected pages on several sites using developer tools and added
> code that handles noticed elements.

I see. I basically did the same, except some minimal support for
OpenGraph (though I stopped when I saw that even YouTube is not
following the standard, except the most basic fields).

> The only force to add some formal data is "share" buttons. Maybe some
> guides for web developers from social networks or search engines could
> be more useful than formal references, but I have not had a closer
> look.

It is also consistent with what I saw. <meta .. twitter:..> fields seems
to be very common.

>> Also, org-capture-ref does not really force the user to put BiBTeX into
>> the capture. Individual metadata fields are available using
>> org-capture-ref-get-bibtex-field (which extracts data from internal
>> alist structure). It's just that I mostly had BiBTeX in mind (with
>> distant goal of supporting export to LaTeX) for my use-cases.
>
> I do not have clear vision how to use collected data for queries. 
> Certainly I want to have more human-friendly representation than BibTeX 
> entries (maybe in addition to machine-parsable data) adjacent to my notes.

So far, I found author, website name, publication year, title, and
resource type useful. My standard capture template for links is:

* <Author> [<Website>] (<Year>) Title

Example:

* dash-docs-el [Github] Dash-Docs-El Helm-Dash: Browse Dash Docsets Inside Emacs

Such headlines can be easily searched later, especially when I also add
some #keywords manually.

> Personally, I would prefer to avoid http queries from Emacs. Sometimes 
> it is better to have current DOM state, not page source, that is why I 
> decided to gather data inside browser, despite security fences that are 
> placed quite strangely in some cases.

Completely agree here. That's why I directly reuse the current DOM state
from qutebrowser in my own setup. However, extension for qutebrowser was
easy to write for me as it can be simply a bash script. I know nothing
about Firefox/Chrome extensions and I do not know javascript.

On the other hand, having an ability to get html is still useful in my
case (Emacs package) when the capture is not done from browser. For
example, I often capture links from elfeed - http query from Emacs is
useful then.

>  From my point of view, you should be happy with any of projects you 
> mentioned below. Are all of them have some problems critical for you?

They are all javascript, except one (unicontent), which can be easily
replaced with built-in Elisp libraries (dom.el).

>> Finally, would you be interested to join efforts on metadata parsing?
>
> Could you, please, share a bit more details on your ideas? 

> Technically it should be possible to push e.g. raw 
> document.head.innerHtml to any external metadata parser using native 
> messaging (to deal with sites requiring authorization). However it could 
> cause an alarm during review before publication of the extension to the 
> browser catalogues.

That's unfortunate. Pushing raw html/dom is what I had in mind when
talking about joining efforts.

Another idea would be providing a callback from elisp to browser (I am
not sure if it is possible). org-capture-ref has a mechanism to check if
the link was captured in the past. If the link is already captured, the
information about the link location and todo-state can be messaged back
to the browser.

Example message (only qutebrowser is supported now):

Bookmark not saved!
Already captured into org-capture-ref:TODO maxnikulin [Github] linkremark: LinkRemark - page or link notes with context

>There is some room for improvement, but I do not think that quality of
> metadata for ordinary sites could be dramatically better. The case
> that is not handled it all is scientific publications, unfortunately
> currently I have quite little interest in it. Definitely results
> should be stored in some structured format such as BibTeX. I have seen
> huge <head> elements describing even all references. Certainly such
> lists are not for general-purpose notes (at least without explicit
> request from the user), they should be handled by some bibliography
> software to display citation graphs in the local library. On the other
> hand it is not a problem to feed such data to some tool using native
> messaging protocol. I have no idea if various publisher provide such
> data in a uniform way, I just hope that pressure from citation indices
> and bibliography management software has positive influence on
> standardization.

I think https://github.com/microlinkhq/metascraper#core-rules can be
used for ideas. It has generic parsing apart from site-specific rules.

For the scientific publications, the key point is usually getting
DOI/ISBN. Then, most of the metadata can be obtained using standard API
of doi.org or various ISBN databases. In addition, reference data is
generally available in OpenCitations.net (they also have all kinds of
web APIs).

Also, do you pass any of the parsed metadata to org-protocol? If you do,
it would be trivial to get it into capture templates on Elisp (and
org-capture-ref) side.

> I am not going to blow up the code with recipes for particular sites. 
> However I realize that some special cases still should be handled. I am 
> not ready to adapt user script model used by 
> Greasemonkey/Violentmonkey/Tampermonkey. I believe, it is better to 
> create dedicated extension(s) that either adds and overwrites existing 
> meta elements or allows to query gathered data using sendMessage 
> webextensions interface. By the way, scripts for above mentioned 
> extensions could be used as well. It should alleviate cases when some 
> site with insane metadata is important for particular user.

I see. This is another point I thought it could be worth collaborating.
The parser rules just need to be written once (probably in some common
format, like json) and then can be reused.

> For some reason I even did not tried to 
> find existing projects for metadata extraction. Maybe I still hope that 
> quite simple implementation could handle most of the cases.

Actually, simple parsing does fairly good job on most of websites. It's
just that it is not ideal. For example, I tweaked title of captured
github issues to include "issue#", which helps to distinguish such pages
from individual repo bookmarks. I believe that such adjustments should
be available for the users, which was where org-capture-ref code started
from.

Best,
Ihor