From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <emacs-orgmode-bounces+larch=yhetil.org@gnu.org>
Received: from mp0 ([2001:41d0:2:4a6f::])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	by ms11 with LMTPS
	id JA2KKnAj518iJgAA0tVLHw
	(envelope-from <emacs-orgmode-bounces+larch=yhetil.org@gnu.org>)
	for <larch@yhetil.org>; Sat, 26 Dec 2020 11:50:08 +0000
Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	by mp0 with LMTPS
	id qF3tJXAj519gOwAA1q6Kng
	(envelope-from <emacs-orgmode-bounces+larch=yhetil.org@gnu.org>)
	for <larch@yhetil.org>; Sat, 26 Dec 2020 11:50:08 +0000
Received: from lists.gnu.org (lists.gnu.org [209.51.188.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by aspmx1.migadu.com (Postfix) with ESMTPS id 24B289404C5
	for <larch@yhetil.org>; Sat, 26 Dec 2020 11:50:08 +0000 (UTC)
Received: from localhost ([::1]:58770 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-orgmode-bounces+larch=yhetil.org@gnu.org>)
	id 1kt85G-0005Oa-Bm
	for larch@yhetil.org; Sat, 26 Dec 2020 06:50:06 -0500
Received: from eggs.gnu.org ([2001:470:142:3::10]:34990)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <geo-emacs-orgmode@m.gmane-mx.org>)
 id 1kt84e-0005OO-M7
 for emacs-orgmode@gnu.org; Sat, 26 Dec 2020 06:49:28 -0500
Received: from ciao.gmane.io ([116.202.254.214]:59408)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <geo-emacs-orgmode@m.gmane-mx.org>)
 id 1kt84c-0005mr-V8
 for emacs-orgmode@gnu.org; Sat, 26 Dec 2020 06:49:28 -0500
Received: from list by ciao.gmane.io with local (Exim 4.92)
 (envelope-from <geo-emacs-orgmode@m.gmane-mx.org>)
 id 1kt84b-0002Fs-CS
 for emacs-orgmode@gnu.org; Sat, 26 Dec 2020 12:49:25 +0100
X-Injected-Via-Gmane: http://gmane.org/
To: emacs-orgmode@gnu.org
From: Maxim Nikulin <manikulin@gmail.com>
Subject: Re: Yet another browser extension for capturing notes - LinkRemark
Date: Sat, 26 Dec 2020 18:49:19 +0700
Message-ID: <rs7800$838$1@ciao.gmane.io>
References: <rs4mrd$nip$1@ciao.gmane.io> <87v9cqx91l.fsf@localhost>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.10.0
In-Reply-To: <87v9cqx91l.fsf@localhost>
Content-Language: en-US
Received-SPF: pass client-ip=116.202.254.214;
 envelope-from=geo-emacs-orgmode@m.gmane-mx.org; helo=ciao.gmane.io
X-Spam_score_int: 12
X-Spam_score: 1.2
X-Spam_bar: +
X-Spam_report: (1.2 / 5.0 requ) BAYES_00=-1.9, DKIM_ADSP_CUSTOM_MED=0.001,
 FORGED_GMAIL_RCVD=1, FORGED_MUA_MOZILLA=2.309,
 FREEMAIL_FORGED_FROMDOMAIN=0.249, FREEMAIL_FROM=0.001,
 HEADER_FROM_DIFFERENT_DOMAINS=0.249, NICE_REPLY_A=-1.561,
 NML_ADSP_CUSTOM_MED=0.9, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=no autolearn_force=no
X-Spam_action: no action
X-BeenThere: emacs-orgmode@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "General discussions about Org-mode." <emacs-orgmode.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-orgmode>,
 <mailto:emacs-orgmode-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-orgmode>
List-Post: <mailto:emacs-orgmode@gnu.org>
List-Help: <mailto:emacs-orgmode-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-orgmode>,
 <mailto:emacs-orgmode-request@gnu.org?subject=subscribe>
Errors-To: emacs-orgmode-bounces+larch=yhetil.org@gnu.org
Sender: "Emacs-orgmode" <emacs-orgmode-bounces+larch=yhetil.org@gnu.org>
X-Migadu-Flow: FLOW_IN
X-Migadu-Spam-Score: -1.72
Authentication-Results: aspmx1.migadu.com;
	dkim=none;
	dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=gmail.com (policy=none);
	spf=pass (aspmx1.migadu.com: domain of emacs-orgmode-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=emacs-orgmode-bounces@gnu.org
X-Migadu-Queue-Id: 24B289404C5
X-Spam-Score: -1.72
X-Migadu-Scanner: scn0.migadu.com
X-TUID: IAPvzgrsuVQi

On 25/12/2020, Ihor Radchenko wrote:
> 
> Reading through the code, I can see that you are familiar with metadata
> conventions. Do you know good references about what og: metadata is
> commonly used? I looked through the official OpenGraph specification,
> but popular websites appear to ignore most of the conventions.

I just inspected pages on several sites using developer tools and added
code that handles noticed elements.

I have not tried to find any resources on metadata (OK, once I searched 
for LD+JSON, essentially the outcome was the link to schema.org that I 
have seen in data already). Looking into page source, I realized that 
almost nobody cares if the site has metadata of appropriate quality. I 
think, search engines are advanced enough to work without metadata and 
even decrease page rank if something suspicious was added by SEO. The 
only force to add some formal data is "share" buttons. Maybe some guides 
for web developers from social networks or search engines could be more 
useful than formal references, but I have not had a closer look.

> Also, org-capture-ref does not really force the user to put BiBTeX into
> the capture. Individual metadata fields are available using
> org-capture-ref-get-bibtex-field (which extracts data from internal
> alist structure). It's just that I mostly had BiBTeX in mind (with
> distant goal of supporting export to LaTeX) for my use-cases.

I do not have clear vision how to use collected data for queries. 
Certainly I want to have more human-friendly representation than BibTeX 
entries (maybe in addition to machine-parsable data) adjacent to my notes.

Personally, I would prefer to avoid http queries from Emacs. Sometimes 
it is better to have current DOM state, not page source, that is why I 
decided to gather data inside browser, despite security fences that are 
placed quite strangely in some cases.

 From my point of view, you should be happy with any of projects you 
mentioned below. Are all of them have some problems critical for you?

Technically it should be possible to push e.g. raw 
document.head.innerHtml to any external metadata parser using native 
messaging (to deal with sites requiring authorization). However it could 
cause an alarm during review before publication of the extension to the 
browser catalogues.

> Finally, would you be interested to join efforts on metadata parsing?

Could you, please, share a bit more details on your ideas? There is some 
room for improvement, but I do not think that quality of metadata for 
ordinary sites could be dramatically better. The case that is not 
handled it all is scientific publications, unfortunately currently I 
have quite little interest in it. Definitely results should be stored in 
some structured format such as BibTeX. I have seen huge <head> elements 
describing even all references. Certainly such lists are not for 
general-purpose notes (at least without explicit request from the user), 
they should be handled by some bibliography software to display citation 
graphs in the local library. On the other hand it is not a problem to 
feed such data to some tool using native messaging protocol. I have no 
idea if various publisher provide such data in a uniform way, I just 
hope that pressure from citation indices and bibliography management 
software has positive influence on standardization.

I am not going to blow up the code with recipes for particular sites. 
However I realize that some special cases still should be handled. I am 
not ready to adapt user script model used by 
Greasemonkey/Violentmonkey/Tampermonkey. I believe, it is better to 
create dedicated extension(s) that either adds and overwrites existing 
meta elements or allows to query gathered data using sendMessage 
webextensions interface. By the way, scripts for above mentioned 
extensions could be used as well. It should alleviate cases when some 
site with insane metadata is important for particular user.

> P.S. Some links I collected myself when working on org-capture-ref. They
> might also be of interest for you:
> 
> - https://github.com/ageitgey/node-unfluff
> - https://github.com/gabceb/node-metainspector
> - https://github.com/wikimedia/html-metadata
> - https://github.com/microlinkhq/metascraper
> - https://github.com/hboisgibault/unicontent

Thank you for the links. I should have a closer look at that projects. 
E.g. I considered itemprop="author" elements but postponed 
implementation of such features. For some reason I even did not tried to 
find existing projects for metadata extraction. Maybe I still hope that 
quite simple implementation could handle most of the cases.