From: Maxim Nikulin <manikulin@gmail.com>
To: emacs-orgmode@gnu.org
Subject: Re: Yet another browser extension for capturing notes - LinkRemark
Date: Sat, 26 Dec 2020 18:49:19 +0700 [thread overview]
Message-ID: <rs7800$838$1@ciao.gmane.io> (raw)
In-Reply-To: <87v9cqx91l.fsf@localhost>
On 25/12/2020, Ihor Radchenko wrote:
>
> Reading through the code, I can see that you are familiar with metadata
> conventions. Do you know good references about what og: metadata is
> commonly used? I looked through the official OpenGraph specification,
> but popular websites appear to ignore most of the conventions.
I just inspected pages on several sites using developer tools and added
code that handles noticed elements.
I have not tried to find any resources on metadata (OK, once I searched
for LD+JSON, essentially the outcome was the link to schema.org that I
have seen in data already). Looking into page source, I realized that
almost nobody cares if the site has metadata of appropriate quality. I
think, search engines are advanced enough to work without metadata and
even decrease page rank if something suspicious was added by SEO. The
only force to add some formal data is "share" buttons. Maybe some guides
for web developers from social networks or search engines could be more
useful than formal references, but I have not had a closer look.
> Also, org-capture-ref does not really force the user to put BiBTeX into
> the capture. Individual metadata fields are available using
> org-capture-ref-get-bibtex-field (which extracts data from internal
> alist structure). It's just that I mostly had BiBTeX in mind (with
> distant goal of supporting export to LaTeX) for my use-cases.
I do not have clear vision how to use collected data for queries.
Certainly I want to have more human-friendly representation than BibTeX
entries (maybe in addition to machine-parsable data) adjacent to my notes.
Personally, I would prefer to avoid http queries from Emacs. Sometimes
it is better to have current DOM state, not page source, that is why I
decided to gather data inside browser, despite security fences that are
placed quite strangely in some cases.
From my point of view, you should be happy with any of projects you
mentioned below. Are all of them have some problems critical for you?
Technically it should be possible to push e.g. raw
document.head.innerHtml to any external metadata parser using native
messaging (to deal with sites requiring authorization). However it could
cause an alarm during review before publication of the extension to the
browser catalogues.
> Finally, would you be interested to join efforts on metadata parsing?
Could you, please, share a bit more details on your ideas? There is some
room for improvement, but I do not think that quality of metadata for
ordinary sites could be dramatically better. The case that is not
handled it all is scientific publications, unfortunately currently I
have quite little interest in it. Definitely results should be stored in
some structured format such as BibTeX. I have seen huge <head> elements
describing even all references. Certainly such lists are not for
general-purpose notes (at least without explicit request from the user),
they should be handled by some bibliography software to display citation
graphs in the local library. On the other hand it is not a problem to
feed such data to some tool using native messaging protocol. I have no
idea if various publisher provide such data in a uniform way, I just
hope that pressure from citation indices and bibliography management
software has positive influence on standardization.
I am not going to blow up the code with recipes for particular sites.
However I realize that some special cases still should be handled. I am
not ready to adapt user script model used by
Greasemonkey/Violentmonkey/Tampermonkey. I believe, it is better to
create dedicated extension(s) that either adds and overwrites existing
meta elements or allows to query gathered data using sendMessage
webextensions interface. By the way, scripts for above mentioned
extensions could be used as well. It should alleviate cases when some
site with insane metadata is important for particular user.
> P.S. Some links I collected myself when working on org-capture-ref. They
> might also be of interest for you:
>
> - https://github.com/ageitgey/node-unfluff
> - https://github.com/gabceb/node-metainspector
> - https://github.com/wikimedia/html-metadata
> - https://github.com/microlinkhq/metascraper
> - https://github.com/hboisgibault/unicontent
Thank you for the links. I should have a closer look at that projects.
E.g. I considered itemprop="author" elements but postponed
implementation of such features. For some reason I even did not tried to
find existing projects for metadata extraction. Maybe I still hope that
quite simple implementation could handle most of the cases.
next prev parent reply other threads:[~2020-12-26 11:50 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-12-25 12:44 Yet another browser extension for capturing notes - LinkRemark Maxim Nikulin
2020-12-25 14:19 ` Ihor Radchenko
2020-12-26 11:49 ` Maxim Nikulin [this message]
2020-12-26 13:49 ` Ihor Radchenko
2020-12-27 12:18 ` Maxim Nikulin
2021-11-18 17:01 ` LinkRemark Firefox extension approved for addons.mozilla.org Max Nikulin
2020-12-25 14:26 ` Yet another browser extension for capturing notes - LinkRemark Russell Adams
2020-12-25 22:11 ` Samuel Wales
2020-12-26 9:16 ` Maxim Nikulin
2022-01-17 2:29 ` Samuel Wales
2022-01-18 1:03 ` Samuel Wales
2022-01-18 5:43 ` Samuel Banya
2022-01-18 10:57 ` Max Nikulin
2022-01-18 10:34 ` Max Nikulin
2022-01-19 3:28 ` Ihor Radchenko
2022-01-19 8:45 ` András Simonyi
2022-01-19 10:00 ` Ihor Radchenko
2022-01-19 10:58 ` András Simonyi
2022-01-19 11:42 ` Ihor Radchenko
2022-01-20 0:23 ` Samuel Wales
2022-01-20 12:16 ` Org mode and firefox tabs (feature request) Max Nikulin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.orgmode.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='rs7800$838$1@ciao.gmane.io' \
--to=manikulin@gmail.com \
--cc=emacs-orgmode@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).