Hi, [Cc-ing Theo von der Malsburg] Now that Org is getting support for Citeproc, it could be useful to add support for the CSL-JSON format for bibliographic data to Emacs. Therefore, after a friendly request from Denis Maier, I have added support for this format to the =parsebib= library. Since =parsebib= is used by =bibtex-completions=, which in turn is used by =bibtex-actions=, =helm-bibtex=, =ivy-bibtex=, =org-ref= and =org-roam-bibtex=, this is a first step in making bibliographic data in =.json= format directly available to Org users, without the need of any BibTeX conversion. [Boy, look at me doing the marketing speak! :D ] Anyway, this really is the first step. =bibtex-completion= will need to be modified in order to make use of the new functionality, and the same may be true of the packages based on it. At this point, the new code isn't merged into =master= yet. It is available in the =wip/csl= branch of =parsebib='s Github repo: https://github.com/joostkremers/parsebib/tree/wip/csl The README has most of the details. I appreciate any and all comments, suggestions and tips. For those maintaining packages based on =parsebib=, I have at least one question: currently, =parsebib= returns a BibTeX entry in the form of an alist of =(<field> . <value>)= pairs, where both =<field>= and =<value>= are strings. A CSL-JSON entry is returned as an alist, but the =<field>= names are symbols, not strings. It would be extremely impractical to return the JSON data with strings as field names, because the JSON parsing libraries in Emacs return symbols, so converting them would take time. Plus, those libraries also expect symbols when serialising Elisp data to JSON. (Which I intend to make use of in Ebib later on.) It would be easier to modify the BibTeX output to return field names as symbols. I originally chose strings, because that's what =bibtex.el= uses, making it a little easier to integrate with it. So the question: would it be helpful to make this change to the BibTeX data, so that the data from both sources uses the same format? Or would it be better to keep it as it is, even if that means that BibTeX data and JSON data isn't compatible? TIA Joost -- Joost Kremers Life has its moments
On Fri, May 7, 2021 at 7:30 AM Joost Kremers <joostkremers@fastmail.fm> wrote: > Now that Org is getting support for Citeproc, it could be useful to add support > for the CSL-JSON format for bibliographic data to Emacs. Therefore, after a > friendly request from Denis Maier, I have added support for this format to the > =parsebib= library. Nice! ... > So the question: would it be helpful to make this change to the BibTeX data, so > that the data from both sources uses the same format? Just as a general point, this. From my perspective as =bibtex-actions= developer, it's not a problem given I don't have a lot of code that accesses that data directly. And I'd rather be able to support both import formats without hassle. Titus may have other views, of course, given how much =bibtex-completoin= does work directly with that data. Bruce
Hi all,
I’m the maintainer of bibtex-completion, helm-bibtex, and ivy-bibtex. My name is actually Titus, not Theo ;)
Cool to see that the ecosystem around academic writing in org mode is developing so nicely. I use org mode for this purpose every single working day and it’s amazing already. I have to confess, though, that I haven’t been keeping up with recent developments. I just saw the recent thread about the citation syntax. (Thanks to Bruce D’Arcus for pointing me to it.) Is there a good place where I can read up on the current efforts and plans regarding citations, bibliographies and so on (I mean other than reading the last couple of months of the mailing list archive)?
Regarding the symbols vs. string issue: I don’t have a strong opinion, but personally tend to favor a conservative solution that avoids braking changes. First, it’s difficult to predict how switching to symbols is going to affect other software including custom code written by users. Second, JSON key names can contain spaces and other weird stuff. So strings are perhaps a more natural choice anyway. (It appears that you can actually configure the JSON parser to use strings instead of symbols. See variable `json-key-type`.) Third, as you say, it would also be nice to maintain compatibility with bibtex.el. Finally, it’s not necessarily clear that avoiding the conversion to strings saves sufficiently many CPU cycles to justify the effort. (But this may be a non-issue anyway, if the JSON parser can return strings directly.)
Having said that, I’d be happy to merge a PR that that implements the switch to symbols in bibtex-completion if that’s the consensus. Touches a substantial number of lines, but should nonetheless be relatively straightforward.
Regarding support for CSL-JSON: bibtex-completion is currently very BibTeX-oriented and uses fairly low-level parsing functions from parsebib. We could add similar support for CSL-JSON but things would become messy. (It’s already a bit ugly, I have to say, which is entirely my fault.) It might be more elegant to have a higher-level API in parsebib. This API could perhaps even abstract away from the underlying format (BibTeX, CSL-JSON, or others in the future?). This would substantially simplify matters in bibtex-completion, but would also enable many other cool uses of parsebib.
Some rough ideas for such an API (just for illustration):
- A function that returns all entries in a .bib or CSL-JSON file.
- A function that returns an entry with a specific key (or multiple entries).
- Functions for resolving strings and cross-references.
So much for now.
Titus
On 2021-05-07 Fri 11:17, Joost Kremers wrote:
> Hi,
>
> [Cc-ing Theo von der Malsburg]
>
> Now that Org is getting support for Citeproc, it could be useful to add support
> for the CSL-JSON format for bibliographic data to Emacs. Therefore, after a
> friendly request from Denis Maier, I have added support for this format to the
> =parsebib= library.
>
> Since =parsebib= is used by =bibtex-completions=, which in turn is used by
> =bibtex-actions=, =helm-bibtex=, =ivy-bibtex=, =org-ref= and =org-roam-bibtex=,
> this is a first step in making bibliographic data in =.json= format directly
> available to Org users, without the need of any BibTeX conversion.
>
> [Boy, look at me doing the marketing speak! :D ]
>
> Anyway, this really is the first step. =bibtex-completion= will need to be
> modified in order to make use of the new functionality, and the same may be true
> of the packages based on it.
>
> At this point, the new code isn't merged into =master= yet. It is available in
> the =wip/csl= branch of =parsebib='s Github repo:
>
> https://github.com/joostkremers/parsebib/tree/wip/csl
>
> The README has most of the details. I appreciate any and all comments,
> suggestions and tips.
>
> For those maintaining packages based on =parsebib=, I have at least one
> question: currently, =parsebib= returns a BibTeX entry in the form of an alist
> of =(<field> . <value>)= pairs, where both =<field>= and =<value>= are strings.
> A CSL-JSON entry is returned as an alist, but the =<field>= names are symbols,
> not strings.
>
> It would be extremely impractical to return the JSON data with strings as field
> names, because the JSON parsing libraries in Emacs return symbols, so converting
> them would take time. Plus, those libraries also expect symbols when serialising
> Elisp data to JSON. (Which I intend to make use of in Ebib later on.)
>
> It would be easier to modify the BibTeX output to return field names as symbols.
> I originally chose strings, because that's what =bibtex.el= uses, making it a
> little easier to integrate with it.
>
> So the question: would it be helpful to make this change to the BibTeX data, so
> that the data from both sources uses the same format? Or would it be better to
> keep it as it is, even if that means that BibTeX data and JSON data isn't
> compatible?
>
> TIA
>
> Joost
>
>
> --
> Joost Kremers
> Life has its moments
Hi Titus, On Fri, May 07 2021, Titus von der Malsburg wrote: > I’m the maintainer of bibtex-completion, helm-bibtex, and ivy-bibtex. My name is > actually Titus, not Theo ;) :$ (I do apologise!) > Regarding the symbols vs. string issue: I don’t have a strong opinion, but > personally tend to favor a conservative solution that avoids braking changes. > First, it’s difficult to predict how switching to symbols is going to affect > other software including custom code written by users. Second, JSON key names > can contain spaces and other weird stuff. Apparently, =json-parse-{buffer|string}= then gives you a symbol with a space in it... > So strings are perhaps a more natural > choice anyway. (It appears that you can actually configure the JSON parser to > use strings instead of symbols. See variable `json-key-type`.) This works for the Elisp library =json.el=, but Emacs 27 can be compiled with native JSON support, which, however, doesn't provide this option, unfortunately. > Finally, > it’s not necessarily clear that avoiding the conversion to strings saves > sufficiently many CPU cycles to justify the effort. I can simply try it out. Shouldn't be difficult to code up. > Regarding support for CSL-JSON: bibtex-completion is currently very > BibTeX-oriented and uses fairly low-level parsing functions from parsebib. We > could add similar support for CSL-JSON I'm afraid that won't be possible, because the CLS-JSON support in parsebib isn't low-level. ;-) There's basically just a single function that gives you all the entries in the buffer and that's it. > Some rough ideas for such an API (just for illustration): > - A function that returns all entries in a .bib or CSL-JSON file. Those already exist... ;-) For JSON, that's basically the only option, because the actual parsing isn't handled by parsebib. For BibTeX, such a function has existed for some time now. > - A function that returns an entry with a specific key (or multiple entries). That would be easy to support, but IMHO is better handled in bibtex-completion: just parse the buffer and then call =gethash= on the resulting hash table. Or what use-case do you have in mind? > - Functions for resolving strings and cross-references. This, too, is something that parsebib already does. parsebib has a lower-level API and a higher-level API, and the latter does essentially what you suggest here. I thought bibtex-completion was already using it... -- Joost Kremers Life has its moments
On Fri, May 7, 2021 at 8:52 AM Titus von der Malsburg
<malsburg@posteo.de> wrote:
> It might be more elegant to have a higher-level API in parsebib. This API could perhaps even abstract away from the underlying format (BibTeX, CSL-JSON, or others in the future?). This would substantially simplify matters in bibtex-completion, but would also enable many other cool uses of parsebib.
Just wanted to +1 this!
Bruce
On 2021-05-07 Fri 14:34, Joost Kremers wrote: > Hi Titus, > > On Fri, May 07 2021, Titus von der Malsburg wrote: >> I’m the maintainer of bibtex-completion, helm-bibtex, and ivy-bibtex. My name is >> actually Titus, not Theo ;) > > :$ (I do apologise!) > >> Regarding the symbols vs. string issue: I don’t have a strong opinion, but >> personally tend to favor a conservative solution that avoids braking changes. >> First, it’s difficult to predict how switching to symbols is going to affect >> other software including custom code written by users. Second, JSON key names >> can contain spaces and other weird stuff. > > Apparently, =json-parse-{buffer|string}= then gives you a symbol with a space in it... I now see that symbol names “can contain any characters whatever” [1]. But many characters need to be escaped (like spaces) which isn’t pretty. >> So strings are perhaps a more natural >> choice anyway. (It appears that you can actually configure the JSON parser to >> use strings instead of symbols. See variable `json-key-type`.) > > This works for the Elisp library =json.el=, but Emacs 27 can be compiled with > native JSON support, which, however, doesn't provide this option, unfortunately. I see. In this case it might make sense to propose string keys as a feature for json.c. The key is a string anyway at some point during parsing, so avoiding the conversion to symbol may actually be the best way to speed things up. >> Finally, >> it’s not necessarily clear that avoiding the conversion to strings saves >> sufficiently many CPU cycles to justify the effort. > > I can simply try it out. Shouldn't be difficult to code up. > >> Regarding support for CSL-JSON: bibtex-completion is currently very >> BibTeX-oriented and uses fairly low-level parsing functions from parsebib. We >> could add similar support for CSL-JSON > > I'm afraid that won't be possible, because the CLS-JSON support in parsebib > isn't low-level. ;-) There's basically just a single function that gives you all > the entries in the buffer and that's it. > >> Some rough ideas for such an API (just for illustration): >> - A function that returns all entries in a .bib or CSL-JSON file. > > Those already exist... ;-) For JSON, that's basically the only option, because > the actual parsing isn't handled by parsebib. For BibTeX, such a function has > existed for some time now. Wasn’t aware. Fantastic! >> - A function that returns an entry with a specific key (or multiple entries). > > That would be easy to support, but IMHO is better handled in bibtex-completion: > just parse the buffer and then call =gethash= on the resulting hash table. Or > what use-case do you have in mind? One use case: bibtex-completion drops fields that aren’t needed early on to save memory and CPU cycles. (Some people work with truly enormous bibliographies, like crypto.bib with ~60K entries.) But this means that we sometimes have to read an individual entry again if we need more fields that were dropped earlier. In this case I’d like to be able to read just one entry without having to reparse the complete bibliography. >> - Functions for resolving strings and cross-references. > > This, too, is something that parsebib already does. OMG, bibtex-completion is doing this as well, but I’d be happy to get rid of this code. > parsebib has a lower-level API and a higher-level API, and the latter does > essentially what you suggest here. I thought bibtex-completion was already using it... Nope. I think the high-level API didn’t exist when I wrote my code in 2014. Seems like there’s quite a bit of potential for streamlining bibtex-completion. Now I just need a week to work on it. :) Titus [1] https://www.gnu.org/software/emacs/manual/html_node/elisp/Symbol-Type.html
On Fri, May 07 2021, Titus von der Malsburg wrote: >> Apparently, =json-parse-{buffer|string}= then gives you a symbol with a space >> in it... > > I now see that symbol names “can contain any characters whatever” [1]. But many > characters need to be escaped (like spaces) which isn’t pretty. Agreed. But if you pass such a symbol to =symbol-name= or to =(format "%s")=, the escape character is removed, so when it comes to displaying those symbols to users, it shouldn't matter much. Note, though, that the keys in CSL-JSON don't seem to contain any spaces or other weird characters. There are just lower case a-z and dash, that's all. >> This works for the Elisp library =json.el=, but Emacs 27 can be compiled with >> native JSON support, which, however, doesn't provide this option, >> unfortunately. > > I see. In this case it might make sense to propose string keys as a feature for > json.c. The key is a string anyway at some point during parsing, so avoiding the > conversion to symbol may actually be the best way to speed things up. True. I'll ask on emacs-devel. Personally, I'd prefer strings, too, but I'm a bit hesitant about doing the conversion myself, esp. given that in Ebib, all the keys would need to be converted back before I can save a file. >> That would be easy to support, but IMHO is better handled in >> bibtex-completion: >> just parse the buffer and then call =gethash= on the resulting hash table. Or >> what use-case do you have in mind? > > One use case: bibtex-completion drops fields that aren’t needed early on to save > memory and CPU cycles. (Some people work with truly enormous bibliographies, > like crypto.bib with ~60K entries.) But this means that we sometimes have to > read an individual entry again if we need more fields that were dropped earlier. > In this case I’d like to be able to read just one entry without having to > reparse the complete bibliography. Makes sense. For .bib sources, this should be fairly easy to do. For .json, I can't really say how easy it would be. It's not difficult to find the entry key in the buffer, but from there you'd have to be able to find the start of the entry in order to parse it. Currently, I don't know how to do that. >>> - Functions for resolving strings and cross-references. [...] >> parsebib has a lower-level API and a higher-level API, and the latter does >> essentially what you suggest here. I thought bibtex-completion was already >> using it... > > Nope. I think the high-level API didn’t exist when I wrote my code in 2014. No, it didn't. I seem to remember, though, that you gave me the idea for the higher-level API, which is probably why I assumed you were using it. So that part of =parsebib= hasn't been tested much... (Ebib doesn't use it, either). If you do decide to start using it, please test it and report any issues you find. And let me know if I can help with testing. -- Joost Kremers Life has its moments
On 2021-05-07 Fri 16:47, Joost Kremers wrote: > On Fri, May 07 2021, Titus von der Malsburg wrote: >>> Apparently, =json-parse-{buffer|string}= then gives you a symbol with a space >>> in it... >> >> I now see that symbol names “can contain any characters whatever” [1]. But many >> characters need to be escaped (like spaces) which isn’t pretty. > > Agreed. But if you pass such a symbol to =symbol-name= or to =(format "%s")=, > the escape character is removed, so when it comes to displaying those symbols to > users, it shouldn't matter much. > > Note, though, that the keys in CSL-JSON don't seem to contain any spaces or > other weird characters. There are just lower case a-z and dash, that's all. I agree that weird characters are unlikely going to be an issue. Nonetheless, strings seem slightly more future-proof. Funky unicode stuff is now appearing everywhere (I’ve seen emoji being used for variable names) and the situation could be different a couple of years down the line. >>> This works for the Elisp library =json.el=, but Emacs 27 can be compiled with >>> native JSON support, which, however, doesn't provide this option, >>> unfortunately. >> >> I see. In this case it might make sense to propose string keys as a feature for >> json.c. The key is a string anyway at some point during parsing, so avoiding the >> conversion to symbol may actually be the best way to speed things up. > > True. I'll ask on emacs-devel. Personally, I'd prefer strings, too, but I'm a > bit hesitant about doing the conversion myself, esp. given that in Ebib, all the > keys would need to be converted back before I can save a file. Sure, converting all keys in parsebib is not attractive. >>> That would be easy to support, but IMHO is better handled in >>> bibtex-completion: >>> just parse the buffer and then call =gethash= on the resulting hash table. Or >>> what use-case do you have in mind? >> >> One use case: bibtex-completion drops fields that aren’t needed early on to save >> memory and CPU cycles. (Some people work with truly enormous bibliographies, >> like crypto.bib with ~60K entries.) But this means that we sometimes have to >> read an individual entry again if we need more fields that were dropped earlier. >> In this case I’d like to be able to read just one entry without having to >> reparse the complete bibliography. > > Makes sense. For .bib sources, this should be fairly easy to do. For .json, I > can't really say how easy it would be. It's not difficult to find the entry key > in the buffer, but from there you'd have to be able to find the start of the > entry in order to parse it. Currently, I don't know how to do that. Not a big deal. Since it’s just about individual entries and the code isn’t super central, we can easily hack something. >>>> - Functions for resolving strings and cross-references. > [...] >>> parsebib has a lower-level API and a higher-level API, and the latter does >>> essentially what you suggest here. I thought bibtex-completion was already >>> using it... >> >> Nope. I think the high-level API didn’t exist when I wrote my code in 2014. > > No, it didn't. I seem to remember, though, that you gave me the idea for the > higher-level API, which is probably why I assumed you were using it. > > So that part of =parsebib= hasn't been tested much... (Ebib doesn't use it, > either). If you do decide to start using it, please test it and report any > issues you find. And let me know if I can help with testing. The organically grown parsing code in the Bibtex completion has been bugging me for a while. So I'm keen on rewriting this. But I may not get to it until the summer. I'll keep you posted when I start working on it. Titus
Dear All,
this is just to +1 this on my part as well. Although unadvertised,
citeproc-org basically already supports CSL-JSON bibliographies, and
it would be fantastic if other components of the Emacs
citation/bibliography infrastructure also did. BTW, would CSL-JSON
support in =parsebib= mean that there is hope for having CSL-support
in Ebib too?
best regards,
András
On Fri, 7 May 2021 at 18:23, Titus von der Malsburg <malsburg@posteo.de> wrote:
>
>
> On 2021-05-07 Fri 16:47, Joost Kremers wrote:
> > On Fri, May 07 2021, Titus von der Malsburg wrote:
> >>> Apparently, =json-parse-{buffer|string}= then gives you a symbol with a space
> >>> in it...
> >>
> >> I now see that symbol names “can contain any characters whatever” [1]. But many
> >> characters need to be escaped (like spaces) which isn’t pretty.
> >
> > Agreed. But if you pass such a symbol to =symbol-name= or to =(format "%s")=,
> > the escape character is removed, so when it comes to displaying those symbols to
> > users, it shouldn't matter much.
> >
> > Note, though, that the keys in CSL-JSON don't seem to contain any spaces or
> > other weird characters. There are just lower case a-z and dash, that's all.
>
> I agree that weird characters are unlikely going to be an issue. Nonetheless, strings seem slightly more future-proof. Funky unicode stuff is now appearing everywhere (I’ve seen emoji being used for variable names) and the situation could be different a couple of years down the line.
>
> >>> This works for the Elisp library =json.el=, but Emacs 27 can be compiled with
> >>> native JSON support, which, however, doesn't provide this option,
> >>> unfortunately.
> >>
> >> I see. In this case it might make sense to propose string keys as a feature for
> >> json.c. The key is a string anyway at some point during parsing, so avoiding the
> >> conversion to symbol may actually be the best way to speed things up.
> >
> > True. I'll ask on emacs-devel. Personally, I'd prefer strings, too, but I'm a
> > bit hesitant about doing the conversion myself, esp. given that in Ebib, all the
> > keys would need to be converted back before I can save a file.
>
> Sure, converting all keys in parsebib is not attractive.
>
> >>> That would be easy to support, but IMHO is better handled in
> >>> bibtex-completion:
> >>> just parse the buffer and then call =gethash= on the resulting hash table. Or
> >>> what use-case do you have in mind?
> >>
> >> One use case: bibtex-completion drops fields that aren’t needed early on to save
> >> memory and CPU cycles. (Some people work with truly enormous bibliographies,
> >> like crypto.bib with ~60K entries.) But this means that we sometimes have to
> >> read an individual entry again if we need more fields that were dropped earlier.
> >> In this case I’d like to be able to read just one entry without having to
> >> reparse the complete bibliography.
> >
> > Makes sense. For .bib sources, this should be fairly easy to do. For .json, I
> > can't really say how easy it would be. It's not difficult to find the entry key
> > in the buffer, but from there you'd have to be able to find the start of the
> > entry in order to parse it. Currently, I don't know how to do that.
>
> Not a big deal. Since it’s just about individual entries and the code isn’t super central, we can easily hack something.
>
> >>>> - Functions for resolving strings and cross-references.
> > [...]
> >>> parsebib has a lower-level API and a higher-level API, and the latter does
> >>> essentially what you suggest here. I thought bibtex-completion was already
> >>> using it...
> >>
> >> Nope. I think the high-level API didn’t exist when I wrote my code in 2014.
> >
> > No, it didn't. I seem to remember, though, that you gave me the idea for the
> > higher-level API, which is probably why I assumed you were using it.
> >
> > So that part of =parsebib= hasn't been tested much... (Ebib doesn't use it,
> > either). If you do decide to start using it, please test it and report any
> > issues you find. And let me know if I can help with testing.
>
> The organically grown parsing code in the Bibtex completion has been bugging me for a while. So I'm keen on rewriting this. But I may not get to it until the summer. I'll keep you posted when I start working on it.
>
> Titus
>
>
[-- Attachment #1: Type: text/plain, Size: 125 bytes --] Hi,well, this is what I asked Joost in the first place. Adjusting parsebib is part of the efforts to make that possible.Denis [-- Attachment #2: Type: text/html, Size: 334 bytes --]
On Sat, May 08 2021, András Simonyi wrote:
> this is just to +1 this on my part as well. Although unadvertised,
> citeproc-org basically already supports CSL-JSON bibliographies, and
> it would be fantastic if other components of the Emacs
> citation/bibliography infrastructure also did. BTW, would CSL-JSON
> support in =parsebib= mean that there is hope for having CSL-support
> in Ebib too?
Yes, that is the plan. No promises on an ETA, but it's high on my to-do list.
--
Joost Kremers
Life has its moments