From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id 8A43H0W5u1+yPgAA0tVLHw (envelope-from ) for ; Mon, 23 Nov 2020 13:29:41 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id UKE0G0W5u19WLgAA1q6Kng (envelope-from ) for ; Mon, 23 Nov 2020 13:29:41 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 33D02940630 for ; Mon, 23 Nov 2020 13:29:40 +0000 (UTC) Received: from localhost ([::1]:57310 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1khBuU-0005Ov-Tl for larch@yhetil.org; Mon, 23 Nov 2020 08:29:38 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:52922) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1khBtA-0005LT-4Y for emacs-orgmode@gnu.org; Mon, 23 Nov 2020 08:28:19 -0500 Received: from static.rcdrun.com ([95.85.24.50]:43195) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1khBt6-0007Un-1O for emacs-orgmode@gnu.org; Mon, 23 Nov 2020 08:28:15 -0500 Received: from localhost ([::ffff:41.202.241.56]) (AUTH: PLAIN admin, TLS: TLS1.2,256bits,ECDHE_RSA_AES_256_GCM_SHA384) by static.rcdrun.com with ESMTPSA id 00000000002C1AEA.000000005FBBB8E8.00005D99; Mon, 23 Nov 2020 13:28:08 +0000 Date: Mon, 23 Nov 2020 16:17:30 +0300 From: Jean Louis To: Texas Cyberthal Subject: Re: One vs many directories Message-ID: References: <87y2ive1i4.fsf@localhost> <878sauhhv1.fsf@web.de> <875z5ygwwr.fsf@web.de> <87r1olfvh4.fsf@web.de> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline In-Reply-To: User-Agent: Mutt/2.0 (3d08634) (2020-11-07) Received-SPF: pass client-ip=95.85.24.50; envelope-from=bugs@gnu.support; helo=static.rcdrun.com X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_FILL_THIS_FORM_SHORT=0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-orgmode@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "General discussions about Org-mode." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "Dr. Arne Babenhauserheide" , "emacs-orgmode@gnu.org" Errors-To: emacs-orgmode-bounces+larch=yhetil.org@gnu.org Sender: "Emacs-orgmode" X-Scanner: ns3122888.ip-94-23-21.eu Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of emacs-orgmode-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=emacs-orgmode-bounces@gnu.org X-Spam-Score: -0.51 X-TUID: vITmBlEaWlNz * Texas Cyberthal [2020-11-23 12:51]: > Hi Dr. Arne, > > > The only part that hits performance limits is the agenda. > > Well, IIRC your Org Textmind is much smaller than mine. > > > My current guess is that the agenta is slow because it has to parse all my 7500 clock entries, and it has to check the Todo states of around 1200 headings. > > Ouch. I'd rather keep a "ramble log" so I can reconstruct an exactly > honest time accounting, with discounts for partial attention, without > worrying about fiddly clockin/outs. At least when working from home. > If clocking into a work site, that's different, because one can > reasonably bill for the entire time, with minimal clock toggling. > > > Did you check against filesystem limits? At 10k entries in a > directory typical filesystems start becoming slow. That's the main > reason I see for adding hierarchies. >From ext4 manual: dir_index Use hashed b-trees to speed up name lookups in large directories. This feature is supported by ext3 and ext4 file systems, and is ignored by ext2 file systems. dir_nlink This ext4 feature allows more than 65000 subdirectories per directory. I think that file systems should be unlimited and fast in relation to that. I have ~/Maildir with over 50000 subdirectories, direct access is very easy and fast while listing takes some time. If file system does not allow fast access it is time to replace it with one that does allow it. Now I wonder of HAMMER in DragonflyBSD is also slow with 50000 directories. My PostgreSQL database is not huge, it is when packed about 50 MB. On the file system it is 810 MB. To select 2469 contacts as subset of 204048 contacts that belong in certain group does not give (usually) feeling of any delay, it looks instant for human. My Org work is on meta-level so my truly important headings or subtree names are in the database. Subtrees have their various properties, like I can place any tags there inside, like TODO or designate type of TODO. My work is intertwined with text and Org mode mostly, but I could use any kind of mime type or any kind of Emacs mode. Some nodes are on file system while some are in the database. Nodes within subtree are hyperdocuments, they are all linkable and could be on file system or not on file system. Everything is together in one tree and it does not matter as access to the nodes does not go over the tree necessary. There are 19197 nodes. To find 76 that are tagged with TODO does not give me any slight or visible delay, definitely not even 0.2 seconds. When I press enter it is right there. >From the system I am using personally I am thinking that Org mode could get its database connection so that headings and properties are managed in the database fully while text could be managed in files. It seems very possible. The only thing that would be needed to add to Org in that case is some heading tag that would uniquely designate where in the database that heading is managed. It could be very lightly displayed on the screen and would not be exported by default. Something like *** TODO Heading :ID-123: That would be all. All other meta data belonging to the heading could be managed in the database. If heading is deleted it need not be deleted in the database. Text belonging to heading could be managed in the text file. Properties in the database. It can be simple database such as GDBM if such is fast enough. Meta data for the heading would or could be updated automatically from time to time. User could easily decide to show the properties in the Org file or not to show. It does not matter much as long as :ID-123: tag is there. All things like tags, properties, clock-in and out, schedule, deadlines, custom_id and everything else as heading meta data could be manageable in the database. It could be copied into new headings. Creation of heading like this: *** TitleRET would automatically invoke creation of heading 124 in the database and it would appear as: *** Title :ID-124; >From there on user would be doing anything as usual in the Org mode with the difference that properties would be displayed in the updated manner and would not be really in the Org file. They would be displayed on the fly. Any properties and plethora of other new properties could be included. System would recognize automatically by saving the Org file or by opening it: - If headings are in the right file, if file changed its place it would be automatically updated in the database. - the heading ID would always remain unique no matter what, so users linking to any heading would not need to worry of title remaining. The unique ID that links to heading would basically link to the database entry. Opening the link would ask database where the entry is located and it would open up proper Org file at proper location without parsing the Org file in usual manner. Org file would then remain pretty much more text than it is now. - all the parsing and searching and indexing would be automatically solved and human readable SQL queries could be easily customized by user. Suddenly there would be much less commotion in work. Org files would look much more humane readable then they are now. > 10k entries in a directory sounds inhumanely unergonomic. I guess my > biggest flat name directory might eventually reach that size? In > which case I could just split it in the middle of the alphabet, or > similar solution. Like by first letters, like ~/Maildir/a/d/a/adam@example.com Such sorting of files would be automatic. You would need to invoke a command that sorts files that way automatically and that may also quickly access such files automatically. I have comand that I often use, mkdatedir that makes me directory for the current date. If I wish to make a database note for the day, the command today-note would make sure there is: - Year 2020 (formatted how I customize it) - November (also formatted by custom) - 2020-11-23 And entry is automatically opened for the note. The system helps that I locate quickly the note that relates to the day. But I can put multiple notes under same date and I can also have same titles for those multiple notes. This is because each note has its unique ID. I do not know how Org handles multiple same headings when linking to it. It does not by default: [[Heading][Heading]] * Heading Text here * Heading More text here. But if I wish to link here I need to do hac To me and my thinkin that is not really logical. There shall be always unique ID for each heading. My mind is not comforted by Org system in that sense. And I should not be thinking of the unique ID neither I should be writing those links like [[Something][Here]] as they should be constructed automatically. Myself I would like to come with cursor to second Heading and capture the link to the heading. I would kill [[Heading][Heading]] into special memory for those links. Then I could go to any other place in the Org file and insert it there without thinking how link looks like or constructing the link myself as it already exists in front of me. Constructing links by hand is fine for those which are external. Headings of Org files could be managed by the database in background. Then all that distributed or sparse meta information (mess) disappears. What people are now trying to handle with Org files is management of a database. Only that entries of the database are pretty much disconnected from each other, vague, in unknown positions, then Org algorhitms try to manage that all everything what is anyway built-in in all SQL databases. Mess is growing over time. > A 10k entry directory is getting into enterprise territory, and I'm > sure enterprise has tech tricks that become worthwhile at that scale. I will try with those options dir_index and dir_nlink to see if my 50000+ directory becomes somewhat faster. Direct access to the subdirectory is always very fast. I almost never do ls there neither enter any such directory manually. They store emails, so I just click one key in mutt, that key extracts the current email address such as person@example.com and opens up ~/Maildir/person@example.com, one among 50000. It is accessed by wanting to see previous conversation with the contact, not by knowing what is the directory name or email address, computer does that. It is simple system I use for years and it is blazing fast. > There are scaling problems in every direction: Too many files per > > directory, too large files, too much content per heading, too many > > headings. To list more than 200,000 contacts does take some time but access to the list from database is so much faster than ls in the ~/Maildir with more than 50000 entries or subdirectories. I can relate to that. And I still think that file systems should manage any numbers of entries. > There are scaling problems from too much deep tree nesting, namely too > much fiddly ambiguous manual refiling. Solution is flat "solid name" > directories just below feasible 10 Bins. Work fine. I have tried your solution and could not find the mental concept to relate to my thinking. And I do agree that such solution could help other people. For images I have some command like `sort-images.lisp' that just sorts images by its embedded dates. Many times I sort even downloads per day. * Texas Cyberthal [2020-11-23 12:51]: > Hi Dr. Arne, > > > The only part that hits performance limits is the agenda. > > Well, IIRC your Org Textmind is much smaller than mine. > > > My current guess is that the agenta is slow because it has to parse all my 7500 clock entries, and it has to check the Todo states of around 1200 headings. > > Ouch. I'd rather keep a "ramble log" so I can reconstruct an exactly > honest time accounting, with discounts for partial attention, without > worrying about fiddly clockin/outs. At least when working from home. > If clocking into a work site, that's different, because one can > reasonably bill for the entire time, with minimal clock toggling. > > > Did you check against filesystem limits? At 10k entries in a > directory typical filesystems start becoming slow. That's the main > reason I see for adding hierarchies. >From ext4 manual: dir_index Use hashed b-trees to speed up name lookups in large directories. This feature is supported by ext3 and ext4 file systems, and is ignored by ext2 file systems. dir_nlink This ext4 feature allows more than 65000 subdirectories per directory. I think that file systems should be unlimited and fast in relation to that. I have ~/Maildir with over 50000 subdirectories, direct access is very easy and fast while listing takes some time. If file system does not allow fast access it is time to replace it with one that does allow it. Now I wonder of HAMMER in DragonflyBSD is also slow with 50000 directories. My PostgreSQL database is not huge, it is when packed about 50 MB. On the file system it is 810 MB. To select 2469 contacts as subset of 204048 contacts that belong in certain group does not give (usually) feeling of any delay, it looks instant for human. My Org work is on meta-level so my truly important headings or subtree names are in the database. Subtrees have their various properties, like I can place any tags there inside, like TODO or designate type of TODO. My work is intertwined with text and Org mode mostly, but I could use any kind of mime type or any kind of Emacs mode. Some nodes are on file system while some are in the database. Nodes within subtree are hyperdocuments, they are all linkable and could be on file system or not on file system. Everything is together in one tree and it does not matter as access to the nodes does not go over the tree necessary. There are 19197 nodes. To find 76 that are tagged with TODO does not give me any slight or visible delay, definitely not even 0.2 seconds. When I press enter it is right there. >From the system I am using personally I am thinking that Org mode could get its database connection so that headings and properties are managed in the database fully while text could be managed in files. It seems very possible. The only thing that would be needed to add to Org in that case is some heading tag that would uniquely designate where in the database that heading is managed. It could be very lightly displayed on the screen and would not be exported by default. Something like *** TODO Heading :ID-123: That would be all. All other meta data belonging to the heading could be managed in the database. If heading is deleted it need not be deleted in the database. Text belonging to heading could be managed in the text file. Properties in the database. It can be simple database such as GDBM if such is fast enough. Meta data for the heading would or could be updated automatically from time to time. User could easily decide to show the properties in the Org file or not to show. It does not matter much as long as :ID-123: tag is there. All things like tags, properties, clock-in and out, schedule, deadlines, custom_id and everything else as heading meta data could be manageable in the database. It could be copied into new headings. Creation of heading like this: *** TitleRET would automatically invoke creation of heading 124 in the database and it would appear as: *** Title :ID-124; >From there on user would be doing anything as usual in the Org mode with the difference that properties would be displayed in the updated manner and would not be really in the Org file. They would be displayed on the fly. Any properties and plethora of other new properties could be included. System would recognize automatically by saving the Org file or by opening it: - If headings are in the right file, if file changed its place it would be automatically updated in the database. - the heading ID would always remain unique no matter what, so users linking to any heading would not need to worry of title remaining. The unique ID that links to heading would basically link to the database entry. Opening the link would ask database where the entry is located and it would open up proper Org file at proper location without parsing the Org file in usual manner. Org file would then remain pretty much more text than it is now. - all the parsing and searching and indexing would be automatically solved and human readable SQL queries could be easily customized by user. Suddenly there would be much less commotion in work. Org files would look much more humane readable then they are now. > 10k entries in a directory sounds inhumanely unergonomic. I guess my > biggest flat name directory might eventually reach that size? In > which case I could just split it in the middle of the alphabet, or > similar solution. Like by first letters, like ~/Maildir/a/d/a/adam@example.com Such sorting of files would be automatic. You would need to invoke a command that sorts files that way automatically and that may also quickly access such files automatically. I have comand that I often use, mkdatedir that makes me directory for the current date. If I wish to make a database note for the day, the command today-note would make sure there is: - Year 2020 (formatted how I customize it) - November (also formatted by custom) - 2020-11-23 And entry is automatically opened for the note. The system helps that I locate quickly the note that relates to the day. But I can put multiple notes under same date and I can also have same titles for those multiple notes. This is because each note has its unique ID. I do not know how Org handles multiple same headings when linking to it. It does not by default: [[Heading][Heading]] * Heading Text here * Heading More text here. But if I wish to link here I need to do hack. To me and my thinking that is not really logical. There shall be always unique ID for each heading. My mind is not comforted by Org system in that sense. And I should not be thinking of the unique ID neither I should be writing those links like [[Something][Here]] as they should be constructed automatically. Myself I would like to come with cursor to second Heading and capture the link to the heading. I would kill [[Heading][Heading]] into special memory for those links. Then I could go to any other place in the Org file and insert it there without thinking how link looks like or constructing the link myself as it already exists in front of me. Constructing links by hand is fine for those which are external. Headings of Org files could be managed by the database in background. Then all that distributed or sparse meta information (mess) disappears. What people are now trying to handle with Org files is management of a database. Only that entries of the database are pretty much disconnected from each other, vague, in unknown positions, then Org algorhitms try to manage that all everything what is anyway built-in in all SQL databases. Mess is growing over time. > A 10k entry directory is getting into enterprise territory, and I'm > sure enterprise has tech tricks that become worthwhile at that scale. I will try with those options dir_index and dir_nlink to see if my 50000+ directory becomes somewhat faster. Direct access to the subdirectory is always very fast. I almost never do ls there neither enter any such directory manually. They store emails, so I just click one key in mutt, that key extracts the current email address such as person@example.com and opens up ~/Maildir/person@example.com, one among 50000. It is accessed by wanting to see previous conversation with the contact, not by knowing what is the directory name or email address, computer does that. It is simple system I use for years and it is blazing fast. > There are scaling problems in every direction: Too many files per > > directory, too large files, too much content per heading, too many > > headings. To list more than 200,000 contacts does take some time but access to the list from database is so much faster than ls in the ~/Maildir with more than 50000 entries or subdirectories. I can relate to that. And I still think that file systems should manage any numbers of entries. > There are scaling problems from too much deep tree nesting, namely too > much fiddly ambiguous manual refiling. Solution is flat "solid name" > directories just below feasible 10 Bins. Work fine. I have tried your solution and could not find the mental concept to relate to my thinking. And I do agree that such solution could help other people. For images I have some command like `sort-images.lisp' that just sorts images by its embedded dates. Many times I sort even downloads per day. Memacs tries to solve about same problem. Memacs https://github.com/novoid/Memacs That hyperlink I have selected among other 20000 hyperlinks. I could as well send the notes to you or annotation related to the hyperlink. I have not written the hyperlink myself, all I did is that I have opened HyperScope, invoked completion and on the link on screen I pressed W, it copied itself to this email. It was blazing fast as I have accessed it by thinking Memex. Not Memacs, but Memex. Memacs was just next to it. By thinking would still mean that I had to enter some words that I think of. Memory is involved in that process of thinking and accessing. You mentioned humans know many words. If we observe the process of knowing words, how do we access them? This time really by thinking. But we access them how I heard of it, mostly by association or by direct access. We see the flower and word is just there. Do we think of a tree of knowledge first? I do not think so. And there are memory systems that DO think of plethora of various things and increase human memory capabilities. That is called mnemonics. Mnemonics is based mostly on associations. It becomes possible to remember pack of mixed 52 cards within 20 minutes and to reliably know at which position is which card located and to replicate the full series of cards. Mnemonics methods help human to do such feats. Everybody can do it. Compare now the human mind system: - of direct access by direct association, something like I think of Memex but I know there is something similar in Emacs, I write Memex and I get Memacs, then I give reference to you. - and there is also the system of thinking that I can locate in my mind a reference to Memacs even by its number or ID because that could be my mnemonics how I think about How human think -- is nowhere defined and is vague. Human thinks how they think and there may be as many versions as humans. Computers should not be delivered any more with one built-in paradigm only such as file system. There shall be at least several: - file system - meta databased approach, that involves little but more curation than just making a file name. - subject or tag based approach - Dewey decimal approach or other similar - 10 Bins, etc. Then user could decide to use this or that approach. Having file managers for decades is really boring. It does not advance computing. To say that we have hierarchical file systems by default and nothing else shows how much we are under-developed. Doug Engelbart has already envisioned how files could be stored, accessed, hyperlinked, referenced and we do not use it in that sense today after how many years? Maybe 40 years. Computer makers and OS makers do not really help us, there are visionaries but we do not get file systems that helps us to access files by association or thinking so we have to upgrade our tools for those tasks that should be built in. Org as a concept was already invented by Doug Engelbart before decades and it still does not have features that I would like it to have. For example finely grained unique ID numbers that can also relate to paragraphs or set of paragraphs, unique or static sorting of files repository, wide group collaboration and sharing and other concepts. Hyperlinking already back than was sophisticated. Highlights of the 1968 "Mother of All Demos" https://www.dougengelbart.org/content/view/276/000/