Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to use one backing-store for multiple atomspaces? #1855

Closed
enku-io opened this issue Aug 7, 2018 · 26 comments
Closed

Is it possible to use one backing-store for multiple atomspaces? #1855

enku-io opened this issue Aug 7, 2018 · 26 comments

Comments

@enku-io
Copy link

enku-io commented Aug 7, 2018

I am working on a project that requires different atomspaces to run in memory. Is it possible to use one backing-store to persist all these atomspaces? Having one backing-store, hence one database per atomspace, would be inefficient and resource intensive.

@linas
Copy link
Member

linas commented Aug 8, 2018

Yes and no, depending on what you mean.

If you mean:
a) "can I have N different processes or servers or machines all accessing the same atomspace in the same backend, sharing their data with one-another in a coherent, bug-free way", the answer is yes. I do that all the time, there are unit tests for it.

If you mean:
b) "can I have N different processes or servers or machines all accessing different, disjoint, unrelated atomspaces that just happen to be stored in the same table", the answer is no. And mostly I cannot see any reason to do that -- and the reasons you give - "inefficient and resource intensive" would not apply to this case, anyway. There are only efficiency and resource losses from putting unrelated things into the same table; there are no gains.

@enku-io
Copy link
Author

enku-io commented Aug 8, 2018

I think I need to elaborate on my situation better. The project I am working on enables a user to create a project with its own dataset and knowledge base. The knowledge base is basically an atomspace. If I have N users and each creates a Project with a different knowledge base, hence atomspace, which all might access their respective atomspaces at the same time; how will I handle the persistence? If I store each atomspace in a different database, I will have to connect to each database concurrently and store the atoms. When saying " inefficient and resource intensive" I am referring to connecting to multiple databases and writing to those databases. Also, the knowledge bases( currently we have three that the user can choose from) in most cases are similar initially. Having a database for each atomspace would be resource intensive.

@linas
Copy link
Member

linas commented Aug 8, 2018

If I have N users ... how will I handle the persistence?

At the command-line, issue the command createdb user_foo_atoms although this could be moved to C++ code. The current backend will initialize this database automatically, and the creation of the database could be automated as well.

If I store each atomspace in a different database, I will have to connect to each database concurrently and store the atoms.

? Who is "I" in this sentence? Only the app has to connect to the database, and not "you". There is no additional cost associated with this. The total cost only depends on the number of users: each user has to have their own connection to a database, whether or not they are sharing it.

I am referring to connecting to multiple databases and writing to those databases.

There is no additional cost to accessing multiple databases, as compared to one. The grand total number of TCP/IP sockets is exactly the same. Every user has to open a socket. You cannot avoid this. The compute overhead for the postgres server is exactly the same.

Having a database for each atomspace would be resource intensive.

No, that is just not true. In fact, its the other way around: giving each user their own database would use less resources, and not more. Trying to put every user into the same atomspace causes the the size of the indexes on the tables to bloat: the atomspace indexes are quite large. Larger indexes use up more disk space; searching larger indexes takes more CPU time, and slows down response times. Making indexes even larger is a total, net loss. Having a separate atomspace for each user will be smaller, simpler, faster, more light-weight, easier to manage. For example:

  • To delete user N, you just say dropdb user_foo_atoms. You don't have to go fishing for atoms that might or might not belong to user N.

  • Backup and restore: its tons easier to make backups of N distinct databases, than to make one giant backup of one database.

  • Security: You don't have to worry about user A accidentally getting access to data from user B. You don't have to worry about user A accidentally corrupting someone-else's data.

  • Provisioning: users connecting on the local network could be provisioned differently from users coming in over a VPN. A third class, coming over the open internet, could have a third set of security provisions. These are set per database.

  • Dataloss: if you have power-loss, lightning strike, ECC RAM corruption, gamma-ray strike, vibration, humidity, dust, cooling event, and there is database corruption, probably only 1 or 2 out of N users are affected. If you put them into one database, then all users suffer from catastrophic data loss. Hope you had backups! Hope you paid extra so that your disk drives had batteries or supercapacitors on them!

  • Provisioning: If you have N users, and they don't all fit on one server, you can move half of them to a second server. This is easy. By contrast, trying the write the required postgres rules and triggers and what not to have half of one table sit on one server instance, and half of the table to sit on another instance is extremely hard and error prone. It's just barely possible, in the very newest and latest versions of postgres. It's wayy too bleeding-edge to deploy on a shoe-string operation.

There is nothing to be gained by trying to force all users to use the same database. It makes no sense at all. You just end up using more disk space to store indexes, and you make the response-time worse, because the large indexes take longer to search. Putting everyone into the same database requires more disk, more RAM, more CPU; its a total loss, and there is no gain.

@linas
Copy link
Member

linas commented Aug 18, 2018

I thought some more about what you asked above, and I now think I understand what you are asking for.

The short answer is, "no its not possible", but design-wise, it is an interesting usage scenario, and it probably should get love and attention and a good design to enable dataset, umm, read-only sharing. That is, I'm envisioning a large dataset with lots of genomic or proteomic data, that would be accessible, pre-looaded in some read-only fashion, shared by everyone, and have multiple users who could get access to it, and perform COW (copy-on-write) modifications to it.

Having this ability seems like a high-priority design/arhcitecture goal.

@enku-io
Copy link
Author

enku-io commented Aug 22, 2018

That's great to hear. I have a couple of questions regarding what you are envisioning.

  1. Are you envisioning a scenario where multiple datasets and each can be separately loaded?
  2. How about change tracking on the datasets when they are loaded to an Atomspace? How would you consider implementing copy-on-write, meaning will the whole dataset be copied along with the changed atoms when a user modifies a specific dataset or only the changes that the user makes be stored elsewhere?
  3. Related to Can't make guile example work  #2, if we store the changed part of the atomspace on a separate database/table, how do we load the changed dataset that has some of its atoms stored elsewhere?

@linas
Copy link
Member

linas commented Aug 23, 2018

Your questions are not so much questions, as requirements: you want a system that can do 1,2 and 3 (and maybe other things that you haven't mentioned). So:

  1. Yes, the ability to load multiple datasets is a reasonable, understandable requirement.

  2. copy-on-write and change-tracking are two unrelated concepts. Currently, an atomspace can have one or more overlay atomspaces, so that, in the overlay, all of the original atomspace is visible, but so are the atoms in the overlay. You can alter the atoms in the overlay, without altering the atoms in the base space. This works today. Mostly, I'm not sure if we have unit tests for values; there might be bugs.

2a. Overlay atomspaces are NOT supported in the postgres back-end. They could be, but someone has to write the code to do this. Its not hard (it is not research AI), but its also not easy (it does require a system programmer who knows both postgres, and also the atomspace). Storage for overlays does NOT require copying the entire underlying database. Only the changed portions would be stored.

2b. Change tracking is something else: change tracking is about user-permissions (does user have permission to read or write?) , dataset meta-data (who owns the dataset? When was it created? what does it contain?) and overlay tracking (does overlay B require overlay A which requires base dataset Z? ) I understand that these are needed, but I kind-of don't want to solve them inside of the atomspace, because they are "icky". It would be if some external module solved these problems.

  1. Yes, it should be possible to store the overlay in a different database from the base. This code has not been written ...

3b. Currently, it is not possible to stores different atomspaces in different tables in a database. That is because the atomspace uses 3 tables, and half a dozen indexes, all that have fixed names. We would need to have some way of making the names be generic, and pass them in during login. I guess that could be done. Someone would need to write the code ...

Do you have any other requirements?

I don't have the time to design the above, and to write the code. I could help someone do this. You would need to find someone who would know how to do this (or you would have to learn, yourself -- again, I could help with the design and code reviews, but this is not something that can be done easily or quickly.).

@linas
Copy link
Member

linas commented Aug 23, 2018

Also: currently, in the overlay atomspace, there is no way to create a second atom that has the same name as an atom in the base. That means that, currently, it is impossible to have some atom, with one truthvalue in the base space, but a different truth value in the overlay space. (also, for any value, not just truth value) This could be fixed, but currently, its just not possible. I just checked:

(use-modules (opencog))

(Concept "foo" (stv 0.5 0.5))

(define base (cog-atomspace))
(define ov (cog-new-atomspace base))
(cog-set-atomspace! ov)

(cog-set-tv! (Concept "foo") (stv 0.8 0.8))

(cog-set-atomspace! base)

(Concept "foo") ;  TV changed, ... its no longer 0.5 ....

@linas
Copy link
Member

linas commented Aug 23, 2018

One can create two atoms with the same name, but creating one in the overlay, first:

(use-modules (opencog))

(define base (cog-atomspace))
(define ov (cog-new-atomspace base))
(cog-set-atomspace! ov)

(Concept "foo" (stv 0.5 0.5)) ; create atom in overlay, first.
(Concept "foo")  ; check to make sure.

(cog-set-atomspace! base)

(Concept "foo")  ; this creates a new atom in the base.
(cog-set-tv! (Concept "foo") (stv 0.8 0.8)) ; change TV

(cog-set-atomspace! ov) ; go back to overlay

(Concept "foo") ;  the overlay version still has the old value.

@linas
Copy link
Member

linas commented Aug 23, 2018

.. so we could create two types of overlays: "transparent" ones, that allow atoms in the base space to be changed, and "opaque" ones, that would allow atoms in the base to be seen, but not changed (and changes would be copy-on-write).

@linas
Copy link
Member

linas commented Sep 26, 2018

The different overlays are possible, because values are immutable. All changes to the value of an atom MUST go through Atom::setValue(), and so this can always be intercepted. This provides a natural place where the atom could be copied into the overlay atomspace, if desired. Alternately, one could set a bitflag in the atom, indicating that it's value has changed, so that, when saving to disk, only the changed atoms are saved.

@linas
Copy link
Member

linas commented Nov 4, 2018

There are several competing alternative design points/implementations:

  • Atom::setValue(), Atom::getValue() and Atom::getKeys() are overloaded methods. They check user-permissions (read/write, read-only) on a per-atom basis. But where are they overloaded from ??

  • The overloaded Atom::getValue(), etc. methods are provided by the overlay atomspace. When a user creates an overlay atomspace, they must supply a username/password (if different from the current username/password) This could be similar to the current sql-open mechanism, which requires a user-passwd pair. (More precisely, we want to use a capability model, not a permsissions model, so that user can perform an action if they have capability (the pointer/token))

The above design is "poor" because it imposes a performance overhead on every user, whether or not they need it. So here's an alternative:

  • Keep Atom::getValue(), etc. unchanged from current design. However, the Values themselves now have a per-user security implementation. Thus, for example, there would be a SecureFloatValue which performs the permissions checking.

@linas
Copy link
Member

linas commented Nov 4, 2018

Lets look at some plausible user flows:

-- user creates base atomspace, opens a read-only database as backing store.
-- user creates an overlay atomspace, and opens a different, read-write atomspace on this backing store.
-- user fetches some atom. If the atom is in the overlay atomspace, it is fetched, else, it is fetched from the base atomspace.
-- user alters some value on the atom. This sets a bit on the atom, marking "this value has changed"
-- User stores atom. The store will always go to the overlay backing-store, never to the base store.

Where should the "this value has changed" flag be stored? With the atom? Somewhere else? Easiest to store it with the atom, but this bloats the atom. But if we store it somewhere else, then overhead.

Implementation:

  • The Atom::setValue() etc methods are virtual.

  • The default Atom::setValue() is a trampoline. If called on an atom having a base-space backing store, it will look for overlays with an open backing store, and, if found, it will trampoline to there, and install the overlay method. Thus, there is a penalty only on the first call to Atom::setValue() and subsequent calls are "cheap".

@linas
Copy link
Member

linas commented Nov 4, 2018

@enku-io the above thoughts are incomplete, and would require a fair bit of thinking to make them clear and clean and elegant enough to actually make them work well. Then after a design/prototype is in place, someone needs to implement it, in full detail. Is there anyone on your team who would be able to do this? How high a priority would this be? This is a fairly large project, non-trivial, and it's not clear to me if Ben or @mjsduncan has provisioned headcount for this, or if anyone at icog labs has the needed sophistication and experience to do this. ... this is not a junior-developer project. Thoughts?

@mjsduncan
Copy link
Contributor

thanks for the feedback, linus! the bio team certainly doesn't have resources to do opencog backing store development but we do want to do graph searches/pattern matching in atomspaces that are too big to fit in ram. here is another framework that might be useful as a model, a graph database that uses pgsql or other options as a back end: https://docs.bmeg.io/grip/docs/
it doesn't have hyper-edges but it does allow arbitrary types and values for vertices and edges.

@mjsduncan
Copy link
Contributor

what happens if you load a big atomspace from storage, then add an overlay and do stuff, adding and deleting atoms and modifying values, then delete the starting atomspace. is anything left in the overlay that could be saved and then reloaded on top of a new copy of the original atomspace?

@linas
Copy link
Member

linas commented Nov 4, 2018

First, the easy question: "deleting" .. there is currently no way to "delete" the atomspace underneath the overlay.

@mjsduncan
Copy link
Contributor

so if you switch to the original atomspace and delete all the atoms then everthing in the overlay atomspace is gone also?

@linas
Copy link
Member

linas commented Nov 4, 2018

Next: "fit in RAM". So:

  1. You don't have to actually load all atoms into RAM. You can selectively load only those that you care about (and search over those; and purge them when you are done searching)

  2. No database ever can search for anything that is not in RAM. Whatever data you are searching for, it's got to be in RAM at the time that you search for it. That said, there is a huge variety of tricks and techniques that some (not all) databases use to minimize what's held in RAM. (e.g. by loading only what they need, and then throwing it away ASAP. But then the bottleneck is disk I/O, which is why DB machines have super pumped-up I/O subsystems and relatively slow CPU's)

For the atomspace, we currently have two super-minimalist "searches", one called "fetch atoms of type", another called "fetch incoming set", and a third called "fetch incoming set by type" -- these search the DB for only those atoms that you want, (and loads only those). That's it. It's possible to create more of these, a fancier search, that pushes more of the pattern matching down into the DB... but there are issues. One is that postgres still needs to suck things into RAM, to run the query, (even if the results are never passed up to the atomspace). The other is that general searches are hard for any database -- postgres does "query optimization", as do OLAP and OLTP tools -- its decades of high-tech stuff to figure out how to pull into RAM only the parts you actually need. Not to exaggerate, but people have been searching DB's that don't fit into RAM for more than five decades, now. There's no magic potion, and you learn to deal with it.

That said, its hard to move forward without specific examples... how fancy are your queries? Can you make do by clever use of "fetch by type" and "fetch incoming set"? You'd have to do this by hand. Note also: the "main loop" of the pattern matcher is a loop over entire-incoming-set and a loop over all atoms of type. We don't currently, but we could rewrite those loops to fetch from DB, as the search progresses....

@linas
Copy link
Member

linas commented Nov 4, 2018

if you switch to the original atomspace and delete all the atoms then everthing in the overlay atomspace is gone also?

Um ... good question. 1) no one has ever done this before. 2) doing this will expose one or more bugs. 3) now is a good time to discuss what should happen. (the right answer is not fully obvious)

@linas
Copy link
Member

linas commented Nov 4, 2018

if you switch to the original atomspace and delete all the atoms then everthing in the overlay atomspace is gone also?

Um ... good question. 1) no one has ever done this before. 2) doing this will expose one or more bugs. 3) now is a good time to discuss what should happen. (the right answer is not fully obvious)

False. I just tried it. It seems to work, without bugs. You cannot delete atoms that are referenced in the overlay. Here's a demo.

 (use-modules (opencog))

 ; create atoms in the base atomspace.
 (define a (Concept "a"))
 (define b (Concept "b"))
 (define c (Concept "c"))
 (cog-prt-atomspace)

 ; create an overlay
 (define base (cog-atomspace))
 (define ov (cog-new-atomspace base))
 (cog-atomspace-env ov)
 (cog-set-atomspace! ov)

 ; create a link in the overlay
 (Link a b)
 (cog-prt-atomspace)

 ; go to the base.
 (cog-set-atomspace! base)

 ; print .. notice the link is not in the base.
 (cog-prt-atomspace)

 ; delete "everything". The first two deletes will fail.
 (cog-extract a)
 (cog-extract b)
 (cog-extract c)

 ; Verify what's left.
 (cog-prt-atomspace)

 ; what happened to c? is should be #<Invalid handle>
 c
 (format #t "this is it: ~A\n" c)
 (format #t "this is it: ~A\n" a)
 (format #t "this is it: ~A\n" b)

linas added a commit to linas/atomspace that referenced this issue Nov 5, 2018
@linas
Copy link
Member

linas commented Nov 5, 2018

OK @mjsduncan and @enku-io after pull req #1895 I think you've now got 90% of what you seem to be asking for, here. (I had the sudden urge to code it up in the proverbial "afternoon". -- 10 hours, actually, but whatever). So -- two things are now possible:

You can have a base atomspace, marked read-only, and a read-write overlay. You can alter truth-values in the overlay, without damaging atoms in the base space. This is done by automatically copying an atom into the overlay, whenever you attempt to alter it's TV. The result is two copies of the atom: one in the base, untouched, with the original truth value, and it's twin, in the overlay, with the altered TV. See /examples/atomspace/copy-on-write.scm for an example.

Part two of what you want -- managing datasets that don't fit in RAM -- you've got a basic set of tools for dealing with that. You can (for example) (load-atoms-of-type 'ConceptNode) which loads only those and nothing else ... then perform a pattern match, and then cog-extract everything you don't want, freeing up RAM. (You will need to also do sql-clear-cache to reclaim yet more RAM and sql-stats prints cryptic info) And then you can do the load again, to get them back, or a different set of atoms, and etc. What's more, you can do this all in the read-only atomspace. Without unmarking the read-only flag!

The idea is that the atomspace is acting as a in-RAM cache for the database. The database remains read-only, you can move atoms from the database into the atomspace, you can wipe out some or all of the atomspace, and its all still "read-only". For example, you still cannot add atoms to the atomspace, if they aren't already in the database!.

Everything I describe above should "just work" in a bug-free fashion. However, there are some rough spots. I'm not exactly clear on what would happen if you tried to open two databases at the same time, one attached to the read-only base space, and one attached to the read-write overlay. That is probably broken somehow. If (when) you find bugs, have complaints, open new bug reports.

The read-only/read-write mechanism is NOT a security/permissions mechanism. it helps you avoid corrupting the read-only data, but it won't stop you from toggling the write-enable bit back on. If you want true malicious-hacker-proof protection, you'll have to set up postgres permissions correctly.

Probably need to add handy utilities for extracting lots of atoms at the same time ... instead of one-at-a-time.

If you have detailed questions about the new example(s), open a new bug report. General discussion should stay in this issue.

@linas
Copy link
Member

linas commented Nov 5, 2018

Ooops... @amebel points out that the work in #1895 wrecks the connectivity of incoming sets, and so graph traversal won't work correctly, so pattern matching will behave in unexpected ways. So although #1895 is a step in the right direction, its incomplete.

@linas
Copy link
Member

linas commented Nov 8, 2018

Ignore the remark immediately above. It all works. Extended unit tests in #1901 and #1902. Apologies for some of the whipsawing. Please try it out; please open new bug reports if things do not seem to work as expected.

@linas
Copy link
Member

linas commented Dec 24, 2018

The best long-term solution that I can think of is described here: #1967 I think it would solve everything @enku-io is asking for, as well as other problems & issues that other people have.

@linas
Copy link
Member

linas commented May 10, 2022

FYI Much or most of what was discussed above has gotten implemented in the pull reqs of April-May 2022. So, starting with #2925 and etc up through about #2936 or so.

@linas
Copy link
Member

linas commented Dec 11, 2022

Closing. Everything discussed here is now available via the various ProxyNode which extends the earlier StorageNode interfaces. You can now have M AtomSpaces sending Atoms to/from N other AtomSpaces anywhere on the net, while controlling for read-only, load-balancing, distributed redundant writes, mirroring, and transforming Atoms & Values on the fly. It's all there' there are even demo examples for each of these in the examples dir.

@linas linas closed this as completed Dec 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants