-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to use one backing-store for multiple atomspaces? #1855
Comments
Yes and no, depending on what you mean. If you mean: If you mean: |
I think I need to elaborate on my situation better. The project I am working on enables a user to create a project with its own dataset and knowledge base. The knowledge base is basically an atomspace. If I have N users and each creates a Project with a different knowledge base, hence atomspace, which all might access their respective atomspaces at the same time; how will I handle the persistence? If I store each atomspace in a different database, I will have to connect to each database concurrently and store the atoms. When saying " inefficient and resource intensive" I am referring to connecting to multiple databases and writing to those databases. Also, the knowledge bases( currently we have three that the user can choose from) in most cases are similar initially. Having a database for each atomspace would be resource intensive. |
At the command-line, issue the command
? Who is "I" in this sentence? Only the app has to connect to the database, and not "you". There is no additional cost associated with this. The total cost only depends on the number of users: each user has to have their own connection to a database, whether or not they are sharing it.
There is no additional cost to accessing multiple databases, as compared to one. The grand total number of TCP/IP sockets is exactly the same. Every user has to open a socket. You cannot avoid this. The compute overhead for the postgres server is exactly the same.
No, that is just not true. In fact, its the other way around: giving each user their own database would use less resources, and not more. Trying to put every user into the same atomspace causes the the size of the indexes on the tables to bloat: the atomspace indexes are quite large. Larger indexes use up more disk space; searching larger indexes takes more CPU time, and slows down response times. Making indexes even larger is a total, net loss. Having a separate atomspace for each user will be smaller, simpler, faster, more light-weight, easier to manage. For example:
There is nothing to be gained by trying to force all users to use the same database. It makes no sense at all. You just end up using more disk space to store indexes, and you make the response-time worse, because the large indexes take longer to search. Putting everyone into the same database requires more disk, more RAM, more CPU; its a total loss, and there is no gain. |
I thought some more about what you asked above, and I now think I understand what you are asking for. The short answer is, "no its not possible", but design-wise, it is an interesting usage scenario, and it probably should get love and attention and a good design to enable dataset, umm, read-only sharing. That is, I'm envisioning a large dataset with lots of genomic or proteomic data, that would be accessible, pre-looaded in some read-only fashion, shared by everyone, and have multiple users who could get access to it, and perform COW (copy-on-write) modifications to it. Having this ability seems like a high-priority design/arhcitecture goal. |
That's great to hear. I have a couple of questions regarding what you are envisioning.
|
Your questions are not so much questions, as requirements: you want a system that can do 1,2 and 3 (and maybe other things that you haven't mentioned). So:
2a. Overlay atomspaces are NOT supported in the postgres back-end. They could be, but someone has to write the code to do this. Its not hard (it is not research AI), but its also not easy (it does require a system programmer who knows both postgres, and also the atomspace). Storage for overlays does NOT require copying the entire underlying database. Only the changed portions would be stored. 2b. Change tracking is something else: change tracking is about user-permissions (does user have permission to read or write?) , dataset meta-data (who owns the dataset? When was it created? what does it contain?) and overlay tracking (does overlay B require overlay A which requires base dataset Z? ) I understand that these are needed, but I kind-of don't want to solve them inside of the atomspace, because they are "icky". It would be if some external module solved these problems.
3b. Currently, it is not possible to stores different atomspaces in different tables in a database. That is because the atomspace uses 3 tables, and half a dozen indexes, all that have fixed names. We would need to have some way of making the names be generic, and pass them in during login. I guess that could be done. Someone would need to write the code ... Do you have any other requirements? I don't have the time to design the above, and to write the code. I could help someone do this. You would need to find someone who would know how to do this (or you would have to learn, yourself -- again, I could help with the design and code reviews, but this is not something that can be done easily or quickly.). |
Also: currently, in the overlay atomspace, there is no way to create a second atom that has the same name as an atom in the base. That means that, currently, it is impossible to have some atom, with one truthvalue in the base space, but a different truth value in the overlay space. (also, for any value, not just truth value) This could be fixed, but currently, its just not possible. I just checked:
|
One can create two atoms with the same name, but creating one in the overlay, first:
|
.. so we could create two types of overlays: "transparent" ones, that allow atoms in the base space to be changed, and "opaque" ones, that would allow atoms in the base to be seen, but not changed (and changes would be copy-on-write). |
The different overlays are possible, because values are immutable. All changes to the value of an atom MUST go through |
There are several competing alternative design points/implementations:
The above design is "poor" because it imposes a performance overhead on every user, whether or not they need it. So here's an alternative:
|
Lets look at some plausible user flows: -- user creates base atomspace, opens a read-only database as backing store. Where should the "this value has changed" flag be stored? With the atom? Somewhere else? Easiest to store it with the atom, but this bloats the atom. But if we store it somewhere else, then overhead. Implementation:
|
@enku-io the above thoughts are incomplete, and would require a fair bit of thinking to make them clear and clean and elegant enough to actually make them work well. Then after a design/prototype is in place, someone needs to implement it, in full detail. Is there anyone on your team who would be able to do this? How high a priority would this be? This is a fairly large project, non-trivial, and it's not clear to me if Ben or @mjsduncan has provisioned headcount for this, or if anyone at icog labs has the needed sophistication and experience to do this. ... this is not a junior-developer project. Thoughts? |
thanks for the feedback, linus! the bio team certainly doesn't have resources to do opencog backing store development but we do want to do graph searches/pattern matching in atomspaces that are too big to fit in ram. here is another framework that might be useful as a model, a graph database that uses pgsql or other options as a back end: https://docs.bmeg.io/grip/docs/ |
what happens if you load a big atomspace from storage, then add an overlay and do stuff, adding and deleting atoms and modifying values, then delete the starting atomspace. is anything left in the overlay that could be saved and then reloaded on top of a new copy of the original atomspace? |
First, the easy question: "deleting" .. there is currently no way to "delete" the atomspace underneath the overlay. |
so if you switch to the original atomspace and delete all the atoms then everthing in the overlay atomspace is gone also? |
Next: "fit in RAM". So:
For the atomspace, we currently have two super-minimalist "searches", one called "fetch atoms of type", another called "fetch incoming set", and a third called "fetch incoming set by type" -- these search the DB for only those atoms that you want, (and loads only those). That's it. It's possible to create more of these, a fancier search, that pushes more of the pattern matching down into the DB... but there are issues. One is that postgres still needs to suck things into RAM, to run the query, (even if the results are never passed up to the atomspace). The other is that general searches are hard for any database -- postgres does "query optimization", as do OLAP and OLTP tools -- its decades of high-tech stuff to figure out how to pull into RAM only the parts you actually need. Not to exaggerate, but people have been searching DB's that don't fit into RAM for more than five decades, now. There's no magic potion, and you learn to deal with it. That said, its hard to move forward without specific examples... how fancy are your queries? Can you make do by clever use of "fetch by type" and "fetch incoming set"? You'd have to do this by hand. Note also: the "main loop" of the pattern matcher is a loop over entire-incoming-set and a loop over all atoms of type. We don't currently, but we could rewrite those loops to fetch from DB, as the search progresses.... |
Um ... good question. 1) no one has ever done this before. 2) doing this will expose one or more bugs. 3) now is a good time to discuss what should happen. (the right answer is not fully obvious) |
False. I just tried it. It seems to work, without bugs. You cannot delete atoms that are referenced in the overlay. Here's a demo.
|
First step for issue opencog#1855
OK @mjsduncan and @enku-io after pull req #1895 I think you've now got 90% of what you seem to be asking for, here. (I had the sudden urge to code it up in the proverbial "afternoon". -- 10 hours, actually, but whatever). So -- two things are now possible: You can have a base atomspace, marked read-only, and a read-write overlay. You can alter truth-values in the overlay, without damaging atoms in the base space. This is done by automatically copying an atom into the overlay, whenever you attempt to alter it's TV. The result is two copies of the atom: one in the base, untouched, with the original truth value, and it's twin, in the overlay, with the altered TV. See Part two of what you want -- managing datasets that don't fit in RAM -- you've got a basic set of tools for dealing with that. You can (for example) The idea is that the atomspace is acting as a in-RAM cache for the database. The database remains read-only, you can move atoms from the database into the atomspace, you can wipe out some or all of the atomspace, and its all still "read-only". For example, you still cannot add atoms to the atomspace, if they aren't already in the database!. Everything I describe above should "just work" in a bug-free fashion. However, there are some rough spots. I'm not exactly clear on what would happen if you tried to open two databases at the same time, one attached to the read-only base space, and one attached to the read-write overlay. That is probably broken somehow. If (when) you find bugs, have complaints, open new bug reports. The read-only/read-write mechanism is NOT a security/permissions mechanism. it helps you avoid corrupting the read-only data, but it won't stop you from toggling the write-enable bit back on. If you want true malicious-hacker-proof protection, you'll have to set up postgres permissions correctly. Probably need to add handy utilities for extracting lots of atoms at the same time ... instead of one-at-a-time. If you have detailed questions about the new example(s), open a new bug report. General discussion should stay in this issue. |
Closing. Everything discussed here is now available via the various ProxyNode which extends the earlier StorageNode interfaces. You can now have M AtomSpaces sending Atoms to/from N other AtomSpaces anywhere on the net, while controlling for read-only, load-balancing, distributed redundant writes, mirroring, and transforming Atoms & Values on the fly. It's all there' there are even demo examples for each of these in the examples dir. |
I am working on a project that requires different atomspaces to run in memory. Is it possible to use one backing-store to persist all these atomspaces? Having one backing-store, hence one database per atomspace, would be inefficient and resource intensive.
The text was updated successfully, but these errors were encountered: