Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load CSV tables into AtomSpace #2989

Merged
merged 57 commits into from
Aug 21, 2022
Merged

Load CSV tables into AtomSpace #2989

merged 57 commits into from
Aug 21, 2022

Conversation

linas
Copy link
Member

@linas linas commented Aug 20, 2022

This provides an ability to load plain-text tables (comma-separated values, tab-seperated values)
into the AtomSpace.

The format allows Atomese programs to act on the columns of the table (add, subtract, etc.)

This is one of the important capabilities needed by old-style as-moses.

@linas
Copy link
Member Author

linas commented Aug 21, 2022

BTW, @ngeiswei @Habush @kasimebrahim @Yidnekachew @Bitseat @eman @behailu04 I'd like to bring your attention to the brand-new demo examples/atomspace/table.scm It does two things: it shows you how to load a CSV/TSV table into the atomspace -- this is now a "core function" in the atomspace. Next, it shows how to write functions that act on the table, and how to write scoring functions, in pure atomese. All this works outside of AS-MOSES.

The main difference here is that the main atomspace evaluator is used, instead of the as-moses evaluator. That means that all the functions from the AtomSpace are supported, and not just some of them. The functions look very similar to the as-moses atomese/combo trees; they're only a little bit different. There's an extra ValueOf link used to fish out the data from the columns. Otherwise, its more or less identical.

This opens the possibility for applying moses algos to non-table data, including video and audio data, or any kind of streaming data, or complex data sources. The Value system allows data to flow in from anywhere, in any way. The AS-MOSES system can then explore different kinds of mutations applied to data processing pipelines. I'm getting ready to tackle some of these data sources.

Anyway, thanks for your work in as-moses. It's not been in vain. The future is bright, methinks.

@linas linas merged commit 1e76e3a into master Aug 21, 2022
@linas linas deleted the load-csv-tables branch August 21, 2022 10:32
@Bitseat
Copy link

Bitseat commented Aug 22, 2022 via email

@kasimebrahim
Copy link
Contributor

That is impressive @linas! And thank you for keeping us in the loop.

@mjsduncan
Copy link
Contributor

this is very cool, thanks linus! how hard would it be to expand this to import sql db dumps and reproduce the relationships of the connecting keys between tables?

@linas
Copy link
Member Author

linas commented Aug 30, 2022

Hi @mjsduncan -- not hard. Not easy. It depends.

Let me start with a question. Do you want the SQL data as Values, or as Atoms? The CSV mapping puts an entire column of a table into a single vector value (because this is "natural" for moses) An alternative would have been to take each row of the table, and convert it into a (EvaluationLink (PredicateNode "my CSV table") (List (Concept "...") (DateNode "...") (NumberNode ...)))

The nice thing about using vectors is that they're fast, compact, uniform. The bad thing is they're not searchable. By contrast, you can search (pattern-match) the EvaluationLinks; but they're slower, bulkier.

Long ago, I came up with this idea, never implemented. Tell me what you think. It goes like this:

  • There would be a mapping, from an SQL table, to some AtomSpace structure. So for example SQL TABLE Foo (Name STRING, Date TIME, Location INT) would map to (EvaluationLink (PredicateNode "Foo") (List (Concept "...") (DateNode "...") (NumberNode ...))) The mapping is user-specified, so it would not have to be an EvaluationLink it could be whatever you want.

  • A connection would be made to a running SQL server. So, instead of working from a dump, you could work with live data in a live DB. So, basically, you'd be "mirroring" the SQL data in the AtomSpace. Not only reading it, but if there are changes, updates, these changes would be written out to the DB. Doesn't even have to be SQL, could be "any" data source.

I don't know if you're interested in the second bullet or not. If you're working with biology databases, then maybe working from dumps is all you want. Maybe the live data connection isn't needed. The live data connection is trickier, harder and more fragile.

One "hard part" is coming up with a generic way of allowing the user to specify what the table-to-atomese mapping is. I've got ideas for this (See wiki page for SignatureLink...) but it would take some polishing to get it right.


Excuse me. As I write the above, I just realized there are two easy tricks... Just click here. One trick is to create a TableValue and it would take all (EvaluationLink (PredicateNode "my CSV table")...) and return corresponding FloatValue vectors for each column. Then there could also be the inverse: some ExportTableLink, which, given vectors, would create a brand-new EvaluationLink for each row.

@mjsduncan
Copy link
Contributor

thanks for the detailed reply, linus. i'm definitely thinking of importing data as atoms, and ultimately converted into a more compact and semantically meaningful form than the original tables, otherwise what would be the point? what i'm interested in is importing a whole database, tho i can see the value in what would be a sql interface module so info from a sql db could be imported as needed for evaluation & inference.

my question is motivated by the existence of a relational db schema and related tools that are being used to compile data on model organisms:
http://gmod.org/wiki/Overview#Chado_and_BioSQL (description of schema)
https://www.alliancegenome.org/ (meta organization with 6 model organism groups using shared infrastructure)

importing these into an atomspace would be fertile ground for developing automated biological inference systems

@linas
Copy link
Member Author

linas commented Sep 7, 2022

Hi Mike,

The way to move forward is to open a new issue on github, describe the general desired features, and reference the discussion here. We should continue the discussion there.

To build this, make things concrete: pick the 1 or 2 schema that seem to be the most important for you, copy them into the issue. Then write down the matching AtomSpace structures that these would be converted into. Basically, provide a detailed example. This will allow me to think concretely about how to implement things.

Where's the data? Do you just want to import database dumps stored in some compressed files? Or will you set up a server somewhere, running some DB, that will hold the data? If there's some server, what is it? postgres? mariadb? reddis? something else? I would need to know, in order to connect to it, interact with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants