Load CSV tables into AtomSpace #2989

linas · 2022-08-20T22:25:40Z

This provides an ability to load plain-text tables (comma-separated values, tab-seperated values)
into the AtomSpace.

The format allows Atomese programs to act on the columns of the table (add, subtract, etc.)

This is one of the important capabilities needed by old-style as-moses.

linas · 2022-08-21T09:29:06Z

BTW, @ngeiswei @Habush @kasimebrahim @Yidnekachew @Bitseat @eman @behailu04 I'd like to bring your attention to the brand-new demo examples/atomspace/table.scm It does two things: it shows you how to load a CSV/TSV table into the atomspace -- this is now a "core function" in the atomspace. Next, it shows how to write functions that act on the table, and how to write scoring functions, in pure atomese. All this works outside of AS-MOSES.

The main difference here is that the main atomspace evaluator is used, instead of the as-moses evaluator. That means that all the functions from the AtomSpace are supported, and not just some of them. The functions look very similar to the as-moses atomese/combo trees; they're only a little bit different. There's an extra ValueOf link used to fish out the data from the columns. Otherwise, its more or less identical.

This opens the possibility for applying moses algos to non-table data, including video and audio data, or any kind of streaming data, or complex data sources. The Value system allows data to flow in from anywhere, in any way. The AS-MOSES system can then explore different kinds of mutations applied to data processing pipelines. I'm getting ready to tackle some of these data sources.

Anyway, thanks for your work in as-moses. It's not been in vain. The future is bright, methinks.

Bitseat · 2022-08-22T06:47:40Z

Hi Linas, It is really great to hear the news and also great to hear from you. :) Congratulations on the big achievement and I thank you for the recognition. Kind regards, Bitseat

…

On Sun, Aug 21, 2022 at 1:32 PM Linas Vepštas ***@***.***> wrote: Merged #2989 <#2989> into master. — Reply to this email directly, view it on GitHub <#2989 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGKAU5AAF7BAHKLINETGCF3V2IAU7ANCNFSM57D4MTCQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

kasimebrahim · 2022-08-22T16:30:52Z

That is impressive @linas! And thank you for keeping us in the loop.

mjsduncan · 2022-08-30T05:23:10Z

this is very cool, thanks linus! how hard would it be to expand this to import sql db dumps and reproduce the relationships of the connecting keys between tables?

linas · 2022-08-30T11:06:09Z

Hi @mjsduncan -- not hard. Not easy. It depends.

Let me start with a question. Do you want the SQL data as Values, or as Atoms? The CSV mapping puts an entire column of a table into a single vector value (because this is "natural" for moses) An alternative would have been to take each row of the table, and convert it into a (EvaluationLink (PredicateNode "my CSV table") (List (Concept "...") (DateNode "...") (NumberNode ...)))

The nice thing about using vectors is that they're fast, compact, uniform. The bad thing is they're not searchable. By contrast, you can search (pattern-match) the EvaluationLinks; but they're slower, bulkier.

Long ago, I came up with this idea, never implemented. Tell me what you think. It goes like this:

There would be a mapping, from an SQL table, to some AtomSpace structure. So for example SQL TABLE Foo (Name STRING, Date TIME, Location INT) would map to (EvaluationLink (PredicateNode "Foo") (List (Concept "...") (DateNode "...") (NumberNode ...))) The mapping is user-specified, so it would not have to be an EvaluationLink it could be whatever you want.
A connection would be made to a running SQL server. So, instead of working from a dump, you could work with live data in a live DB. So, basically, you'd be "mirroring" the SQL data in the AtomSpace. Not only reading it, but if there are changes, updates, these changes would be written out to the DB. Doesn't even have to be SQL, could be "any" data source.

I don't know if you're interested in the second bullet or not. If you're working with biology databases, then maybe working from dumps is all you want. Maybe the live data connection isn't needed. The live data connection is trickier, harder and more fragile.

One "hard part" is coming up with a generic way of allowing the user to specify what the table-to-atomese mapping is. I've got ideas for this (See wiki page for SignatureLink...) but it would take some polishing to get it right.

Excuse me. As I write the above, I just realized there are two easy tricks... Just click here. One trick is to create a TableValue and it would take all (EvaluationLink (PredicateNode "my CSV table")...) and return corresponding FloatValue vectors for each column. Then there could also be the inverse: some ExportTableLink, which, given vectors, would create a brand-new EvaluationLink for each row.

mjsduncan · 2022-09-06T15:22:09Z

thanks for the detailed reply, linus. i'm definitely thinking of importing data as atoms, and ultimately converted into a more compact and semantically meaningful form than the original tables, otherwise what would be the point? what i'm interested in is importing a whole database, tho i can see the value in what would be a sql interface module so info from a sql db could be imported as needed for evaluation & inference.

my question is motivated by the existence of a relational db schema and related tools that are being used to compile data on model organisms:
http://gmod.org/wiki/Overview#Chado_and_BioSQL (description of schema)
https://www.alliancegenome.org/ (meta organization with 6 model organism groups using shared infrastructure)

importing these into an atomspace would be fertile ground for developing automated biological inference systems

linas · 2022-09-07T12:30:03Z

Hi Mike,

The way to move forward is to open a new issue on github, describe the general desired features, and reference the discussion here. We should continue the discussion there.

To build this, make things concrete: pick the 1 or 2 schema that seem to be the most important for you, copy them into the issue. Then write down the matching AtomSpace structures that these would be converted into. Basically, provide a detailed example. This will allow me to think concretely about how to implement things.

Where's the data? Do you just want to import database dumps stored in some compressed files? Or will you set up a server somewhere, running some DB, that will hold the data? If there's some server, what is it? postgres? mariadb? reddis? something else? I would need to know, in order to connect to it, interact with it.

linas added 30 commits August 2, 2022 10:48

Start work on a CSV loader.

3c180e8

initial scaffolding for csv tables

f50a203

Copy code from asmoses

b23a69e

Cut down the original code to only the readers

0c75df4

Merge branch 'master' into load-csv-tables

31aa3f7

Add Makefile.

345de2b

Include AtomSpace

ca52b37

Convert bool and contin types to Values

7a3e3cf

Define what string_seq is

c1e7824

std namespace conversion for strings

1e311f7

More std namespace and atomese conversions

15a338e

More namespace conversions

4c5aac8

More conversions

a8e1705

White-space conversion

cf8743b

Ongoing conversion efforts

2556dbd

More conversions

4709381

Remove cruft

4064853

Whitespace rework

7f22f38

Convert and simplify table reading

277fec8

More cleanup

e45aa9f

Reorder order of teh code

2524133

Code that compiles.

c7b6ca9

Remove unused code

70b2eaa

Remove more dead code

b4789ca

Prepare columns that will be filled in.

41a43a1

Read boolean columns in the table

635a479

Handle the remaining column types

02bb9bd

Stub out or remove dead code

9c16a75

More cleanup

601a226

Start passing column names in

aab0dd5

linas added 20 commits August 20, 2022 23:10

Move documentation around

829d934

Add a README to explain what is going on

c15a912

Start work on a unit test for CSV

7c8256c

Fix typo in the name

8171c14

Bug fix, failed to pass types along

a9c5b23

nother bug fix

0a0f8a6

Another bugfix

eff2d58

Expand teh unit test some more

f9272b8

Add scheme bindings to the table loader

be92867

Add the scm side of the csv-table module

2b491d5

Bug fix cut-n-paste error

b862c41

Specify file path correctly

6bd2075

Mkae the AtoSpace an explicit argument

dbc24f0

Start work on a table demo.

eb8adb0

Announce the demo

e56a4af

Must use FloatValueOf not ValueOf

577aa9b

Update unit test to use the new API.

9e61cc0

Provide a scoring function example.

f6df994

Add explanation of the demo

1aec920

List additional modules.

54f05f4

linas mentioned this pull request Aug 21, 2022

Atomese interpreter not needed!? opencog/asmoses#101

Open

linas merged commit 1e76e3a into master Aug 21, 2022

linas deleted the load-csv-tables branch August 21, 2022 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load CSV tables into AtomSpace #2989

Load CSV tables into AtomSpace #2989

linas commented Aug 20, 2022

linas commented Aug 21, 2022

Bitseat commented Aug 22, 2022 via email

kasimebrahim commented Aug 22, 2022

mjsduncan commented Aug 30, 2022

linas commented Aug 30, 2022

mjsduncan commented Sep 6, 2022

linas commented Sep 7, 2022

Load CSV tables into AtomSpace #2989

Load CSV tables into AtomSpace #2989

Conversation

linas commented Aug 20, 2022

linas commented Aug 21, 2022

Bitseat commented Aug 22, 2022 via email

kasimebrahim commented Aug 22, 2022

mjsduncan commented Aug 30, 2022

linas commented Aug 30, 2022

mjsduncan commented Sep 6, 2022

linas commented Sep 7, 2022