Skip to content

An Adaptive resonance theory simulator written in C

Notifications You must be signed in to change notification settings

mathieufrh/art1simulator

Repository files navigation

  ___  ______ _____ __        _____ _                 _       _             
 / _ \ | ___ \_   _/  |      /  ___(_)               | |     | |            
/ /_\ \| |_/ / | | `| |______\ `--. _ _ __ ___  _   _| | __ _| |_ ___  _ __ 
|  _  ||    /  | |  | |______|`--. \ | '_ ` _ \| | | | |/ _` | __/ _ \| '__|
| | | || |\ \  | | _| |_     /\__/ / | | | | | | |_| | | (_| | || (_) | |   
\_| |_/\_| \_| \_/ \___/     \____/|_|_| |_| |_|\__,_|_|\__,_|\__\___/|_|   

================================================================================

TABLE OF CONTENT
================

1. REQUIREMENTS
2. INSTALLATION
3. MEMORY LEAKS
4. GENERATE YOUR BINARY PATTERNS SETS
5. RUNING THE PROGRAM
    A. EXEMPLE RUN
    B. RESULTS INTERPRETATION
6. PARAMETERS

1. REQUIREMENTS
===============

You'll need cmake in order to compile with the cmake list. On debian you can
use:

# apt-get install cmake


This code also use the C Containers Library (ccl). The library is released
under the BSD license and the code is include in this program.
See: https://code.google.com/p/ccl/

Note: It is possible to translate the program so that it dosen't use these
library containers but it is very less readable and the performances aren't 
better.

2. INSTALLATION
===============

You just need to run:

$ cmake . && make

3. MEMORY LEAKS
===============

No memory leaks should occur. I execute the program many times in different
ways and vargrind is manifest:

 All heap blocks were freed -- no leaks are possible


4. GENERATE YOUR BINARY PATTERNS SETS
=====================================

I have add a python script which will help you generate binary patterns sets
from non-binary patterns sets.
You can run it with:

./generate_binary_patterns -i input_file -o output_file -s class_pos


The input file should contains one patterns per line. Every patterns should
have the same number of attributes.
The clas_pos is the index of the class of the pattern. It can be at the 
begining like in:

class, attrib0, attrib1, attrib2

In which case the class_pos would be 1. Or it can be at the end like in:

attrib0, attrib1, attrib2, class

In which case the class_pos wouls be -1.

For instance if you want to generate a binary patterns set for the Zoo 
dataset you can use:

./generate_binary_patterns -i ./data/Zoo/zoo.data -o ./data/zoo.csv -s '-1'


The python script will scan every line of the input file and count the number
of different values for every comma separated attributes. You can see it
like:

e,t,s,h,s,s,p
a,t,s,r,s,s,j
e,b,s,h,s,a,j
e,t,s,g,s,s,p
-------------
2 2 1 3 1 2 2


This means that the third and fifth attributes are useless because they can
only hold 1 unique value. The script now knows the number of different values
each attribute can hold and it also knows these values because it has build
a list which should looks like:

[['e', 'a'],
 ['t', 'b'],
 ['s'],
 ['h', 'r', 'g'],
 ['s'],
 ['s', 'a'],
 ['p', 'j']]


Finaly the script will flatten the previous nested list and give each pattern
a 1 if it hold this value or a 0 otherwise. In our example the first pattern
will be converted to (first line preent the flatten list:

['e', 'a', 't', 'b', 's', 'h', 'r', 'g', 's', 's', 'a', 'p', 'j']
-----------------------------------------------------------------
  1    0    1    0    1    1    0    0    1    1    0    1    0


Of course, the pattern will only have a 1 per attribute part of the list.

5. RUNING THE PROGRAM
=====================

A. EXEMPLE RUN
--------------

You can run the program with something like:

$ ./runart1 -t ./data/mushrooms_train.csv -T ./data/mushrooms_test.csv -s
-n 5 -N 5 -p 5


The above command use the mushrooms dataset, limit the execution to 5 passes
and skip the first column of the dataset.
See the description of the command line options below.

The output looks like::

   _   ___ _____ _   ___ ___ __  __ _   _ _      _ _____ ___  ___ 
  /_\ | _ \_   _/ | / __|_ _|  \/  | | | | |    /_\_   _/ _ \| _ \
 / _ \|   / | | | | \__ \| || |\/| | |_| | |__ / _ \| || (_) |   /
/_/ \_\_|_\ |_| |_| |___/___|_|  |_|\___/|____/_/ \_\_| \___/|_|_\

-------------------------------------------------------

Runing ART1 algorithm using following parameters:
	Training file: ./data/mushrooms.csv
	Testing file: ./data/mushrooms.csv
	Skipping first attribute of every patterns
	Output file prefix: art
	Fluctuation percentage: 5%
	Max number of iterations: 5
	Beta parameter: 1
	Vigilance parmeter: 0.5

------------- INTERNING TRAINING PATTERNS ------------

Opening file "./data/mushrooms.csv"... OK
Reading input file "./data/mushrooms.csv"... OK

--------- CHECKING TRAINING PATTERNS VALIDITY --------

7723 patterns have been scanned
Patterns length is 117
Removing patterns containing only 0... OK
Number of network patterns: 7723

-------------------- ADDING NOISE -------------------

Adding 5% of noise to each patterns... OK
8452 pattern's bits have been switched to 0

------------------- TRAINING STAGE ------------------

Pass n° | No. reassigned | Fluctuation | No. clusters
--------+----------------+-------------+-------------
      1 |           7723 |        100% |          359
      2 |           6410 |    82.9988% |          442
      3 |           4309 |    55.7944% |          447
      4 |           3568 |    46.1997% |          447
      5 |           3466 |    44.8789% |          447

-------------- WRITING TRAINING RESULTS -------------

Writing results...
Opening file "./results/train/art_results"... OK
Opening file "./results/clusters/art_clust_0.csv"... OK
Opening file "./results/clusters/art_clust_1.csv"... OK
Opening file "./results/clusters/art_clust_2.csv"... OK
[...]
Opening file "./results/clusters/art_clust_446.csv"... OK
OK

------------- INTERNING TESTING PATTERNS ------------

Opening file "./data/test.csv"... OK
Reading input file "./data/test.csv"... OK

--------- CHECKING TESTING PATTERNS VALIDITY --------

401 patterns have been scanned
Patterns length is 117
Removing patterns containing only 0... OK
Number of network patterns: 401

-------------------- ADDING NOISE -------------------

Adding 5% of noise to each patterns... OK
454 pattern's bits have been switched to 0

------------------- TESTING STAGE ------------------

SUCCESS: 396
FAIL: 5

-------------- WRITING TESTING RESULTS -------------

nPats: 401
emptyPats: 0
Opening file "./results/test/art_results"... OK
Writing results... OK


Dataset used is mushrooms_train.csv. We ask program to exit after 5 passes
and we indicate that the first column of each non-empty and non-comment 
lines of the dataset is the class of the pattern so it must be isolated from
the pattern itself. We also add 5% of noise to the patterns, this is the
reason why there is so much clusters.

Here are the major steps of the training stage::

1. The program list the parameters of the network. It can be the parameters
   you have pass on the command line or their default values.
2. The dataset is opened and read. The patterns (every valid lines) are 
   extracted from the file and inserted in as much CSVLine structures (see 
   description below) as needed. An CSVLine structure basically hold a 
   pattern and its class.
3. The patterns length is checked: every patterns must have the exact same
   length, otherwise, the program will exit.
3. The patterns containing only 0s are removed because they are useless and
   can cause infinite loops.
4. The patterns are normalized: every bits are clamped to 0 or 1.
5. The "real" training starts:
     a. The program try to assigned every patterns to a cluster, creating new
        clusters as needed.
     b. The pass ends when every patterns have been assigned to a cluster.
     c. Then a new pass starts and the program try to reassigned patterns to
        better clusters. The fluctuation of the pass is a function of the
        number of reassigned patterns.
     d. And so on until the maximum number of passes is reached or the 
        fluctuation fall below the minimum fluctuation rate.
6. The results are written on a file + one file for each clusters, describing
   it.
7. The program also compute the classes of the clusters during the results
   output (I agree this may not be the right place to do it, bit 
   confusing...). The class of a cluster is determined by the classes of the
   patterns it contains. The program count the number of each unique class.
   The most prominent class is the class of the cluster.
8. The program can then compute, for each pattern, if it has been classified
   in the right cluster, i.e. a cluster with the same class. The 
   classification success/failure ratio is added at the bottom of the 
   result file along with the clusters classes repartition.

Here are the following steps which are carried only if a testing set is
passed to the program::

9. The program repeat the step 1-4 for the testing patterns set, if provided,
   otherwise it simply ends.
10. The testing starts:
     a. For each testing pattern the program compute its best matching
        cluster, i.e. the cluster with the nearest prototype of the pattern,
        where the pattern would be classified.
     b. Then it compare the best matching cluster class with the pattern 
        class. If they match then the classification is a success, 
        otherwise it is a failure.
     c. The success/failure ratio is computed.
11. The test results are written on a file in /results/test which allow the
    user to check wich patterns have been successfully classified and which
    have not. The ratio is also added at the end of the file.

Reguarding our example output, the five first steps are carried out then the
training starts:

 - In the first pass every patterns are assigned to a prototype so the
   fluctuation rate is 100%.
 - In the second pass about 83% of the patterns are reassigned to a different
   cluster, the fluctuation rate is about 83% (this high value is due to the
   noise we add to the atterns, without any noise it would have be something
   like 5%).
 - And so on until...
 - The program reach the sixth passes but do not enter it. Instead it break
   the loop because we exceed the maximum number of passes.

After the training stage the results are written into the results files.
The main result file (in "/results/training/") contains every information 
about the network and the training stage.
The files under "/results/clusters/" describes each cluster of the network by
showing their prototype and listing their patterns (patterns which have
contribute to its prototype).
When the program's done with the training, it starts testing with the testing
patterns set. The testing loop looks like:

 - Select the next testing pattern.
 - Find its best matching cluster.
 - Compare the best matching cluster class with the pattern class.
     - If they match, then the success variable is incremented.
     - Else, the fail variable is incremented.

And the testing loop will iterate on every testing patterns. At the end of 
the loop the ratio is printed and you can see that 396 testing patterns have
been classified in a relevant cluster, matching their classes. The 5 others
patterns have not been classified in a matching cluster.

B. RESULTS INTERPRETATION
-------------------------

The classification in the previous exemple has been willingly complicated by
adding noise to the patterns. Nonetheless, the neural network manage to reach
a very acceptable result. It has successfully recognized the edible state of
98.75% of the tested mushrooms. Among the 5 mushrooms which have been
wrongfully classified 3 have been classified as edible altough they are
poisonous (pattern 51, 134, 169) so there is less than a chance for two
hundred for you to eat a poisonous mushroom if you trust the program results.
Keep it mind that without noise the results could have been better.

6. PARAMETERS
=============

-t specifies the dataset (csv file) with the patterns to train.

-T specifies the dataset (csv file) with the patterns to test.

-f specifies the minimum fluctuation. The program will stop if the minimum
fluctuation is greater than the pass fluctuation. Default value is 5%.

-p specifies the maximum number of passes. The program will stop if the 
maximum number of passes is reached. In any case it is important to note that
the final clusters are the best clusters set (the clusters set which induce
the less fluctuation).

-n noise network parameter: percentage of noise to add to the training
patterns.

-N noise network parameter: percentage of noise to add to the testing
patterns.

-s indicate to the program that it must skip the first dataset column (it
should be the class of the pattern). This parameter must be used whenever you
call the program but the idea is to make it more flexible by passing a value
with the parameter -s indicating the columns of the classes (this is a todo).

-o specifies a prefix for the output result files. Default is "art".

-b specifies the beta parameter value of the network. It influences the 
number of cluster (with the vigilance parameter). Default is 1 (no 
influence).

-v specifies the network vigilance. It act as a pattern acceptance threshold
for the clusters. The higher the vigilance the pickier the threshold, the
higher the amount of clusters. The vigilance must be between 0 and 1. 
Default is 0.5.

TODO: save network datas in a database (clusters, patterns...) so that it 
 will be easier to train a network and test it later. Each network can be
 saved in a specific database which can be loaded later.

About

An Adaptive resonance theory simulator written in C

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published