NEAT was originally presented in the paper Efficient Evolution Of Neural Network Topologies [1], and was accompanied with an implementation of NEAT written in C++ [2]. Over time a number of additional implementations have been created by researchers and enthusiasts [3]. The precise details of the speciation method used in each implementation likely differ to some degree, but in general the method used in all implementations can be broadly described as laid out in this document.
A compatibility metric is defined that allows us to determine the compatibility between any two genomes. Compatibility is represented by a single non-negative value which by itself is effectively meaningless; its primary use is as a means of estimating relative compatibility between genomes, i.e. e.g. genome A is more compatible with B than C. The compatibility metric in Canonical NEAT (C-NEAT) pairs up connection genes with the same innovation ID (from the two genomes being compared). For matching gene pairs the distance between the genes is defined as the absolute difference in weight multiplied by a coefficient (`c_3`). Non-matching genes are categorized as either 'excess' or 'disjoint', and each of these two types contributes a fixed value towards the overall distance value, the fixed values are named `c_1` and `c_2` respectively.
The coefficients `c_1`, `c_2` and `c_3` allow scaling of the relative importance of different types of genome differences. E.g. if we consider weight differences to be irrelevant then we can set `c_3` to zero and the distance metric will become an indication of topology differences only.
N.B. In practice `c_1` and `c_2` are often set equal - no distinction is made between excess and disjoint genes.
There is a fairly significant caveat to make here, in [1] compatibility is defined as:
$$ D = \frac{{c_1}{N_e}}{N} + \frac{{c_2}{N_d}}{N} + {c_3}{W} \tag{1}$$
`N_e`, `N_d` - The number of excess and disjoint genes, respectively.
`N` - The number of genes in the largest genome.
`W` - The sum of absolute weight differences.
The contribution of disjoint and excess genes is divided by N and thus the total contribution of the these two parts of the equation is scaled to a value between 0 and 1 (there can only be N mismatching genes at most). Therefore the distance between genomes with completely different topologies is 1 (assuming c1 = c2 = 1), but the distance between genomes with the same topology and different weights is unbounded (unless c3 is set to zero). This would seem to run contrary to the motivation for a genetic distance metric and in fact the original C++ source code [2] uses a modified distance metric:
$$ D = {c_1}{N_e} + {c_2}{N_d} + {c_3}{W} \tag{2}$$
Distance is defined as the sum of the differences from the three types, with each type of difference having its contribution weighted by its own coefficient (`c_1`, `c_2`, `c_3`) as before.
Colin,
September, 2009