Numerical Optimization as a Means to Symbolic Regression Program Synthesis
thesisposted on 15.04.2014 by Brian M. Cerny
In order to distinguish essays and pre-prints from academic theses, we have a separate category. These are often much longer text based documents than a paper.
Over the years there has been an increasing interest in probabilistically oriented Evolutionary Algorithms (EAs), but it has not been until recently that these innovative methods have been collectively recognized and achieved an independent status. By eliminating the traditionally employed genetic operators, these probabilistic EAs have been forced to adopt an alternative approach, and in the case of Estimation of Distribution Algorithms (EDAs), probabilistic graphical models have become the favored substitute. In the first major contribution of thesis, the proposal is made to utilize a probabilistic model that has been previously overlooked in the EDA literature, namely Hidden Markov Models (HMMs). But preferring not to completely abandon the biologically inspired genetic operations, the classical learning algorithms used to train HMMs are largely ignored, and instead, Differential Evolution (DE) is used to evolve the underlying constrained numerical parameters of the chosen probabilistic model. The evolved HMMs are then used to generate likely Prefix Gene Expression Programming (PGEP) chromosomes which encode candidate solutions, and thus provide feedback to guide this proposed evolutionary search process. After the algorithm is described in detail, benchmarking results on a set of Symbolic Regression (SR) problems are reported and this novel approach is compared to the original PGEP method. Besides investigating probabilistic model driven representations for SR, one other problem that has plagued Genetic Programming (GP) and its derivatives like PGEP, is also investigated herein. That being, numerical Constant Creation (CC) in SR. Given a mathematical formula expressed as a tree structure, the leaf nodes are either variables or constants. Such constants are usually unknown in SR problems, and GP, as well as many of its derivatives, lack the ability to precisely approximate these values. This is due to the inherently discrete encoding of GP-like methods which are more suited for combinatorial searches than real-valued optimization tasks. Previously, several attempts have been made to resolve this issue, and the dominant solutions have been to either embed a real-valued local optimizer or to develop additional numerically oriented genetic operators. In the second and final major contribution of this thesis, an entirely new and unified approach to SR with CC is proposed. Again, through the adoption of the robust real-valued optimization algorithm known as DE, constants and PGEP programs will be simultaneously evolved in such a way that the values of the leaf nodes will be approximated as the tree structure is itself changing. Experimental results from several SR benchmarks are presented and analyzed. The results demonstrate the feasibility of the proposed algorithm and suggest that exotic or computationally expensive methods are not necessary for successful CC.