Saturday, 13 February 2016

My unconventional research career - from industry to academia and back again

I was recently asked to give a seminar at the Australian Meteorological and Oceanographic Society Early Career Researcher Forum on my research career to date. I got quite a good reception and thought that I would share my talk here, in the event that others might also gain something from my previous experiences.

Good afternoon,

Firstly I'd like to thank the organisers for giving me this opportunity to reflect back on my research career to date, and hopefully share some useful knowledge and experiences with you all.

Like the other participants I've been asked to give a talk on my career and to provide some advice to you all. Perhaps, the best advice I can give you, however, is not to listen to anyone's advice! Seriously, the most important thing you can do is to first try to figure out what it is you want out of your career and life in general. Next talk to as many people about their own careers as possible, not in search of career advice per se, but instead in search of career examples from which you'd like to replicate certain elements. The research sector globally is going through many changes at the moment, so any advice you're likely to receive may be outdated, and in any case filtered by the experiences of the people who are giving you that advice. So today I'm going to give you an overview of my somewhat non-standard research career, from industry to academia and back again. At the end of my talk I'll also provide some concrete tips on how to make the transition from academia to industry if that's what you're after.

I've worked in various sectors in industry and academia over the past 12 years, primarily in the fields of numerical simulation and data mining. My journey started with my undergraduate degrees in mechanical engineering and finance at the University of Western Australia. In 2001, at the end of my third year I undertook vacation employment with BHPBilliton in Port Headland in the north-west of Australia, working in the iron ore ship-loading division. At the end of the following year I did vacation employment with the Defence Science and Technology Organisation (DSTO) in Adelaide, undertaking computational fluid dynamics analysis of the next generation of missile propulsion systems. I found the science both challenging and interesting, and I also had an excellent mentor. After this summer vacation work period, I returned to university to complete my degree. I applied for a graduate position at DSTO, of which I was the preferred candidate, but at the last minute the funding was pulled from the position.

Following this I then applied for and attained a continuing graduate industrial research position at General Motors Holden Innovation. I worked there from 2003 to 2006, where I was essentially a liaison between production engineering and academia on developing virtual engineering solutions for the engineering design processes. For example, using numerical optimisation to minimise weight in suspension components as opposed to running many physical tests [1]. Despite the pay and conditions being good, I personally found my position not all that fulfilling, as I was an intermediary, and the “real” work was being done by our collaborators in academia and at the Victorian Partnership of Advanced Computing. I unsuccessfully tried to get back into DSTO on several occasions, and realised that completing a PhD would help me get back into that organisation.

During my time at Holden I met many academics as part of my work. I decided to do a PhD at the mechanical engineering department at the University of Melbourne (with co-supervision from Monash University) in the numerical simulation and reduced order modelling of aerofoil flows, applicable to the improved energy efficiency of wind turbines and aircraft. It was actually a double badged degree with the Université de Poitiers in France, where I worked for almost a year and a half with some fantastic people. During this project I also worked with some incredible people at Stanford University, the Universidad Politecnica de Madrid, and the University of Manchester. As an added bonus I also met my now wife at the Poitiers swimming pool, but that's another story! I completed my PhD in 2010, produced four journal papers, and also won a prize for my thesis [2].

Upon completion of my PhD I was offered 6 different post-doc positions, both here in Australia and overseas. One of the options was a 3-year contract role at the CSIRO Marine at Atmospheric Research group working on developing the next generation of climate models, to better inform policy makers on issues pertaining to climate variability and change. I was sold on the role immediately after meeting the impressive people with whom I would be working. It is interesting to note, that one of the other options was a continuing role at DSTO, the organisation that I had been trying to get back into for the previous 7 years. I worked very hard over a long period to get back into DSTO, however, at the point that the opportunity was offered to me, I didn't want it any more.

I worked at CSIRO from 2010 to 2013, during which time I learnt an immense amount from my co-workers. I solved the long-standing and important problem of resolution dependence in atmospheric and oceanic climate models, via the development of stochastic subgrid turbulence models [3]. As my contract at CSIRO was coming to an end, it became clear that even though I wanted to stay and the organisation wanted to keep me, budget cuts meant that this would not be possible unless I brought in some form of external funding. I unsuccessfully tried to get additional funding via the Discovery Early Career Researcher Award (of which CSIRO is no longer an eligible institution). After this I then gave seminars at various geophysical research groups in an attempt to identify any potential opportunities for co-funding between a partner university and the CSIRO for me to continue the research. Unfortunately all of the university groups I approached had their own funding problems and were struggling to keep hold of their own researchers at the time. I was reluctant to search for positions overseas, since my wife had already relocated for me once, made new friends here in Australia, and had also started making inroads into her own successful career as an educator.

I then approached my former PhD co-supervisor from the mechanical engineering department at Monash University, who was awarded Australian Research Council (ARC) funding to undertake the world's largest numerical simulation of an adverse pressure gradient turbulent boundary layer [4]. We organised a 3-year arrangement. I would be a Monash employee working on this extensive and challenging ARC project for half of my time. CSIRO would then pay Monash for me to continue the subgrid turbulence model development for climate simulations in the other half of my time. In addition, during the last two years of the arrangement I lectured a final year elective unit at Monash University on advanced aerodynamics and turbulence.

This arrangement was very productive, however, it took its toll. After two years of working every night and every weekend, and unsuccessfully applying for various lecturing positions, I was exhausted, burnt out, and felt that all my efforts in attaining a continuing academic role were futile. I have always prided myself on working on difficult and important problems, but this meant that my publication rate was less than some of my competitors. I have averaged two first author journal papers per year throughout my career, but as I've come to realise, to succeed in academia two papers a year is not going to cut it. What's more I realised that I'm not willing to work the way academia demands of me, in terms of many studies solving incremental problems, as opposed to fewer studies solving big, risky, and difficult questions. Another factor in my decision, was that my wife and I wanted to start a family soon. I felt that we needed some more stability, and I needed a better work/life balance. This made me search for other career options, and in doing so I realised that there are opportunities for challenging, interesting and meaningful work outside of academia.

For those of you toying with the idea of perhaps working in industry you might get something out of the steps I took over the past year. The first thing I did was to put a blog together communicating my previous research in lay terms, to serve as a portfolio of work to share with potential employers. I also created a website, youTube channel, Google+ page, twitter feed, and LinkedIn profile to build upon my online presence. I started attending a couple of after-hours industry based meetups in the field of data science, and made sure that after the seminars I got the contact details of at least one industry representative. I also started collecting online job descriptions of positions that I thought I might be interested in, and soon realised that many of the roles were looking for similar skills and experiences. Once I determined the skills I needed to develop I put two resumes together: one representing my current skill set; and another target resume that I would like to attain. I then made plans to address whatever short comings I deemed that I had. In addition, over the past year I have been doing some advising/consulting for start-up eHealth company, CurveTomorrow, using machine learning methods to build predictive models linking patient details, initial diagnoses and initial procedures to the probability of the patients returning to the hospital with future complications. This experience served as another example of my ability to quickly determine the important factors in a new field, and apply my mathematical and numerical skills to solve their problems.

As a result of these efforts I had job offers to work in the energy sector, management consulting and also for the global macro fund manager, AE Capital. I accepted the AE Capital role and have been working for them since the beginning of the year. The reason I chose AE Capital above my other options, was that I felt that I could bring some new ideas to their business, contribute in a meaningful way, and felt that I could learn much from their current team. I still publish research with my academic collaborators to contribute to the wider community where I can.

In summary, as I mentioned in the beginning, the best advice I can give you is first to figure out what you want, and talk to others in search of career examples to draw upon, as opposed to specific career advice. Above all try to work on things you love, and be good at it, so that you may have an opportunity to continue to work on the things you love in the future.

Post Script:
As of July 2017 the opportunity came up for me to return to CSIRO in a continuing position working on a long term decadal prediction project. Working in finance was interesting, I learnt a lot and met some great people. However, climate science is my passion and something I want to dedicate the rest of my life toward.

[1] Hermann, P., Ting, T. & Kitsios, V., 2005, Development of a fatigue analysis tool chain for automotive structural applications, Society of Automotive Engineering Transactions, Vol. 114, no 6, pp 522-530.[link]
[2] Kitsios, V., 2010, Recovery of fluid mechanical modes in unsteady separated flows, PhD Thesis, The University of Melbourne. [link]
[3] Kitsios, V., Frederiksen, J.S. & Zidikheri, M.J., 2015, Theoretical comparison of subgrid turbulence in the atmosphere and ocean, Nonlinear Processes in Geophysics Discussion, Vol. 2, pp 1675-1704. [link]
[4] Kitsios, V., Atkinson, C., Sillero, J.A., Borrell, G., Gungor, A.G., Jiménez, J. & Soria, J., 2016, Direct numerical simulation of a self-similar adverse pressure gradient turbulent boundary layer, International Journal of Heat and Fluid Flow, under review.

Saturday, 11 July 2015

Neural network classification of countries in the OECD

In the previous couple of posts I discussed the application of both regression (estimation of a continuous output for a given input sample) and classification (estimation of the class to which a given input sample belongs) machine learning methods with application to socio-economic data. Specifically multi-dimensional linear regression was used to estimate the average life expectancy for a particular country from a series of continuous socio-economic factors. Logistic regression was also used to classify if a country belonged to the OECD or not in a particular year. Here I will illustrate a way to build predictive classification models using artificial neural networks (ANN) machine learning algorithms.

The formulation of ANNs is inspired by the structure of the brain. They are represented as a network of interconnected neurons, and are used to learn both regression and classification functions from data samples comprising of potentially a large number of input variables. This general approach has produced state-of-the-art results in computer vision, speech recognition and natural language processing [1,2,3]. ANNs are also referred to as feed-forward neural networkmulti-layer perception, and more recently deep networks/learning. Deep learning has come to represent applications in which the amount of feature engineering is minimised, with the machine learning tasks achieved through multiple learned layers of neurons. Each neuron undertakes a weighted sum over a series of inputs then applies a nonlinear activation function producing an output value, as illustrated below. For regression problems the rectified linear unit (ReLU) is a popular activation function. For classification problems the sigmoid or hyperbolic tangent functions as used in logistic regression are the most common activations functions.

The input variables in a ANN are introduced into an input layer. These inputs are fed into a hidden layer of neurons. The outputs from the first hidden layer may then be fed into a second hidden layer. There may well be many hidden layers, with each layer learning more complex and more specialised behaviour. The penultimate hidden layer feeds its outputs to the final output layer. This typical structure of an ANN is illustrated below. Each circle represents a neuron, and the lines represent the interconnections. For a given network, the credit assignment path (CAP) is the path of nonlinear functions from an input to the output. The length of the CAP for a feed-forward neural network is the number of hidden layers plus one for the output layer. In recurrent neural networks a signal may pass through a given layer on multiple occasions, which means the length of the CAP is unbounded.

The weights in each neuron are learnt from the training data, such that they minimise a specified cost function. The cost function comprises of a loss (or error) function element that quantifies the difference between the value estimated by the ANN and the true value. For regression problems one would use a least squares error measure, whilst for classification problems a log-likelihood (or information entropy) measure is more appropriate. The cost function may also contain some additional regularisation terms which specify the relative importance put on minimising the amplitude of the model weights.

In ANNs the cost function is typically minimised using the back-propagation method. In the standard back-propagation method the weights are updated using some form of convex optimisation method, such as the gradient decent algorithm. In the gradient decent algorithm one first calculates the gradient/derivative of (change in) the cost function with respect to the weights, with the gradient averaged across all samples in the training set. The value of each weight from the previous iteration is then updated by adding to it the cost function gradient multiplied by a specified constant rate of learning. Here I adopt the mini-batch stochastic gradient descent variant, where the cost function gradient is averaged over only a small batch of samples in the training set, as opposed to all of the available samples. The larger the batch size the better the estimate of the gradient, but the more time taken to calculate the gradient. The batch size that returns the model coefficients with the least amount of computational resources is problem dependent, but typically ranges in order from 1 to 100 [4]. A very good explanation and illustration of the back-propagation method and gradient decent algorithm can be found in [5].

There are various frameworks available for creating ANN applications including:
  • Theano is a python library developed at University of Montreal, which calculates the partial derivatives required for the back-propagation algorithm using symbolic mathematics [4,6,7].
  • Torch is a deep learning framework developed using the Lua programming language, and is currently being co-developed and maintained by researchers at facebook, google deepmind, twitter and New York University.
  • DeepLearning4J is an open-source, parallel deep-learning library for Java and Scala, which can be integrated with Hadoop and Spark databases management systems.
  • Caffe is a C++ deep learning framework developed at Berkeley University with python and matlab interfaces.

As an illustration I have repeated the OECD classification problem solved in the previous blog post using logistic regression. To briefly recap the goal is to build a classification model determining if a particular country in a particular year is in the OECD on the basis of various socio-economic measures including: GDP; average life expectancy; average years spent in school; population growth; and money spent on health care amongst others. Here the classification problem is solved using a ANN with 12 input variables, one hidden layer comprising of 24 neurons, all of which feeding into one output neuron. A hyperbolic tangent activation function is used for the hidden layer neurons, and a sigmoid activation function for the output neuron. The model weights are randomly initialised using a uniform probability distribution. The code is written in python using the theano library.

The ANN is calculated over the training data set using L2 regularisation, with the regularisation parameters ranging from 0.001 and 0.1. The predictive performance measures of precision (P = [True positive] / [True positive + False positive], recall (R = [True positive] / [True positive + False negative]) and F1 score (2P*R/(P+R)) are calculated for each regularisation parameter over the training and validation data sets and illustrated in the following figures. As expected each performance measure of is greater for training data set than the validation data set across most regularisation levels.
The ANN with the greatest F1 score over the validation data set has a regularisation level of 0.001. The learning curves illustrating the F1 curve as a function of the number of samples used to train this particular model is illustrated below. The fact that the learning curves of the training and validation data sets do not converge indicate that the predictive performance could be further improved with additional data.
One final note, ANNs are biologically inspired learning algorithms, but the mathematical form of the neurons is far simpler than the behaviour of biological neurons. There are a range of methods that aim to more closely replicate the learning processes in the brain. The human brain actually consists of several levels (of which neurons are only one), each of which serve a slightly different purpose in the learning process [8]. Algorithms that attempt to replicate this process are refereed to as cortical learning algorithms.

[2] Deng, L.; Li, Xiao (2013). "Machine Learning Paradigms for Speech Recognition: An Overview". IEEE Transactions on Audio, Speech, and Language Processing.
[3] Socher, Richard (2013). "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank"
[4] LISA lab (2015), "Deep Learning Tutorial", University of Montreal.
[6] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley and Y. Bengio. “Theano: new features and speed improvements”. NIPS 2012 deep learning workshop.
[7] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley and Y. Bengio. “Theano: A CPU and GPU Math Expression Compiler”Proceedings of the Python for Scientific Computing Conference (SciPy) 2010. June 30 - July 3, Austin, TX

Sunday, 14 June 2015

Supervised machine learning – classification of countries in the OECD

To recap from the previous post supervised machine learning aims to build predictive models linking a series of inputs to a series of outputs, as opposed to unsupervised machine learning which builds descriptive models representing the data at hand. There are two types of supervised machine learning: regression, that predicts a continuous number for a given input sample; and classification that predicts to which class (or group) a particular sample belongs. Here we will be concentrating on classification.

In the previous post, I built a regression model predicting the average life expectancy of a country from various socio-economic input factors. As an example of classification, I will use this same data set to determine if a particular country in a particular year is in the OECD (that is the group of countries belonging to the Organisation for Economic Co-operation and Development) or not. The list of input factors/fields (with the associated variable name in parentheses) include: 
  • Life expectancy (lifeExp)
  • GDP per capita (gdpPC)
  • Total per capita spend on health (healthPC)
  • Government per capita spend on health (healthPCGov)
  • Births per capita (birth)
  • Deaths per capita (death)
  • Deaths of children under 5 years of age per capita (deathU5)
  • Years women spend in school (womenSchool)
  • Years men spend in school (menSchool)
  • Population growth rate (popGrowth)
  • Population (pop)
  • Immigration rate (immigration)
  • Flag stating whether a country in a particular year is the OECD (oecd).
The data was downloaded from the Gapminder and OECD websites, data wrangling was done using pandas, machine learning undertaken using scikit-learn, and visualisations developed using matplotlib and seaborn.

For clarity I have visualised the data for the key factors determined in the previous regression study against the OECD flag variable. If a country in a particular year is in the OECD the variable is given a value of oecd=1, if it is not in the OECD is has a value of oecd=0. Let us first consider the subplot below highlighted by the black box, which is the per capita death rate of children under the age of 5 (deathU5) versus the OECD flag. One can see that only countries with a low infant death rate are in the OECD, however, there are also certain countries with low infant death rates that are not in the OECD. By inspecting the remaining subplots one can also say that in general countries in the OECD have higher life expectancies, lower birth rates and spend more money on health care. The supervised machine learning task is to use these inputs fields (and the others listed above, but not visualised here) to predict whether or not a particular country in a particular year is in the OECD.
There are various classification algorithms that one could use including: logistic regression; support vector machines (also known as large margin classifiers); k-nearest neighbours; or neural networks / deep learning (the subject of the following post). In this example I will be adopting logistic regression [1], which essentially aims to fit to the data an "S" shaped curve, called the sigmoid" (or logistic) function, as opposed to the linear function used in the previous regression example. The inputs to the sigmoid function are again the feature variables and model parameters, however the output is now bounded by 0 and 1, and can be interpreted as being the probability of a particular sample belonging to the class at hand. In the present example the output is the probability of a country being in the OECD. If the probability is greater than 0.5 then it estimated to be in the OECD, if the probability is less than 0.5 it is estimated to not be in the OECD. In general, however, this threshold of 0.5 can be modified to control the predictive performance of the model.

Before building the model, each of the features are first standardised, by subtracting away the mean and then dividing by the standard deviation. To address underfitting (high bias) and overfitting (high variance) we first break the available data into a training data set (consists of 60% of the samples) to build the model, a cross validation data set (20% of the samples) to select the optimal regularisation level/hyper-parameter, and a test data set (remaining 20% of the samples) to determine the performance of the optimal model. I reduce the complexity of the logistic regression model by applying an L1 regularisation on the model parameters, which penalises candidate models for having large magnitude coefficients, with the penalty proportional to the regularisation hyper-parameter. This concept is discussed further in the previous post with respect to regression. The stronger the regularisation level the more simple the final model, and the weaker the regularisation the more complex the model becomes.

The quantification of the error in classification is not as straight forward as in regression studies, particularly for skewed data in which there may be many more negative samples (not in the OECD) as opposed to positive samples (in the OECD) or vice-versa. In regression studies the cost function is simply the summed squared error between the model prediction and the true value. In classification studies there are four types of prediction outcomes:

  • true positive - a positive result (in OECD) is predicted for a positive event (in OECD)
  • true negative - a negative result (not in OECD) is predicted for a negative event (not in OECD)
  • false positive (or a Type I error) - a positive result (in OECD) is predicted for a negative event (not in OECD)
  • false negative (or a Type II error) - a negative result (not in OECD) is predicted for a positive event (in OECD)
Associated with these outcomes are two measures: precision; and recall.

Firstly, precision (P) is the proportion of correctly predicted positive events to the total amount of events predicted as positive (True positives / [True positives + False positives]). The precision versus regularisation level is illustrated in the figure below. For all plots in this post the blue dots represent the training data set, and the red dots the cross validation data setTypical of machine learning studies, the precision (inversely proportional to the generalisation error) of the training data set is greater than the precision of the cross validation data set for all regularisation levels. As the regularisation level increases the precision of the training data set tends upwards, particularly for very small regularisation levels as the model becomes increasingly complex. This means that the countries that are predicted as being in the OECD are in the OECD the vast majority of the time. 
Recall (R) is the proportion of correctly predicted positive events to the total amount of actual positive events (True positives / [True positives + False negatives]). Typically the greater the precision the lower the recall. We can see from the figure below, for both the training and cross validation data sets as the regularisation level decreases and the model becomes more complex, the recall of the model reduces (as the precision increases). This means that while the countries that are predicted as being in the OECD are in the OECD the vast majority of the time (high precision), the model is also classifying many countries as not being in the OECD when in fact they actually are (low recall)For logistic regression one can also trade off the precision and recall measures against each other by modifying the threshold probability (set to 0.5 here) between positive and negative events. This effect is discussed in more detail in [2].
One way of combining the precision (P) and recall (R) measures of model performance is the F1 score, given by 2*P*R/(P+R). This measure is illustrated below over the same range of regularisation parameters. The optimal model is defined as the one with the highest F1 score in predicting the cross validation data set below. This occurs for a regularisation parameter of 1.3.
The next stage is to generate learning curves to determine the sensitivity of the error of the optimal model to the number of samples used to build the model. If sufficient data has been used to build the model, then the performance measures of the model over the test and cross validation sets should converge. As the number of samples used to build the model increases, the performance measures of the test data set should decrease (generalisation error increase), whilst the performance measures of the cross validation data set should increase (generalisation error decrease). This appears to be the case from the precision learning curve illustrated below. The convergence is less clear, however, when inspecting the associated plots for the recall and F1 score measures.
The dominant model coefficient in the optimal model are associated with the average life expectancy (lifeExp), government spend on health care (healthPCGov) and the infant death rate (deathU5)The ability of the optimal model is found to have an F1 score of approximately 0.9 for the training, cross validation and test data sets. To improve the predictability of the classification model (and also the regression model in the previous post) one could adopt a more complex unsupervised machine learning method, such as neural networks. This will be the subject of the following post.

[1] Cox, D.R., 1958, "The regression analysis of binary sequences", J Roy Stat Soc B, Vol. 20,  215–242.
[2] Ng, A., 2015, Course in Machine Learning, Stanford University,

Monday, 11 May 2015

Supervised machine learning - regression of life expectancy

In this post I will use machine learning to determine the key factors driving life expectancy in the countries of the world throughout history. Firstly, I will provide a definition of machine learning and an overview of the available classes of methods following the descriptions outlined in [1]. In general machine learning algorithms learn a model that either predicts or describes features of a particular system from a series of input (and potentially output) samples, by minimising some error measure (generalisation error). In this context, features are the variables (or dimensions) that describe the system, and a sample is one observation of that system. Typically features would be the columns in a database, and the samples would be the rows.

There are four main machine learning categories:
  • Supervised (predictive), in which both inputs and output features are known.
  • Semi-supervised (predictive), in which some of the data used to train the models is either noisy or missing.
  • Unsupervised (descriptive), in which only the inputs are available.
  • Reinforcement (adaptive), where feedback is applied via a series of punishments and rewards based on how well the algorithm performs in a given environment. Feedback control systems are one such example, as discussed in the final chapter of my PhD thesis.
In supervised and semi-supervised machine learning there are only two possible tasks:
  • Regression - a continuous number is estimated from the input values. Specific techniques include multi-dimensional linear regression and its variants. Recursive Bayesian Estimation, as discussed in a previous blog post, can be considered as an example of a semi-supervised regression algorithm.
  • Classification - a discrete quantity is estimated, determining the class to which a particular sample belongs. For example, determining the species of flower from the length and width of its petals, or voice to text conversion. Specific techniques include logistic regression and support vector classifiers. In semi-supervised classification, several classification functions may need to be learnt to account to different combinations of missing inputs.
In unsupervised learning a new descriptive representation of the input data is sought. There are various possible tasks including:
  • Data decomposition / dimension reduction - the data is decomposed into a series of modes comprising of different combinations of the features/variables that can be used in combination to reconstruct each of the samples. Example decompositions include Fourier, wavelet and principle component analysis as discussed in a previous blog post.
  • Clustering - define a finite set of classes (or categories) to describe the data. The classes can be determined on the basis of the feature centroids (eg: K-means), assumptions of the data distribution (eg: Gaussian Mixture Models) or on the density of the clustering of the points (eg: mean shift)
  • Density estimation – estimates the probability density function (PDF) from which the samples are drawn. Examples methods include: histograms in which the feature dimensions are discretised and samples are binned creating vertical columns; and kernel density estimation in which samples contribute to a Gaussian distribution locally producing a smoother final PDF.
  • Anomaly detection – determining which samples in the existing data set are outliers. This can be based on statistics (i.e. how many standard deviations away from the mean), or on the multi-dimensional distance between samples.
  • Imputation of missing values – Estimate the missing values in a data set. This can be undertaken by conditioning the PDF on the available values, replacing the missing values with either the mean or median of the existing values, replacing with a random selection from the range of possible values, or interpolating in time between samples for time series data.
  • Synthesis and sampling – generate new samples that are similar, but not identical, to the training data set. An example is the generation of initial conditions for ensemble numerical weather prediction.
In the current example we will adopt supervised machine learning (specifically multi-dimensional linear regression) to build a predictive model of life expectancy. Potential key socio-economic factors (with the associated variable name in parentheses) include: 
  • Life expectancy (lifeExp)
  • GDP per capita (gdpPC)
  • Total per capita spend on health (healthPC)
  • Government per capita spend on health (healthPCGov)
  • Births per capita (birth)
  • Deaths per capita (death)
  • Deaths of children under 5 years of age per capita (deathU5)
  • Years women spend in school (womenSchool)
  • Years men spend in school (menSchool)
  • Population growth rate (popGrowth)
  • Population (pop)
  • Immigration rate (immigration)
Each of these features were downloaded from the Gapminder website. The following data mining exercise is undertaken using the python eco-system. The wrangling of the input data was done using pandas, the machine learning tasks undertaken using scikit-learn, and the visualisations developed using matplotlib and seaborn.

The first step is to calculate the correlations between each of the fields to determine likely important factors. In the figure below red indicates features that are strongly positively correlated, blue strongly negatively correlated, and grey weakly correlated. We find that the life expectancy is weakly correlated with population growth, population, and immigration rate. We also find that total health spend per capita, and government health spend per capita are similarly correlated with other fields. Likewise, the number of years spent in school by women and men, are similarly correlated with the other fields.
The next step is to build a regression model linking the life expectancy to the other features. Each feature is first standardised, by subtracting away the mean and then dividing by the standard deviation. When developing any form of model from data it is important to address under and over fittingUnderfitting (high bias) of the data occurs when the learning function is not complex enough to represent the observations. Overfitting (high variance) occurs when the learning function is too complex. We address these aspects by first breaking the available data into the following three sets:
  • training data set, typically consists of 60% of the total amount of data available, which is used to first build the model for a given set of fitting parameters;
  • cross validation data set, consisting of 20% of the data, over which the error prediction is calculated, with the optimal model parameters determined as the model of minimum error; and
  • test data test, consisting of the remaining 20% of the data, which is used to determine the performance of the optimal model.
An excellent discussion on this concept can be found in [3].

I use the LASSO regression method [2] to reduce the complexity of the model as far as possible by regularising the regression coefficients. Regularisation is a means of reducing the capacity / complexity of the learning function. It can be thought of as a mathematical representation of Occam's razor, or the reductionist idea that among equally well performing candidate models, one should select the least complex. The stronger the regularisation level, the greater the tendency to remove features that do not contribute to the improved prediction of the model. The squared error versus regularisation level is illustrated in the figure below. As is typical of machine learning studies, the generalisation error of the training data set decreases as the regularisation parameter decreases, or equivalently as the model becomes more complex. The generalisation error of the cross validation data set initially decreases with increasing complexity and reduced underfitting, until it reaches a minimum error, after which the error increases due to overfitting. To address underfitting (high variance) one can either: increase the regularisation; fit to less features; or get more training dataTo address overfitting (high bias) one can either: decrease the regularisation; or fit to more features, including nonlinear combinations of features. The regularisation parameter can be considered as a hyper-parameter, in that it is a parameter that defines how the model is fit to the data, as opposed to the model parameters that define the model itself.
The next stage is to generate learning curves, that is determine the sensitivity of the error of the optimal model to the number of samples used to build the model. As illustrated below, and again typical of machine learning studies, as the number of samples used to build the model increases, the generalisation error of the training set increases as it attempts to represent more observations. At the same time the generalisation error of the cross validation set decreases as the model becomes more representative of reality. An indication that sufficient data has been used to build the model, is when the error measures calculated from the two data sets are equal. The error of the training data set is the lower limit. If this error is deemed too high, additional data will not help the future predictability of the model. In this scenario to improve accuracy one must either reduce the regularisation, add in additional input features, or use more complex regression methods (eg: multi-layer neural networks).
The ability of the model with the lowest squared error to predict the cross validation data set, is then used to predict the test data set, as illustrated below. Each green dot represents a country in a particular year. A perfect model would have all of the points lying along the black line. This model did not use the features of immigration, years women spent in school, and government health spend. The last two features were not required as they are strongly correlated with years men spent in school and total health spend, respectively. In addition the regression coefficients for the features of population, population growth, years men spent in school, GDP and death rate were also orders of magnitude smaller than the coefficients of the dominant features. 
The correlation coefficients of the remaining key features are illustrated below. This indicates that life expectancy is most strongly related to child death rate, with a strong negative correlation of -0.93. This makes sense, since the deaths of very young people would bring down the average life expectancy of the entire country. We also find that the birth rate is higher in countries with higher child death rate, with a strong positive correlation of 0.84. Finally the child death rate is lower in countries with higher health spend per capita, with a negative correlation of -0.4.
When inspecting scatter plots of the key features, the relationship between child death rate and health spend per capita is actually far stronger than the correlation coefficient of -0.4 initially suggests. In the scatter plot highlighted in the figure below you can see there is a very tight arrangement of the samples, with the child death rate dropping rapidly past a threshold health care spend. The relationship is a very strong, but not a linear one, which is why the correlation coefficient was not excessively high. To improve the predictability of the model one could perform a transformation linearising the data prior to undertaking the regression, or adopt a nonlinear regression method such as neural networks. This will be the subject of a future post. For completeness, histogram estimates of the probability density functions are illustrated in the plots along the diagonal of the figure below.
In summary the best way to improve the life expectancy of a country is to reduce the child death rates, which has a very strong relationship to the money spent on health care.

[1] Bengio, Y., Goodfellow, I. J. & Courvill, A., Deep Learning, book in preparation for MIT press,
[2] Tibshirani, R. 1996, Regression shrinkage and selection via the lasso, J. Roy. Soc. B., pp267-288.
[3] Ng, A., 2015, Course in Machine Learning, Stanford University,

Friday, 6 March 2015

Calibration of Computationally Expensive Simulations via Response Surface Optimisation

In this post I will provide a general definition of optimisation, and an overview of some popular algorithms. I will then go through an example on the use of response surface optimisation to determine the turbulence model parameters that minimise the error between a computational expensive fluid flow simulation of a submarine, and associated experimental measurements. We first presented this work in [1].

Optimisation is the means of determining the design parameters that minimise a specified cost function, under a set of constraints. For example if we want to minimise the noise of a fan, then the cost function would be the decibel level of the noise generated by the fan, and the design parameters would include the width, length, thickness and material type of the fan blades. There may also be a constraints put on the weight and sharpness of the components for occupational, health and safety reasons. Constraints limit the allowable combination of design parameters, and typically result in a lower performance design had the constraints not been in place. The cost functions can also be multi-objective. For example, in addition to minimising the noise of the fan, one may also want to minimise the time, energy and financial cost required to manufacture the components. Typically the designer must determine the relative importance of each of the objectives to combine them into one cost function.

Optimisation methods can be classed as two different types: gradient based methods; and non-gradient based methods. If the design / parameter space for a given cost function is convex (continually decreasing with only one minimum) then gradient based methods are an extremely efficient way of determining the optimal solution. In these methods one estimates how the cost function changes with the design parameters, and this information is used to select the next combination of parameters to set. However, if the cost functions are not convex and are highly non-linear, then gradient based methods have the potential to get stuck in local minima and produce sub-optimal designs. In this case non-gradient based methods are preferred.

One of the more popular non-gradient based methods is evolutionary optimisation. These methods attempt to evolve an optimal design through natural selection, or survival of the fittest. This is achieved through breeding a set of parents to produce a set of children, and keep only the fittest children (low cost function) for further breeding [2]. Each parent and child have a DNA gene sequence defined by the design input parameters. Breeding is undertaken on the basis of mutation and cross-over (gene swapping). There are two main incarnations that utilise the above approach, genetic algorithms (GA) and evolutionary algorithms (EA). GA were development first, with the parameters encoded into binary numbers, which are essentially treated as genes in the reproductive process. In the mutation process one of the digits in the binary coding is randomly changed from a 1 to a 0 or visa-versa. In the cross-over process to genes are split at a random point and interchanged. EA are philosophically the same, however, the crossover and mutation steps are performed on the design parameters directly, without any binary coding and decoding processes. In either case the breeding process is continued over multiple generations until a sufficiently good solution is found or computational resources are exhausted.

Evolutionary approaches can achieve a very good solution in a reasonable time, however, they require many concurrent evaluations of the cost function [2]. This is not reasonable for cost functions that are very computationally expensive. In this case Response Surface Models (RSM) can be utilised to build an estimate of the parameter space from a limited number of cost function evaluations. The minimum of the RSM is then obtained either numerically or analytically, and the cost function is then re-evaluated for the estimated optimal parameter set. The RSM is updated with the additional cost function evaluation, and the new minimum determined. This process is repeated until there is a sufficiently small difference between the RSM estimate of the cost function minimum and the value obtained from the cost function itself. The only subjective choice is the functional form of the RSM. Popular choices are splines and Kriging [3]. The latter is adopted in the following example. Kriging has the added feature of providing error estimates of the RSM throughout the design space.

As mentioned at the beginning of the post, the optimisation task here is to determine the turbulence model parameters that minimise the error between an expensive computational fluid dynamics (CFD) simulation of a submarine, and associated experimental measurements (lift force, drag force and yaw moment). Slices of the velocity field are illustrated with respect to the submarine body below.

For this particular turbulence model there are two key parameters βi and α. We, therefore, have a two-dimensional parameter space. Reasonable limits for these two parameters are set, with 9 evenly spaced values are selected along each dimension, producing 9x9=81 design candidates. The CFD simulation was run 81 times - once for each of the design candidates – with the lift force, drag force and yaw moments calculated from the flow field. The multi-objective cost function is the average squared error between the forces calculated from the CFD and the forced measured experimentally. The initial RSM of the simulation error is illustrated below, where blue contours are low error and red contours are high error. The symbols in the figure below represent the initial 81 evaluations of the cost function.

The minimum of this RSM is determined along with the associated values of βi and α. An additional CFD simulation is run with these particular values of βi and α. The cost function (error in the forces between the CFD simulation and the measurements) is evaluated and the RSM is updated. This process continues until the difference between the RSM of the cost function and the evaluated cost function is sufficiently small. The final RSM with the additional cost function evaluations are illustrated below.
In addition to determine the optimal solution, the RSM also provides an understanding of how the cost function changes throughout the design space, which evolutionary methods do not. RSM are also capable of answering multiple questions at once. For example in a multi-objective optimisation study if the relative importance of the objectives change, then a new RSM can be built using the existing simulation results. Further refinement of the RSM would in all likelihood be required to find the new optimal solution for the new cost function. If one was to use an evolutionary method, then the optimisation process would have to be run from scratch.

The RSM optimisation method can also be used to determine the optimal hyper-parameters used in the data modelling process, such as regularisation parameters used in regression and classification studies. 

[1] Chng, T. S., Widjaja, R., Kitsios, V. & Ooi , A., RANS Turbulence Model Optimisation based on Surrogate Management Framework , Australasian Fluid Mechanics Conference , Queensland University, Brisbane, Australia , 3-7 December, 2007 .
[2] Fogel, D., 1994, An introduction to simulated evolutionary optimization, IEEE transactions on neural networks, Vol. 5, pp 3-14.
[3] Jones, D. R., 2001, A taxonomy of global optimization methods based on response surfaces, J. Global Opt., Vol. 21, pp 345-383.