how the world works - a data scientist's perspective: 2014

Sunday 28 September 2014

Scientific Visualisation - from 1's and 0's to pretty pictures

What is Scientific Visualisation? It is actually far more than just pretty pictures. It's a means of representing vast amounts of data in a manner that we can better appreciate and understand. I'm going to run through an example of scientific visualisation by generating an image of a three-dimensional turbulent flow over an aerofoil surface (eg: aircraft wing, wind turbine blade) at one instance in time. By then repeating the process at multiple instances we can generate a movie illustrating how the turbulence evolves in time. I produced the data used in this blog post during my PhD, using the Stanford University Center for Turbulence Research code, CDP [1]. You can find further detail on the data in [2,3].

The software tool I will be using to visualise this data is the open source code, Paraview. It is a graphical user interface to VTK++ (www.vtk.org), which is a C++ library (including Python and Tcl wrappers) defining how to read and write the VTK file format and perform the visualisations described below (and much more). Paraview has also been written to use multiple computer processing units (CPUs) at once to visual massive data sets, that could not ordinarily be visualised using only one CPU. If you like you can download the necessary software / data and follow the steps outlined below.

Step 1) If you do not already have it installed, download and install the freely available open source software Paraview from www.paraview.org.

Step 2) Download the required VTK files. You will need two files:

aerofoil_surface.vtk - this VTK file defines the aerofoil surface geometry.

aerofoil_data.vtk - this VTK file contains the flow field data around the aerofoil surce. It is not the complete flow field, but a small subset of the flow over just the front portion of the aerofoil. In this file you will find the velocity components (vector u) in all three dimensions, and the pressure field (scalar p). In addition, you will also find the invariants of the velocity gradient tensor P,Q,R and the discriminant D, which can be used to define the boundary of a vortex [4].

Step 3) Open Paraview, go to File->Load and select the aerofoil_surface.vtk file. You will need to click the green "Apply" button in the "Properties" tab. You now should be able to rotate and zoom into the aerofoil surface.

Step 4) Again go to File->Load and this time load the aerofoil_data.vtk file, remembering to click the "Apply" button. You should now see the subset of the flow field placed with respect to the aerofoil surface. This is a three-dimensional volume of discrete points. You can change the "Surface" drop down menu to "Wireframe" or "Surface With Edges" to see the individual discrete points.

Step 5) Now go to Filters->Alphabetical->Contour. Just make sure that you still have the aerofoil_data.vtk file selected in the pipeline browser. In the "Properties" tab change the pressure variable "p" to the variable "Q". Also change the isosurface value below the subsection "Value Range" to 250, and click "Apply". This should generate a series of surfaces all of which have a value of Q=250. This makes it easier to visualise what is going on inside the three-dimensional volume.

Step 6) At the top of the program interface, change the colour-by drop down menu from "Solid Color" to "vorticity". You can also play around with the colour mapping properites on the right hand side and rotate the object to get it to look just the way you want.

Paraview also has the functionality to load multiple files from consecutive instances in time all at once. It will repeat the visualisation process outlined above for each individual file (representative of each instant in time), and then generate a movie autmatically. The movie generated for the associated aerofoil flow can be seen below. In this movie, however, I used the entire flow field instead of the small subset provided for you to play around with.

I also generated an annimation for turbulent flow inside a channel, between a top and bottom wall. Here the iso-surfaces of are D, and are coloured by the velocity in the direction of the flow. I generated the data used for this visualisation using the code developed in [5,6]. You can find many other annimations generated using this process on my youTube channel.

Once you're familiar with the steps, you in fact do not need to use Paraview at all. You can write either a C++, Python of Tcl code to do all of the operations outlined above automatically.

The data used in this blog post is in fact by today's standards quite small. Each file is approximately 117Mb in size for the small domain size provided, and we used 512 to generate the movie, for a total size of the database of 59Gb. In my current work one instant in time contains 100 billion degrees of freedom, with each data file 400Gb in size. To make a movie of the same number of fields as outlined above the total size of the database would be 204,800Gb = 200Tb = 0.2Pb.

Finally the scientific visualisation methods described above are not only applicable to turbulence, but many fields of including medical imaging, socio-economic geo-spatial data, genetics, and in fact any data set that varying in time and space. So if you have time varying spatial data (like the data presented here), you can generate movies automatically without having to do any frame by frame animation like you would have to do in traditional animation.

There is another field of visualisation that aims to present data that depends on more than 3 dimensions, as is the case in design applications. This, however, will be the subject of a future post.

References:

[1] Mahesh, K., Constantinescu, G. & Moin, P., 2004, A numerical method for large eddy simulations in complex geometries, J. Comput. Phys. 197, 215–240.

[2] Kitsios, V., Cordier, L., Bonnet, J.-P., Ooi, A. & Soria, J., 2011, On the coherent structures and stability properties of a leading-edge separated aerofoil with turbulent recirculation, J. Fluid Mech., 683, 395–416.

[3] Kitsios, V., 2010, Recovery of fluid mechanical modes in unsteady separated flows, PhD thesis, The University of Melbourne & Universite de Poitiers.

[3] Perry, A. E. & Chong, M. S., 1987, A description of eddying motions and flow patterns using critical-point concepts, Ann. Rev. Fluid Mech. 19, 125–155.

[5] Kim J., Moin P. and Moser R., 1987, Turbulence statistics in fully developed channel flow at low Reynolds number, J. Fluid Mech. 177 133–166.

[6] Del Alamo, J. C. & Jimenez, J., 2003, Spectra of the very large anisotropic scales in turbulent channels, Phys. Fluids, 15 L41–L44.

Sunday 29 June 2014

Inequality of Energy Consumption between Developing and Developed Nations

In a previous post I discussed the relationship between the increase in the population of the world, and associated increases in energy consumption and carbon dioxide (CO2) emissions since the industrial revolution. In the current post I'll break down the contributions to population, energy use and CO2 from the developed and developing world over the passed 20 year period. I've defined the group of developed nations by those currently in the Organisation for Economic Co-operation and Development (OECD). The current OECD countries are: Australia; Austria; Belgium; Canada; Chile; Czech Republic; Denmark; Estonia; Finland; France; Germany; Greece; Hungary; Iceland; Ireland; Israel; Italy; Japan; Korea; Luxembourg; Mexico; Netherlands; New Zealand; Norway; Poland; Portugal; Slovak Republic; Slovenia; Spain; Sweden; Switzerland; Turkey; United Kingdom; and the United States. The remaining nations are denoted as non-OECD. The data presented below has been downloaded from the gapminder website [1], which is a collection of various socio-economic data from various sources.

The first distinction between the developed (OECD) and developing (non-OECD) worlds is done on the basis of population. In the figure below the OECD population is illustrated by the orange line, the non-OECD world by the yellow, and the sum of the two giving the total world population is illustrated by the blue line. The OECD population has remained relatively constant at 1 billion people. Over this same period the non-OECD world has increased from approximately 5 to 6.5 billion people. The majority of people live in the non-OECD world, and the recent population growth has also come from these nations.

There is approximately 4 times the number of people in the developing world, as compared to the developed world. However, as illustrated in the figure below, up until 2005 the developed world actually consumed more total energy. In all the figures presented in this post energy is quantified as billions of tonnes of equivalent oil, regardless of where the energy has come from. The OECD energy consumption has been relatively constant over the passed 20 years, with the increase in the world's energy consumption coming from the non-OECD countries. This is due both to the increased standards of living in the non-OECD countries, and the manufacturing of products for markets in OECD countries.

A measure of a nation's standard of living is the average energy consumed per person per year. This measure has remained relatively constant over the passed 20 years, for the OECD countries, with a slight increase in the non-OECD countries. The key observation from this figure is that energy consumed per capita (or standard of living) is 4 times greater in the developed world than the developing world.

Considering only countries with more than 10 million people, the 13 nations with the highest level of energy use per capita in 2010 are illustrated in the figure below. The only two non-OECD nations in this list of countries are Saudia Arabia and Russia, both of which have significant oil reserves. This list is in fact an indication of the countries with the highest standards of living in the world. It is not obvious at this point, however, how carbon intensive the consumed energy is, and hence what the impact that this high standard of living has on the environment.

For this list of countries presented in the same order, the CO2 emmisions per person per year are illustrated in the figure below. Belgium and France are now reduced relative to the other countries, due to the higher percentage contribution of nuclear and wind power to the supplied energy.

In summary 4 times more people live in the developing world, however, people in the developed world use 4 times more energy per person. It is worth noting that the impact this high standard of living has on the environment is dependent upon the source of the energy supply (i.e. fossil fuels, nuclear power, renewable).

References:

[1] www.gapminder.org

Sunday 25 May 2014

Accounting for Model Uncertainty - A Case Study in Climate Change

When developing mathematical models of reality, at the forefront of your mind must be the question you are aiming to answer. The model must be sufficiently complex to respond to this question. As an example we will look at the question of climate change attribution, that is:

Did human activities cause the global temperature increase over the passed 100 years?

As I discussed in the previous post over the passed 100 years the human population has increased from 2 billion to 7 billion people, and our total energy consumption has increased over 10 fold. The burning of fossil fuels to meet this energy consumption need rapidly increased the carbon dioxide concentration in the atmosphere to a level higher than anything observed over the passed 700,000 years. More carbon dioxide in the atmosphere enhances the greenhouse effect, which traps in more heat from the sun, and increases the temperature of the planet. The question remains, however, is the observed increase in temperature over the passed 100 years due to man made carbon dioxide emissions, or would the Earth have heated up anyway? We do not have another Earth without man made carbon dioxide emissions to directly compare to, so to help us answer this question we rely on numerical simulations of a virtual Earth.

The diagram below outlines the general steps in developing a mathematical representation of any system. It is a simplified version of that presented in reference [1] below. One starts with "reality" in all its complexity and splendour. In order to answer the desired question it is not necessary to retain all of this complexity and we can simply our view of the world. For example we need to include the effect of heating from the sun, and the greenhouse effect as a result of carbon dioxide emissions from both natural processes and human activities. On the other hand we do not need to model the combustion process generating the emissions from each individual manufacturing plant, house and car. It is sufficient to use the measured carbon dioxide concentration in the atmosphere as an input. This simplified view is what is called the "conceptual model".

The next step is to represent our idealised view of the world with mathematics. This "mathematical model" is a set of equations that governs how the winds, rain and temperature of the Earth changes in time and space. These are very complicated equations and do not have a simple analytical solution that one can write down on paper. We, therefore, need to develop a "numerical model", which breaks the Earth down into a series of discrete positions in space and time. For example we can break the Earth down into a series of horizontal boxes of size 100km, and in time by solving the equations once every day or so. The greater the resolution, that is the smaller the boxes and shorter the time step, the more accurate the "numerical model". The "numerical model" is essentially software that solves the discrete equations on a computer. In this particular application, the "numerical model" is referred to as a general circulation model, which was specifically discussed in my second post. The following "numerical solution" is the final data generated by the "numerical model", which we can later analyse and data mine.

At each step we make assumptions in order to represent "reality" by a "numerical solution". The potential error associated with each of these assumptions must also be assessed to determine if one can respond to the original question given the amount of uncertainty. One way of doing this is to test the sensitivity of the assumptions to the final result. An example of quantifying this uncertainty is presented below for the climate attribution problem. The following graphs are generated by the Intergovernmental Panel on Climate Change (IPCC) effort published in reference [2] listed at the end of this post.

The black line in the figure below represents the observed increase in temperature of the planet from 1900. Each thin orange line represents the temperature resulting from one of 58 "numerical solutions" with different resolutions in space and time, generated by 14 different "numerical models" solving different specific equations using different numerical techniques. The thick red line is the average temperature of all of the cases. These simulations include the carbon dioxide originating from both natural processes and also from human activities (i.e. burning fossil fuels). As you can see there is quite a bit of variability between each of the 58 simulations, but on average they follow the observed changes quite well. Interestingly, the vertical grey lines identify major volcanic eruptions, which coincide with a temperature decrease in both the observations and the simulations. This is because the volcanoes send up so much dust, dirt and other particles into the atmosphere that they essentially block out the Sun for extended periods of time, which temporarily reduces the temperature.

In contrast, the figure below illustrates the temperature generated from simulations that do not include carbon dioxide from human activities. Each light blue lines represents one of 19 "numerical solutions" from 5 different "numerical models", with the thick blue line the average. As you can see the agreement with the observations is not very good, with the general trend not agreeing with the observations. Whilst there is again a significant amount of variability between each of the different cases, not any one of the simulations are able to reproduce the observed warming. So after accounting for the uncertainty the only way to reproduce the observed warming over the passed century is to include man made carbon dioxide emissions. In other words, the Earth would not have heated up without human involvement. This is an example of the numerical solutions having sufficiently high fidelity and sufficiently low uncertainty to answer the original question.

Globally averaged temperature is quite a robust signal. However, there is more model variability / uncertainty when asking the question if a particular place will be hotter, colder, wetter or drier. For example under a warming world there will be more evaporation, which means more water vapour in the atmosphere, which also means more rain. However, it is not clear if a particular place will receive more or less rain, as the patterns of rainfall are also likely to change. So it may well be raining more around the world on average, but it might be raining more over the ocean and less is certain places over land. This is precisely what is happening in the southern hemisphere. The weather patterns are changing such that the rain that was previously falling over the south-west of Australia has shifted further south, and is it now raining more over the ocean - see reference [3] below. It is very difficult to predict how this will change in the future. Each of the global circulation models may suggest similar increases in total global rainfall, but the rainfall patterns are very different. This means less agreement and more uncertainty on the predicted rainfall for a particular place in the world in the future.

References:

[1] B. H. Thacker, S. W. Doebling, F. M. Hemez, M. C. Anderson, J. E. Pepin, E. A. Rodriguez, 2004, "Concepts of Model Verification and Validation", Los Alamos Laboratory Technical Report.

[2] Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change, 2007, Solomon, S., D. Qin, M. Manning, Z. Chen, M. Marquis, K.B. Averyt, M. Tignor and H.L. Miller (eds.) Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA.

[3] J. S. Frederiksen and C. S. Frederiksen, “Decadal changes in Southern Hemisphere winter cyclogenesis,” in CSIRO Marine and Atmospheric Research, p. 29, Aspendale, Victoria, Australia, 2005

Sunday 11 May 2014

History of Human Population Growth, Energy Consumption and Carbon Dioxide Concentration

Individually we humans are intelligent, emotional, illogical, unpredictable and incredibly complicated. However, our behaviour as a group is in fact far more simple. In my previous post I used a simple mathematical model to illustrate that in biological systems when there is more food available animals breed more and infant survival rates increase. In this respect humans are no different to any other form of life. Below I will outline the key factors leading to human population growth and the associated increases in global energy consumption and atmospheric carbon dioxide concentration. Note in this blog post I have generated the following plots from raw data, so no numerical models were required. References to the raw data can be found at the end of this post.

The population of the world did not reach 1 billion people until the 1800s. Prior to this point food was produced by manual farming methods. The discovery of fossil fuels was the driver for the incredible population growth to follow. The industrial revolution was between 1760-1840. During this time machines were developed to improve the efficiency of crop planting and harvesting methods, that was previously done by hand. This coupled with the development of enhanced fertilisers enabled global food production to increase significantly. The increased availability of food meant that population centres that were previously in famine were able to adequately feed their people. Consistent with typical biological systems, as more food became available the population increased. In fact for the entire human history prior to the 1800s the population was 1 billion people, and it then doubled to 2 billion in less than 150 years. This global population data was downloaded from reference [1].

The end of the second world war in 1945 marked the beginning of a second major population increase associated with the baby boomer generation. This population increase was perhaps more due to people feeling better about bringing a new child into a perceived safer world. The population doubled again this time in less than 40 years. It could be argued that increased breeding rates in perceived safer environments is also a biological trait. The following images are further illustrations of the population growth during this period. The following two images are visualisations of the population of the cities of the world in 1950 and 2010 respectively, with the size of each yellow circle is proportional to the population. Comparing these images highlights the population growth and arrival of new cities over this period. The city based population data was obtained from reference [2].

Population of the cities of the world in 1950

Population of the cities of the world in 2010

Total global energy consumption is given by the amount of energy used per person multiplied by the total number of people. The increase in total global energy consumption illustrated below, largely follows the increase in population growth. The population increase over the passed century has been mainly in the developing world countries. However, living standards and hence energy used per person have in fact been significantly increasing in the developed world countries. These two effects can be more clearly illustrated by decomposing the energy usage to that associated with developed and developing world countries. This will be the subject of future posts. The energy data presented below was downloaded from reference [3].

The vast majority of the energy consumed has come from the burning of fossil fuels. The global carbon dioxide (CO2) concentration in the atmosphere in parts per million (ppm) is illustrated below, which increases in line with the energy consumption and human population growth. The CO2 data presented below has been obtained from sources [4] to [7] listed at the end of this post.

The attribution of the carbon dioxide emissions to human energy consumption is perhaps more starkly illustrated when zooming out in time. It is clear from the image below that the carbon dioxide concentration has been relatively stable over the passed 2000 years, and then increased sharply coinciding with the discovery of fossil fuels and the industrial revolution.

Zooming out in time again illustrates the natural variability of the carbon dioxide concentration over the passed 700,000 years. The apparent cycles in the concentration are associated with the Milankovic cycles, which are changes in the orbit of the Earth around the sun. These changes give rise to ice ages in periods of low concentration and warm periods with high concentrations. It is clear that over the passed 700,000 years the Earth has not been subjected to carbon dioxide concentrations as high as what we have today.

I put together the following animation to illustrate the increase in the population of the cities of the world overlaid with the associated total population growth and increase in the atmospheric carbon dioxide concentration.

References:

[1] www.un.org

[2] nordpil.com/go/resources/world-database-of-large-cities
[3] www.theoildrum.com/node/8936
[4] Etheridge, D.M., L.P. Steele, R.L. Langenfelds, R.J. Francey, J.-M. Barnola, and V.I. Morgan. 1996. Natural and anthropogenic changes in atmospheric CO2 over the last 1000 years from air in Antarctic ice and firn. Journal of Geophysical Research 101:4115-4128.
[5] U. Siegenthaler, T. F. Stocker, E. Monnin, D. Lüthi, J. Schwander, B. Stauffer, D. Raynaud, J.-M. Barnola, H. Fischer, V. Masson-Delmotte, J. Jouzel. 2005. Stable Carbon Cycle-Climate Relationship During the Late Pleistocene. Science, v. 310 , pp. 1313-1317, 25 November 2005.
[6] www.esrl.noaa.gov/gmd/ccgg/trends/
[7] Barnola, J.-M., D. Raynaud, A. Neftel, and H. Oeschger. 1983. Comparison of CO2 measurements by two laboratories on air from bubbles in polar ice. Nature 303:410-13.

Tuesday 22 April 2014

Koalas vs Possums - Modelling Biological Growth using Recursive Bayesian Estimation

In this week's post I will be discussing how to track biological population growth using a simple numerical model coupled together with incomplete noisy measurements. Firstly I will provide an overview of the numerical population model, then I will illustrate how to incorporate the measurements to ensure the numerical model is more representative of the real world.

The example I am using to illustrate this is the predator-prey problem. In the classic predator-prey problem, there is one type of predator and type of prey, for example foxes and rabbits. Here, however, will be using two predators and one prey to achieve the required level of complexity. The selected elements of the model have a somewhat Australian flavour. The two types of predators are koalas and possums. You may not think that these two animals are the most vicious of predators, however, you might think otherwise if you were the chosen prey ... gum leaves! The numerical model calculates the population of the koalas, possums and gum leaves, based on certain assumptions. We need to provide the rate at which an average koala and possum eats gum leaves. For both the koalas and possums, we also need to provide a threshold amount of gum leaves below which they die and above which they breed. We can estimate these rates and thresholds from the data itself, but for the moment we'll assume that we know what they are. Details on the numerical model can be found in reference [1] detailed at the end of this post.

For a particular selection of rates and thresholds we calculate the evolution of the koala, possum and gum leaf population versus time, as illustrated in the plot below. The specific numbers do not mean much, it is just intended to be an illustration of the approach. Here you can see a very regular and periodic pattern. The number of gum leaves (green line) initially increase, which is then followed by an increase in the population of both the koalas and possums as there is now more food available. When there are so many koalas and possums such that they are eating the gum leaves faster than the gum leaves can grow, the number of gum leaves decrease. When the gum leaves get below a certain threshold value the animals start to die because there is not enough food available. With there now being less koalas and possums the gum leaves begin to grow back. When there is enough food the animals begin to breed again, and so the cycle continues.

However, if we slightly change the rate at which the possums eats the gum leaves then we get a very different evolution of all three populations, as you can see from the plot below. It is clear that the system is now more complicated and less periodic / regular. This is one illustration of the butterfly effect, where a very small change to the system can produce very different results. The overall pattern to the population growth, however, is the same with the growth in the animal population following a growth in the number of gum leaves, and the decay in the animal population following a decay in the number of gum leaves. The predator-prey model has no random components in it all, however, it produces changes in the population, which on the surface appear to be random. They are in fact not random, but chaotic. This type of chaotic behaviour can only be produced by a system with at least three degrees of freedom (koala, possum and gum leaves in this case). This comparison has illustrated how important it is to get the input parameters of the model right. A small change to the rate at which possums eat gum leaves produced qualitatively similar behaviour, but quantitatively different results.

We will now augment the simple model used above with measurements of the true system using Recursive Bayesian Estimation. Recursive Bayesian Estimation is the general term given to a variety of methods used to incorporate measurements into a numerical model, as soon as these measurements become available. In this application, we'll be using the Ensemble Kalman Filter. The interested readers can find the mathematical details on wikipedia. It essentially involves running an ensemble of many numerical models at the same time each starting with slightly different conditions (e.g. different populations), and also in some cases slightly different model parameters (e.g. possum feeding rates). A comparison of each instance of the numerical model gives an indication of the natural variability (or error) in the system. When a measurement becomes available it is compared to what the numerical model suggests the measurement should be, and the model is then pushed in the direction of the measurements. If the measurement has no noise in it at all, then the model is set to be measurement exactly. If there is some error in the measurement, then the numerical model is pushed in the direction of the measurement. The extent of which is determined by a comparison of the natural variability and measurement error.

Here the Ensemble Kalman Filter is used to estimate the population growth properties from a partial noisy measurement of the "true" system. The "true" system is the second case discussed above, but here we are just focussing at the first 100 time units of it. The measurement is "partial" in that we have only measured the koala and possum populations, and do not have a measurement of the number of leaves. They are "noisy" in that I have added a random number to the measurement of the "true" system.

In the plot below the "true" Koala population is represented by the red line, and the noisy measurements of it are represented by the small black boxes. You can see that the black boxes in some cases are shifted away from the red line, due to the noise in the measurement. In between each measurement, the numerical model calculates the population until a new measurement comes along to correct it. The numerical model estimate in each of the following plots is given by the black line.

Likewise, as can be seen from the plot below, the numerical model estimates the possum population, which is corrected when the measurements become available.

The Ensemble Kalman Filter uses the measurements of the koala and possum populations to not only correct the koala and possum population estimates from the numerical model, but also to correct the gum leaf population. This illustrates a key feature of the approach. One can make an estimate of something that you cannot measure (gum leaf populations), throughout its relationship (numerical model) to something that you can measure (koala and possum population).

In addition to estimating the population of the gum leaves from the measurements of the animal population, you can also estimate the model parameters from the data itself. For example we can estimate the rate at which the possums eat the gum leaves, which we identified earlier to significantly effect the evolution of the populations. In the plot below you can see that the estimate of the possum feeding rate approaches the true value as more measurements of the animal population become available. Each of the stair step changes in the estimate coincide with a new measurement. The more parameters you wish the estimate from the data, the more numerical simulations are required to be run simultaneously in the ensemble.

This fun example of Recursive Bayesian Estimation has illustrated a powerful approach for fusing together numerical models and real world data. It has implications for any application in which a numerical model is required to simulate reality, particularly when the numerical model parameters are not well known. This is particularly the case in socio-economic and complex biological systems.

References:

[1] Timo Eirola , Alexandr V. Osipov and Gunnar Soderbackao, Chaotic Regimes in a Dynamical System of the Type Many Predators One Prey, Helsinki University of Technology, Institute of Mathematics, Research Reports A368 (1996).

Sunday 6 April 2014

Flow over Aircraft and Wind Turbines - Capturing Increasing Complexity using Principal Components

In my previous post I discussed the important role that turbulence plays in the evolution of the atmosphere and ocean. Here I'll discuss turbulence in the context of the flow over aircraft wings and wind turbines. Specifically I'll show how the complexity of the flow increases as the object moves faster. I'll also show how to represent this complexity in a simple say using the data mining technique of "principal component analysis", also known as "proper orthogonal decomposition" in engineering, "singular value decomposition" in mathematics, or "empirical orthogonal functions" in geophysics.

Arguably the most important qunatity in the field of fluid dynamics is the Reynolds number, which is essentially the ratio of the momentum force to the viscous force. It is high for objects moving fast in a fluid of low viscosity (eg: air), and low for objects moving slowly in a very viscous fluid (eg: honey). Flows with higher Reynolds numbers are more complex and have a greater range of scales, that is the largest vortex in the flow is significantly bigger than the smallest voretx in the flow. I'll illustrate below how the complexity of the flow over an aerofoil (representative of an idealised aircraft wing or wind turbine blade) increases with Reynolds number.

The configuration that I am looking at is an aerofoil at an angle of 18 degrees with the flow moving from left to right in the movies below. All of the movies illustrated below are generated by post-processing the data resulting from computational fluid dynamics simulations, using conceptually the same approach as that discussed in my previous post on the simulations of the atmosphere and ocean. The volume of fluid surrounding the aerfoil is broken down into a series of grid boxes and the Navier-Stokes equations are solved at each position to determine the velocity and pressure throughout the fluid volume.

In fact the flow around an aerofoil at very low Reynolds numbers (or very slow moving aerofoils), does not change with time. The fluid is moving, but its velocity and presssure at each position is not changing. It is not until the aerofoil reaches a certain critical Reynolds number (or speed) that the flow begins to change with time, as illustrated in the movie below. Here the flow is changing in time and two-dimensional. It is coloured by vorticity, which is a measure of the rotation of the fluid. Red is rotating in the couter clockwise direction and blue is rotating in the clockwise direction.

I will now use principal component analysis to breaks down the flow into a series of "modes", which can be added together with varying weights to reconstruct each instant in time. One can also think of the method as a form of information compression, and has also been used for facial feature detection. For this particular flow 94% of the energy can be represented by the first two modes (or "facial features") illustrated below. Further details of this flow and its stability properties can be found in my Journal of Computational Physics paper in reference [1] and in my PhD thesis in reference [2] listed below.

mode 1

mode 2

As the Reynolds number increases the flow transitions for an unsteady two-dimensional flow, to an unsteady three-dimensional flow illustrated in the movie below. This movie is illustrating three-dimensional surfaces defining the boundary of the complex vortex structures. They are coloured by rotation in the flow direction. The total data set is 17.5Gb.

Here the first two modes represent the two-dimensional aspects of the flow and capture only 65% of the energy.

mode 1

mode 2

The next two modes capture the three-dimensional aspects of flow. Further details of this flow and the associated modes can be found in my PhD thesis.

mode 3

mode 4

As the Reynolds number is increased further the flow becomes even more complex with many smaller vortices being generated, as illustrated in the movie below. This total data set is 35Gb.

Here the first two modes now represent only 45% of the total energy. What is interesting here, is that the flow is so complex that the large scale vortices are hidden amongst the forest of small scale vortices. The principal components, however, are able to extract the large features from the data. Further details on this flow and the modal decomposition can be found in my Journal of Fluid Mechanics paper in reference [3].

mode 1

mode 2

It is clear that as the Reynolds number increases and the flow becomes more complex, the first two modes represent less and less of the total energy. This also means that more modes are required to represent a given percentage of the total energy.

Principal Component Analysis has many applications and can be used to extract key features from any data set be it physical, biological or socio-economic.

References:

[1] Kitsios, V., Rodríguez, D., Theofilis, V., Ooi, A. & Soria, J., 2009, BiGlobal stability analysis in curvilinear coordinates for massively separated lifting bodies, Journal of Computational Physics, Vol. 228, pp 7181-7196. [link]

[2] Kitsios, V., 2010, Recovery of fluid mechanical modes in unsteady separated flows, PhD Thesis, The University of Melbourne. [PDF]

[3] Kitsios, V., Cordier, L., Bonnet, J.-P., Ooi, A. & Soria, J., 2011, On the coherent structures and stability properties of a leading edge separated aerofoil with turbulent recirculation, Journal of Fluid Mechanics, Vol. 683, pp 395-416. [link]