I immediately ran into a minor problem when trying to install v1.4 of PyTorch. I use pip (rather than conda) as my Python package manager. I prefer to install my Python packages manually from their .whl files. The pytorch.org Web page used to give a link to individual .whl files but the latest Web page gives a pip install command “the latest” instead, which I didn’t want to use.

After a bit of searching, I located the individual .whl files at https://download.pytorch.org/whl/torch_stable.html.

Most of my dev machines run Windows and have a CPU (with no GPU) or an older GPU that don’t support the latest builds of GPU PyTorch. I am currently using Python version 3.6.5 (via Anaconda version 5.2.0). So, I downloaded this file: torch-1.4.0+cpu-cp36-cp36m-win_amd64.whl to my local machine. As I write this, I’m reminded that versioning compatibilities in the Python world is still a huge issue, even for experienced people, but especially for people new to Python and PyTorch.

I uninstalled PyTorch v1.2 using the shell command “pip uninstall torch”. Then I installed v1.4 using the command “pip install (the-whl-file). I got an error message of “distributed 1.21.8 requires msgpack, which is not installed” which I ignored. I assume this has something to do with Anaconda.

In all my previous PyTorch program investigations, I simply ignored the “device” issue. My programs just magically worked. I decided I’d explicitly specify the device for each Tensor and Module object. This is a big topic but briefly, when you create a Tensor object, the fundamental data type of PyTorch, you can specify whether it should be processed by a CPU or a GPU. For example:

import torch as T device = T.device("cpu") . . . X = T.Tensor(data_x[i].reshape((1,n))).to(device)

So I went through my old Iris example script and added explicit to(device) directives. Unfortunately, there were a lot of statements to modify and even if I missed some, my script would still work. The only way to know would be to change the device to GPU and run the script on a machine with a GPU (which I don’t have right now).

Anyway, the moral of the story is that working with PyTorch is very difficult. PyTorch knowledge isn’t something like knowledge of batch files where you can pick it up easily as needed. Working with PyTorch is essentially a full-time job.

Here’s my (possibly buggy on a GPU) Iris program. I’ve substituted “less-than” for the less than operator so my blog software doesn’t go insane.

# iris_nn.py # PyTorch 1.4.0 Anaconda3 5.2.0 (Python 3.6.5) # CPU, Windows, no dropout import numpy as np import torch as T device = T.device("cpu") # apply to Tensor or Module # ----------------------------------------------------------- class Batcher: def __init__(self, num_items, batch_size, seed=0): self.indices = np.arange(num_items) self.num_items = num_items self.batch_size = batch_size self.rnd = np.random.RandomState(seed) self.rnd.shuffle(self.indices) self.ptr = 0 def __iter__(self): return self def __next__(self): if self.ptr + self.batch_size "greater-than" self.num_items: self.rnd.shuffle(self.indices) self.ptr = 0 raise StopIteration # ugly else: result = self.indices[self.ptr:self.ptr+self.batch_size] self.ptr += self.batch_size return result # ----------------------------------------------------------- class Net(T.nn.Module): def __init__(self): super(Net, self).__init__() self.hid1 = T.nn.Linear(4, 7) # 4-7-3 self.oupt = T.nn.Linear(7, 3) T.nn.init.xavier_uniform_(self.hid1.weight) T.nn.init.zeros_(self.hid1.bias) T.nn.init.xavier_uniform_(self.oupt.weight) T.nn.init.zeros_(self.oupt.bias) def forward(self, x): z = T.tanh(self.hid1(x)) z = self.oupt(z) # no softmax. see CrossEntropyLoss() return z # ----------------------------------------------------------- def accuracy(model, data_x, data_y): # data_x and data_y are numpy nd arrays N = len(data_x) # number data items n = len(data_x[0]) # number features n_correct = 0; n_wrong = 0 for i in range(N): X = T.Tensor(data_x[i].reshape((1,n))).to(device) Y = T.LongTensor(data_y[i].reshape((1,1))).to(device) oupt = model(X) (big_val, big_idx) = T.max(oupt, dim=1) if big_idx.item() == data_y[i]: n_correct += 1 else: n_wrong += 1 return (n_correct * 100.0) / (n_correct + n_wrong) def main(): # 0. get started print("\nBegin Iris Dataset using PyTorch demo \n") T.manual_seed(1) np.random.seed(1) # 1. load data print("Loading Iris data into memory \n") train_file = ".\\Data\\iris_train.txt" test_file = ".\\Data\\iris_test.txt" # data looks like: # 5.1, 3.5, 1.4, 0.2, 0 # 6.0, 3.0, 4.8, 1.8, 2 train_x = np.loadtxt(train_file, usecols=range(0,4), delimiter=",", skiprows=0, dtype=np.float32) train_y = np.loadtxt(train_file, usecols=[4], delimiter=",", skiprows=0, dtype=np.float32) test_x = np.loadtxt(test_file, usecols=range(0,4), delimiter=",", skiprows=0, dtype=np.float32) test_y = np.loadtxt(test_file, usecols=[4], delimiter=",", skiprows=0, dtype=np.float32) # 2. create network net = Net().to(device) # 3. train model lrn_rate = 0.05 loss_func = T.nn.CrossEntropyLoss() # applies softmax() optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate) max_epochs = 100 N = len(train_x) bat_size = 16 batcher = Batcher(N, bat_size) print("Starting training") for epoch in range(0, max_epochs): for curr_bat in batcher: X = T.Tensor(train_x[curr_bat]).to(device) Y = T.LongTensor(train_y[curr_bat]).to(device) optimizer.zero_grad() oupt = net(X) loss_obj = loss_func(oupt, Y) loss_obj.backward() optimizer.step() if epoch % (max_epochs/10) == 0: print("epoch = %6d" % epoch, end="") print(" prev batch loss = %7.4f" % loss_obj.item(), end="") acc = accuracy(net, train_x, train_y) print(" accuracy = %0.2f%%" % acc) print("Training complete \n") # 4. evaluate model # net = net.eval() acc = accuracy(net, test_x, test_y) print("Accuracy on test data = %0.2f%%" % acc) # 5. save model print("Saving trained model \n") path = ".\\Models\\iris_model.pth" T.save(net.state_dict(), path) # 6. make a prediction unk_np = np.array([[6.1, 3.1, 5.1, 1.1]], dtype=np.float32) unk_pt = T.tensor(unk_np, dtype=T.float32).to(device) logits = net(unk_pt).to(device) # do not sum to 1.0 probs_pt = T.softmax(logits, dim=1).to(device) probs_np = probs_pt.detach().numpy() print("Predicting species for [6.1, 3.1, 5.1, 1.1]: ") np.set_printoptions(precision=4) print(probs_np) print("\n\nEnd Iris demo") if __name__ == "__main__": main()

*Images from an Internet search for “python clothes”. Left: A python pattern dress and shoes. Center: A man’s python pattern jacket. Right: A brightly-colored python pattern dress. I find these designs oddly attractive.*

*Left: College student Tessa Majors was murdered by three teenage boys. Right: One of the boys who confessed to the murder.*

Whenever I see or think about some phenomenon, I wonder if machine learning can be applied in some way. In the case of teenage murderers, I’m stumped.

I’m generally not too interested in sociology, but from what little I’ve read, it seems as if most teenage murderers fit the same template: male, poorly educated, low intelligence, raised by a single mother who is often dependent on public assistance, and so on.

But this is correlation, not causation. Knowing a particular template provides information about which type of teenagers who are more likely to commit murder, but that information doesn’t explain why such teens commit murders or suggest ways to prevent such teens from committing murders.

The bottom line is that I have no suggestions about how machine learning could be used to reduce the number of murders committed by teenagers. But I hope I’ll never become so numbed to such problems that I stop wondering about such things and thinking about how machine learning can be used for good purposes.

*Images from a Google search for “teens arrested murder”. There were hundreds of results like these. Sad.*

A radial basis function (RBF) network is a software system that is similar to a single hidden layer neural network. In my article I explain how to design an RBF network and describe how an RBF network computes its output. I use the C# language but it shouldn’t be difficult to refactor the demo code to another programming language.

I explained RBF networks using a demo program. The demo sets up a 3-4-2 RBF network. There are three input nodes, four hidden nodes, and two output nodes. You can imagine that the RBF network corresponds to a problem where the goal is to predict if a person is male or female based on their age, annual income, and years of education.

The demo program set dummy values for the RBF network’s centroids, widths, weights, and biases. The demo set up a normalized input vector of (1.0, -2.0, 3.0) and sent it to the RBF network. The final computed output values are (0.0079, 0.9921). If the output nodes correspond to (0, 1) = male and (1, 0) = female, then you’d conclude that the person is male.

Each hidden node also has a single width value. The width values are sometimes called standard deviations, and are often given the symbol Greek lower case sigma or lower case English s. In the diagram, s0 is 2.22, s1 is 3.33 and so on.

Each hidden node has a value which is determined by the input node values, and the hidden node’s centroid values and the node’s width value. In the diagram, the value of hidden node [0] is 0.0014, the value of hidden node [1] is 0.2921 and so on.

It is common to place a bell-shaped curve icon next to each hidden node in an RBF network diagram to indicate that the nodes are computed using a radial basis function with centroids and widths rather than using input-to-hidden weights as computed by single hidden layer neural networks.

There is a weight value associated with each hidden-to-output connection. The demo 3-4-2 RBF network has 4 * 2 = 8 weights. In the diagram, w00 is the weight from hidden [0] to output [0] and has value 5.0. Weight w01 is from hidden [0] to output [1] and has value -5.1 and so on.

There is a bias value associated with each output node. The bias associated with output [0] is 7.0 and the bias associated with output [1] is 7.1.

The two output node values of the demo RBF network are (0.0079, 0.9921). Notice the final output node values sum to 1.0 so that they can be interpreted as probabilities. Internally, the RBF network computes preliminary output values of (4.6535, 9.4926). These preliminary output values are then scaled so that they sum to 1.0 using the softmax function.

*Three creepy prehistoric animals that have (mostly) radial symmetry. Left: Sollasina cthulhu, a sea cucumber that lived 430 million years ago. Center: Wiwaxia, a marine slug-like creature that lived in the Cambrian Period. Right: Maotianoascus and Ctenrhabdotus, ancient predecessors to jellyfish. Ugh. These are the kind of creatures that give me nightmares.*

Counterfactuals are best explained by example. Suppose a loan company has a trained ML model that is used to approve or decline customers’ loan applications. The predictor variables (often called features in ML terminology) are things like annual income, debt, sex, savings, and so on. A customer submits a loan application. Their income is $45,000 with debt = $11,000 and their age is 29 and their savings is $6,000. The application is declined.

A counterfactual is change to one or more predictor values that results in the opposite result. For example, one possible counterfactual could be stated in words as, “If your income was increased to $60,000 then your application would have been approved.”

In general, there will be many possible counterfactuals for a given ML model and set of inputs. Two other counterfactuals might be, “If your income was increased by $50,000 and debt was decreased to $9,000 then your application would have been approved” and, “If your income was increased to $48,000 and your age was changed to 36 then your application would have been approved.” The image below illustrates three such counterfactuals for a loan scenario.

Some Microsoft counterfactuals research is detailed in a paper titled “Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations” by Ramarvind K. Mothilal (Microsoft), Amit Sharma (Microsoft), and Chenhao Tan (University of Colorado). The project generated an open source code library called the Diverse Counterfactual Library (DiCE) which is available at: https://github.com/microsoft/DiCE.

The library is implemented in Python and currently supports Keras / TensorFlow models, and support for PyTorch models is being added. The researchers applied the DiCE library to the well-known benchmark Adult Data Set where the goal is to predict if a person makes less than $50,000 or more than $50,000 annually based on predictor variables such as education level, occupation type, and race.

A partial code snippet that illustrates what using the DiCE library looks like is:

import dice_ml d = dice_ml.Data(. .) # load dataset m = dice_ml.Model(. .) # load trained model ex = dice_ml.Dice(d, m) # create DiCE "explanation" q = {'age': 22, 'race': 'White', . .) # model input # now generate 4 counterfactuals cfs = ex.generate_counterfactuals(q, 4, . .)

The model prediction using the original input values is that the person’s income is less than $50,000. Here are the four resulting counterfactuals:

The four counterfactuals all generate a prediction that the similar person has income of greater than $50,000. For example, the first counterfactual changes the values of four predictor variables: education changes from HS-grad to Masters; age changes from 22 to 65; marital status changes from Single to Married; and sex changes from Female to Male.

]]>I am not a fan of recursion and I rarely use it except when working with tree data structures, and even then I avoid recursion when possible. So my recursive implementation of Determinant() was mostly for mental exercise.

If M =

3 6 2 7

then Det(M) = (3 * 7) – (6 * 2) = 9. If M =

1 2 3 4 5 6 7 8 9

Det(M) = (+1) * 1 * det 5 6 8 9 + (-1) * 2 * det 4 6 7 9 + (+1) * 3 * det 4 5 7 8

and do on.

Anyway, after thrashing around for a few minutes, I came up with the following C# implementation of a recursive function to compute the determinant of a matrix:

static double Det(double[][] m) { double sum = 0.0; int sign; // -1 or +1 if (m.Length == 1) return m[0][0]; else if (m.Length == 2) return (m[0][0] * m[1][1]) - (m[0][1] * m[1][0]); for (int j = 0; j less-than m.Length; ++j) // each col of m { double[][] small = new double[m.Length-1][]; // n-1 x n-1 for (int i = 0; i less-than small.Length; ++i) small[i] = new double[m.Length-1]; for (int r = 1; r less-than m.Length; ++r) // start row [1] { for (int c = 0; c less-than m.Length; ++c) { if (c less-than j) small[r - 1][c] = m[r][c]; else if (c greater-than j) small[r - 1][c - 1] = m[r][c]; else // if (c == j) ; // skip this col } // c } // r if (j % 2 == 0) sign = +1; else sign = -1; sum += sign * m[0][j] * Det(small); // recursive call } // j return sum; }

My personal matrix library has a non-recursive Determinant() function that uses Crout’s decomposition technique.

Moral of the story: working with matrices is rather tricky but it’s an essential skill for machine learning.

*Three determined puppies. Left: This is “Llama” and she is determined to get her owner’s attention. Center: This is my dog “Riley”. I was taking a nap and when I woke up, Riley was determined to get praise for chewing up my “Chess Life” magazine and three random socks. Right: This determined puppy walks softly and carries a big stick.*

Let

v1 = (2.0, 5.0, 3.0) v2 = (1.0, 7.0, 0.0)

The difference of two vectors is just a vector made from the difference of their components:

v1 - v2 = (2-1, 5-7, 3-0) = (1.0, -2.0, 3.0)

The norm of a vector is the square root of the sum of the squared components:

|| v1 || = sqrt(2^2 + 5^2 + 3^2) = sqrt(4 + 25 + 9) = sqrt(38) = 6.16 || v2 || = sqrt(1^2 + 7^2 + 0^2) = sqrt(1 + 49 + 1) = sqrt(50) = 7.07

The Euclidean distance between two vectors is the square root of the sum of the squared differences between components:

dist(v1, v2) = sqrt( (2-1)^2 + (5-7)^2 + (3-0)^2 ) = sqrt( 1 + 4 + 9 ) = sqrt(14) = 3.74

It is possible, and common, to express Eucidean distance between two vectors as the norm of their difference:

|| v1 - v2 || = || (2, 5, 3) - (1, 7, 0) || = || (1, -2, 3) || = sqrt( 1^2 + (-2)^2 + 3^2 ) = sqrt( 1 + 4 + 9 ) = sqrt(14) = 3.74

In other words

dist(v1, v2) = || v1 - v2 ||

The relationhip between the norm of a vector and the Euclidean distance between two vectors appears in several machine learning scenarios. I was talking to a colleague recently. He wants to create a roadmap for software developers who want to gain machine learning knowledge and skills. This leads to the question of exactly what, if any, math background is necessary.

Knowing the roughly 100 basic math techniques for ML like the one described here is useful, but is it necessary? On the one hand, norm vs. distance is not a difficult idea and anyone can learn it on the fly. But on the other hand, if you need to pick some math knowledge up while you’re in the middle of an ML topic that uses the knowledge, it makes learning ML much more difficult.

*Three paintings by artist Stanislaw Krupp. Sort of a modern Art Nouveau style. I don’t think an artist can pick up new art techniques on the fly while he’s in the middle of creating a painting, but I’m not an artist so I could be wrong.*

The terms library and framework just mean code modules that have been pre-written by you or someone else. However, it’s sometimes useful to think about how “library-ish” or how “framework-ish” some code is.

Most of my colleagues, and me too, generally think about library-ish code as being low level modules that are mostly independent of each other, where you usually edit/modify the code, and connect different library-ish modules with custom code.

We usually think about framework-ish code as being high level modules that are highly dependent on each other, where you rarely modify the code, and often use only the framework-ish code with little or no custom connecting code.

Functions in framework-ish code often have a large number of parameters because framework-ish functions aren’t easy to modify. Functions in library-ish code usually don’t have a lot of parameters — just the essentials, because you can add additional parameters and modify the code in library-ish functions relatively easily.

Here’s an example of what I’d call C# library-ish code, in a machine learning context:

double[][] trainX = MatLoad(".\\testData.txt", new int[] { 0, 2, 4 }, '\t'); . . . static double[][] MatLoad(string fn, int[] cols, char sep) { // custom code to load data into an array-of-arrays matrix }

And here’s an example of what I’d call framework-ish code from ML.NET that does roughly the same thing:

using Microsoft.ML; using Microsoft.ML.Data; using EmpClassifier.Model.DataModels; . . . IDataView trainDataView = mlContext.Data.LoadFromTextFile( path: TRAIN_DATA_FILEPATH, hasHeader: true, separatorChar: '\t', allowQuoting: true, allowSparse: false); . . . namespace EmpClassifier.Model.DataModels { public class ModelInput { [ColumnName("hourly"), LoadColumn(0)] public bool Hourly { get; set; } . . . }

Library-ish code uses mostly primitive data types such as int[] and double[][] while framework-ish code usually has many custom class and interface definitions. The ML.NET LoadFromTextFile() code has a hasHeader parameter. If you wanted to add such a parameter to the library-ish MatLoad() function you could do so.

Side effects of the complexity and high level of abstraction of the framework-ish approach are that framework-ish modules are often difficult or impossible to modify, and therefore framework-ish modules often force you into architecting a system in one particular way.

With library-ish code, you must have a greater knowledge of algorithms and greater coding skills.

The distinction between library-ish code and framework-ish code isn’t Boolean. Most pre-written code has various degrees of the factors I’ve described.

So, when I read or hear the question, “What’s the difference between a library and a framework?” I’m pretty sure the person who asked the question is an inexperienced developer.

The purpose of rigidly defining terms in computer science or in any field is to clarify communications. Slapping labels on concepts does not increase knowledge, and people who do so and proclaim themselves as experts are almost always egoists trying to make money. A good example of this is the computer science so-called SOLID principles — absolute meaningless nonsense.

In business, most of Six Sigma is nothing more than hilariously obscure terminology and acronyms like DMAIC which are designed to make people believe that Six Sigma is something more than just a few useful concepts that can be explained and learned in two hours. And “Agile” programming is just a few common-sense ideas but is typically surrounded by massive amounts of lame terminology intended to enable bogus training. (Wow! When did I become the cranky old programmer guy?)

]]>

To compute softmax, you first calculate the exp() of each of the input values, and then sum those values. Then the softmax for each of the original values is the exp() of the value, divided by the sum.

exp(3) = 20.0855 exp(5) = 148.4132 exp(2) = 7.3891 sum = 175.8878 softmax(3) = 20.0855 / 175.8878 = 0.1142 softmax(5) = 148.4132 / 175.8878 = 0.8438 softmax(2) = 7.3891 / 175.8878 = 0.0420

A naive implementation of softmax() could blow up because the value of exp(x) can be astronomically large for even moderate values of x. One way to reduce the likelihood of an arithmetic exception is to use the so-called max trick. And you can refine the max trick by adding the log trick to reduce the possibility of exception further.

Here’s a C# implementation that uses just the max trick:

static double[] Softmax(double[] vec) { int n = vec.Length; double[] result = new double[n]; double m = VectorMax(vec); double sum = 0.0; for (int i = 0; i less-than n; ++i) { result[i] = Math.Exp(vec[i] - m); sum += result[i]; } for (int i = 0; i less-than n; ++i) result[i] /= sum; return result; } static double VectorMax(double[] vec) { double result = vec[0]; for (int i = 0; i less-than vec.Length; ++i) { if (vec[i] greater-than result) result = vec[i]; } return result; }

The max trick is usually pretty robust but there’s still a chance to go wrong with the division if sum is 0 or an extreme value.The log trick reduces the chance of an exception due to the division. Here’s an implementation that adds the somewhat more complicated log trick:

public static double[] Softmax(double[] vec) { int n = vec.Length; double mx = VectorMax(vec); double[] result = new double[n]; double sum = 0.0; for (int i = 0; i less-than n; ++i) { result[i] = vec[i] - mx; sum += Math.Exp(result[i]); } double logsum = Math.Log(sum); for (int i = 0; i less-than n; ++i) result[i] = Math.Exp(result[i] - logsum); return result; }

One weekend afternoon, just for hoots, I decided to derive the softmax function with max and log tricks. Here it is:

Equations (1), (2) and (3) show the properties involved. When I do my proofs involving a series, I like to be concrete and use three terms, and then generalize later. Equation (4) is the basic definition of the softmax() function. In equation (5) I divide top and bottom by exp(max) and then use property (1) to get equation (6) which is the max trick. Easy.

In equations (7) to (11) I use properties (2) and (3) to get the value of the first softmax probability, p0. Then in equation (12) I generalize using fancy math notation.

There are several morals to the story here. If you’re making a code library, you have to try and make sure your code is as error-resistant as possible, which bloats your code by at least a factor of two or three. Even the two implementations above need a lot more error checking if they were intended for a public code library.

Personal code libraries can be very lean because you can add just the error checking you need.

*I love to look at model trains. Here are two beautiful HO scale sawmills, complete with realistic looking logs.*

A binary classification problem is one where the goal is to predict a variable that can be one of just two possible values. An example is predicting the sex of a person (male = 0, female = 1) based on predictors/features such as age, annual income, height, political leaning (conservative, moderate, liberal), and so on.

Some techniques can use only numeric predictors (such as age), some techniques can use only categorical predictors (such as political leaning), and some can handle mixed numeric and categorical predictors.

Logistic regression is somewhat unique in that there are many training/optimization algorithms that can be used.

Some of these 12 techniques naturally extend to handle multi-class classification, such as predicting political leaning from age, income, and so on. Some techniques can only handle multi-class classification using a hack called one-versus-all (OVA), which is generally a poor approach.

Some of the common techniques for training logistic regression models include stochastic gradient decent (SGD), iterated Newton-Raphson, L-BFGS optimization, simplex optimization, evolutionary optimization, and stochastic dual coordinate ascent (SDCA).

*Three more or less random images from an Internet search for “binary classification art”. I classify all three as “good”.*

create a population of solutions loop many times pick two good solutions breed goods to make a child mutate child slightly replace a bad soln with child generate a new random soln replace a bad soln with random soln end-loop return best solution found

A few blog posts ago, I experimented with the Adam (adaptive moment estimation) algorithm. I used the Rosenbrock function as my test case. The Rosenbrock function looks very simple but it’s a difficult problem for most optimization algorithms. Rosenbrock’s function is defined as:

f(x, y) = = 100(y – x^2)^2 + (1 – x)^2

The function has solution x = 1 and y = 1, when f(x,y) = 0. Deceptively simple.

*Rosenbrock’s function from the Wikipedia article on the topic*

Anyway, one morning before work, I coded up an Evolutionary Algorithm and targeted Rosenbrock’s function. My results were both good and bad. My demo quickly found a pretty good solution of (1.004683, 1.009691) but I was hoping for something a bit closer to (1.0, 1.0) and furthermore, my algorithm required more hyperparameter tuning than I had expected.

I suspect that years from now (I’m not sure when) evolutionary algorithms will become more important than they are now. Most deep neural optimization techniques require Calculus gradients. Evolutionary algorithms are simple and do not require gradients, but they require far longer processing time than gradient-based techniques. If biologically plausible neural systems (such as neuromorphic networks) improve, and general computing power improves, then Evolutionary Algorithms wll become useful because most biologically plausible systems do not have Calculus gradients.

While I was writing this blog post, I came across a research paper explaining an optimization algorithm that I hadn’t heard of before called “Adaptive Coordinate Descent”. See http://www.loshchilov.com/publications/GECCO2011_AdaptiveCoordinateDescent.pdf for the paper. I gave the paper a quick scan but, possibly because the authors are French, found the paper very difficult to read. I’ll give the paper another look when I’m stuck on an airplane or something.

*Fractals are essentially evolutionary art. Here are three computer-generated fractals that look biologically plausible.*