Practical Biology for software developers, part 1

“The older I get, the more I believe that the only way to become a better programmer is by not programming.”

Jeff Atwood, How to become a better programmer by not programming

Copying is innovation and innovation is copying. Engineering disciplines often take a cue from traits and patterns found in life.

HIROMI OKANO/CORBIS; WEST JAPAN RAILWAY CO. VIA BLOOMBERG

Eiji Nakatsu, an engineer at the JR-West rail company, redesigned the the nose of the train by the beak of the kingfisher to reduce drag and noise. It’s an engineering feat that would otherwise take numerous simulations and incremental optimisations. While it’s false to think that nature always has the best answers to man-made problems, it simply had more time to figure out what does and doesn’t work. Now you might say: “okay, a guy really liked birds and got lucky, so what?”, and you might be right. This series of short articles isn’t about clear-cut advice, but rather a collection of observations about what makes the living things tick.

An organism is, depending on who do you ask, a system of autonomous units called cells which in some way promote reproduction, or other method of survival. The key thing is that they form a semi-closed system, that means they’re not ignorant of what’s going outside and they can interact with each other, but they know their boundaries. Compare this to a function with a state (first-class function), or an object in computer languages. These tiny objects¹ can differentiate (and dedifferentiate) depending on the environment and signalization. This is a fancy way to say they can be refit for various tasks, just like the objects in prototype-based languages, such as JavaScript or Lua.

This is probably the most important pattern found even on the macroscopic level, and it gives them resiliency, and aptitude to survive. It shows how the organisms deal with failures and end-of-life states - a malfunctioning cell triggers a “programmed cell death”, and takes one for the team. The real MVP. Even though there are many safeguards in cell replication, a failure is not treated as something impossible, but as a normal part of life.

“In an increasingly multi-core world, the ability to isolate the processes as well as shield each open tab from other misbehaving pages alone proves that Chrome has a significant performance edge over the competition. In fact, it is important to note that most other browsers have followed suit, or are in the process of migrating to similar architecture.”

Ilya Grigorik, High Performance Networking in Chrome, The Many Facets of Performance

Google Chrome design team copied this pattern in time of the threaded² browsers, and it paid off. In fact it goes even farther with the zygote process. A zygote is a cell containing a complete genetic information, formed from two reproduction cells. Similarly, the “zygote process” opens all files and performs some initialization, and then it can be fork()-ed as needed, with all the information already in place.

Disassembling our code

Producing “code” is a discipline that requires critical thinking combined with language skills and plain old engineering. The language part sets it apart from other engineering disciplines like construction or electric engineering.

We often say: “I’m just a C guy” or a “He’s a Clojure ninja”, but why is the dialect we speak so important at all? Computer languages, just like natural languages, differ in expressiveness, number of speakers, idioms and trends.

“The big picture is lost in the process”

But all the languages have one thing in common - they are, in the end, translated to a language that computers can understand, a machine code. As this is essentially a one-way process, the computer languages do not have to encode colour, emotions, secondary meanings or the “big picture”. They’re confusing for us because their semantics is for computers, byt syntax for humans.

Since the advent of the DNA sequencing and central dogma, we’ve been curious what’s our own machine code. The Human Genome Project was declared complete more than a decade ago now, but it’s just the first step. A genetic information is included in our cells containing a nucleus, a core. It’s very tiny of course, and also packed very tightly to fit, which makes it impractical to read. Fortunately it also encodes a machinery to unravel it, clone it, fill it and cut it. A toolkit of some sort. Using it enabled all kinds of sequencing methods, notably shotgun sequencing. Instead of tedious clone-by-clone mapping³, it breaks the DNA sequence randomly into many small pieces and reassembles the sequence from overlapping fragments. In another words, it accelerates the process by sheer paralelization, another ubiquitous pattern.

Quick facts that we know - compared to binary code, genetic information is encoded in 4 base pair types (bp). Each “triplet” of bp then encodes an amino-acid (AA), this would have been the most economical choice if it wasn’t for the fact that it’s redundant. A single AA can be encoded by a multitude of triplets. CPU instructions, on the other hand, are encoded by exactly one unique sequence of bits. Nature isn’t irked at all about storage efficiency.

The genetic code contains not only structural elements, but also many types of control elements with variable width and sensing mechanisms. For example, a simple motif in computer languages is branching:

if enabled then
	effect
end

Nature has similar motifs that enable or disable following pieces of code (promoters, i.e. Goldberg-Hogness box). But very often (especially in more complex organisms) you can find a sum/product of a series of conditions called “enhancers” and “repressors”. Very much like in complex conditions.

if ((enhancer_a) and not (repressor_b)) and enabled then
	effect
end

Now what we don’t have is a “CPU reference manual” to decipher the semantics of the code. For example, we can have a look in the x86 manual and see that the ADD instruction performs addition, this makes disassembly (and decompilation to some degree) possible, but we have no such knowledge about many motifs found in the genome. In order to make any sense of them, we need to learn to recognize patterns first.

The waggle dance

Swarming and flocking animals utilize much less verbose methods for communication, and so they provide viable models in the age of anycast network routing, containers, clusters and IoT. For instance the problem of foraging bees is glaringly similar to a man-made problem of cost and routing in the computer networks. The foragers communicate not only the distance and orientation of the food source, but also its quality through a specific dance. This motivates other foragers to switch to the current best food source, and dance in return to motivate even more workers.

Now switch the words “food source” with “destination”, and “forager” with “message”. The ZigBee protocol borrowed not only it’s name from the apis, but also the behavior of the repeater radios in an electronic Zigbee network. Granted, it hasn’t been overly successful yet.

On the macroscopic level, a behaviour of the hive is useful for making rule of thumb decisions. A bread and butter for programmers. When the hive is foraging, it has two choices - either stay with the current food source or scout for a better one, which may or may not pay off. The waggle dance of the returning forager communicates the direction to the food source, and an integrated travel length (cost). Based on the net gain, it might recruit more foragers based on several other constraints.

Foragers stay true to their food source of choice until it worsens its quality.
More than 3/4 of the foragers prefer to stay in their comfortable distance (1/12 - 1/4 of the range), and only few experienced ones fly farther.
The amount of foraging activity is limited by environment.

The same choice paralysis problem exists in computer networking software, where instead of food we’re chosing servers for the next hop. The question is always - did we pick the best performing server, and what if something changed meanwhile? What if our hero fails? A good rule of thumb is to stick with the best performing server (within the range) with at least 3/4 probability (fidelity). If not, try different server with the probability proportional to it’s known quality (scouting). Back off if all servers are consistently bad (weather).

This simple rule of thumb algorithm shows both fast response time to changes and congestion control, compared to traditional approaches like Exponential decay, where the round-trip time (cost) of the inactive server decays over time, so the application is incented to retry it. In fact, I’ve implemented name-server selection code based on this algorithm in the Knot resolver at CZ.NIC.

Paterns revisited

I feel that I’m sometimes too burdened with the need to make everything right on the first try, and yet it rarely ever works. If it did, there would be no need for versioning or patches. I’m ashamed when the coding standard is not good enough, about the printf("swearword") debugging strategy, copypasta code. I’m a programmer, maybe you are as well, and it hurts our pride. Yet even an imperfect model can reveal a lot of important clues about how to make the final version suck less. That’s why the builders build models and scientists experiment.

Looking at the successful patterns in life can help us understand why approaches live or die, from the lowest level of celular life, sutainability mechanisms, to behaviour analysis of animals. Now I turn it up to you, readers, if there’s some part that interests you more - fire away. I’d like to write more about redundancy, and more “rules of thumb”.

These cells are called “stem cells”, they’re special because they have the potential to become anything. You might have heard about these because of the stem cell therapy, a transplantation of these cells. ↩
Threads are finnicky to work with and overused, but more about that later. ↩
Figuring out where to place the sequence fragment on the chromosome. ↩

Written on June 19, 2015