Building software to last forever

9 min read

February 25, 2023

https://news.ycombinator.com/item?id=32997102

New York subway trains, nicknamed the Brightliners for their shiny unpainted exteriors, were introduced in 1964 and built to last 35 years. When they were finally retired from service in January 2022, they had been riding the rails of New York City for 58 years and were the oldest operating subway cars in the world. That’s 23 unplanned, extra years of use.

Replacement parts were impossible to find; the firms that made them — the Budd Company, Pullman-Standard and Westinghouse — long gone. Components were well past their design life, mechanics plucked similar parts from retired vehicles or engineered substitutes. Maintenance is an improvisational process that relies on the know-how that long-time employees built up over many years. In a way, it was a small miracle that they lasted so long, an “anomaly,” as one mechanic put it. Except it wasn’t.

Back in the 1980s MTA president David L. Gunn kicked off an ambitious overhaul of the crumbling, graffiti-covered subway. In 1990, this approach became the "scheduled maintenance system". Every two to three months, more than 7,000 railcars are taken in for inspection with the goal of catching problems before they happen. That’s the difference between maintenance and repair. Repair is fixing something that’s already broken. Maintenance is about making something last.

Notably, in 2020 Sabre signed a ten year deal with Google to migrate SABRE to Google's cloud infrastructure.

Maintenance mostly happens out of sight, mysteriously. If we notice it, it’s a nuisance. When road crews block off sections of highway to fix cracks or potholes, we treat it as an obstruction, not a vital and necessary process. When your refrigerator needs maintenance you defer it, hoping maybe you can just buy a new one later. If you wait long enough and your refrigerator breaks down, you get upset at the cost of the repair bill - not at yourself for deferring the maintenance.

This is especially true in the public sector: it’s almost impossible to get governmental action on, or voter interest in, spending on preventive maintenance, yet governments make seemly unlimited funds available once we have a disaster. We are okay spending a massive amount of money to fix a problem, but consistently resist spending a much smaller amount of money to prevent it; as a strategy this makes no sense.

Many organizations say they want to be "data driven" but that requires critical thinking skills as well. In some places I see "being data driven" leading to poor decisions because data can be (and is) manipulated to support a desired outcome.

Many times I have found the technology team and the business users arguing over "bug" vs. "enhancement". When I ask why the answer is always the same - it is a way of assigning blame. "Bug" means its engineering's fault, "enhancement" means it was a missed requirement by the business. I always tell everyone they are both just maintenance and the only important decision is which to fix first.

Organizational Design

Conway’s law tells us that the technical architecture and the organization’s structure are general equivalents so organizational design matters. Nothing says you’re serious about accomplishing something more effectively than changing people's scenery, but don't assume that organizational change is necessary out of the gate.

Reorgs are incredibly disruptive. They are demoralizing. They send the message to rank and file engineers that something is wrong — they built the wrong thing or the product they built doesn’t work or the company is struggling. It increases workplace anxiety and decreases productivity. The fact that reorgs almost always end up with a few odd people out who are subsequently let go exacerbates the anxiety.

What you don’t want to do is draw a new organization chart based on your vision for how teams will be arranged with the new system. You don’t want to do this for the same reason that you don’t want to start product development with everything designed up front. Your concept of what the new system will look like will be wrong in some minor ways you can’t possibly foresee. You don’t want to lock in your team to a structure that will not fit their needs.

The only way to design communication pathways is to give people something to communicate about. In each case, we allow the vision for the new organization to reveal itself by designing structures that encourage new communication pathways to form in response to our modernization challenges. As the work continues, those communication pathways begin to solidify, and we can begin documentation and formalizing new teams or roles. In this way, we sidestep the anxiety of reorganizing. The workers determine where they belong based on how they adapt to problems; workers typically left out are given time and space to learn new skills or prove themselves in different roles, and by the time the new organization structure is ratified by leadership, everyone already has been working that way for a couple months.

Who needs to communicate with whom may not be clear when you get started. This is an exercise I use to help reveal where the communication pathways are or should be. I give everyone a piece of paper with a circle drawn on it. The instructions are to write down the names of the people whose work they are dependent on inside the circle (in other words, “If this person fell behind schedule, would you be blocked?”) and the names of people who give them advice outside the circle. If there’s no one specific person, they can write a group or team name or a specific role, like frontend engineer, instead. Then I compare the results across each team. In theory, those inside the circle are people with whom the engineer needs to collaborate closely. Each result should resemble that engineer’s actual team with perhaps a few additions or deletions based on current issues playing out.

Outside the circle should be all the other teams. Experts not on the team should be seen as interchangeable with other experts in the same field. Small variations will exist from person to person, but if the visualizations that people produce don’t look like their current teams, you know your existing structure does not meet your communication needs.

Some of his other observations include the following: Individual incentives have a role in design choices. People will make design decisions based on how a specific choice — using a shiny new tool or process — will shape their future. Minor adjustments and rework are unflattering. They make the organization and its future look uncertain and highlight mistakes. To save face, reorgs and full rewrites become preferable solutions, even though they are more expensive and often less effective.

An organization’s size affects the flexibility and tolerance of its communication structure. When a manager’s prestige is determined by the number of people reporting up to her and the size of her budget, the manager will be incentivized to subdivide design tasks that in turn will be reflected in the efficiency of the technical design — or as Conway put it: “The greatest single common factor behind many poorly designed systems now in existence has been the availability of a design organization in need of work.” Conway’s observations are even more important in the maintaining of existing systems than they are in the building of new systems.

The Open Mainframe Project estimates that there about 250 billion lines of COBOL code running today in the world economy, and nearly all COBOL code contains critical business logic.

Building Software to Last Forever

Stewart Brand, the editor of the Whole Earth Catalog, described a concept that buildings have architectural "layers":

(INFRA)STRUCTURE: foundation & load bearing, very difficult to change
SKIN: changes every 20 years, more air-tight, better insulated
SERVICES: Plumbing, electrical, etc. - changes every 7-15 years
SPACE PLAN: interior layout, walls, ceilings, floors, doors. changes every 3-10 years

This concept can be applied to software systems in a similar way.

In 1983, Charles Perrow coined the term "normal accidents" to describe a situation where systems are prone to failure, and no amount of safety procedures could eliminate accidents entirely. According to Perrow, "normal accidents" are not the product of bad technology or incompetent staff. Rather, systems that experience normal accidents display two important characteristics:

They are tightly coupled. When separate components are dependent on each other, they are said to be coupled. In tightly coupled situations, there’s a high probability that changes with one component will affect the other. Tightly coupled systems produce cascading effects.
They are complex. Signs of complexity in software include the number of direct dependencies and the depth of the dependency tree. Computer systems naturally grow more complex as they age, because we tend to add more and more features to them over time.

Tightly coupled and complex systems are prone to failure because the coupling produces cascading effects, and the complexity makes the direction and course of those failure cascades impossible to predict.

In a simple example of coupling imagine software that tightly links three components together. Each component is 99% reliable. The overall reliability of the system is 0.99 to the third power, or 0.99 x 0.99 x 0.99 = 0.97 (97% reliable). Now imagine instead of three tightly linked components it requires ten. The systems reliability would be 0.99 to the 10th power, or only 90% reliable. It would be out of service 10% of the time, despite being made of highly reliable components.

Resilient design is decoupled and as simple as possible.

In short, small organizations build monoliths because small organizations are monoliths. (Location 2073)

Culturally, the engineering organization was flat, with teams formed on an ad hoc basis. Opportunities to work on interesting technical challenges were awarded based on personal relationships, so the organization’s regular hack days became critical networking events. Engineering wanted to build difficult and complex solutions to advertise their skills to the lead engineers who were assembling teams.

Well-designed, high-functioning software that is easy to understand usually blends in. Simple solutions do not do much to enhance one’s personal brand. They are rarely worth talking about. Therefore, when an organization provides limited pathways to promotion for software engineers, they are incentivized to make technical decisions that emphasize their individual contribution over integrating well into an existing system.

=====

The planning isn't the issue, it's the conviction, resolve, determination...whatever you want to call it.

COBOL modernization is quite expensive and fraught with major risk. You're talking about multi-year efforts where the key players who initiate the effort are often not around to see it through. Limited legacy resources, poor-to-absent documentation, mission critical systems...it's a mess no one internally wants to touch or take the fall for. MicroFocus, Blu Age (now part of AWS), TSRI, and plenty of other companies are happy to take your money and tell you it can be done, and it certainly can be done technically.

The reality is the cost benefit is often years down the line, so keeping the right people around to sustain the effort and see it through financially and technically is the tallest of enterprise orders. So it's the old two-step: 1) Pay IBM for another three years of support and licensing and 2) Go through the motions of migration planning then repeat Step 1.

Resilient Design

Software that is resilient is generally infrastructure agnostic.

Use open standards, or widely adopted standards
Decouple as much as possible.
Be boring
"Bake in" observability and testing. When both are lacking on your legacy system, observability comes first. Tests tell you only what won’t fail; monitoring and observability tells you what is failing.

Basically, a system is resilient if it continues to carry out its mission in the face of adversity (i.e., if it provides required capabilities despite excessive stresses that can cause disruptions). Being resilient is important because no matter how well a system is engineered, reality will sooner or later conspire to disrupt the system. Residual defects in the software or hardware will eventually cause the system to fail to correctly perform a required function or cause it to fail to meet one or more of its quality requirements (e.g., availability, capacity, interoperability, performance, reliability, robustness, safety, security, and usability). The lack or failure of a safeguard will enable an accident to occur. An unknown or uncorrected security vulnerability will enable an attacker to compromise the system. An external environmental condition (e.g., loss of electrical supply or excessive temperature) will disrupt service.

Adverse events are events that due to their stressful can disrupt critical capabilities by causing harm to associated assets. These adverse events (with their associated quality attributes) include the occurrence of the following:

Adverse environmental events, such as loss of system-external electrical power as well as natural disasters such as earthquakes or wildfires (Robustness, specifically environmental tolerance) Input errors, such as operator or user error (Robustness, specifically error tolerance) Externally-visible failures to meet requirements (Robustness, specifically failure tolerance) Accidents and near misses (Safety) Cybersecurity/tampering attacks (Cybersecurity and Anti-Tamper) Physical attacks by terrorists or adversarial military forces (Survivability) Load spikes and failures due to excessive loads (Capacity) Failures caused by excessive age and wear (Longevity) Loss of communications (Interoperability)

Sharing is Caring

Edit this page