History of DevOps at $CLIENT

Or, why it’s not all unicorns and rainbows and docker clusters.

Created by Laurence J MacGuire a.k.a. 刘建明 a.k.a Liu Jian Ming With some wording/ideas stolen/paraphrased from internal documents.

ThoughtWorks Xi’An, 2017/02/10

About Me

Started running Linux in ~2000.
Had a server room in my closet. Started programming.
Did freelancing & then startups in Montreal
ThoughtWorks since 2013, all of which in China

What I do

Helping people. ~100 devs in my office.
Very AWS heavy.
~~Slowly~~ Incrementally moving into Docker.

First

This talk is almost all non-technical.

But! Every slide is an invitation to questions. There’s technical reasons that explain our behaviour.

The Organisation

Australia’s Largest $SOMETHING Website

Started in 199X, as an ISP
2000s’ transition to domain specific website
Today is a tech-driven media company
Around 300 devs (for Australia only)
Expanding in Asia

Pre 2011

Two Data-Centers (Sydney & Amsterdam)
Varying amounts of automation
Still a lot of manual deployments

~2011

VMWare / VSphere
NetApp / NetScaler
Puppet
Associated (custom) tooling

All supported by one central team

Dev & Ops Dev vs Ops

Simba!

Dev & Ops Dev vs Ops

Office Space Printer!

AWS Comes in

it will be cheap, they said

Turns out, it’s not cheap. But it’s arguably worth the cost.

Milestone

The first China based team is one of the first to get all dev done on AWS.

Circa ~2011

“Gandalf”

Major effort to have one AWS based dev/test environment.

One AWS Account
One VPC
Devs can each get a stack
And include which company components they need

Circa ~2012

Milestone

Our main monolithic app can now be no-downtime-deployed w/ a few clicks.

Circa ~2013

TMI

Team Managed Infrastructure

A team should 100% own their project. App code & infrastructure

Circa ~2014

Silos

With Opsy type people embedded in each.

Systems/Staff/Size Skyrockets

hockey sticking

Ambitious People

Not Invented Here Syndrome

A couple mis-managed changes
Hurtful polical fights
Reading doco takes too long
Learn by re-inventing
“I can do it better”

Temptation

Diverging

Real Life Examples

Omniture Sucks. I’ll roll my own.
Splunk Sucks. I’ll roll out ELK.
Docker Docker Docker.
Kubernetes does all this cool stuff.
Bamboo sucks. Let’s use Concourse CI.
Rails sucks. Scala/Go/Elm/Elixir/Php/Crystal is better.

In Human Terms

This old thing isn’t new and shiny anymore.

It doesn’t mean you’re wrong. Or the new thing isn’t better.

Central Team Over Burdened

Lots of responsibility
Lack of authority & resources
- Good ideas on how to solve problems
- But little people-time to lead by example

Two Approaches

Carrot vs Stick

Long Term Danger

Train Yard

Lots of Machinery

Currently about 130 AWS accounts.
And still two Data-Centers.

Circa Now

Milestone

OpenID based auth to our AWS accounts (and many other systems). For The Win.

Circa 2014

For better & for worse

Most teams have/had their own way of doing things.

Everybody learned A LOT

Product Lifecycle

When a product is “feature complete”, who owns it?

Who owns it?

Sadly, usually Opsy people

Back To This

Office Space Printer!

High Leverage Team

“ABCDEFG”

Merging Tracks

History

Been around for ~4 years (in one form or another)
But our broader influence is more recent

Team Members

5 Australia based members
1 China based member

(A Thoughtworks/Client first)

4x Devs whose domain is Ops
2x Ops who can (kind of) code

Our Infrastructure

Stuff we directly support

Docker Registry

Authenticated Registry
Available on the Internet
No IP whitelisting
Thousands of images / 5-6TB worth
SLA 24/7

Docker Registry Auth Provider

Supply Docker Creds to IAM entities through S3
Can be Read/Write or Read Only
SLA 24/7

BuildKite Agent Deployment

Give you a clean and easy tool to deploy agents.
And best practices/stencils on how to use it.
We don’t maintain agents (not really, anyway)

A Few Legacy Things

Nexus (Artefact hosting/management)
Internal Rubygems
A few Bamboo (CI) instances
A few more things from being in a shared support pool/roster.

SLAs between 9-5 and 24/7

Company-Wide Deployment Tool

Simple YAML interface.
What?
- Docker images + config
Where?
- Which AWS Account/Network
Others
- Logging
- Monitoring
- Alerting
- AutoScaling

Adoption Curve

Tenets: Open

We build internal Open-Source software. We invite contributions from all our users.

Tenets: Components

Build things in modular fashion, so we can easily replace components.

Tenets: Conventions

Borrowing over inventing. Will make it easier to integrate existing tools later.

Tenets: Interfaces

For tools to work together well, the interfaces need to be simple and stable.

Aphorism

All the boats rise with the tide

Approach

Carrot vs Stick

Different ratios

How We Work

Very Present on Slack (Chat/Comms tool)
Open schedules/plans (90 days / 180 days. All visible)
Working plans/schedules open to input from others
Open to re-prioritising (based on needs/input)

How We Work

Perhaps more importantly

Not afraid of saying
- No, we’d rather not merge this PR (because…)
- “I don’t know (…let me find out)”

(Even more) Remote Friendly

Document (in writing) many design decisions.
Document (debate) PRs and Feature Requests.

Work we don’t want

Fix your builds
Deploy your software
Be on your pager roster

Not that we won’t. But!
Give a man a fish … teach a man how to fish …

Daily Work / Responsibilities

Identify good patterns
Create stencils that embody them
Consult in various teams

Key Take-Aways

Innovation Is Great
Change Management is really hard
Developer happiness is important, but it can be costly
Offering good support is key
Evaluate the cost of new things

Questions?

Comments? Insults?

History

2011, started seeing a lot of automation work, a lot of puppet.
2011, home ideas. first product developed in China. And one of the very first to use ec2 for a dev/test environment
2012, gandalf environment. set of ruby cli tools and a complete EC2/VPC setup.
2013, Made the main-app click to deploy
2014, First deploy of the main web-app by a contractor
2014, move towards TMI
2014, introduced IDP based auth for our AWS accounts
2015, trying to get 90% of our stuff in AWS
2016, TMI is harder than we thought and docker adoption
2016, “all in” for AWS, but still a lot of legacy stuff (still) in the DC
2017, cost optimisation vs growth
2018, year of the docker cluster, maybe