History of DevOps at $CLIENT

Or, why it’s not all unicorns and rainbows and docker clusters.

Created by Laurence J MacGuire a.k.a. 刘建明 a.k.a Liu Jian Ming With some wording/ideas stolen/paraphrased from internal documents.

ThoughtWorks Xi’An, 2017/02/10

Creative Commons License

About Me

  • Started running Linux in ~2000.
  • Had a server room in my closet. Started programming.
  • Did freelancing & then startups in Montreal
  • ThoughtWorks since 2013, all of which in China

What I do

  • Helping people. ~100 devs in my office.
  • Very AWS heavy.
  • Slowly Incrementally moving into Docker.

First

This talk is almost all non-technical.

But! Every slide is an invitation to questions. There’s technical reasons that explain our behaviour.

The Organisation

Australia’s Largest $SOMETHING Website

  • Started in 199X, as an ISP
  • 2000s’ transition to domain specific website
  • Today is a tech-driven media company
  • Around 300 devs (for Australia only)
  • Expanding in Asia

Pre 2011

  • Two Data-Centers (Sydney & Amsterdam)
  • Varying amounts of automation
  • Still a lot of manual deployments

~2011

  • VMWare / VSphere
  • NetApp / NetScaler
  • Puppet
  • Associated (custom) tooling

All supported by one central team

Dev & Ops Dev vs Ops

Simba!

Dev & Ops Dev vs Ops

Office Space Printer!

AWS Comes in

it will be cheap, they said

Turns out, it’s not cheap. But it’s arguably worth the cost.

Milestone

The first China based team is one of the first to get all dev done on AWS.

Circa ~2011

“Gandalf”

Major effort to have one AWS based dev/test environment.

  • One AWS Account
  • One VPC
  • Devs can each get a stack
  • And include which company components they need

Circa ~2012

Milestone

Our main monolithic app can now be no-downtime-deployed w/ a few clicks.

Circa ~2013

TMI

Team Managed Infrastructure

A team should 100% own their project. App code & infrastructure

Circa ~2014

Silos

Silos

With Opsy type people embedded in each.

Systems/Staff/Size Skyrockets

hockey sticking

Ambitious People

Not Invented Here Syndrome

  • A couple mis-managed changes
  • Hurtful polical fights
  • Reading doco takes too long
  • Learn by re-inventing
  • “I can do it better”

Temptation

Diverging

Real Life Examples

  • Omniture Sucks. I’ll roll my own.
  • Splunk Sucks. I’ll roll out ELK.
  • Docker Docker Docker.
  • Kubernetes does all this cool stuff.
  • Bamboo sucks. Let’s use Concourse CI.
  • Rails sucks. Scala/Go/Elm/Elixir/Php/Crystal is better.

In Human Terms

This old thing isn’t new and shiny anymore.

It doesn’t mean you’re wrong. Or the new thing isn’t better.

Central Team Over Burdened

  • Lots of responsibility
  • Lack of authority & resources
    • Good ideas on how to solve problems
    • But little people-time to lead by example

Two Approaches

Carrot vs Stick

Long Term Danger

Train Yard

Lots of Machinery

  • Currently about 130 AWS accounts.
  • And still two Data-Centers.

Circa Now

Milestone

OpenID based auth to our AWS accounts (and many other systems). For The Win.

Circa 2014

For better & for worse

Most teams have/had their own way of doing things.

Everybody learned A LOT

Product Lifecycle

When a product is “feature complete”, who owns it?

Who owns it?

Sadly, usually Opsy people

Back To This

Office Space Printer!

High Leverage Team

“ABCDEFG”

Merging Tracks

History

  • Been around for ~4 years (in one form or another)
  • But our broader influence is more recent

Team Members

  • 5 Australia based members
  • 1 China based member

(A Thoughtworks/Client first)

  • 4x Devs whose domain is Ops
  • 2x Ops who can (kind of) code

Our Infrastructure

Stuff we directly support

Docker Registry

  • Authenticated Registry
  • Available on the Internet
  • No IP whitelisting
  • Thousands of images / 5-6TB worth
  • SLA 24/7

Docker Registry Auth Provider

  • Supply Docker Creds to IAM entities through S3
  • Can be Read/Write or Read Only
  • SLA 24/7

BuildKite Agent Deployment

  • Give you a clean and easy tool to deploy agents.
  • And best practices/stencils on how to use it.
  • We don’t maintain agents (not really, anyway)

A Few Legacy Things

  • Nexus (Artefact hosting/management)
  • Internal Rubygems
  • A few Bamboo (CI) instances
  • A few more things from being in a shared support pool/roster.

SLAs between 9-5 and 24/7

Company-Wide Deployment Tool

  • Simple YAML interface.
  • What?
    • Docker images + config
  • Where?
    • Which AWS Account/Network
  • Others
    • Logging
    • Monitoring
    • Alerting
    • AutoScaling

Adoption Curve

Adoption Curve

Tenets: Open

We build internal Open-Source software. We invite contributions from all our users.

Tenets: Components

Build things in modular fashion, so we can easily replace components.

Tenets: Conventions

Borrowing over inventing. Will make it easier to integrate existing tools later.

Tenets: Interfaces

For tools to work together well, the interfaces need to be simple and stable.

Aphorism

All the boats rise with the tide

Approach

Carrot vs Stick

Different ratios

How We Work

  • Very Present on Slack (Chat/Comms tool)
  • Open schedules/plans (90 days / 180 days. All visible)
  • Working plans/schedules open to input from others
  • Open to re-prioritising (based on needs/input)

How We Work

Perhaps more importantly

  • Not afraid of saying
    • No, we’d rather not merge this PR (because…)
    • “I don’t know (…let me find out)”

(Even more) Remote Friendly

  • Document (in writing) many design decisions.
  • Document (debate) PRs and Feature Requests.

Work we don’t want

  • Fix your builds
  • Deploy your software
  • Be on your pager roster

Not that we won’t. But!
Give a man a fish … teach a man how to fish …

Daily Work / Responsibilities

  • Identify good patterns
  • Create stencils that embody them
  • Consult in various teams

Key Take-Aways

  • Innovation Is Great
  • Change Management is really hard
  • Developer happiness is important, but it can be costly
  • Offering good support is key
  • Evaluate the cost of new things

Questions?

Comments? Insults?

History

  • 2011, started seeing a lot of automation work, a lot of puppet.
  • 2011, home ideas. first product developed in China. And one of the very first to use ec2 for a dev/test environment
  • 2012, gandalf environment. set of ruby cli tools and a complete EC2/VPC setup.
  • 2013, Made the main-app click to deploy
  • 2014, First deploy of the main web-app by a contractor
  • 2014, move towards TMI
  • 2014, introduced IDP based auth for our AWS accounts
  • 2015, trying to get 90% of our stuff in AWS
  • 2016, TMI is harder than we thought and docker adoption
  • 2016, “all in” for AWS, but still a lot of legacy stuff (still) in the DC
  • 2017, cost optimisation vs growth
  • 2018, year of the docker cluster, maybe