Structured Logging

A Working Example

Created by Laurence J MacGuire a.k.a Liu Jian Ming aka 刘建明

ThoughtWorks Xi'An, 2015/05/20

The “5 W’s”

Who?
What?
Where?
When?
Why
and how?

What are logs?

Debugging
Auditing
Profiling

Distributed Logging

Horizontal vs Vertical scaling

ssh user@system-01.xyz.com tail -f /var/log/my-application.log
ssh user@system-02.xyz.com tail -f /var/log/my-application.log
ssh user@system-XX.xyz.com tail -f /var/log/my-application.log

Nope! Nope! Nope!

Centralised Access

“Distribute data and processing. Centralise access.”

– Me

Rsyslog
Graylog
Lumberjack
S3, SFTP, Whatever

File Based Logs

127.0.0.1 - - [01/Feb/2015:12:41:18 +0800] "GET /optout_check?callback=Krux.ns._default.kxjsonp_optOutCheck HTTP/1.1" 404 506 "http://www.theguardian.com/us-news/2015/jan    /31/detroit-aiyana-stanley-jones-police-officer-cleared" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/39.0.2171.65 Chrome/39.0.    2171.65 Safari/537.36"

$> cat access.log | awk '{print $1}' | sort -n | uniq -c | sort -nr
93.xxx.xxx.xxx
80.xxx.xxx.xxx
198.xxx.xxx.xxx
37.xxx.xxx.xxx
208.xxx.xxx.xxx
94.xxx.xxx.xxx
   ...

Cool! But inaccessible.

It Still Fails

Some formats are really hard to parse.

Started GET "/meta/processed_dates.json" for 127.0.0.1 at 2015-01-25 18:24:49 +0800
Processing by Meta::ProcessedDatesController#all as JSON
  Rendered text template (0.0ms)
Completed 404 Not Found in 8.4ms (Views: 4.9ms | ActiveRecord: 0.5ms)

But there’s really cool information in there!

Better Logging Formats

host=8.8.8.8 level=info ip=86.86.86.86 msg=hello world ...

Why not JSON?

{
  "host":"8.8.8.8",
  "level":"info",
  "ip":"86.86.86.86",
  "msg":"hello world"
}

Enter: Logstash

About logs. But not really about logs.

Written in Ruby/JVM
Actually, just a generic event processor
With logging specific tooling
OpenSource

Logstash Workflow

Logstash workflow

Config - Inputs

Events go in, somehow

Files
TCP/IP
- Rsyslog
- Graylog
- Lumberjack
- Stomp
Write your own (it’s Ruby)

Config - Filters

Do stuff to them

Date: standardize formats
Grok: Large collection of RegExp
JSON: Parse JSON
GeoIP: IP to Geo information
Metrics: event statistics aggregation
Write your own (it’s Ruby)

Config - Outputs

Put them somewhere

ElasticSearch
Email
S3
File
Graphite
Write your own (it’s Ruby)

ElasticSearch

More or less

JSON/HTTP based API
A transaction log
Distributed storage and retrieval
Lucene search engine
Good cloud support
OpenSource

Example Configs

Nginx

log_format logstash_json '{ "@timestamp": "$time_iso8601", '
 '"clientip": "$remote_addr", '
 '"time_spent.total": $request_time, '
 '"forwarded_for": "$http_x_forwarded_for", '
 '"response.cache_status": "$upstream_cache_status", '
 '"response.mime_type": "$sent_http_content_type", '
 '"response.status": $status, '
 '"response.size": $body_bytes_sent, '
 '"request.path": "$request_uri", '
 '"request.verb": "$request_method", '
 '"request.referrer": "$http_referer", '
 '"request.host": "$host", '
 '"request.agent": "$http_user_agent" }';

Example Configs

Logstash

input {
	lumberjack {
		port => 443
		ssl_certificate => "/opt/logstash/lumberjack.crt"
		ssl_key => "/opt/logstash/lumberjack.key"
		codec => "json"
	}
}

output {
	elasticsearch {
		host => "my-elasticsearch-cluster.example.com"
		cluster => "my-elasticsearch-cluster-name"
		protocol => "http"
	}
}

Now What?

All your events are in a distributed, easily queryable datastore

curl -XPOST http://my-elasticsearchcluster.example.com/_search -d '{"query":{"bool":{"must":[{"term":{"http_access_logs.response.status":"404"}}],"must_not":[],"should":[]}},"from":0,"size":10,"sort":[],"facets":{}}'
{
	"took": 1357,
	"timed_out": false,
	"_shards": {
		"total": 4,
		"successful": 4,
		"failed": 0
	},
	"hits": {
		"total": 184229,
		"max_score": 1,
		"hits": [
			{
				"_index": "logstash-2015.01.25",
				"_type": "http_access_logs",
				"_id": "AUsfjnQ7mXbLjGbOXHyr",
				"_score": 1,
				"_source": {
					"@timestamp": "2015-01-25T05:20:10.182Z",
					"type": "http_access_logs",
					"timestamp": "2015-01-25T05:20:10.182641Z",
					"request": {
						"scheme": "http",
						"host": "some-home.domain.com",
						"port": 80,
						"path": "/request/blah/sdfsdf",
						"query": null,
						"fragment": null,
						"complete": "http://some-home.domain.com/request/blah/sdfsdf",
						"verb": "GET",
						"http_version":"HTTP/1.1",
						"size": 0
					},
					"response": {
						"status": 404,
						"size": 21,
					},

					"time_spent": {
						"request_processing": 0.000049,
						"backend": 0.012,
						"response_processing": 0.000042,
						"total": 0.012505
					},
					"source": {
						"lob": "mad",
						"project": "my-project",
						"application": "my-application",
						"stack": "b",
						"tier": "elb"
					},
			}
		]
	}
}

Well, kind of easy.

Kibana: Making it easier

100% HTML + JS + CSS (version 3)
Talks to Elasticsearch directly
Beautiful graphs!
OpenSource

Demo Time!

http://URL-REDATED/

What should you log?

Everything

“~~Everyone~~Everything!!!”

Adding an asterisk, just because

Seriously

System logs

kernel boot logs
security logs
linux subsystem logs

Application logs

Backend App logs
Especially app start-up
Also: Timing information

“If your logs don’t give you a really, really good clue what happened, they’re useless.

– Me

Moar

Application Business logs

Try logging all business/domain relevant events

Users: create, update, delete, promote
Posts: create, reply, delete
Capture metrics!

MOAR!

Access logs
- Unicorn/Whatever
- Nginx
- ELB/HA Proxy
- Cloudfront/CDN

Also (really) important

Application auto-scaling events
Instance/Application boot up
Deployment events (deploy, switch, revert)

I might actually be insane

Browser side application logs?
Browser side user interaction events?
I’ve done it for fun. Why not?

Shifting mindset

“Logs aren’t files full of stacktraces you should never read. They’re an event stream that should be treated and processed like a first-class citizen.

– Me

Tip: Apps are in control

Developers are aware of the business context. Let them log appropriately.

Tip: Typecast events

{"id":"12414"}

Nope! We’re shoving this in a database. Treat it with respect.

Tip: Clearly Label Event Sources

{
  "source": {
    "environment": "production",
    "project": "my-project",
    "application": "my-application",
    "tier": "nginx",
    "host": "ip-10-13.24.52",
    ...
  }
}

Tip: Re-use Event Type Information

Unicorn, NGinx, HAProxy … they all log:

request: {
	"verb": "GET",
	"path": "/somewhere/file.html",
	"http_version": "1.1",
	"host": "virtual-host.domain.com",
	"referer": "http://www.some-other-domain/page.html",
	"user-agent": "Mozilla Compatible (blah blah)"
},
response: {
	"status": 200,
	"size": 123123,
	"etc": "asdasda"
}

These app all log (more or less) the same information. Categorise accordingly.

Tip: Include run-time information

Which user made this post?

{
  "board.id":123,
  "board.name":"Self-help",
  "sender.id":3462,
  "sender.username": "lmac",
  "post.subjet": "Lol, wut?"
}

Tip: Include run-time information

To and from which email address was this email sent?

{
  "from.address":"bob@example.com",
  "to.address":"alice@example.com",
  "sender.username":"bob_awesome_saas",
  "message.id": "eeaa4f14-c720-4c9f-b2f8-b5b859e49e46",
  "message.subjet": "password reminder",
  "message.status": "relayed"
}

Tip: Include run-time information

Which build number of this app is running?

{
  "source": {
    "project": "my-project",
    "application": "my-application",
    "build-number": 1234
  }
}

Tip: Include run-time information

Why was this decision taken?

{
  "entity": "email",
  "status": "marked-spam",
  "reason": "source ip not authorized by SPF record"
}

Tip: Send IDs upstream

Cloudfront: X-Request-Id: eeaa4f14-c720-4c9f-b2f8-b5b859e49e46
Load Balancer: X-Request-Id: eeaa4f14-c720-4c9f-b2f8-b5b859e49e46
Local Proxy: X-Request-Id: eeaa4f14-c720-4c9f-b2f8-b5b859e49e46
Unicorn: X-Request-Id: eeaa4f14-c720-4c9f-b2f8-b5b859e49e46
Database: X-Request-Id: eeaa4f14-c720-4c9f-b2f8-b5b859e49e46

CloudFront -> ELB -> NGinx -> Unicorn -> ActiveRecord = 345ms

Tip: Human readability

Even if you’re doing structured logging, always include a humand readable version, in a field such as “message”.

Tip: Watch out for security

logger.debug(personal_information.to_json)
{
	"name": {"first":"JianMing","last":"Liu"},
	"birth": {"date":"8888/88/88", "location":"Somewhere"},
	"job": {"company":"ThoughtWorks", "salary":"888888888888888888888888888888"},
	"email": "ljm@thoughtworks.com",
	"address": {"city":"XiAn","state":"ShanXi","road":"JinYeYiLu","number":34},
	"credit_card": {"number":"88888888", "exp":"2020/12/12", "cvv":"123"},
	"security_questions": { "mothers_name": "Mary", "first_pet": "dog" }
}

Please, No.

This can get us in very serious trouble.

Tip: De-normalise data

You’ll need to de-normalise your data a lot.

Most event oriented databases are NOT relational.

Storage is cheap.

Homework

Pipe you app logs somewhere and get useful data.

As usual, I don’t really care how you do it.

Your App

Collection

rsyslog, files, directly over the wire, etc

Forwarding

rsyslog, Kinesis, Kafka, Lumberjack, Graylog, etc

Logging backends

Processing

Logstash, fluentd, rsyslog, scripts, splunk, etc

Storage

Elasticsearch, Influxdb, OpenTSDB, MongoDB, etc

The Point

Get Useful Data

Ideas

Collect: syslog
Transport: syslog or logstash-forwarder
Ingest: Logstash
Storage: Elasticsearch
Query: Kibana

How we’re doing it

Collect: syslog + docker logging plugins
Transport: Kinesis
Ingestion: Logstash
Storage: Elasticsearch + S3 archiving
Query: Kibana (3)
Getting ALL system logs
Slowly moving apps in parallel