Structured Logging

A Working Example

Created by Laurence J MacGuire a.k.a Liu Jian Ming aka 刘建明

ThoughtWorks Xi'An, 2015/05/20

Creative Commons License

The “5 W’s”

  • Who?
  • What?
  • Where?
  • When?
  • Why
  • and how?

What are logs?

  • Debugging
  • Auditing
  • Profiling

Distributed Logging

  • Horizontal vs Vertical scaling
ssh user@system-01.xyz.com tail -f /var/log/my-application.log
ssh user@system-02.xyz.com tail -f /var/log/my-application.log
ssh user@system-XX.xyz.com tail -f /var/log/my-application.log

Nope! Nope! Nope!

Centralised Access

“Distribute data and processing. Centralise access.”

– Me

  • Rsyslog
  • Graylog
  • Lumberjack
  • S3, SFTP, Whatever

File Based Logs

127.0.0.1 - - [01/Feb/2015:12:41:18 +0800] "GET /optout_check?callback=Krux.ns._default.kxjsonp_optOutCheck HTTP/1.1" 404 506 "http://www.theguardian.com/us-news/2015/jan    /31/detroit-aiyana-stanley-jones-police-officer-cleared" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/39.0.2171.65 Chrome/39.0.    2171.65 Safari/537.36"
$> cat access.log | awk '{print $1}' | sort -n | uniq -c | sort -nr
  29877 93.xxx.xxx.xxx
  17538 80.xxx.xxx.xxx
   5895 198.xxx.xxx.xxx
   3042 37.xxx.xxx.xxx
   2956 208.xxx.xxx.xxx
   2613 94.xxx.xxx.xxx
   ...

Cool! But inaccessible.

It Still Fails

Some formats are really hard to parse.

Started GET "/meta/processed_dates.json" for 127.0.0.1 at 2015-01-25 18:24:49 +0800
Processing by Meta::ProcessedDatesController#all as JSON
  Rendered text template (0.0ms)
Completed 404 Not Found in 8.4ms (Views: 4.9ms | ActiveRecord: 0.5ms)

But there’s really cool information in there!

Better Logging Formats

host=8.8.8.8 level=info ip=86.86.86.86 msg=hello world ...
Why not JSON?
{
  "host":"8.8.8.8",
  "level":"info",
  "ip":"86.86.86.86",
  "msg":"hello world"
}

Enter: Logstash

About logs. But not really about logs.

  • Written in Ruby/JVM
  • Actually, just a generic event processor
  • With logging specific tooling
  • OpenSource

Logstash Workflow

Logstash workflow

Config - Inputs

Events go in, somehow

  • Files
  • TCP/IP
    • Rsyslog
    • Graylog
    • Lumberjack
    • Stomp
  • Write your own (it’s Ruby)

Config - Filters

Do stuff to them

  • Date: standardize formats
  • Grok: Large collection of RegExp
  • JSON: Parse JSON
  • GeoIP: IP to Geo information
  • Metrics: event statistics aggregation
  • Write your own (it’s Ruby)

Config - Outputs

Put them somewhere

  • ElasticSearch
  • Email
  • S3
  • File
  • Graphite
  • Write your own (it’s Ruby)

ElasticSearch

More or less

  • JSON/HTTP based API
  • A transaction log
  • Distributed storage and retrieval
  • Lucene search engine
  • Good cloud support
  • OpenSource

Example Configs

Nginx

log_format logstash_json '{ "@timestamp": "$time_iso8601", '
 '"clientip": "$remote_addr", '
 '"time_spent.total": $request_time, '
 '"forwarded_for": "$http_x_forwarded_for", '
 '"response.cache_status": "$upstream_cache_status", '
 '"response.mime_type": "$sent_http_content_type", '
 '"response.status": $status, '
 '"response.size": $body_bytes_sent, '
 '"request.path": "$request_uri", '
 '"request.verb": "$request_method", '
 '"request.referrer": "$http_referer", '
 '"request.host": "$host", '
 '"request.agent": "$http_user_agent" }';

Example Configs

Logstash

input {
	lumberjack {
		port => 443
		ssl_certificate => "/opt/logstash/lumberjack.crt"
		ssl_key => "/opt/logstash/lumberjack.key"
		codec => "json"
	}
}

output {
	elasticsearch {
		host => "my-elasticsearch-cluster.example.com"
		cluster => "my-elasticsearch-cluster-name"
		protocol => "http"
	}
}

Now What?

All your events are in a distributed, easily queryable datastore

curl -XPOST http://my-elasticsearchcluster.example.com/_search -d '{"query":{"bool":{"must":[{"term":{"http_access_logs.response.status":"404"}}],"must_not":[],"should":[]}},"from":0,"size":10,"sort":[],"facets":{}}'
{
	"took": 1357,
	"timed_out": false,
	"_shards": {
		"total": 4,
		"successful": 4,
		"failed": 0
	},
	"hits": {
		"total": 184229,
		"max_score": 1,
		"hits": [
			{
				"_index": "logstash-2015.01.25",
				"_type": "http_access_logs",
				"_id": "AUsfjnQ7mXbLjGbOXHyr",
				"_score": 1,
				"_source": {
					"@timestamp": "2015-01-25T05:20:10.182Z",
					"type": "http_access_logs",
					"timestamp": "2015-01-25T05:20:10.182641Z",
					"request": {
						"scheme": "http",
						"host": "some-home.domain.com",
						"port": 80,
						"path": "/request/blah/sdfsdf",
						"query": null,
						"fragment": null,
						"complete": "http://some-home.domain.com/request/blah/sdfsdf",
						"verb": "GET",
						"http_version":"HTTP/1.1",
						"size": 0
					},
					"response": {
						"status": 404,
						"size": 21,
					},

					"time_spent": {
						"request_processing": 0.000049,
						"backend": 0.012,
						"response_processing": 0.000042,
						"total": 0.012505
					},
					"source": {
						"lob": "mad",
						"project": "my-project",
						"application": "my-application",
						"stack": "b",
						"tier": "elb"
					},
			}
		]
	}
}

Well, kind of easy.

Kibana: Making it easier

  • 100% HTML + JS + CSS (version 3)
  • Talks to Elasticsearch directly
  • Beautiful graphs!
  • OpenSource

Demo Time!

http://URL-REDATED/

What should you log?

Everything

EveryoneEverything!!!”

Adding an asterisk, just because

Seriously

System logs

  • kernel boot logs
  • security logs
  • linux subsystem logs

More

Application logs

  • Backend App logs
  • Especially app start-up
  • Also: Timing information

“If your logs don’t give you a really, really good clue what happened, they’re useless.

– Me

Moar

Application Business logs

Try logging all business/domain relevant events

  • Users: create, update, delete, promote
  • Posts: create, reply, delete
  • Capture metrics!

MOAR!

  • Access logs
    • Unicorn/Whatever
    • Nginx
    • ELB/HA Proxy
    • Cloudfront/CDN

Also (really) important

  • Application auto-scaling events
  • Instance/Application boot up
  • Deployment events (deploy, switch, revert)

I might actually be insane

  • Browser side application logs?
  • Browser side user interaction events?
  • I’ve done it for fun. Why not?

Shifting mindset

“Logs aren’t files full of stacktraces you should never read. They’re an event stream that should be treated and processed like a first-class citizen.

– Me

Tip: Apps are in control

Developers are aware of the business context. Let them log appropriately.

Tip: Typecast events

{"id":"12414"}

Nope! We’re shoving this in a database. Treat it with respect.

Tip: Clearly Label Event Sources

{
  "source": {
    "environment": "production",
    "project": "my-project",
    "application": "my-application",
    "tier": "nginx",
    "host": "ip-10-13.24.52",
    ...
  }
}

Tip: Re-use Event Type Information

Unicorn, NGinx, HAProxy … they all log:

request: {
	"verb": "GET",
	"path": "/somewhere/file.html",
	"http_version": "1.1",
	"host": "virtual-host.domain.com",
	"referer": "http://www.some-other-domain/page.html",
	"user-agent": "Mozilla Compatible (blah blah)"
},
response: {
	"status": 200,
	"size": 123123,
	"etc": "asdasda"
}

These app all log (more or less) the same information. Categorise accordingly.

Tip: Include run-time information

Which user made this post?

{
  "board.id":123,
  "board.name":"Self-help",
  "sender.id":3462,
  "sender.username": "lmac",
  "post.subjet": "Lol, wut?"
}

Tip: Include run-time information

To and from which email address was this email sent?

{
  "from.address":"bob@example.com",
  "to.address":"alice@example.com",
  "sender.username":"bob_awesome_saas",
  "message.id": "eeaa4f14-c720-4c9f-b2f8-b5b859e49e46",
  "message.subjet": "password reminder",
  "message.status": "relayed"
}

Tip: Include run-time information

Which build number of this app is running?

{
  "source": {
    "project": "my-project",
    "application": "my-application",
    "build-number": 1234
  }
}

Tip: Include run-time information

Why was this decision taken?

{
  "entity": "email",
  "status": "marked-spam",
  "reason": "source ip not authorized by SPF record"
}

Tip: Send IDs upstream

Cloudfront: X-Request-Id: eeaa4f14-c720-4c9f-b2f8-b5b859e49e46
Load Balancer: X-Request-Id: eeaa4f14-c720-4c9f-b2f8-b5b859e49e46
Local Proxy: X-Request-Id: eeaa4f14-c720-4c9f-b2f8-b5b859e49e46
Unicorn: X-Request-Id: eeaa4f14-c720-4c9f-b2f8-b5b859e49e46
Database: X-Request-Id: eeaa4f14-c720-4c9f-b2f8-b5b859e49e46

CloudFront -> ELB -> NGinx -> Unicorn -> ActiveRecord = 345ms

Tip: Human readability

Even if you’re doing structured logging, always include a humand readable version, in a field such as “message”.

Tip: Watch out for security

logger.debug(personal_information.to_json)
{
	"name": {"first":"JianMing","last":"Liu"},
	"birth": {"date":"8888/88/88", "location":"Somewhere"},
	"job": {"company":"ThoughtWorks", "salary":"888888888888888888888888888888"},
	"email": "ljm@thoughtworks.com",
	"address": {"city":"XiAn","state":"ShanXi","road":"JinYeYiLu","number":34},
	"credit_card": {"number":"88888888", "exp":"2020/12/12", "cvv":"123"},
	"security_questions": { "mothers_name": "Mary", "first_pet": "dog" }
}

Please, No.

This can get us in very serious trouble.

Tip: De-normalise data

You’ll need to de-normalise your data a lot.

Most event oriented databases are NOT relational.

Storage is cheap.

Homework

Pipe you app logs somewhere and get useful data.

As usual, I don’t really care how you do it.

Your App

Collection

rsyslog, files, directly over the wire, etc

Forwarding

rsyslog, Kinesis, Kafka, Lumberjack, Graylog, etc

Logging backends

Processing

Logstash, fluentd, rsyslog, scripts, splunk, etc

Storage

Elasticsearch, Influxdb, OpenTSDB, MongoDB, etc

The Point

Get Useful Data

Ideas

  • Collect: syslog
  • Transport: syslog or logstash-forwarder
  • Ingest: Logstash
  • Storage: Elasticsearch
  • Query: Kibana

How we’re doing it

  • Collect: syslog + docker logging plugins
  • Transport: Kinesis
  • Ingestion: Logstash
  • Storage: Elasticsearch + S3 archiving
  • Query: Kibana (3)

  • Getting ALL system logs
  • Slowly moving apps in parallel