Nov 9th, 2016

Notes from my journey at Velocity conf 2016 (Europe)

conference, velocity, Amsterdam, people

I had the chance to attend Velocity conf 2016 in Amsterdam thanks to my current employer Trainline (formerly Capitaine Train). This article is a minimal transcript of my journey there during the talk sessions.

Big venue, big conference: better than expected

First of all it was my first "sponsored big tech conference" experience. I personally like independent ones - which usually tries to be as inclusive as possible - but I have to say I was really happy to see the diverse lineup of speakers at Velocity.

The venue in the south of Amsterdam was really impressive (even though some attendees told me it was 4 times smaller than the same conf in California). I have to say it is a good way to share talks and experiences to many different attendees.

Even if I know it requires lots of planing and lots of money to organize, I felt bad that this conference has a high price entry. I wish it was more accessible and that people without a company behind them could afford attending such a conference. This is why I decided to publish my notes here. Beware though they are my personal notes so they might not be as clear as I wish they could be. Feel free to contact me if you have doubts. The videos of the talks should be published in a couple of weeks if you desire.

Talks

I especially appreciated the talks of Avleen Vig talking about burnouts in our industry, Adam Surak about server reliability, George Sudarkoff about ops teams, Astrid Atkinson about building great teams, Guy Podjarny about open source security, Amanda Folson about how oncall sucks and finally (clearly not least) Mathias Meyer about distributed teams.

Enough of my thoughts, let's jump into the notes.

First day:

Second day:

First day

Steven Shorrok

@stevenshorrock

In a context of your work, there is usually two different vision about the actual work:

There is usually a small overlap but the operating zone is pretty different.

E.g. with aircraft pilots. ECAM (monitoring tool) failing: pilots have judgment. Yes machine can process data quicker than us but can't have judgment.

E.g.2 NHS UK "target-driven": nurses in hospital had 4 hours to have a patient sent home. Nurses found a way to bypass this "imagined work" by registering a patient in different zones when needed.

"Targets always result in gaming" Simon Caulkin

Solutions?

↑ Back to the list of talks


Yoav Weiss

Akamai

@yoavweiss

W3C Web Performance Working Group (browser APIs)

What's a WG?

What they've been doing? APIs to measure. You can't improve what you can't measure. E.g. Performance data from users.

Hints

Next?

How to influence the WG?

↑ Back to the list of talks


James Duncan

UK Government Digital Service

@jamesaduncan

"Any sufficiently advanced technology is indistinguishable from magic"

↑ Back to the list of talks


Jason Grigsby

Cloud Four

@grigs

Building progressive webapps

He is so happy about PWA. Go out and build PWA now.

↑ Back to the list of talks


Avleen Vig

Etsy

@avleen

slides

Burnout in Engineering

Occupational Burnout. What it's not?

So what it is? 3 concurrent symptoms (all together):

90% of occupational burnout → Similar to clinical burnout

What should we take a look at?

Easy solution:

Common steps towards burnout:

  1. Compulsion to prove yourself. Stay late one night a week, then two, then three.. Then start to ignore priorities.
  2. Working harder
  3. Ignoring priorities
  4. Start to blame others
  5. There can be no failure
  6. Start blaming your workload
  7. Start dreaming abt ur job (been already a few months)
  8. Others start to notice some changes in your behavior. But you don't believe them
  9. You can't get any work done
  10. Emotional emptiness (depression starts here and sometime need professional help at this point)

Personal burnout of Avleen: (June 2013)

First vacation in 5 years!

It can take months to recover, but it was a start.

Identify areas that are disconnected and how to reconnect Rebuild self-confidence

Your greatest asset:

↑ Back to the list of talks


Adam Surak

Algolia

@adamsurak

slides

Algolia: 15 regions (50 physical data centers)

Basic principle, new parallel redundancy:

98% reliable

--| 99% |---| 99% |--

99.9% reliable

   .- | 99% | -.
---             ---
   '- | 99% | -'

SLA (basically up time)

per year

SLA tricks from marketing

You need an independent monitoring. It might make sense to make it on your own in certain case (Algolia wanted a 10s freq and that is very expensive with external service so they did it them selves)

Who can safely reboot one machine at any time without impacting customers? a rack at any time? a data center at any time?

Underestimated dependencies

A/C Power

Network maintenance issues

AWS outages

GCP outages

Azure has outages too

Salesforce outages

You name it.. Whatever your provider cloud, self hosted, VMs or bare metal you will have outages or unexpected performance. What can you do about it?

You can deploy multi-cloud!

Other outages examples:

→ TCP proxy becomes your friend

DNS.. Default timeouts are usually high in libs

Have two dns provider as easy as that

Software design

Software ops

Chaos monkey at DNS level: try to iptables reject a small range of ports from DNS.. You might get some people mad with that.

What about your team, the people?

↑ Back to the list of talks


Radu and Rafal

Sematext

Shard your indices. Day by day is a good example. (Graylog does it automatically per amount of docs btw)

Hot/Cold architecture

Keep indices balanced

In AWS

UDP → increase network buffers on destination

ES has built in metrics about time to flush, time to search, time to fetch indices.. etc.. if you want to tweak your ES cluster you need to monitor these metrics.

↑ Back to the list of talks


George Sudarkoff

SurveyMonkey

@sudarkoff

Distributed Ops analogy with unpredictable military operation.

Give control to skilled workers.

Distribute responsibility within teams. Give independence (technological, processes) to each team. Anyone should be able to build their service as they want (Go, Python, C, Ruby..)

This is great but you need to standardize to keep it understandable. And to help knowledge sharing and team mixing.

What to standardize

Automation?

Automation

Automation!

Closing

↑ Back to the list of talks


Björn Rabenstein

SoundCloud

beorn

slides

Kubernetes + Prometheus = ♥

Both hosted by CNCF Cloud Native Computing Foundation.

Both projects are quite independent.

GIFEE: Google Infra for everybody else.

Google production systems pyramid /\ :

→ Monitoring is important in your infra.

Monitor all the things

Now you add between App and Host

TL;DR to many tools.

All this replaced by Prometheus with a Grafana UI (and shippers: node exporter for host, SNMP exporter, cadvisor for containers built-in a kubelet)

Labels from Kubernetes (env, app, zone…) vs classic hierarchy

Service discovery

Prometheus talks to all of them. Poll based system.

Prometheus is pretty opinionated. Tells you quite basic things, histograms,"best practices"

App rides with Sidecars (ops detail)

↑ Back to the list of talks

Walking to catch a bus to RAI conference center

Second day

Astrid Atkinson

Google

"Come with me if you want to live." Terminator

Are people born for coding? How to make a great team? Can't just try to find individual rock stars.. What if greatness was everywhere?

"Never doubt that a small group of thoughtful, committed citizens can change the world." Margaret Mead

2004-2012 production team was not that big in google. Engineering team inside ops separated in 3 shards. Very high pressure. Need to observe systems. Listen to suggestions.

Helping to make people amazing and build "rock star" teams

TL;DR Look for people that can challenge you. Let people speak up. If you want an extraordinary team, be kind.

↑ Back to the list of talks


Eric McNulty

How to think differently abt leadership. "Artful leadership"

Artful is not about Art - as good as it can be -, it's about

E.g. The reaction of the people on how the dealt with a horrible event: the Boston 2013 marathon bombing.

That's abt resilience

What went so well? Everything happened 10yrs ago. Preparation, planning - because if close to 9/11. Live exercises where we test people, protocols and technology. Work with the scenario.

→ Creating high trust env.

TL;DR Make a difference in leadership with adaptive capacity, resilience and trust. People want to stay when they feel they can make a difference. Hope you use your superpower for good.

↑ Back to the list of talks


Phil Stanhope

Dyn

A Zombie Apocalypse. Talk ironically presented 5 days before the Dyn DDoS attack of 21/10/2016.

Masked TCP/UDP traffic over the 53 port.

Already 10 yrs ago it was easy to get a server connected to the network and start hacking around. Now with more and more devices it gets even easier.

Scrubbing Tech (cleaning IPs you don't want). BGP hijack.

IOT → more problems. E.g. Lightbulbs becoming a router.

Human infection vs Machine infection

Botnets have a "speed of light" infection rate. More and more infected hosts. Queue delaying.

How to survive this problem?

OSI layers

Are we ready for 2 million rq/sec?

Fish in a barrel. (you can shoot them they always come back) From the sample of 3K ips (of 4 million unique A records in dynamic dns data) 10% of the devices were impacted. Coming from DVR XiongMai tech cameras. E.g. AvTech IP Cameras.

What to really do?

TL;DR Dealing with the incident in real time.

↑ Back to the list of talks


Guy Podjarny

Snyk.io

@guypod

Who owns open source security?

Background of Guy,

Heartbleed

→ "Open source sec done right" corp discovery, corp patches, responsible disclosure to OSS project.

However there are still open questions:

Kernel Linux vul takes avg of ~5 yrs to be discovered

Shortly after: ShellShock

Why Heartbleed is such a big deal? Because Open Source is everywhere (especially openSSL).

Estimate 25% of the (HTTPS) web after heartbleed:

93% patched but only 13% servers swapped cert (/!\ university of Maryland estimate) 74% of Fortune 2 000 companies didn't patch all servers Constant stream of active exploits.. A lot of publicity for Heartbleed.

ImageTragick

Timeline (2016)

Months without a full fix! And it's extremely severe vuln..

Again lots of Open questions. (Even if it was handled properly tatata because of the process)

Community building lib without even money (for the case of imagemagick) why should they fix this?

Similar: Marked

59% of reported vul in maven packages remain unfixed (old stat from ~2yrs ago). Mean time to fix the 41%: 390 days, CVSS 10 vulns: 265 days

Hawk: request validation package (npm). Built within "happy" second most popular JS web framework.

Attackers

Can develop exploit locally. One vuln, MANY VICTIMS. Human component on top of all that. Often slow to fix & slower to update

TL;DR

→ Well NO we are not. BUT it IS a big issue. As a community we NEED to help fixing this

If security matters, don't use open source then?

Core challenges then?

Who should help?

Org backed vs independent

Free for OSS. OSS projects can't pay, so be kind and help them

Github specific wish list

Who owns OSS sec? ALL of us.

TL;DR OSS security is really hard. And also really bad because attackers love it. We as a community need to try to make it better. More process, more awareness, more involvement in sec by everyone.

Q/A

Q. How do we as a consumer fix this?

A. no silver bullet. But accept your part of responsibility. There will be overlap. It's not realistic to complain abt authors for eg. But their role is more to have a way to more easily share and discuss abt security issues.

↑ Back to the list of talks


Mandi Walls

Chef

@lnxchk

slides

Building sec into your workflow w/InSpec

Chef

Their products

How people do security? Especially in nowadays continuous delivery platforms?

Mentality of security reviews as a barrier to production. "No we can't do that", "No not this protocol", "Nope"..

→ Chef wants to speed up this process.

We have actually a communication problem

We wanted this common language between these 3 fields being:

InSpec!

E.g. ssh

Remediation

Lifecycle

v1.0 of InSpec. Find it at inspec.io, open source, "spec".

Describe a resource. E.g.

describe sshd_config do
  impact 1.0

  title "SSH version 2"

  desc <<-TEXT
       blabla compliance text
  TEXT

  its("Protocol") { should cmpl 2 }
end

Resources

Run it

Test any target. inspec exec test.rb

Failures

Test kitchen

Profiles

Extending InSpec

Over Time (hopefully)

We hope people that have the knowledge with compliance can contribute more thanks to that. Faster, more participation and better outcomes. Approachable by more people.

More infos?

TL;DR Presenting Chef's new tool "InSpec". Trying to fill the gap between security compliance papers and systems. By defining "specs" on specific resources of our systems.

Q/A

Q. Where should inspec be used? more in the dev part or infra?

A. We would like people to include this tests whenever they need. E.g. a normal web app can use profiles that are more app related. Some profiles are more machine related so can be part of your chef rules. So it's geared towards infrastructure but it needs to be integrated in the build process.

Q. Does inspec help for external services. Such as AWS LBs for instance?

A. Thinking about it but only on design for now. So for now no.

↑ Back to the list of talks


Cynthia Mai

Amazon

dev @ amazon (UI team for amazon.com retail websites)

How we applied best practices and our challenge to implement them. Usually engineering paradigms that have been proved to be the best way to do something.

Over 300 million active customers

Distributed development ecosystem - example on a single product page on amazon → 500+ features

In reality best practices are not always best fit for you. You usually need to adapt/customize.

  1. Preloading * search result → Product detail. * 10yrs ago. <iframe>s. * custom amazon preload js utilty. preload('http://whateverassets') * Buut cross browser compat of course.. Worst case, no css at all. Best case, no preloading. * E.g. they had issues with Firefox CORS caching system. * custom JS utility to common support in browsers. E.g. rel="prefetch". → The standard way may not work at huge scale. Their case: 500 features in one page: each one wanting to preloading things… → Back to custom solution
  2. Responsive. One DOM how does it work at large scale? They had three teams: * Desktop codebase, webapp codebase and mobile native apps. * Inconsistent UX * Long launch times * duplicate work * supporting new devices * long release cycle for mobile. → slower innovation. They moved to one code base, one UX, different layouts. → media queries * challenges unused bytes down the wire (lots of media queries) * big maintenance cost * user experience not as good as expected. Thus they built a custom framework to adapt UX to each device. * optimized views per devices * shared backend + business logic code * CSS + specific device CSSs + segments by browser

    Balance between best practices and resilience. Iterate to find balance.

  3. Sprites (perf vs perceived perf) - image with all image assets. * country specific sprites * E.g. of their Indian sprite without some images * The got better full page load time BUT the Above the Fold (ATF) didn't change * Improvement not visible for the users * Not worth the extra dev time for this sprites specific.

TL;DR Prioritize on providing better perceived perf by following best practices but tweak on your use case.

↑ Back to the list of talks


Amanda Folson

Gitlab

@ambassadorAwsum

"On-calliday: Un-sucking your on-call experience"

Why oncall sucks?

It can be better!

Intro about burnout, basically what was said by Avleen Vig yesterday.

You need preparation, basics (chargers, cables, small snacks) to reduce stress.

Forgetting you're on shift

How to make it easier from employers?

Vacation, track them.

Select the right shift length. Common length: 8, 12, 24 and 1 week

SO what should I use?

E.g. She likes:

Iterate on that, scheduling overrides has to be done.

Mature monitoring is the key. Make your data count! Do postmortems, notice patterns and FIX IT when you find a strange pattern.

Tooling matters

Keep staff on their toes.

Incident response. Incident commander see blackrock3.com for details. Note taking. Record what's going on.

You write code, your wear it.

TL;DR Oncall sucks. Have an adaptive process that fits with your monitoring and your team.

Q. Should you give incentive to people on call?

A. Depends of the company, but I believe you should. It's important if your oncalls miss a diner, family time.. you should give incentives.

Q. How to deal with devs that don't know that much abt ops?

A. I've already seen shadowing for that. (an ops guys shadowing a dev). DOCUMENTATION is important. Or simply TRAIN.

Q. Docs, postmortems. Do you have recommendation?

A. @gitlab we use Gitlab issues. Some companies use internal wikis. I personally prefer wikis.

Q. Thoughts on small teams? 1-2 people w/skill and knowledge of oncall

A. Either train someone else. If you can't, take a look at the latest oncall issues and do something abt it. Look at your patterns (hm hm someone said disk space? :D)

↑ Back to the list of talks


Mathias Meyer

CEO of Travis CI

@roidrage

Used to be dev @Travis, now CEO.

Business rooted in Germany. HQ in Berlin. He now lives there.

Focus in this talk on culture differences. They are at the core of what a distributed team is. We've noticed big differences between US and EU.

History of Travis-CI company: distributed build server for the Ruby community. "Bob the builder", bob has a tractor called "Travis". Started at 5. Now we are 40 builders. Across 9 time zones.

Time zones

Occasionally we experience this: NZST with HAST (23h diff). That's difficult to handle within a company.

People liked to come to Berlin work with us. But it's not the best thing for our business because we have a large customer base in the US. So we pushed for remote AND for diversity.

10 native languages. 16 Home Countries.

<include> <div></div> </include> quoting Jan Lehnard

52% remote now. Even if we have an office in Berlin. Not everyone wants to come in the office.

52% women in engineering. But it doesn't mean the "diversity" label was achieved. We still want diversity.

Challenges: COMMUNICATION

Deploy via slack command - so anyone can deploy. /deploy

Infra Heroku based.

Handled in #incident-response slack channel

We worked our oncall to take timezone into account. They have 12 hours shifts. All devs are oncall. No real ops team. :clap:

Blameless culture.

Github all the things.

Everyone does customer support

Book allowance. You can buy any book you like.

Office issues. Someone needs to water the plants.

Everything in github issues. It creates visibility.

→ ⚠ But it can make a mess

Distributed teams need to accept a TAX on communication.

INCLUSIVE Communication

"less oral com means more accidental documentation"

Travis CI Team

Can lead to multiple github projects, issues, wikis… We had this pb. So we started a "Builder"'s manual.

Distributed teams need CENTRALIZED documentation

YAGR yet another github repo: how-we-work

Inclusive decision making

Doing this inclusive and open discussions it creates a CI for your Company Culture.

E.g. :freedom: (a US flag waving emoji added by me in Slack)

One of the builder turned :freedom: to a black fist closed.

We added gender pronoun in profiles.

Empower everyone

With time we renamed #panic to #incident-response.

Scaling distributed culture requires DOCUMENTATION. Why a certain thing is there?

Distributed Facetime

Friday lightning talks

Company hangouts occasionally. Cultural diversity.

"Distributed teams increase the changes of diversity due to having people from diff cultures and backgrounds"

Travis CI team

German, say no. No means no. "That was an 'ok' talk but we need to review 3 things".

US, much more enthusiast abt things. \o/ awesome.

from US: great, good, needs improvement

from German: great, good, needs improvement, okay, no.

Tone and nuance gets lost in writing

"Never attribute to malice or stupidity that which is adequately explained by a missing perspective"

Travis CI Team

English is ambiguous

It's ok to ask questions

Work culture

Care about the humans.

"Sending people offline for a vacation is like Chaos Monkey for orgs"

Courtney Nash

Salaries and Cost of Living

We are not remote-first.

Team lunch once a week.

Your own Mileage May Very. We do things like that but we have constant changes.

Thanking his team. Thanking audience.