Introduction to DevOps

Thinking DevOps

General guidance

User stories: Persona; Action; Benefit
Avoid mono repos
Create a new branch for every issue
Work in small batches
A minimum viable product is an experiment
Consider test-driven development
CI/CD requires automated testing
Consider behaviour-driven development
- This ensures that you're building the right thing; TDD ensures that you're building the thing right
- Consider Gherkin syntax

Cloud-native microservices

"Stateless" microservices actually have state, with each service maintaining its own database
The microservices can be scaled independently
Failing instances are killed and respawned

Designing for failure

Plan to be throttled
Plan to retry (with exponential backoff)
Cache where appropriate
Circuit-breaker pattern
- If failure rates pass a threshold, the circuit breaker is tripped from closed to open and further requests to the service are blocked, preventing further strain
- After a timeout, the circuit breaker moves to half open, allowing a small number of requests through
- It will then transition to either closed or open depending on whether these requests succeed
Bulkhead pattern
- Segment the application into isolated components (bulkheads) so that it remains functional even if a component fails
- Can increase complexity
Chaos engineering (or monkey testing)
- Deliberately kill services to check for robustness

Working DevOps

Infrastructure as code

Executable text format
Stored in configuration-management systems
Use version control
Treat servers as cattle, not pets (they should all be treated the same)
Allows identical environments to be run in parallel
Infrastructure should be ephemeral
Applications are packaged in containers
- Contains dependencies
- Limits side effects
- Changes are always made to the image, not to running containers

An example was given of Knight Capital, which went bankrupt after failing to update one of eight servers as part of a manual process.

Continuous integration and continuous delivery

Continuous integration
- This means continuously building, testing, and merging to master
- Work in small batches
- Commit regularly
- Pull requests should be automatically built and tested
  - Can be triggered by CI systems that monitor the version-control system for changes
Continuous delivery
- This means continuously deploying to a production-like environment
- This ensures that the changes could be deployed to production
Continuous deployment
- This involves deploying to production rather than just a production-like environment
CI/CD pipeline components
- Code repository
- Build server
- Integration server
- Artifact repository (for binaries)
- Automatic configuration and deployment
Summary pipeline
- Continuous integration
  - Push to version control
  - Automated build and testing
- Continuous delivery
  - Release automation (store any artifacts)
  - Delivery automation (deploy binaries to a given environment)
- Continuous deployment
  - Production automation (promote deployment to production)
Feature flags can decouple deployment from activation

Organizing for DevOps

Agile teams should be cross-functional, self-organizing, and organized around business domains
Teammates, not tickets
Conway's law: Complex systems tend to become shaped like the organizational (communication) structures from which they emerge
- For example, if a software project involves multiple teams, it may end up with a module per team
There shouldn't be a separate DevOps team (in the same way you wouldn't have a specific Agile team); the whole point is to get rid of silos
Introducing a separate QA team can reduce quality as developers no longer feel that it is their responsibility to check that the code works; it's important not to separate people from the consequences of their actions

The course also recommends keeping teams small, but the range given was 5-10 people, which is the same range given in Leading teams, which recommended large teams, so "small" and "large" are clearly ambiguous.

Measuring DevOps

DevOps metrics

Metrics should emphasize the social aspect (are other people using your code?) rather than the competitive aspect
Prefer mean time to recovery (MTTR) to mean time to failure (MTTF) — the end user sees uptime, not whether any particular container is still running
Actionable metrics
- Time to market
- Overall availability
- Time to deploy
- Proportion of defects detected before production
- Efficient use of infrastructure
- Timeliness of performance feedback
These contrast with vanity metrics (e.g., number of website hits) where it isn't clear what action should be taken

DevOps vs. Site Reliability Engineering

SRE is "what happens when a software engineer is tasked with what used to be called operations"
Tenets of SRE
- Only hire software engineers
- SRE teams are separate from development teams
- Development teams can deploy straight to production provided that error rates are within the error budget
- Developers rotate through operations
SRE maintains the infrastructure; DevOps uses the infrastructure

Case studies

The course concluded with a set of quizzes that involved applying the ideas discussed to a set of case studies.

Last updated 8 months ago