Machine learning in production
This well-structured course gives plenty of sensible suggestions without being overly long. I intend to use these notes as a checklist for any future projects. Multiple (optional) practical exercises too.
Deployment
Key challenges:
Concept/data drift
Data drift (or covariate shift): Changes in the input distribution
Concept drift: Changes in the relationship between inputs and outputs
Can be gradual or sudden (e.g., purchase patterns during COVID)
Enterprise data often changes faster than user data
Software engineering
Need to consider:
Realtime vs. batch
Cloud vs. edge/browser
Compute resources
Latency and throughput (queries per second)
Logging
Security and privacy
You generally want a gradual deployment (with monitoring) and the ability to roll back changes.
Deployment patterns:
Shadow mode: The ML system runs in parallel with humans but is not used to make decisions
Canary: Roll out to a small fraction of traffic initially and ramp up gradually
Blue-green: Two servers are maintained but only one handles requests at any given time
While you don't get gradual rollout with blue-green deployment, rolling back is easy.
Degrees of automation:
Human only
Shadow mode
Human in the loop
AI assistance
Partial automation (edge cases are referred to humans)
Full automation
Try to brainstorm everything that could go wrong and create a monitoring dashboard. You can start with a lot of metrics and remove any that prove not to be useful over time.
Potential metrics:
Software (memory, compute, latency, throughput, server load, etc.)
Input metrics (to check for data drift/covariate shift)
Output metrics (e.g., how often it returns
null
or how often the user retries)
Monitoring can prompt either manual or automatic retraining.
When you have a pipeline of models (such as a voice activity detection model chained with a speech recognition model) then you have to consider the impact of changes to models earlier in the pipeline on the performance of the subsequent models.
Modelling
Selecting and training a model
Model-centric AI development focuses on the choice of model and architecture
Data-centric AI development focuses more on improving the training data
This often works better than model-centric development in practice
Data quality is important, not just quantity
Good performance on the test set may not be enough
Performance may be more important in some cases than in others
For example, a search engine user will expect the obvious result to be first on navigational queries (e.g., "Reddit"), but will be less worried about the exact ordering for informational/transactional queries (e.g., "apple pie recipe")
There may be regulatory or ethical concerns about including certain characteristics as predictors (e.g., gender, race)
Rare classes (e.g., disease prediction, anomaly detection)
Remember to establish a baseline
Human performance
Particularly useful for unstructured data
If your model performs similarly to humans on some cases but not on others, that may suggest where the model can be improved
Simple models (e.g., just predict the average)
Particularly useful for structured data
Older system
Literature search
Getting started
Literature search
It's important to find something reasonable, but it doesn't necessarily need to be cutting-edge
Deployment constraints are important for the final model, but less important if you're looking to establish a baseline
Sanity-check the code and results (e.g., by overfitting a small dataset)
Error analysis and performance auditing
Look for common factors in misclassified/poorly predicted examples (e.g., car noise in speech recognition)
Tools may be available to help with this (e.g., LandingLens)
Prioritization factors
Room for improvement (e.g., distance to human baseline)
Ease of improvement
Frequency and importance of particular cases
Auditing performance
Brainstorm ways the system might go wrong
Performance on specific subsets (e.g., specific ethnicities)
Performance on rare classes
Different error types (e.g., false positives vs. false negatives)
Mislabelled data (e.g., "GAN" may have been transcribed as a more common word like "gang" or "gun")
Costly types of errors (e.g., a speech-transcription model mistakenly outputting an offensive term)
Establish metrics for slices of data
MLOps tools may be able to help
Business/product owner buy-in
Data iteration
The general loop is to update the data while keeping the model fixed
Consider data augmentation
Adding accurately-labelled data to a large unstructured model may not always help performance, but it rarely harms it
Counterexample: If a model that identifies house numbers from photos is given more examples that contain letters (e.g., 42I) then this may decrease performance on numbers with digits that look like letters (e.g., 421)
Data augmentation is difficult when working with structured data, but you can add additional features (e.g., whether someone seeking restaurant recommendations is vegetarian)
Compared to collaborative filtering (recommendations based on similar users), content-based filtering (recommendations based on item features) suffers less from the cold-start problem
Error analysis is also harder for structured data due to the lack of a human baseline
Experiment tracking
Algorithm and code versioning
Dataset used
Hyperparameters
Results
Metrics
Trained model (ideally)
Consider experiment-tracking systems (e.g., Weights and Biases, Comet, MLFlow, Sagemaker Studio)
Features of good data
Coverage
Consistent definitions
Up to date
Appropriately sized
Data
Define data and establish baseline
For small datasets, labelling accuracy can be critical, so it might be worth checking all the labels manually
Even for large datasets, some events may be rare, so some of the same ideas may apply
Improving labelling consistency
Have multiple people label the same example
Discuss cases of disagreement
Consider regathering data if there isn't enough information (for example, it might not be possible to spot defects in images of phones taken in poor lighting)
Consider merging classes if people struggle to distinguish between them
Consider tags for ambiguous cases (e.g., "borderline", "[unintelligible]")
You can use voting, but this shouldn't be an excuse for not trying to reduce noise first
Measuring human-level performance can help estimate Bayes error/irreducible error to help with error analysis and prioritization
Be cautious about having claimed to beat human performance
If there are multiple correct ways to label something, an algorithm that chooses the most popular one will show higher agreement with humans than humans do, even if humans always select a valid label
The cases where the algorithm beats humans may be less important than the cases where it falls short (for example, it might be better at suggesting obscure breeds of dog, but also more likely to mislabel a dog as a cat)
Try digging into cases where different labellers disagree with each other to see if you can raise the human baseline
Label and organize data
Try to minimize initial data collection time so that you can spend more time iterating
Account for both the monetary and time costs of different data sources
Doing some of the labelling yourself can give you a feel for the data
Try to scale the dataset size gradually (no more than ten times the size of the previous iteration) to avoid overinvesting in data collection
Data provenance: Source of the data
Data lineage: Steps taken in transforming the data
Consider stratified sampling when applying holdout cross-validation to small datasets
Scoping
This was an optional section of the course. Mostly common sense, but worth bearing the points in mind.
What is the business goal?
Is it feasible?
External benchmarks
Literature
Other businesses
Human-level performance
Rate of progress of previous approaches
Availability of relevant predictors
What metrics should we optimize in order to maximize value?
Last updated