Machine learning in production

This well-structured course gives plenty of sensible suggestions without being overly long. I intend to use these notes as a checklist for any future projects. Multiple (optional) practical exercises too.

Deployment

Key challenges:

Concept/data drift
- Data drift (or covariate shift): Changes in the input distribution
- Concept drift: Changes in the relationship between inputs and outputs
- Can be gradual or sudden (e.g., purchase patterns during COVID)
  - Enterprise data often changes faster than user data
Software engineering
- Need to consider:
  - Realtime vs. batch
  - Cloud vs. edge/browser
  - Compute resources
  - Latency and throughput (queries per second)
  - Logging
  - Security and privacy

You generally want a gradual deployment (with monitoring) and the ability to roll back changes.

Deployment patterns:

Shadow mode: The ML system runs in parallel with humans but is not used to make decisions
Canary: Roll out to a small fraction of traffic initially and ramp up gradually
Blue-green: Two servers are maintained but only one handles requests at any given time

While you don't get gradual rollout with blue-green deployment, rolling back is easy.

Degrees of automation:

Human only
Shadow mode
Human in the loop
- AI assistance
- Partial automation (edge cases are referred to humans)
Full automation

Try to brainstorm everything that could go wrong and create a monitoring dashboard. You can start with a lot of metrics and remove any that prove not to be useful over time.

Potential metrics:

Software (memory, compute, latency, throughput, server load, etc.)
Input metrics (to check for data drift/covariate shift)
Output metrics (e.g., how often it returns null or how often the user retries)

Monitoring can prompt either manual or automatic retraining.

When you have a pipeline of models (such as a voice activity detection model chained with a speech recognition model) then you have to consider the impact of changes to models earlier in the pipeline on the performance of the subsequent models.

Modelling

Selecting and training a model

Model-centric AI development focuses on the choice of model and architecture
Data-centric AI development focuses more on improving the training data
- This often works better than model-centric development in practice
- Data quality is important, not just quantity
Good performance on the test set may not be enough
- Performance may be more important in some cases than in others
  - For example, a search engine user will expect the obvious result to be first on navigational queries (e.g., "Reddit"), but will be less worried about the exact ordering for informational/transactional queries (e.g., "apple pie recipe")
- There may be regulatory or ethical concerns about including certain characteristics as predictors (e.g., gender, race)
- Rare classes (e.g., disease prediction, anomaly detection)
Remember to establish a baseline
- Human performance
  - Particularly useful for unstructured data
  - If your model performs similarly to humans on some cases but not on others, that may suggest where the model can be improved
- Simple models (e.g., just predict the average)
  - Particularly useful for structured data
- Older system
- Literature search
Getting started
- Literature search
  - It's important to find something reasonable, but it doesn't necessarily need to be cutting-edge
- Deployment constraints are important for the final model, but less important if you're looking to establish a baseline
- Sanity-check the code and results (e.g., by overfitting a small dataset)

Error analysis and performance auditing

Look for common factors in misclassified/poorly predicted examples (e.g., car noise in speech recognition)
- Tools may be available to help with this (e.g., LandingLens)
Prioritization factors
- Room for improvement (e.g., distance to human baseline)
- Ease of improvement
- Frequency and importance of particular cases
Auditing performance
- Brainstorm ways the system might go wrong
  - Performance on specific subsets (e.g., specific ethnicities)
  - Performance on rare classes
  - Different error types (e.g., false positives vs. false negatives)
  - Mislabelled data (e.g., "GAN" may have been transcribed as a more common word like "gang" or "gun")
  - Costly types of errors (e.g., a speech-transcription model mistakenly outputting an offensive term)
- Establish metrics for slices of data
  - MLOps tools may be able to help
- Business/product owner buy-in

Data iteration

The general loop is to update the data while keeping the model fixed
Consider data augmentation
Adding accurately-labelled data to a large unstructured model may not always help performance, but it rarely harms it
- Counterexample: If a model that identifies house numbers from photos is given more examples that contain letters (e.g., 42I) then this may decrease performance on numbers with digits that look like letters (e.g., 421)
Data augmentation is difficult when working with structured data, but you can add additional features (e.g., whether someone seeking restaurant recommendations is vegetarian)
- Compared to collaborative filtering (recommendations based on similar users), content-based filtering (recommendations based on item features) suffers less from the cold-start problem
Error analysis is also harder for structured data due to the lack of a human baseline
Experiment tracking
- Algorithm and code versioning
- Dataset used
- Hyperparameters
- Results
  - Metrics
  - Trained model (ideally)
Consider experiment-tracking systems (e.g., Weights and Biases, Comet, MLFlow, Sagemaker Studio)
Features of good data
- Coverage
- Consistent definitions
- Up to date
- Appropriately sized

Data

Define data and establish baseline

For small datasets, labelling accuracy can be critical, so it might be worth checking all the labels manually
Even for large datasets, some events may be rare, so some of the same ideas may apply
Improving labelling consistency
- Have multiple people label the same example
- Discuss cases of disagreement
- Consider regathering data if there isn't enough information (for example, it might not be possible to spot defects in images of phones taken in poor lighting)
- Consider merging classes if people struggle to distinguish between them
- Consider tags for ambiguous cases (e.g., "borderline", "[unintelligible]")
- You can use voting, but this shouldn't be an excuse for not trying to reduce noise first
Measuring human-level performance can help estimate Bayes error/irreducible error to help with error analysis and prioritization
Be cautious about having claimed to beat human performance
- If there are multiple correct ways to label something, an algorithm that chooses the most popular one will show higher agreement with humans than humans do, even if humans always select a valid label
- The cases where the algorithm beats humans may be less important than the cases where it falls short (for example, it might be better at suggesting obscure breeds of dog, but also more likely to mislabel a dog as a cat)
- Try digging into cases where different labellers disagree with each other to see if you can raise the human baseline

Label and organize data

Try to minimize initial data collection time so that you can spend more time iterating
Account for both the monetary and time costs of different data sources
Doing some of the labelling yourself can give you a feel for the data
Try to scale the dataset size gradually (no more than ten times the size of the previous iteration) to avoid overinvesting in data collection
Data provenance: Source of the data
Data lineage: Steps taken in transforming the data
Consider stratified sampling when applying holdout cross-validation to small datasets

Scoping

This was an optional section of the course. Mostly common sense, but worth bearing the points in mind.

What is the business goal?
Is it feasible?
- External benchmarks
  - Literature
  - Other businesses
- Human-level performance
- Rate of progress of previous approaches
- Availability of relevant predictors
What metrics should we optimize in order to maximize value?

Last updated 8 months ago