Raoul Harris
  • Introduction
  • Technical books
    • Data engineering with Alteryx
    • Deep learning in Python
    • Generative AI in action
    • Generative deep learning
    • Outlier analysis
    • Understanding deep learning
    • Understanding machine learning: from theory to algorithms (in progress)
    • Review: Deep learning: foundations and concepts
  • Technical courses
    • Advanced SQL Server masterclass for data analytics
    • Building full-stack apps with AI
    • Complete Cursor
    • DataOps methodology
    • DeepLearning.AI short courses
    • Generative AI for software development
      • Introduction to generative AI for software development
      • Team software engineering with AI
      • AI-powered software and system design
    • Generative AI with large language models
    • Generative pre-trained transformers
    • IBM DevOps and software engineering
      • Introduction to agile development and scrum
      • Introduction to cloud computing
      • Introduction to DevOps
    • Machine learning in production
    • Reinforcement learning specialization
      • Fundamentals of reinforcement learning
      • Sample-based learning methods
      • Prediction and control with function approximation
  • Non-technical books
    • Management skills for everyday life (in progress)
  • Non-technical courses
    • Business communication and effective communication specializations
      • Business writing
      • Graphic design
      • Successful presentation
      • Giving helpful feedback (not started)
      • Communicating effectively in groups (not started)
    • Illinois Tech MBA courses
      • Competitive strategy (in progress)
    • Leading people and teams specialization
      • Inspiring and motivating individuals
      • Managing talent
      • Influencing people
      • Leading teams
Powered by GitBook
On this page
  • Deployment
  • Modelling
  • Selecting and training a model
  • Error analysis and performance auditing
  • Data iteration
  • Data
  • Define data and establish baseline
  • Label and organize data
  • Scoping
  1. Technical courses

Machine learning in production

This well-structured course gives plenty of sensible suggestions without being overly long. I intend to use these notes as a checklist for any future projects. Multiple (optional) practical exercises too.

Deployment

Key challenges:

  • Concept/data drift

    • Data drift (or covariate shift): Changes in the input distribution

    • Concept drift: Changes in the relationship between inputs and outputs

    • Can be gradual or sudden (e.g., purchase patterns during COVID)

      • Enterprise data often changes faster than user data

  • Software engineering

    • Need to consider:

      • Realtime vs. batch

      • Cloud vs. edge/browser

      • Compute resources

      • Latency and throughput (queries per second)

      • Logging

      • Security and privacy

You generally want a gradual deployment (with monitoring) and the ability to roll back changes.

Deployment patterns:

  • Shadow mode: The ML system runs in parallel with humans but is not used to make decisions

  • Canary: Roll out to a small fraction of traffic initially and ramp up gradually

  • Blue-green: Two servers are maintained but only one handles requests at any given time

While you don't get gradual rollout with blue-green deployment, rolling back is easy.

Degrees of automation:

  • Human only

  • Shadow mode

  • Human in the loop

    • AI assistance

    • Partial automation (edge cases are referred to humans)

  • Full automation

Try to brainstorm everything that could go wrong and create a monitoring dashboard. You can start with a lot of metrics and remove any that prove not to be useful over time.

Potential metrics:

  • Software (memory, compute, latency, throughput, server load, etc.)

  • Input metrics (to check for data drift/covariate shift)

  • Output metrics (e.g., how often it returns null or how often the user retries)

Monitoring can prompt either manual or automatic retraining.

When you have a pipeline of models (such as a voice activity detection model chained with a speech recognition model) then you have to consider the impact of changes to models earlier in the pipeline on the performance of the subsequent models.

Modelling

Selecting and training a model

  • Model-centric AI development focuses on the choice of model and architecture

  • Data-centric AI development focuses more on improving the training data

    • This often works better than model-centric development in practice

    • Data quality is important, not just quantity

  • Good performance on the test set may not be enough

    • Performance may be more important in some cases than in others

      • For example, a search engine user will expect the obvious result to be first on navigational queries (e.g., "Reddit"), but will be less worried about the exact ordering for informational/transactional queries (e.g., "apple pie recipe")

    • There may be regulatory or ethical concerns about including certain characteristics as predictors (e.g., gender, race)

    • Rare classes (e.g., disease prediction, anomaly detection)

  • Remember to establish a baseline

    • Human performance

      • Particularly useful for unstructured data

      • If your model performs similarly to humans on some cases but not on others, that may suggest where the model can be improved

    • Simple models (e.g., just predict the average)

      • Particularly useful for structured data

    • Older system

    • Literature search

  • Getting started

    • Literature search

      • It's important to find something reasonable, but it doesn't necessarily need to be cutting-edge

    • Deployment constraints are important for the final model, but less important if you're looking to establish a baseline

    • Sanity-check the code and results (e.g., by overfitting a small dataset)

Error analysis and performance auditing

  • Look for common factors in misclassified/poorly predicted examples (e.g., car noise in speech recognition)

    • Tools may be available to help with this (e.g., LandingLens)

  • Prioritization factors

    • Room for improvement (e.g., distance to human baseline)

    • Ease of improvement

    • Frequency and importance of particular cases

  • Auditing performance

    • Brainstorm ways the system might go wrong

      • Performance on specific subsets (e.g., specific ethnicities)

      • Performance on rare classes

      • Different error types (e.g., false positives vs. false negatives)

      • Mislabelled data (e.g., "GAN" may have been transcribed as a more common word like "gang" or "gun")

      • Costly types of errors (e.g., a speech-transcription model mistakenly outputting an offensive term)

    • Establish metrics for slices of data

      • MLOps tools may be able to help

    • Business/product owner buy-in

Data iteration

  • The general loop is to update the data while keeping the model fixed

  • Consider data augmentation

  • Adding accurately-labelled data to a large unstructured model may not always help performance, but it rarely harms it

    • Counterexample: If a model that identifies house numbers from photos is given more examples that contain letters (e.g., 42I) then this may decrease performance on numbers with digits that look like letters (e.g., 421)

  • Data augmentation is difficult when working with structured data, but you can add additional features (e.g., whether someone seeking restaurant recommendations is vegetarian)

    • Compared to collaborative filtering (recommendations based on similar users), content-based filtering (recommendations based on item features) suffers less from the cold-start problem

  • Error analysis is also harder for structured data due to the lack of a human baseline

  • Experiment tracking

    • Algorithm and code versioning

    • Dataset used

    • Hyperparameters

    • Results

      • Metrics

      • Trained model (ideally)

  • Consider experiment-tracking systems (e.g., Weights and Biases, Comet, MLFlow, Sagemaker Studio)

  • Features of good data

    • Coverage

    • Consistent definitions

    • Up to date

    • Appropriately sized

Data

Define data and establish baseline

  • For small datasets, labelling accuracy can be critical, so it might be worth checking all the labels manually

  • Even for large datasets, some events may be rare, so some of the same ideas may apply

  • Improving labelling consistency

    • Have multiple people label the same example

    • Discuss cases of disagreement

    • Consider regathering data if there isn't enough information (for example, it might not be possible to spot defects in images of phones taken in poor lighting)

    • Consider merging classes if people struggle to distinguish between them

    • Consider tags for ambiguous cases (e.g., "borderline", "[unintelligible]")

    • You can use voting, but this shouldn't be an excuse for not trying to reduce noise first

  • Measuring human-level performance can help estimate Bayes error/irreducible error to help with error analysis and prioritization

  • Be cautious about having claimed to beat human performance

    • If there are multiple correct ways to label something, an algorithm that chooses the most popular one will show higher agreement with humans than humans do, even if humans always select a valid label

    • The cases where the algorithm beats humans may be less important than the cases where it falls short (for example, it might be better at suggesting obscure breeds of dog, but also more likely to mislabel a dog as a cat)

    • Try digging into cases where different labellers disagree with each other to see if you can raise the human baseline

Label and organize data

  • Try to minimize initial data collection time so that you can spend more time iterating

  • Account for both the monetary and time costs of different data sources

  • Doing some of the labelling yourself can give you a feel for the data

  • Try to scale the dataset size gradually (no more than ten times the size of the previous iteration) to avoid overinvesting in data collection

  • Data provenance: Source of the data

  • Data lineage: Steps taken in transforming the data

  • Consider stratified sampling when applying holdout cross-validation to small datasets

Scoping

This was an optional section of the course. Mostly common sense, but worth bearing the points in mind.

  • What is the business goal?

  • Is it feasible?

    • External benchmarks

      • Literature

      • Other businesses

    • Human-level performance

    • Rate of progress of previous approaches

    • Availability of relevant predictors

  • What metrics should we optimize in order to maximize value?

Last updated 8 months ago