Raoul Harris
  • Introduction
  • Technical books
    • Data engineering with Alteryx
    • Deep learning in Python
    • Generative AI in action
    • Generative deep learning
    • Outlier analysis
    • Understanding deep learning
    • Understanding machine learning: from theory to algorithms (in progress)
    • Review: Deep learning: foundations and concepts
  • Technical courses
    • Advanced SQL Server masterclass for data analytics
    • Building full-stack apps with AI
    • Complete Cursor
    • DataOps methodology
    • DeepLearning.AI short courses
    • Generative AI for software development
      • Introduction to generative AI for software development
      • Team software engineering with AI
      • AI-powered software and system design
    • Generative AI with large language models
    • Generative pre-trained transformers
    • IBM DevOps and software engineering
      • Introduction to agile development and scrum
      • Introduction to cloud computing
      • Introduction to DevOps
    • Machine learning in production
    • Reinforcement learning specialization
      • Fundamentals of reinforcement learning
      • Sample-based learning methods
      • Prediction and control with function approximation
  • Non-technical books
    • Management skills for everyday life (in progress)
  • Non-technical courses
    • Business communication and effective communication specializations
      • Business writing
      • Graphic design
      • Successful presentation
      • Giving helpful feedback (not started)
      • Communicating effectively in groups (not started)
    • Illinois Tech MBA courses
      • Competitive strategy (in progress)
    • Leading people and teams specialization
      • Inspiring and motivating individuals
      • Managing talent
      • Influencing people
      • Leading teams
Powered by GitBook
On this page
  • Part 1: Introduction
  • Part 2: Functional steps in DataOps
  • Part 3: Governance of DataOps
  1. Technical books

Data engineering with Alteryx

The obvious question that this book has to answer is "Why would you use Alteryx for data engineering?" It gives three main reasons:

  • Speed of development

  • Iterative workflow development

  • Self-documentation

Code-based approaches are also self-documenting, and I see no reason why they can't be developed iteratively either, but my experience is that the no-code approach with a visual canvas and the ability to easily inspect the data at any point in the workflow does lead to significantly faster development than code-based solutions. This may change over time as LLMs improve (see Complete Cursor and Generative AI for software development).

To those points, I would also add the ease of picking it up and the convenience of having almost everything that you need available in a single program (particularly beneficial in organizations where it is difficult to get IT approval for new programs or packages). For many companies, its advantages will outweigh its disadvantages (such as uninformative diffs in version control).

This isn't a bad book overall, but its general focus is on step-by-step instructions for how to do things in Alteryx (accompanied by a lot of screenshots), with very little time spent on how you can use Alteryx in accordance with general data engineering/DataOps principles and best practices. Given that I can just check the documentation when I want to know how to do something, I would have preferred a 20-page book that stuck to what you should do and why rather than one that spreads 20 pages' worth of information over more than 300 pages.

Part 1: Introduction

Tools:

  • Alteryx Designer for data transformation

  • Alteryx Server for automation and scheduling

  • Alteryx Connect for cataloguing and discovery

Suggested best practices for workflows:

  • Supplement the automatic annotations

  • Use tool containers to group logic

  • Use comment and explorer boxes

This chapter defines data engineering and gives some examples of using Designer, Server, and Connect. Nothing particularly stood out as worthy of mentioning here.

DataOps is basically DevOps applied to analytics workflows. According to the book, it promises:

  • Faster cycle times

  • Faster access to actionable insights

  • Improved robustness of data processes (e.g., through statistical process control)

  • The ability to see the entire data flow in a workflow

  • Strong security and confidence

The book also discusses the principles from https://dataopsmanifesto.org/en/. There are quite a few of them, so see either the link or the expandable box below.

Pillars of DataOps
  • Continually Satisfy Your Customer

    • Highest priority is customer satisfaction through early and continuous delivery of valuable analytic insights

    • Delivery timelines range from minutes to weeks

  • Value Working Analytics

    • Primary performance measure is the delivery of insightful analytics

    • This includes accurate data, robust frameworks, and systems

  • Embrace Change

    • Welcome and embrace evolving customer needs for competitive advantage

    • Face-to-face communication is the most efficient and effective method

  • It's a Team Sport

    • Diverse teams with varied roles, skills, tools, and backgrounds enhance innovation and productivity

  • Daily Interactions

    • Customers, analytic teams, and operations must collaborate daily

  • Self-Organize

    • Best results emerge from self-organizing teams

  • Reduce Heroism

    • Strive for sustainable and scalable teams and processes to avoid over-reliance on individuals

  • Reflect

    • Regularly review feedback from customers, team members, and operational statistics to improve performance

  • Analytics is Code

    • Tools generate code and configuration that describe data manipulation for insights.

  • Orchestrate

    • End-to-end orchestration of data, tools, code, environments, and teams is crucial for success

  • Make it Reproducible

    • Version everything (data, hardware, software, code, configuration) for reproducibility

  • Disposable Environments

    • Provide easy-to-create, isolated, and disposable environments for experimentation

  • Simplicity

    • Focus on technical excellence, good design, and simplicity (maximizing work not done) for enhanced agility

  • Analytics is Manufacturing

    • Treat analytic pipelines like lean manufacturing lines, focusing on process thinking for continuous efficiency

  • Quality is Paramount

    • Implement automated abnormality and security detection (jidoka) and continuous feedback mechanisms (poka yoke)

  • Monitor Quality and Performance

    • Continuously monitor performance, security, and quality for unexpected variations and operational statistics

  • Reuse

    • Avoid repeating previous work for increased efficiency

  • Improve Cycle Times

    • Minimize time and effort from customer need to production-ready analytic process and refactoring/reuse

  • Start With Your Data Journey

    • Data trust is crucial for adoption

    • Understand and observe the data journey in your production environment for error reduction.

Part 2: Functional steps in DataOps

Much of this section covers basic functionality, which will be useful for new Alteryx users but will already be familiar to people that have been using the software for a while. It does suggest some best practices though:

  • Use the Field Info tool to check for changes in file structure

  • Consider Dynamic Select and Dynamic Rename over Select

  • Consider whether blanks should actually be nulls

  • Use comments, containers, and annotations to make the workflow easier to follow

  • Remember to profile the data and use exploratory data analysis

  • Consider whether to impute missing values

  • Field types can be extracted to a .yxft file (useful for replacing Auto Field with Select)

  • Use the Message and Test tools to flag errors

  • Use relative paths or Universal Naming Convention paths rather than absolute paths

  • Check the Workflow Dependencies window, which not only lists dependencies, but also allows you to test them and switch them between absolute, relative, and UNC

  • Use a secrets file or environment variables

Part 3: Governance of DataOps

The book suggests using the community-developed CReW macros to complement the Message and Test tools for testing, though for some reason it capitalizes the name in two different ways, neither of which are correct.

You can use GitHub Actions to run test scripts to do things like check for missing metadata, as well as to push the updates to Alteryx Server using the Alteryx Server API. No suggestions are offered on how to use them to check that your workflows actually work.

The book concludes with chapters on security/permissions and data cataloguing/discovery with Alteryx Connect. Like the rest of the book, there's a lot of screenshots and stuff that you would expect to find in the documentation, but not much in the way of general principles.

Last updated 7 months ago