Data engineering with Alteryx

The obvious question that this book has to answer is "Why would you use Alteryx for data engineering?" It gives three main reasons:

Speed of development
Iterative workflow development
Self-documentation

Code-based approaches are also self-documenting, and I see no reason why they can't be developed iteratively either, but my experience is that the no-code approach with a visual canvas and the ability to easily inspect the data at any point in the workflow does lead to significantly faster development than code-based solutions. This may change over time as LLMs improve (see Complete Cursor and Generative AI for software development).

To those points, I would also add the ease of picking it up and the convenience of having almost everything that you need available in a single program (particularly beneficial in organizations where it is difficult to get IT approval for new programs or packages). For many companies, its advantages will outweigh its disadvantages (such as uninformative diffs in version control).

This isn't a bad book overall, but its general focus is on step-by-step instructions for how to do things in Alteryx (accompanied by a lot of screenshots), with very little time spent on how you can use Alteryx in accordance with general data engineering/DataOps principles and best practices. Given that I can just check the documentation when I want to know how to do something, I would have preferred a 20-page book that stuck to what you should do and why rather than one that spreads 20 pages' worth of information over more than 300 pages.

Part 1: Introduction

Tools:

Alteryx Designer for data transformation
Alteryx Server for automation and scheduling
Alteryx Connect for cataloguing and discovery

Suggested best practices for workflows:

Supplement the automatic annotations
Use tool containers to group logic
Use comment and explorer boxes

This chapter defines data engineering and gives some examples of using Designer, Server, and Connect. Nothing particularly stood out as worthy of mentioning here.

DataOps is basically DevOps applied to analytics workflows. According to the book, it promises:

Faster cycle times
Faster access to actionable insights
Improved robustness of data processes (e.g., through statistical process control)
The ability to see the entire data flow in a workflow
Strong security and confidence

The book also discusses the principles from https://dataopsmanifesto.org/en/. There are quite a few of them, so see either the link or the expandable box below.

Pillars of DataOps

Continually Satisfy Your Customer
- Highest priority is customer satisfaction through early and continuous delivery of valuable analytic insights
- Delivery timelines range from minutes to weeks

Value Working Analytics
- Primary performance measure is the delivery of insightful analytics
- This includes accurate data, robust frameworks, and systems

Embrace Change
- Welcome and embrace evolving customer needs for competitive advantage
- Face-to-face communication is the most efficient and effective method

It's a Team Sport
- Diverse teams with varied roles, skills, tools, and backgrounds enhance innovation and productivity

Daily Interactions
- Customers, analytic teams, and operations must collaborate daily

Self-Organize
- Best results emerge from self-organizing teams

Reduce Heroism
- Strive for sustainable and scalable teams and processes to avoid over-reliance on individuals

Reflect
- Regularly review feedback from customers, team members, and operational statistics to improve performance

Analytics is Code
- Tools generate code and configuration that describe data manipulation for insights.

Orchestrate
- End-to-end orchestration of data, tools, code, environments, and teams is crucial for success

Make it Reproducible
- Version everything (data, hardware, software, code, configuration) for reproducibility

Disposable Environments
- Provide easy-to-create, isolated, and disposable environments for experimentation

Simplicity
- Focus on technical excellence, good design, and simplicity (maximizing work not done) for enhanced agility

Analytics is Manufacturing
- Treat analytic pipelines like lean manufacturing lines, focusing on process thinking for continuous efficiency

Quality is Paramount
- Implement automated abnormality and security detection (jidoka) and continuous feedback mechanisms (poka yoke)

Monitor Quality and Performance
- Continuously monitor performance, security, and quality for unexpected variations and operational statistics

Reuse
- Avoid repeating previous work for increased efficiency

Improve Cycle Times
- Minimize time and effort from customer need to production-ready analytic process and refactoring/reuse

Start With Your Data Journey
- Data trust is crucial for adoption
- Understand and observe the data journey in your production environment for error reduction.

Part 2: Functional steps in DataOps

Much of this section covers basic functionality, which will be useful for new Alteryx users but will already be familiar to people that have been using the software for a while. It does suggest some best practices though:

Use the Field Info tool to check for changes in file structure
Consider Dynamic Select and Dynamic Rename over Select
Consider whether blanks should actually be nulls
Use comments, containers, and annotations to make the workflow easier to follow
Remember to profile the data and use exploratory data analysis
Consider whether to impute missing values
Field types can be extracted to a .yxft file (useful for replacing Auto Field with Select)
Use the Message and Test tools to flag errors
Use relative paths or Universal Naming Convention paths rather than absolute paths
Check the Workflow Dependencies window, which not only lists dependencies, but also allows you to test them and switch them between absolute, relative, and UNC
Use a secrets file or environment variables

Part 3: Governance of DataOps

The book suggests using the community-developed CReW macros to complement the Message and Test tools for testing, though for some reason it capitalizes the name in two different ways, neither of which are correct.

You can use GitHub Actions to run test scripts to do things like check for missing metadata, as well as to push the updates to Alteryx Server using the Alteryx Server API. No suggestions are offered on how to use them to check that your workflows actually work.

The book concludes with chapters on security/permissions and data cataloguing/discovery with Alteryx Connect. Like the rest of the book, there's a lot of screenshots and stuff that you would expect to find in the documentation, but not much in the way of general principles.

Last updated 1 year ago