Data engineering with Alteryx
The obvious question that this book has to answer is "Why would you use Alteryx for data engineering?" It gives three main reasons:
Speed of development
Iterative workflow development
Self-documentation
Code-based approaches are also self-documenting, and I see no reason why they can't be developed iteratively either, but my experience is that the no-code approach with a visual canvas and the ability to easily inspect the data at any point in the workflow does lead to significantly faster development than code-based solutions. This may change over time as LLMs improve (see Complete Cursor and Generative AI for software development).
To those points, I would also add the ease of picking it up and the convenience of having almost everything that you need available in a single program (particularly beneficial in organizations where it is difficult to get IT approval for new programs or packages). For many companies, its advantages will outweigh its disadvantages (such as uninformative diffs in version control).
This isn't a bad book overall, but its general focus is on step-by-step instructions for how to do things in Alteryx (accompanied by a lot of screenshots), with very little time spent on how you can use Alteryx in accordance with general data engineering/DataOps principles and best practices. Given that I can just check the documentation when I want to know how to do something, I would have preferred a 20-page book that stuck to what you should do and why rather than one that spreads 20 pages' worth of information over more than 300 pages.
Part 1: Introduction
Tools:
Alteryx Designer for data transformation
Alteryx Server for automation and scheduling
Alteryx Connect for cataloguing and discovery
Suggested best practices for workflows:
Supplement the automatic annotations
Use tool containers to group logic
Use comment and explorer boxes
This chapter defines data engineering and gives some examples of using Designer, Server, and Connect. Nothing particularly stood out as worthy of mentioning here.
DataOps is basically DevOps applied to analytics workflows. According to the book, it promises:
Faster cycle times
Faster access to actionable insights
Improved robustness of data processes (e.g., through statistical process control)
The ability to see the entire data flow in a workflow
Strong security and confidence
The book also discusses the principles from https://dataopsmanifesto.org/en/. There are quite a few of them, so see either the link or the expandable box below.
Pillars of DataOps
Continually Satisfy Your Customer
Highest priority is customer satisfaction through early and continuous delivery of valuable analytic insights
Delivery timelines range from minutes to weeks
Value Working Analytics
Primary performance measure is the delivery of insightful analytics
This includes accurate data, robust frameworks, and systems
Embrace Change
Welcome and embrace evolving customer needs for competitive advantage
Face-to-face communication is the most efficient and effective method
It's a Team Sport
Diverse teams with varied roles, skills, tools, and backgrounds enhance innovation and productivity
Daily Interactions
Customers, analytic teams, and operations must collaborate daily
Self-Organize
Best results emerge from self-organizing teams
Reduce Heroism
Strive for sustainable and scalable teams and processes to avoid over-reliance on individuals
Reflect
Regularly review feedback from customers, team members, and operational statistics to improve performance
Analytics is Code
Tools generate code and configuration that describe data manipulation for insights.
Orchestrate
End-to-end orchestration of data, tools, code, environments, and teams is crucial for success
Make it Reproducible
Version everything (data, hardware, software, code, configuration) for reproducibility
Disposable Environments
Provide easy-to-create, isolated, and disposable environments for experimentation
Simplicity
Focus on technical excellence, good design, and simplicity (maximizing work not done) for enhanced agility
Analytics is Manufacturing
Treat analytic pipelines like lean manufacturing lines, focusing on process thinking for continuous efficiency
Quality is Paramount
Implement automated abnormality and security detection (jidoka) and continuous feedback mechanisms (poka yoke)
Monitor Quality and Performance
Continuously monitor performance, security, and quality for unexpected variations and operational statistics
Reuse
Avoid repeating previous work for increased efficiency
Improve Cycle Times
Minimize time and effort from customer need to production-ready analytic process and refactoring/reuse
Start With Your Data Journey
Data trust is crucial for adoption
Understand and observe the data journey in your production environment for error reduction.
Part 2: Functional steps in DataOps
Much of this section covers basic functionality, which will be useful for new Alteryx users but will already be familiar to people that have been using the software for a while. It does suggest some best practices though:
Use the Field Info tool to check for changes in file structure
Consider Dynamic Select and Dynamic Rename over Select
Consider whether blanks should actually be
nullsUse comments, containers, and annotations to make the workflow easier to follow
Remember to profile the data and use exploratory data analysis
Consider whether to impute missing values
Field types can be extracted to a .yxft file (useful for replacing Auto Field with Select)
Use the Message and Test tools to flag errors
Use relative paths or Universal Naming Convention paths rather than absolute paths
Check the Workflow Dependencies window, which not only lists dependencies, but also allows you to test them and switch them between absolute, relative, and UNC
Use a secrets file or environment variables
Part 3: Governance of DataOps
The book suggests using the community-developed CReW macros to complement the Message and Test tools for testing, though for some reason it capitalizes the name in two different ways, neither of which are correct.
You can use GitHub Actions to run test scripts to do things like check for missing metadata, as well as to push the updates to Alteryx Server using the Alteryx Server API. No suggestions are offered on how to use them to check that your workflows actually work.
The book concludes with chapters on security/permissions and data cataloguing/discovery with Alteryx Connect. Like the rest of the book, there's a lot of screenshots and stuff that you would expect to find in the documentation, but not much in the way of general principles.
Last updated