Stop Writing Complex Code: Build Smarter Workflows with AWS Step Functions

Building AI and data pipelines doesn’t have to mean tangled code and fragile workflows. In this article, we explore how AWS Step Functions can simplify orchestration, improve reliability, and help you build scalable, serverless workflows, without the headache of managing complex logic yourself.

Cloud Engineering Workflows

January 7, 202610 min read11 sectionsBy Ahmed Abdullah

Stop Writing Complex Code: Build Smarter Workflows with AWS Step Functions

Why AWS Step Functions Might Be the Smartest Tool You’re Ignoring

Hi there, fellow AI enthusiast! It’s been a while since I last shared an article, and yes, I know — I’ve been slacking a lot! But I hope you’re doing amazing and thriving in your journey. If you’ve stumbled upon this piece, chances are you (like me, not too long ago) are building incredible backends to support your AI or data pipelines and searching for ways to make your workflows smoother, more efficient, and less of a headache — or maybe you’re just here to uncover one of AWS’s most underrated gems. Either way, you’re in the right place.

Something I’ve learned over time while creating backend applications for AI and data science products is that it’s not just about writing the code — it’s about stitching everything together in a way that makes sense, scales gracefully, and doesn’t turn into a spaghetti mess when requirements inevitably change. That’s where orchestration comes in, and today, we’ll be diving into AWS Step Functions, a service that simplifies all of the above like a pro.

While Lambda, S3, and DynamoDB often steal the spotlight, Step Functions works quietly behind the scenes, waiting to transform how you design and manage workflows. So, without further ado, let’s explore why learning Step Functions might just be the smartest move you’ll make for your projects.

'Step Functions is a workflow service that helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines.'

Why Serverless?

Let’s take a quick detour and talk about why we’re even considering serverless in the first place. Imagine this: you’ve built this amazing AI model that can do wonders with image recognition or stock price prediction. Now you need to deploy it, but wait — do you really want to manage servers, worry about scaling, and wake up at 3 AM because something crashed? Yeah, me neither!

This is where serverless comes in, and AWS Lambda is the perfect example. You just write your backend application, deploy it, and poof — AWS handles all the heavy lifting. No servers to manage, no scaling headaches, and you only pay for what you actually use. It’s like having a personal infrastructure butler who knows exactly what you need before you even ask!

But here’s the catch (isn’t there always one?): while Lambda functions are amazing for individual tasks, things can get messy when you’re trying to coordinate multiple functions in a complex workflow. Imagine trying to orchestrate a data pipeline where you need to:

Download and preprocess data
Run model inference
Post-process results
Update your database
Send notifications

Each step could be a separate Lambda function, but managing their execution order, handling errors, and maintaining state between them… well, that’s where things get interesting!

Why Step Functions?

Enter AWS Step Functions — the superhero we didn’t know we needed! Think of stepfunctions as the conductor of your serverless orchestra, making sure every component plays its part at exactly the right moment. But what makes it so special? Let me break down why I fell in love with Step Functions (and why you might too):

First off, having a visual flowchart that actually works. You can literally draw out your workflow using Amazon’s State Language (ASL), and Step Functions will execute it exactly as you’ve designed it. No more wondering “wait, what happens after this function finishes?”

But that’s just the beginning. Here’s why Step Functions is a game-changer:

Error Handling Made Easy: Remember those times when one failed Lambda would bring down your entire pipeline? Step Functions lets you define retry policies and error handling paths right in your workflow. It’s like having a safety net that actually catches things!
Built-in State Management: No more passing state between functions through databases or shared storage. Step Functions maintains the state of your execution for you, passing data seamlessly between steps. It’s like having a really efficient personal assistant who never forgets anything!
Visual Debugging: The visual console shows you exactly where your workflow is, what data was passed between states, and where things might have gone wrong. Trust me, your future self will thank you during those debugging sessions!
Long-Running Workflows: Unlike Lambda’s 15-minute limitation, Step Functions can run workflows for up to a year! Perfect for those AI training pipelines that seem to run forever.
Native Integration: It plays nicely with pretty much every AWS service you can think of. Whether you’re triggering Lambda functions, starting ECS tasks, or interacting with SageMaker, Step Functions has got your back.
Parallel Processing: Need to run multiple tasks at once? Step Functions can handle parallel executions like a pro, making it perfect for data processing pipelines where you want to maximize throughput.

Let’s Consider a Use Case

In order to get our hands on with Step Functions let’s consider a common scenario in ML deployment: building an API that handles model retraining requests. Our workflow needs to:

Accept new training data
Preprocess this data
Retrain the model
Evaluate the results
Save the model weights to S3 if the performance improves
Update the model endpoints

Building the FastAPI Backend

First, let’s set up our FastAPI endpoints that will handle each step of the process:

python

from fastapi import FastAPI, HTTPException
from mangum import Mangum
from pydantic import BaseModel
import pandas as pd
import boto3

app = FastAPI()
handler = Mangum(app)  # This makes our app Lambda-compatible

class TrainingData(BaseModel):
    data_path: str
    model_params: dict

    @app.get("/health")
    async def health_check():
        return {"status": "healthy"}

        @app.post("/preprocess")
        async def preprocess_data(request: TrainingData):
            try:
                # Load and preprocess data
                data = pd.read_csv(request.data_path)
                processed_data = preprocess_pipeline(data)

                return {
                    "status": "success",
                    "processed_data_path": "s3://bucket/processed_data.csv",
                    "model_params": request.model_params
                }
            except Exception as e:
                raise HTTPException(status_code=500, detail=str(e))

                @app.post("/train")
                async def train_model(data: dict):
                    try:
                        # Load processed data and train
                        processed_data = pd.read_csv(data["processed_data_path"])
                        model = train_model(processed_data, data["model_params"])

                        return {
                            "status": "success",
                            "model_path": "s3://bucket/temp_model.pkl",
                            "metrics": {
                                "accuracy": 0.92,
                                "f1_score": 0.91
                            }
                        }
                    except Exception as e:
                        raise HTTPException(status_code=500, detail=str(e))

                        @app.post("/evaluate")
                        async def evaluate_model(data: dict):
                            try:
                                # Compare with existing model
                                new_metrics = data["metrics"]
                                current_metrics = get_current_model_metrics()

                                should_update = new_metrics["accuracy"] > current_metrics["accuracy"]

                                return {
                                    "status": "success",
                                    "should_update": should_update,
                                    "model_path": data["model_path"]
                                }
                            except Exception as e:
                                raise HTTPException(status_code=500, detail=str(e))

                                @app.post("/update-model")
                                async def update_model(data: dict):
                                    if data["should_update"]:
                                        s3 = boto3.client('s3')
                                        # Upload to S3 and update endpoints
                                        s3.upload_file(
                                            data["model_path"],
                                            "production-models",
                                            "latest_model.pkl"
                                        )

                                        return {"status": "success"}

and the docker file for our backend would look something like this

code

FROM public.ecr.aws/lambda/python:3.9
COPY requirements.txt ${LAMBDA_TASK_ROOT}
RUN pip install -r requirements.txt
COPY app/ ${LAMBDA_TASK_ROOT}
CMD [ "app.handler" ]

Deploying to ECR and Lambda

Let’s get our container up to ECR and connected to Lambda. Here are the steps:

Build and Push to ECR

code

# Login to ECR
aws ecr get-login-password --region your-region | docker login --username AWS --password-stdin your-account.dkr.ecr.your-region.amazonaws.com

# Create ECR repository

code

aws ecr create-repository --repository-name ml-retraining-api

# Build the image

code

docker build -t ml-retraining-api .

# Tag the image

code

docker tag ml-retraining-api:latest your-account.dkr.ecr.your-region.amazonaws.com/ml-retraining-api:latest

# Push to ECR

code

docker push your-account.dkr.ecr.your-region.amazonaws.com/ml-retraining-api:latest

Create Lambda Function

Go to AWS Lambda console
Click “Create function”
Choose “Container image”
Select your pushed ECR image
Configure memory (recommendation: at least 1024MB) and timeout (e.g., 5 minutes)
Configure environment variables if needed

Configure API Gateway

Create a new REST API in API Gateway
Create resources matching your FastAPI routes
Set up Lambda proxy integration with your function

Until now, we’ve built a backend that’s fully dockerized and hosted in a serverless manner on Lambda. You might be wondering, with all these different endpoints scattered across the app, how are they going to be connected? Well, that’s where the real magic starts.

Orchestrating with Step Functions

Now comes the fun part — connecting everything together! Let’s walk through creating our Step Functions workflow using the AWS Console:

Navigate to AWS Console and search for “Step Functions”
Click on “State machines” in the left sidebar
Hit the “Create state machine” button
Choose “Write your workflow in code” (we’ll be using Amazon States Language)
You’ll see a visual editor on the right and code editor on the left

Here’s our workflow definition with detailed comments explaining each part:

typescript

{
    // General description of what this state machine does
    "Comment": "ML Model Retraining Pipeline",

    // Specify which state to begin with
    "StartAt": "PreprocessData",

    // Define all possible states in our workflow
    "States": {
        // First state: Preprocess our training data
        "PreprocessData": {
            "Type": "Task",  // This is a task that needs to be executed
            "Resource": "arn:aws:states:::lambda:invoke",  // We're invoking a Lambda
            "Parameters": {
                // Specify which Lambda function to call
                "FunctionName": "ml-retraining-api",
                // What data to send to the Lambda
                "Payload": {
                    "path": "/preprocess",  // Which endpoint to hit
                    "body.$": "$"  // Pass all input data to the function
                }
        },
    "Next": "TrainModel",  // Where to go after this state
    // Retry logic if something fails
    "Retry": [
        {
            "ErrorEquals": ["States.TaskFailed"],
            "IntervalSeconds": 3,
            "MaxAttempts": 2
        }
]
},

// Second state: Train our model
"TrainModel": {
    "Type": "Task",
    "Resource": "arn:aws:states:::lambda:invoke",
    "Parameters": {
        "FunctionName": "ml-retraining-api",
        "Payload": {
            "path": "/train",
            // Use the output from previous state
            "body.$": "$.Payload"
        }
},
"Next": "EvaluateModel"
},

// Third state: Evaluate model performance
"EvaluateModel": {
    "Type": "Task",
    "Resource": "arn:aws:states:::lambda:invoke",
    "Parameters": {
        "FunctionName": "ml-retraining-api",
        "Payload": {
            "path": "/evaluate",
            "body.$": "$.Payload"
        }
},
"Next": "ShouldUpdateModel"
},

// Decision state: Should we update the model?
"ShouldUpdateModel": {
    "Type": "Choice",  // This state makes a decision
    "Choices": [
        {
            // Check if should_update is true
            "Variable": "$.Payload.should_update",
            "BooleanEquals": true,
            "Next": "UpdateModel"  // If true, update the model
        }
],
"Default": "Success"  // If false, skip to success
},

// Final task state: Update the model if needed
"UpdateModel": {
    "Type": "Task",
    "Resource": "arn:aws:states:::lambda:invoke",
    "Parameters": {
        "FunctionName": "ml-retraining-api",
        "Payload": {
            "path": "/update-model",
            "body.$": "$.Payload"
        }
},
"Next": "Success"
},

// End state: Mark as successful
"Success": {
    "Type": "Succeed"  // This is a terminal state
}
}
}

The path in all of your states would actually be the gateway path or the function url to your lambda function followed by the endpoint for example: "“https://xyz123abc.lambda-url.us-east-1.on.aws/preprocess""

Important Items to check

Once you’ve made your state machine we still have to verify a few things

Click “Refresh” in the visual editor to make sure your workflow looks correct
Use the “Check for errors” button to validate your syntax
Make sure all your Lambda function ARNs are correct

Pro tip: You can test your workflow right from the console! After creating the state machine:

Click on “Start execution”
Provide a sample JSON input
Watch your workflow execute in real-time with the visual tracker

The visual interface will show you exactly which state is currently executing, what data is being passed between states, and if any errors occur. Think of it as a live control panel for your entire workflow, where you can watch your data flow through each step in real-time!

And with that, we’re done. Now you can simply call your statemachine pipeline either via an endpoint (if needed) or via an automated cronjob or even using cloudwatch to call it on regular intervals.

Before we move towards conclusion and I conclude the article there are 2 very important sections that I want you to go through before going for stepfunctions.

When Should You Use Step Functions?

Step Functions shines when you’re dealing with:

Long-running, multi-step processes that need coordination (like ML training pipelines)
Workflows requiring error handling and retries across different services
Complex data processing pipelines where state management is crucial
Operations that need to be audited or tracked in detail
Workflows that combine both synchronous and asynchronous tasks
Tasks that need to pause and resume based on external events
Processes where different teams handle different steps and need clear boundaries

When is Step Functions Overkill?

Skip Step Functions when you have:

Simple, single-step processes that could be handled by a single Lambda
Quick synchronous operations that don’t need state management
Workflows where all steps must execute within 15 minutes (Lambda’s limit might be enough)
Small-scale applications where manual coordination is manageable
Processes where the overhead of setting up and maintaining Step Functions outweighs its benefits
Real-time processing needs where the state transitions add unnecessary latency
Simple CRUD operations that could be handled by straightforward API calls
Workflows where cost optimization is crucial and simpler alternatives would suffice

Conclusion

So, in a nutshell, Step Functions is like having an autopilot for your AWS services where you set the course once, and it handles all the complex navigation. Building ML pipelines used to mean juggling multiple services and praying nothing breaks mid-process. But Step Functions changes the game, orchestrating everything from data preprocessing to model deployment with precision and reliability.

Remember: Step Functions is a powerful tool, but like any AWS service, it comes with its own pricing model. Sometimes, a simple Lambda function or a well-structured API might be all you need!

Here’s to building ML pipelines that don’t keep us up at night! 🚀 Until next time 👋

Keep reading from the journal.

June 23, 2026

Workflows

The pipeline trusted the filename

The unwritten convention holding your pipeline together

June 12, 2026

Growth

The product outgrew its founder

On solo builds, roadmaps that want four specialists, and a one hour sort that shrinks the problem.

June 2, 2026

Agents

The agent that delegated

On May 28, 2026, Anthropic shipped the Agent SDK alongside Claude Opus 4.8. The SDK lets a single orchestrator spawn parallel subagents, each with its own tools and structured output schema. The agents run concurrently, return typed results, and the orchestrator synthesizes them.