Tensor Labs
Stop Writing Complex Code: Build Smarter Workflows with AWS Step Functions

Stop Writing Complex Code: Build Smarter Workflows with AWS Step Functions

Discover More

Building AI and data pipelines doesn’t have to mean tangled code and fragile workflows. In this article, we explore how AWS Step Functions can simplify orchestration, improve reliability, and help you build scalable, serverless workflows, without the headache of managing complex logic yourself.

Why AWS Step Functions Might Be the Smartest Tool You’re Ignoring

Hi there, fellow AI enthusiast! It’s been a while since I last shared an article, and yes, I know — I’ve been slacking a lot! But I hope you’re doing amazing and thriving in your journey. If you’ve stumbled upon this piece, chances are you (like me, not too long ago) are building incredible backends to support your AI or data pipelines and searching for ways to make your workflows smoother, more efficient, and less of a headache — or maybe you’re just here to uncover one of AWS’s most underrated gems. Either way, you’re in the right place.

Something I’ve learned over time while creating backend applications for AI and data science products is that it’s not just about writing the code — it’s about stitching everything together in a way that makes sense, scales gracefully, and doesn’t turn into a spaghetti mess when requirements inevitably change. That’s where orchestration comes in, and today, we’ll be diving into AWS Step Functions, a service that simplifies all of the above like a pro.

While Lambda, S3, and DynamoDB often steal the spotlight, Step Functions works quietly behind the scenes, waiting to transform how you design and manage workflows. So, without further ado, let’s explore why learning Step Functions might just be the smartest move you’ll make for your projects.

'Step Functions is a workflow service that helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines.'

Why Serverless?

Let’s take a quick detour and talk about why we’re even considering serverless in the first place. Imagine this: you’ve built this amazing AI model that can do wonders with image recognition or stock price prediction. Now you need to deploy it, but wait — do you really want to manage servers, worry about scaling, and wake up at 3 AM because something crashed? Yeah, me neither!

This is where serverless comes in, and AWS Lambda is the perfect example. You just write your backend application, deploy it, and poof — AWS handles all the heavy lifting. No servers to manage, no scaling headaches, and you only pay for what you actually use. It’s like having a personal infrastructure butler who knows exactly what you need before you even ask!

But here’s the catch (isn’t there always one?): while Lambda functions are amazing for individual tasks, things can get messy when you’re trying to coordinate multiple functions in a complex workflow. Imagine trying to orchestrate a data pipeline where you need to:

1. Download and preprocess data 2. Run model inference 3. Post-process results 4. Update your database 5. Send notifications

Each step could be a separate Lambda function, but managing their execution order, handling errors, and maintaining state between them… well, that’s where things get interesting!

Why Step Functions?

Enter AWS Step Functions — the superhero we didn’t know we needed! Think of stepfunctions as the conductor of your serverless orchestra, making sure every component plays its part at exactly the right moment. But what makes it so special? Let me break down why I fell in love with Step Functions (and why you might too):

First off, having a visual flowchart that actually works. You can literally draw out your workflow using Amazon’s State Language (ASL), and Step Functions will execute it exactly as you’ve designed it. No more wondering “wait, what happens after this function finishes?”

But that’s just the beginning. Here’s why Step Functions is a game-changer:

- Error Handling Made Easy: Remember those times when one failed Lambda would bring down your entire pipeline? Step Functions lets you define retry policies and error handling paths right in your workflow. It’s like having a safety net that actually catches things! - Built-in State Management: No more passing state between functions through databases or shared storage. Step Functions maintains the state of your execution for you, passing data seamlessly between steps. It’s like having a really efficient personal assistant who never forgets anything! - Visual Debugging: The visual console shows you exactly where your workflow is, what data was passed between states, and where things might have gone wrong. Trust me, your future self will thank you during those debugging sessions! - Long-Running Workflows: Unlike Lambda’s 15-minute limitation, Step Functions can run workflows for up to a year! Perfect for those AI training pipelines that seem to run forever. - Native Integration: It plays nicely with pretty much every AWS service you can think of. Whether you’re triggering Lambda functions, starting ECS tasks, or interacting with SageMaker, Step Functions has got your back. - Parallel Processing: Need to run multiple tasks at once? Step Functions can handle parallel executions like a pro, making it perfect for data processing pipelines where you want to maximize throughput.

Let’s Consider a Use Case

In order to get our hands on with Step Functions let’s consider a common scenario in ML deployment: building an API that handles model retraining requests. Our workflow needs to:

1. Accept new training data 2. Preprocess this data 3. Retrain the model 4. Evaluate the results 5. Save the model weights to S3 if the performance improves 6. Update the model endpoints

Building the FastAPI Backend

First, let’s set up our FastAPI endpoints that will handle each step of the process:

from fastapi import FastAPI, HTTPException
from mangum import Mangum
from pydantic import BaseModel
import pandas as pd
import boto3

app = FastAPI()
handler = Mangum(app)  # This makes our app Lambda-compatible

class TrainingData(BaseModel):
    data_path: str
    model_params: dict

@app.get("/health")
async def health_check():
    return {"status": "healthy"}

@app.post("/preprocess")
async def preprocess_data(request: TrainingData):
    try:
        # Load and preprocess data
        data = pd.read_csv(request.data_path)
        processed_data = preprocess_pipeline(data)
        
        return {
            "status": "success",
            "processed_data_path": "s3://bucket/processed_data.csv",
            "model_params": request.model_params
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/train")
async def train_model(data: dict):
    try:
        # Load processed data and train
        processed_data = pd.read_csv(data["processed_data_path"])
        model = train_model(processed_data, data["model_params"])
        
        return {
            "status": "success",
            "model_path": "s3://bucket/temp_model.pkl",
            "metrics": {
                "accuracy": 0.92,
                "f1_score": 0.91
            }
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/evaluate")
async def evaluate_model(data: dict):
    try:
        # Compare with existing model
        new_metrics = data["metrics"]
        current_metrics = get_current_model_metrics()
        
        should_update = new_metrics["accuracy"] > current_metrics["accuracy"]
        
        return {
            "status": "success",
            "should_update": should_update,
            "model_path": data["model_path"]
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/update-model")
async def update_model(data: dict):
    if data["should_update"]:
        s3 = boto3.client('s3')
        # Upload to S3 and update endpoints
        s3.upload_file(
            data["model_path"],
            "production-models",
            "latest_model.pkl"
        )
        
    return {"status": "success"}

and the docker file for our backend would look something like this

FROM public.ecr.aws/lambda/python:3.9
COPY requirements.txt ${LAMBDA_TASK_ROOT}
RUN pip install -r requirements.txt
COPY app/ ${LAMBDA_TASK_ROOT}
CMD [ "app.handler" ]

Deploying to ECR and Lambda

Let’s get our container up to ECR and connected to Lambda. Here are the steps:

1. Build and Push to ECR

# Login to ECR
aws ecr get-login-password --region your-region | docker login --username AWS --password-stdin your-account.dkr.ecr.your-region.amazonaws.com

# Create ECR repository

aws ecr create-repository --repository-name ml-retraining-api

# Build the image

docker build -t ml-retraining-api .

# Tag the image

docker tag ml-retraining-api:latest your-account.dkr.ecr.your-region.amazonaws.com/ml-retraining-api:latest

# Push to ECR

docker push your-account.dkr.ecr.your-region.amazonaws.com/ml-retraining-api:latest

2. Create Lambda Function

- Go to AWS Lambda console - Click “Create function” - Choose “Container image” - Select your pushed ECR image - Configure memory (recommendation: at least 1024MB) and timeout (e.g., 5 minutes) - Configure environment variables if needed

3. Configure API Gateway

- Create a new REST API in API Gateway - Create resources matching your FastAPI routes - Set up Lambda proxy integration with your function

Until now, we’ve built a backend that’s fully dockerized and hosted in a serverless manner on Lambda. You might be wondering, with all these different endpoints scattered across the app, how are they going to be connected? Well, that’s where the real magic starts.

Orchestrating with Step Functions

Now comes the fun part — connecting everything together! Let’s walk through creating our Step Functions workflow using the AWS Console:

1. Navigate to AWS Console and search for “Step Functions” 2. Click on “State machines” in the left sidebar 3. Hit the “Create state machine” button 4. Choose “Write your workflow in code” (we’ll be using Amazon States Language) 5. You’ll see a visual editor on the right and code editor on the left

Here’s our workflow definition with detailed comments explaining each part:

{
  // General description of what this state machine does
  "Comment": "ML Model Retraining Pipeline",
  
  // Specify which state to begin with
  "StartAt": "PreprocessData",
  
  // Define all possible states in our workflow
  "States": {
    // First state: Preprocess our training data
    "PreprocessData": {
      "Type": "Task",  // This is a task that needs to be executed
      "Resource": "arn:aws:states:::lambda:invoke",  // We're invoking a Lambda
      "Parameters": {
        // Specify which Lambda function to call
        "FunctionName": "ml-retraining-api",
        // What data to send to the Lambda
        "Payload": {
          "path": "/preprocess",  // Which endpoint to hit
          "body.$": "$"  // Pass all input data to the function
        }
      },
      "Next": "TrainModel",  // Where to go after this state
      // Retry logic if something fails
      "Retry": [
        {
          "ErrorEquals": ["States.TaskFailed"],
          "IntervalSeconds": 3,
          "MaxAttempts": 2
        }
      ]
    },

    // Second state: Train our model
    "TrainModel": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "ml-retraining-api",
        "Payload": {
          "path": "/train",
          // Use the output from previous state
          "body.$": "$.Payload"
        }
      },
      "Next": "EvaluateModel"
    },

    // Third state: Evaluate model performance
    "EvaluateModel": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "ml-retraining-api",
        "Payload": {
          "path": "/evaluate",
          "body.$": "$.Payload"
        }
      },
      "Next": "ShouldUpdateModel"
    },

    // Decision state: Should we update the model?
    "ShouldUpdateModel": {
      "Type": "Choice",  // This state makes a decision
      "Choices": [
        {
          // Check if should_update is true
          "Variable": "$.Payload.should_update",
          "BooleanEquals": true,
          "Next": "UpdateModel"  // If true, update the model
        }
      ],
      "Default": "Success"  // If false, skip to success
    },

    // Final task state: Update the model if needed
    "UpdateModel": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "ml-retraining-api",
        "Payload": {
          "path": "/update-model",
          "body.$": "$.Payload"
        }
      },
      "Next": "Success"
    },

    // End state: Mark as successful
    "Success": {
      "Type": "Succeed"  // This is a terminal state
    }
  }
}

The path in all of your states would actually be the gateway path or the function url to your lambda function followed by the endpoint for example: "“https://xyz123abc.lambda-url.us-east-1.on.aws/preprocess""

Important Items to check

Once you’ve made your state machine we still have to verify a few things

1. Click “Refresh” in the visual editor to make sure your workflow looks correct 2. Use the “Check for errors” button to validate your syntax 3. Make sure all your Lambda function ARNs are correct

Pro tip: You can test your workflow right from the console! After creating the state machine:

1. Click on “Start execution” 2. Provide a sample JSON input 3. Watch your workflow execute in real-time with the visual tracker

The visual interface will show you exactly which state is currently executing, what data is being passed between states, and if any errors occur. Think of it as a live control panel for your entire workflow, where you can watch your data flow through each step in real-time!

And with that, we’re done. Now you can simply call your statemachine pipeline either via an endpoint (if needed) or via an automated cronjob or even using cloudwatch to call it on regular intervals.

Before we move towards conclusion and I conclude the article there are 2 very important sections that I want you to go through before going for stepfunctions.

When Should You Use Step Functions?

Step Functions shines when you’re dealing with:

- Long-running, multi-step processes that need coordination (like ML training pipelines) - Workflows requiring error handling and retries across different services - Complex data processing pipelines where state management is crucial - Operations that need to be audited or tracked in detail - Workflows that combine both synchronous and asynchronous tasks - Tasks that need to pause and resume based on external events - Processes where different teams handle different steps and need clear boundaries

When is Step Functions Overkill?

Skip Step Functions when you have:

- Simple, single-step processes that could be handled by a single Lambda - Quick synchronous operations that don’t need state management - Workflows where all steps must execute within 15 minutes (Lambda’s limit might be enough) - Small-scale applications where manual coordination is manageable - Processes where the overhead of setting up and maintaining Step Functions outweighs its benefits - Real-time processing needs where the state transitions add unnecessary latency - Simple CRUD operations that could be handled by straightforward API calls - Workflows where cost optimization is crucial and simpler alternatives would suffice

Conclusion

So, in a nutshell, Step Functions is like having an autopilot for your AWS services where you set the course once, and it handles all the complex navigation. Building ML pipelines used to mean juggling multiple services and praying nothing breaks mid-process. But Step Functions changes the game, orchestrating everything from data preprocessing to model deployment with precision and reliability.

Remember: Step Functions is a powerful tool, but like any AWS service, it comes with its own pricing model. Sometimes, a simple Lambda function or a well-structured API might be all you need!

Here’s to building ML pipelines that don’t keep us up at night! 🚀 Until next time 👋