AWS Step Functions
AWS Step Functions is a serverless function orchestrator that uses a state machine to tie together Lambda functions and other supported AWS services to build complex serverless workflows. Step Functions can integrate directly with many AWS services such as Lambda, DynamoDB, SQS, SNS, API Gateway, SageMaker and Athena. A state machine can be deployed, started and monitored through the console or an AWS SDK.
State machine definition is written in a JSON based structured language called Amazon State Language. Each step within a workflow is called a state. There are several built-in states that can be used for managing the workflow.
- Choice - choose execution branch based on defined conditions
- Wait - delay for a specified time
- Parallel - run parallel execution branches simultaneously
- Map - dynamically iterate over states (also supports concurrency)
- Task - a single unit of work run on Lambda function or a supported AWS service
- Fail - stops an execution and marks it as a failure
- Succeed - stops an execution successfully
- Pass - passes its input to its output
AWS Step Functions is priced on a pay for what you use model - you pay for every state transition rather than the execution time. If any other AWS service is invoked by the Step Functions (such as a Lambda function) this will be charged accordingly.
Deployment options
The AWS Console provides a simple way to experiment with Step Functions. It has sample projects to help you get going quickly, and Workflow Studio provides a visual editing experience for defining and deploying workflows.
For production, an Infrastructure-as-Code technology such as AWS Cloud Formation, AWS SAM, or third-party tools like Terraform should be used to define and deploy Step Functions. The latter support is very good as it is possible to leverage the template_file
data source module resulting in a neat and readable definition of the state machine. A state machine definition can be written in JSON or YAML. AWS also has extensions for your preferred IDE (called AWS Toolkit) providing a better developer experience with features such as graph visualisation of your workflow, state language code snippets and code completions/validation.
Which workflow?
When starting with Step Functions the first decision you have to make is choosing between standard or express workflow.
Express Workflow
Capable of processing up to 100,000 invocations per second, this workflow is best suited to use-cases involving short-lived high event volumes. This model guarantees that each step is executed "at least once" for asynchronous invocation and "at most once" for synchronous invocations. Workflows must complete within 5 minutes. If a workflow fails it must be rerun from the beginning. Express workflows can be invoked both synchronously or asynchronously. Activities and callbacks patterns are not supported by the express workflow.
Standard Workflow
Can process up to 2,000 invocations per second. This is ideal for when tasks are expensive or harmful if run more than once. The standard model guarantees "exactly once" invocation of each step and its execution can last up to 1 year. Each invocation is tracked in the Step Functions console and each state input and output can be inspected during and after execution. Standard Workflows support all service integrations, activities, and design patterns.
When to consider?
- Step Functions should be strongly considered when using a combination of serverless services. This is because Step Functions will not only help with the orchestration of a workflow but also present a visual flow of events to aid debugging or troubleshooting.
- Step Functions is useful when polling long-running tasks which may exceed the maximum lifetime of a Lambda function. This could be ideal for orchestrating a long-running batch job in a serverless manner.
- Due to the long-running nature of standard workflows, Step Functions should be used when some manual intervention or human input is required e.g. customer confirms that they have received their parcel.
- You should consider Step Functions if you require dynamic parallelism for batch workflows as well as some visual representation and monitoring.
- If a complex workflow is required, nested workflows can be handy abstracting away complexity while allowing someone to drill in if more detail is needed.
- CodePipeline can trigger step functions allowing for complex custom deployments in an environment.
- Step Functions is useful for scheduling a workflow that contains an Athena query. Built-in integration with Athena means a query result can be returned asynchronously, avoiding having to poll and check if your query execution finished successfully.
Integration with other services
There are several "patterns" that can be adopted when integrating Step Functions with other services. Some examples are:
Request a response: Step Functions will progress to the next state as soon as a HTTP response is received.
Run a job: This pattern allows integrated services such as AWS Batch and AWS ECS to complete a job before the Step Functions move to the next state.
Activities: are used to associate applications hosted anywhere with a task in a state machine. The only requisite is that the application (worker) can make a HTTP request to the state machine. Workers poll the state machine for work by issuing a GetActivityTask request. When work is available the Step Functions will return a response that includes the input for the job and a Task Token which will be used to correlate the application's job result.
Callbacks: Step Functions invoke a service passing a Task Token. The execution of the state machine then pauses until another service calls back the Step Functions with the Task Token. A waiting task can be configured with a heartbeat timeout. In this case, the Task Token will be returned with one of the following API actions: SendTaskSuccess, SendTaskFailure, or SendTaskHeartbeat.
Gotchas
- The max size of the state for each transfer is 256KB. If the state is bigger it must be saved to S3. This limitation is similar to SQS.
- If workflow versioning is required, then it must be done manually. For example, one could redeploy using a different name, or could manually append a version number to the workflow name. This is in stark contrast with AWS Lambda where each upload automatically creates a new version and previous versions are still accessible.
- Updates to step functions will only apply to new executions, they won't impact executions that are already in progress.
- Step Functions is not suitable for synchronous processing of a request where the client requires an immediate response.
- The 'at least once' guarantee of step functions within an Express Workflow requires the idempotency of each step.
- There's no resume from failure out-of-the box therefore if you require this functionality this has to be designed and implemented into your workflow.
- Step functions can be locally tested in a Docker container however it may require a fair amount of configuring to set this up correctly. This article may provide a good starting point.
- State transition information for an existing workflow execution can only be viewed in Console and isn't accessible via APIs. If an application requires state transition information to be persisted this would have to be designed into the workflow.
- Step Functions can be nested which may create complex workflows which may be difficult to debug and cause unexpected side effects. When designing a workflow consideration should be made to keep it as simple as possible.