10 PRINT "I AM COOL"

Exposing an Azure Function using API Management

Andy Blackledge — Tue, 09 Apr 2024 10:59:51 GMT

This is the third post in my series where I explore the world of serverless Azure. I am doing this by building a webhook proxy application using only the serverless Azure services. In the first two posts, I built and deployed an Azure function that receives the webhooks and validates the request body against the corresponding schema. In this post, I look at using Azure API Management (APIM) in front of that function.

Why use API Management?

As we saw in the previous post, the Azure function is accessible from the public internet provided you know the appropriate API key. So why would you need a service such as API Management? A few reasons are listed below.

Rate limiting, ensuring fair usage among consumers
Subscription-level control, such as key rotation
Advanced security, such as OAuth 2.0

API Management has many more features that we won't explore in this post, but include the following:

Centralized management of APIs
Customizable API facades
API Documentation and Developer Portals
API Analytics and Insights
Caching Mechanisms
Versioning and Revision Control

I find so much in software design is down to positioning. By using a service such as API Management, we will position ourselves to provide production-level API.

Are there any alternatives?

As with other cloud providers, Azure offers services that somewhat overlap in what they offer. In this case, ChatGPT was able to offer the following options amongst some others:

Azure Application Gateway with Web Application Firewall (WAF)

Use Case: If you're primarily looking for API gateway capabilities with security features like a Web Application Firewall, SSL termination, and URL-based routing.

Why Use Over APIM: Offers Layer 7 load balancing with built-in WAF for security-focused scenarios, especially where protection against common web vulnerabilities and exploits is a priority.

Azure Functions Proxies

Use Case: For lightweight API orchestration or when you need a simple facade in front of multiple Azure Functions.

Why Use Over APIM: It's a simple solution to create a single API surface for multiple microservices, particularly when these services are implemented using Azure Functions. However, it's less feature-rich compared to APIM.

Azure Front Door

Use Case: For global routing and load balancing needs, offering capabilities like URL-based routing, SSL termination, and global load balancing.

Why Use Over APIM: It is more focused on content delivery, global routing, and ensuring high availability and performance for your web applications and APIs.

Given the API-focused nature of the application I am building, and the cost, API Management seems to be a good fit.

Creating the API Management instance

Using the Expose serverless APIs from HTTP endpoints using Azure API Management article as a guide, I opened up the Azure portal and navigated to my function app and selected the API Management blade.

This brought up the following option to create a new API Management instance.

Clicking on 'Create new' brought up the following UI. The Region and Resource name were defaulted and I filled in the other details. To save money, I chose the 'Developer' pricing tier noting that there is no Service Level Agreement (SLA) for this tier.

The next step gave the option to link the new instance with Application Insights and utilise 'Defender for Could' (sic). I enabled the former, but there wasn't an option to enable the latter. This may have been due to using the cheapest tier.

The final step was to choose the network connectivity. As this is to be a public endpoint, I chose 'None'.

After clicking 'Create' and waiting a while, the new API Management instance was ready to be connected to the function and expose the API.

Exposing the Azure function as an API

Now when I selected the API Management blade from the function app, it reported that it was now linked with the new instance. However, it had not imported anything automatically and presented me with the option to create a new API.

There was nothing promising in the API dropdown, so I selected 'Create New' and clicked on 'Link API'. I was then presented with the following list of Azure Functions.

Now this was looking more promising. The wizard appears to have recognised my Azure Function. I selected it and tried to progress. However, rather confusingly, I was then prompted to 'Define a new API' by selecting from a list of potential sources. This wasn't what I was expecting, but I selected 'Function App' and clicked to progress.

When I was presented with the next screen, I felt that I had somewhat gone round in a circle. It looked very similar to one earlier when my Azure Function was listed. However, I decided to click 'Select' and continue.

I was then presented with a list of Function Apps from which to import functions. I selected my Function App and clicked to progress.

The next step defaulted in a set of values, which all seemed reasonable to me. So I simply clicked 'Create'.

Once the creation was complete, I was able to view the 'Design' page for the new API and see the integration with the Azure Function in the backend. I could see how the process had recognised the request parameters and I also noticed the 'Inbound processing' box mentioned modifying the request. Also in the 'Inbound processing' box was the option to add policies. These, it turns out, are where you can do things such as filtering by IP address or rate limiting by key.

My eye was also caught by the 'Test' tab, so I clicked on it and gave it a go.

The response below showed that my Azure Function had been successfully called through the new API Management instance. So after quite a bit of clicking, it looks like I had managed to achieve my first aim.

I was a bit curious as how API Management was authorised to access my Azure Function. A bit of searching found this in the Authorization section of a Microsoft article:

Import of an Azure Function App automatically generates:
Host key inside the Function App with the name apim-{your Azure API Management service instance name},
Named value inside the Azure API Management instance with the name {your Azure Function App instance name}-key, which contains the created host key.

Sure enough, when I looked in the portal, I could see the following App key in my Function App.

And in the API Management instance, I found the same value as a 'Named value' with the expected name.

Ideally, I would like to used a Managed Identity and grant this identity access to the Azure Function. This method is more secure and manageable than function keys, but is only available to Premium Tier instances. So, for now, we will go with function keys.

Creating tenant-specific subscriptions

The Azure portal test had shown that the function was successfully integrated with API Management. The next test was to invoke it from outside of the Azure portal. As it happens, the Azure portal test UI has a feature that makes this easy. There is an option to copy the test HTTP request, which is highlighted below.

Using VS Code and the REST Client extension, you can paste the copied value into an .http file and you get the following.

So all I needed to do was find a suitable subscription key. I went into the API Management UI and the Subscriptions blade. This listed three subscription keys that appeared to have been created by default.

I tried the first two primary keys, but neither worked. So, with some reluctance, I tried to select the 'all-access' key. The result was this warning.

I decided to persevere and was able to successfully use the all-access key to make calls from VS Code. However, this was far from ideal and I wanted to work out how I would be able to issue and maintain separate keys for separate callers. So I clicked the option to create a new subscription.

This was more like it. I could restrict the access of the subscription to a specific API. There was also an option to trace, but for now I did not explore it. Once created, the new subscription appeared in the list as shown below.

I copied the primary key locally and re-ran my test from VS Code. As hoped, this was successful and now I had a way of issuing and rotating keys that were scoped to just a specific API.

There was one more thing that was bugging me though. It was still possible to call the Azure Function directly from the public internet. Although this would require the function key to be leaked, it was far from ideal. So I started to look into how I could restrict access and remove this possibility.

Disabling public access to the function

I went into the Networking blade for the Function App and clicked on the link next to 'Public network access'. This brought up the options below with the 'Enabled from all networks' selected. My first thought was to disable all public access. Maybe access from API Management doesn't count as public.

However, a quick test showed that this wasn't the case. Sure, direct access had been stopped, but so had access through API Management.

Engaging with ChatGPT, there was mention of virtual networks and managed identities. The former I didn't want to get into at that point in time and the latter wasn't available at my pricing tier.

I asked ChatGPT "Is it possible to add an access rule to an Azure Function that restricts access only from Azure?" The response came back:

Yes, it is possible to restrict access to an Azure Function so that it's accessible only from within Azure services. This can be done by configuring network access restrictions in your Azure Function settings. However, it's important to note that "from Azure" can encompass a wide range of sources, including other Azure services, VMs hosted in Azure, or services running in Azure regions.

The suggestion from ChatGPT was to use an Azure Service tag in the network access restrictions. As it put it:

Azure Service Tags represent a group of IP address prefixes from a given Azure service, which are used to help minimize complexity for security rule creation. You can leverage these in your network security rules.

Given this, I selected the option to enable from selected IP address.

This brought up a list of rules and I clicked on the option to add a new one. Following ChatGPT's advice, I added a rule for the AzureCloud service tag.

Once added, my new rule took pride of place at the top of the list. The portal defaults the unmatched rule action to 'Allow'. This isn't what I wanted, so I changed it to 'Deny' which resulted in the rule at the bottom.

Again I tested. This time I was able to access the function through API Management, but not directly. So I now had the behaviour that I wanted. However, ChatGPT did highlight these valid considerations for using AzureCloud Service Tag:

Broad Access: The AzureCloud service tag is quite broad and includes all of Azure's public IP addresses. It doesn't restrict access to only your Azure services but allows access from any Azure-hosted service, which can include Azure services used by others.
Other Azure Services: If your intention is to allow access only from specific Azure services (like Azure Logic Apps, Azure VMs, etc.), you might need a more granular approach. You can specify the IP addresses or ranges of those specific services or use relevant service tags if available.

Summary

API Management appears to be a powerful tool to expose and manage external APIs. I barely scratched the surface of its capabilities, as I was satisfied in just knowing that my Azure Function was now behind a suitable service. There was an awful lot of clicking and it makes me wonder about how all this would be done through infrastructure as code.

Comparing this experience to the one I have had with AWS is interesting. With Azure, I had to find ways to stop my Azure Function from being exposed. With AWS, you have to find ways to expose your Lambda functions. With Azure, you have to pay for managed identity functionality to integrate API Management with Azure Functions. With AWS, you have to use Identity and Access Management (IAM) for everything and it is completely free to use.

However, I now have the front of my application in a place I want. So the next step is to look at extending the back-end functionality, which will mean integrating with Blob storage.

Deploying and Debugging my First Azure Function

Andy Blackledge — Mon, 01 Apr 2024 08:49:01 GMT

In this series, I am learning by doing and documenting my experience on the way. I am aiming to build a serverless webhook proxy, to provide a resilient interface for downstream systems. In the first part, I used the local development tools to build a basic function to validate and store inbound requests. In this part, I deploy and debug the function in cloud, using the tools in Visual Studio and the Azure portal.

It's all gone south

Ultimately, I want to deploy the final application using infrastructure as code. However, first I thought I would try the ClickOps approach. This is done by right-clicking on the project and selecting 'Publish'.

The first time through it did successfully publish the function to Azure. However, I did notice something odd in the wizard. At one point it asks you to create a new Functions instance. The odd thing is that it only gave me one option for the storage, and that option was on the other side of the world.

I thought that perhaps it was perhaps a user interface issue. Surely, it wouldn't create resources 12,000 miles apart. I clicked to continue and then looked at the created resources.

Unfortunately, it was very much the case that resources could not have been more geographically disparate if you tried.

Back to the Portal

Try as I might, I could not get the Visual Studio wizard to create a set of geographically-sensible resources. Instead, I went into the Azure portal and selected the option to create a Function App. This did allow me full control over the location of any created resources. In particular, the Application Insights could now be located in the UK.

Once the Function App had been created via the portal, I could select it in the Visual Studio wizard.

Once this wizard had completed, and generated a few interesting files, I was presented by the 'Publish' button below.

Pressing this started the deployment and, before long, I had my function deployed into the cloud and ready to be tested.

Debugging the deployment

Full of excitement, I fired off a request to the function and got the following response.

HTTP/1.1 500 Internal Server Error

Now, I could test the function locally to recreate and debug the issue. However, it struck me that it would be an opportunity to look at what diagnostics are available in the cloud. With this in mind, I opened the Azure portal and went into the function app overview. Here, I was presented with a list of functions.

After selecting the function, I was then given the following developer options.

Out of curiosity, I selected the 'Integration' option and got a diagram showing how the triggers and inputs for the function and the outputs from the function. There was also a warning me that I could not edit my function in the portal, as I had chosen to use the isolated worker model. If such editing is important to you, then this is perhaps a reason not to choose that model. For me, I would rather not have the option.

Selecting 'Monitor' resulted in a promising list of function invocations. At the top of the list was my failure.

Clicking on the hyperlinked date brought up the details that I was looking for.

Here was the exception stack trace that clearly showed that the error lay in the code I had written. Well, the code I had copied from ChatGPT. Clearly, I had not been as diligent as I should have been with my testing.

There was also an option to run a query in Application Insights. Clicking this caused the following query to run and return all the relevant entries. Note how the query uses the 'union' operator to combine data from both the traces and exceptions tables.

For completeness, I open the 'Logs' page. This appeared to be some sort of realtime view of the logs. I ran my faulty function again and saw the following entries appear.

By default, it doesn't appear to log the actual exception. This seems to rather limit its usage, as the other views capture the full details. However, it might have some uses that are not apparent to me at the moment. It is good to know it is there though.

Now I knew what the problem was, I could go and fix it. But before that, there was one more thing I wanted to try and that was remote debugging.

Remote debugging (eventually)

To tell the truth, I could have saved myself quite a bit of frustration if I had read the remote debugging section of the Microsoft Develop Azure Functions using Visual Studio guide. However, here is the tale of my more circuitous route to success.

My first attempt was the most obvious option. That is, to use the option in the Publish page to attach a debugger.

This then indicated some activity, but ultimately no attachment occurred. Undeterred, I searched the internet and found the blog post How to remote debug an HTTP trigger Azure Functions in Visual Studio 2022. Amongst the steps mentioned, was to enable remote debugging in the Azure portal. However, when I looked I found this was already enabled.

With hindsight, what I suspect had happened was that the 'Attach Debugger' operation had enabled this. The Microsoft article advises the following:

After you're done remote debugging your code, you should disable remote debugging in the Azure portal. Remote debugging is automatically disabled after 48 hours, in case you forget.

The next thing I tried was to update the publish settings. The configuration was set to 'Release', so I changed it to 'Debug'.

Following the instructions in the blog post, I tried manually attaching to the remote process and was prompted for credentials to connect.

The credentials required had to be downloaded from the Azure portal, via a publish profile.

In the downloaded file, I found the details required in the 'Zip Deploy' element.

After a few tries, I was finally able to see the processes. So I followed the blog post and tried attaching to the w3wp.exe process.

However, Visual Studio still reported that my breakpoints were not active.

I wondered for a short while if remote debugging was not supported for the isolated worker model. Then it dawned on me that it wasn't the w3wp.exe process that I should be attaching to, it was the isolated dotnet.exe process instead.

Once I had done this, everything fell into place. My breakpoint was hit and I could step through my function remotely.

As mentioned, I could have avoided this, as the Microsoft article clearly states:

Check Show process from all users and then choose dotnet.exe and select Attach.

Summary

My experience highlighted the upsides and downsides of high-level wizards. When they work, they can be very productive. However, when they don't, it can be very difficult to understand what is going on an how to fix it. I also fell foul of the changing technology, as I was reading articles for the older function model whilst using the newer model.

However, I got there in the end and the option of remote debugging is an interesting one. It is one that I have never really felt I needed with AWS Lambda functions, but perhaps I will find a use now it is available. The option does expose the internals of the isolated worker model, as you can see how there is a separate process being called from the usual w3wp.exe process. If you step through an unhandled exception, you can even see how gRPC is used to communicate. All quite different from the black box of AWS Lambda functions.

Observations on Developing My First Azure Function

Andy Blackledge — Sun, 24 Mar 2024 11:08:02 GMT

Recently, my focus at work has shifted from AWS development to Azure development. To help myself get properly acquainted with the technology, I have decided to set myself a Azure-based serverless challenge. That is, to build a multi-tenant webhook proxy. First using ClickOps and then using Infrastructure as Code (IaC). In this post, I will first look at writing a single Azure Function and see what that brings.

The Webhook Proxy Application

The ultimate goal is to create a serverless application that can be placed in front of internal systems to robustly handle webhook callbacks. Its functionality will cover:

Validating the content of the callback
Storing the content of the callback
Forwarding the content to a downstream system
Automatically retrying if the downstream system is offline
Forwarding to a dead letter queue if not able to forward

In AWS, I would probably build the application using API Gateway calling a Lambda function, that stored the request in S3. Then handle the S3 events raised, using a combination of Lambda functions and Event Bridge to deliver the request.

The appealing aspect of this task is that it is a real-world need, and that it covers API management, functions as a service, serverless storage, and events. This means I will need to tangle with the following Azure technologies:

My first step on this journey is to create, test, and deploy an Azure Function that validates the contents of an HTTP according to a schema specified as part of the path.

Choosing an Azure Function model

Things rarely turn out to be as straightforward as you think, and straight away I was forced to make a choice as to the Azure Function model to use. Since I last explored them, there is a new 'isolated worker model' for running Azure Functions. The Microsoft article on the differences between the isolated worker model and the in-process model explains that there are two execution models for .NET functions:

Execution model Description
Isolated worker model Your function code runs in a separate .NET worker process. Use with supported versions of .NET and .NET Framework.
In-process model Your function code runs in the same process as the Functions host process. Supports only Long Term Support (LTS) versions of .NET.

Execution model	Description
Isolated worker model	Your function code runs in a separate .NET worker process. Use with supported versions of .NET and .NET Framework.
In-process model	Your function code runs in the same process as the Functions host process. Supports only Long Term Support (LTS) versions of .NET.

The Guide for running C# Azure Functions in an isolated worker process explains the benefits for the newer model:

Fewer conflicts: Because your functions run in a separate process, assemblies used in your app don't conflict with different versions of the same assemblies used by the host process.
Full control of the process: You control the start-up of the app, which means that you can manage the configurations used and the middleware started.
Standard dependency injection: Because you have full control of the process, you can use current .NET behaviors for dependency injection and incorporating middleware into your function app.

In my experience, when Microsoft develop a new model then it is better to adopt it if you can. The older models do not seem to get quite the same love. So given that, it was the isolated worker model for me.

The 'Out of the Box' Experience

The Microsoft article Create your first C# function in Azure using Visual Studio was my guide to getting started. The experience compared to Lambda functions in AWS is quite stark. Visual Studio is a one-stop development environment for .NET, whereas AWS and VS Code feels like an 'assemble your own' adventure.

As long as you remember to select the Azure development workload during installation of Visual Studio, then it only took a few steps after selecting the 'Azure Functions` template before I could hit F5 and have the following code executing:

[Function("HttpExample")]public IActionResult Run(  [HttpTrigger(AuthorizationLevel.Anonymous, "get", "post")] HttpRequest req){    return new OkObjectResult("Welcome to Azure Functions!");}

Visual Studio had seamlessly started a local hosting process at http://localhost:7166/api/HttpExample, as can be seen below:

So straight away, I could set break points and benefit from a super-quick inner development loop. However, it turned out that the generated code was not the only way to go.

The Built-in HTTP model

As the HTTP trigger documentation states:

HTTP triggers allow a function to be invoked by an HTTP request. There are two different approaches that can be used:
An ASP.NET Core integration model that uses concepts familiar to ASP.NET Core developers
A built-in model, which doesn't require extra dependencies and uses custom types for HTTP requests and responses.

I consulted with ChatGPT, which had the opinion:

The choice between the built-in Azure Functions HTTP trigger model and ASP.NET Core integration depends on the complexity of your application, the need for control over the HTTP pipeline, and your familiarity with ASP.NET Core. For simpler, serverless applications, the built-in model is often the best choice due to its simplicity and tight integration with Azure. For more complex applications requiring the full feature set of ASP.NET Core, integrating with ASP.NET Core would be more appropriate.

Probing ChatGPT a bit more as to the limitations of the Built-in HTTP model, I got the following scenario where the more complex model might be needed:

While the built-in model of Azure Functions is powerful for many use cases, especially those that fit well within a serverless paradigm, it has its limitations in scenarios requiring advanced control over HTTP pipeline processing, complex authentication and authorization, and other sophisticated web API features. In these cases, integrating with a more feature-rich framework like ASP.NET Core would be more appropriate.

Given that I intend to front the function with Azure API Management to handle the authorisation aspects, I decided to go with the simplicity of the built-in model. This meant that the original boilerplate code became the following:

[Function("HttpExample")]public HttpResponseData Run(  [HttpTrigger(AuthorizationLevel.Anonymous, "get")] HttpRequestData req){    var response = req.CreateResponse(HttpStatusCode.OK);    response.WriteString("Welcome to Azure Functions!");    return response;}

So a little more verbose, but I could get rid of a dependency which was nice. I did have to amend Program.cs to use the defaults for a worker process:

var host = new HostBuilder()    // Was ConfigureFunctionsWebApplication()    .ConfigureFunctionsWorkerDefaults()    .Build();

Dependency Injection and Logging

One significant difference that struck me between AWS Lambda functions and Azure Functions, was that the latter starts you off down the dependency injection (DI) route. The boilerplate code uses DI out of the box to inject an ILoggerFactory implementation that can then be used to obtain a logger:

public class SimpleHttpFunction(ILoggerFactory loggerFactory){    private readonly ILogger _logger =      loggerFactory.CreateLogger();    [Function("HttpExample")]    public HttpResponseData Run(      [HttpTrigger(AuthorizationLevel.Anonymous, "get")] HttpRequestData req)    {        _logger.LogInformation("C# HTTP trigger function processed a request.");

As with a 'normal' application, the logging level is controlled within the Program.cs file. Here I want to have a very verbose output in development, but leave the level when in higher environments.

var host = new HostBuilder()    // ...    .ConfigureLogging((context, loggingConfig) =>    {        var env = context.HostingEnvironment;        if (env.IsDevelopment())            loggingConfig.SetMinimumLevel(LogLevel.Trace);    })    // ...    .Build();

I decided to take advantage of the dependency injection and adopt a little hexagonal architecture. I created a couple of services to isolate my function from the details of how a request is validated and how the request content is stored.

var host = new HostBuilder()    // ...    .ConfigureServices(services =>    {        services.AddSingleton();        services.AddSingleton();    })    .Build();

With this in place, I could then use a primary constructor to have the implementations injected at runtime. This would also set me nicely to do some unit testing later on.

public class ValidateAndStoreFunction(    ILoggerFactory loggerFactory,    IRequestValidator requestValidator,    IRequestStore requestStore){    private readonly ILogger _logger =        loggerFactory.CreateLogger();    private readonly IRequestValidator _requestValidator = requestValidator;    private readonly IRequestStore _requestStore = requestStore;

As mentioned earlier, this all feels quite different to working with AWS Lambda functions. In AWS, it felt like the onus was on keeping everything as light as possible. I have never used any dependency injection frameworks with AWS, so this felt much more like 'normal' application development. My concern, as with all serverless functions, was that this would add to any cold start times. However, for my learning application, I am not too concerned. For those that are, Mikhail Shilkov has written this excellent article on Cold Starts in Azure Functions.

Unit testing my function

Whilst it was great to be able to run my function and invoke it from cURL or Postman, I like to write pure unit tests that can be run from anywhere. This turned out to be quite straightforward, at least in the case of my function.

The first task was to instantiate and instance of ValidateAndStoreFunction. The primary constructor for this is as follows:

ValidateAndStoreFunction(    ILoggerFactory loggerFactory,    IRequestValidator requestValidator,    IRequestStore requestStore)

Using the Moq mocking framework, I was able to supply mocks for these and setup appropriate return values.

_mockLoggerFactory = new Mock();_mockRequestValidator = new Mock();_mockRequestStore = new Mock();

The first real snag was the signature of the Run method. It requires a HttpRequestData as input:

HttpResponseData Run(    HttpRequestData req,    string contractId,    string senderId,    string tenantId)

It turns out that HttpRequestData is an abstract class, so I tried to subclass it. This hit a couple of issues. The first was that HttpRequestData has a constructor that requires a FunctionContext instance. FunctionContext is itself abstract, so I tried using Moq to provide one.

class MockHttpRequestData()    : HttpRequestData(new Mock<FunctionContext>().Object)

HttpRequestData also need to be able to create an HttpResponseData instance. The solution, again I subclassed HttpResponseData and return an instance of my new class.

public override HttpResponseData CreateResponse() => new MockHttpResponseData();

The final step was to be able to pass in an object to be returned as a JSON stream.

class MockHttpRequestData(object bodyObject)    : HttpRequestData(new Mock<FunctionContext>().Object){    private readonly string _bodyJson = JsonConvert.SerializeObject(bodyObject);    public override Stream Body => GetStringAsStream(_bodyJson);

Now with my mocks in place, I could create some nice simple unit tests that would run anywhere.

// Arrangevar validateAndStoreSUT = new ValidateAndStoreFunction(    _mockLoggerFactory.Object, _mockRequestValidator.Object, mockRequestStore.Object);// Actvar response =    validateAndStoreSUT.Run(        new MockHttpRequestData(            new { }), ExpectedContractId, ExpectedSenderId, ExpectedTenantId);// Assertresponse.Should().NotBeNull();response.StatusCode.Should().Be(HttpStatusCode.Created);

I don't know if other trigger would be more difficult to mock out, but the ease of mocking HttpRequestData gives me cause for optimism.

Deploying to Azure

My original intention was to finish this post with deploying to Azure from Visual Studio. However, this did not prove as straightforward as thought and I will defer the trials and tribulations to the next post.

Summary

Here are my observations on my first real experience with Azure Functions:

Azure Functions feel more like a traditional application than AWS Lambda functions. With dependency injection and a local development experience.
The two available models, in-process and isolated, complicates initial decisions.
Strong support for the local (F5) experience. This makes it easy to get going and there is support for remote debugging too that I didn't have time to investigate.
The middleware limitations stopped me in my tracks. I was hoping to be able to extend my functions with middleware that would short-circuit the pipeline, but this turned out not to be possible.
Unit testing was straightforward, with support for dependency injection and the framework classes making it easy to mock the inputs to the function.

My First AI-Assisted Project

Andy Blackledge — Mon, 08 Jan 2024 19:12:28 GMT

In this post, I share my experience of learning how to use ChatGPT to assist me develop a tool to assist me in solving a real world problem. A problem in a domain where I have limited knowledge.

Every February, under the guise of Andy Balham, I take part in FAWM (February is Album Writing Month). This is an online challenge to write and record 14 songs in 28 days (14.5 in a leap year!). The problem I had was that the resulting MP3 files need to be hosted, but my usual option was is not available this year.

So I set myself the challenge of finding a hosting alternative and, to provide a bit more of a challenge, develop a simple HTML interface on which to view and play the resulting songs.

What are my options?

I first started asking ChatGPT about hosting static websites in AWS, as I have experience in that area. However, when I asked ChatGPT the following question I got an interesting suggestion:

How can I create a simple static website that looks good?

The part that caught my eye was 'GitHub Pages' (emphasis mine):

Deploy Your Website:
Once you're satisfied with the look and feel:
Host it on platforms like GitHub Pages, Netlify, Vercel, or any web hosting provider.

I hadn't heard of this, so I prompted ChatGPT for more information:

Can you tell me more about GitHub pages please?

ChatGPT duly responded:

GitHub Pages is a free web hosting service offered by GitHub that allows you to easily publish static websites directly from your GitHub repositories. It's a convenient way to showcase your projects, blogs, documentation, or personal websites to the world. Here are some key points about GitHub Pages:
Free Hosting: GitHub Pages provides free hosting for static websites, which includes HTML, CSS, JavaScript, and other static files. This makes it a cost-effective solution for hosting personal or project websites.
GitHub Integration: GitHub Pages is tightly integrated with GitHub repositories. You can create a dedicated branch (usually named gh-pages or main for user/organization pages) or use the docs folder in your repository to store your website files. When you push changes to this branch or folder, your website is automatically updated.

This sounded ideal, as I already had a GitHub account and the price was right in my sweet spot (free 🙂).

The original chat can be found here.

How can I test it?

Like all good software engineers, my first thought was how to test my static website as I developed it. So I lent on my assistant again:

I want to develop a static website. Whilst I am doing development on my PC, how can I host the website so that I can see instant results when I save changes?

As a regular user of VS Code, the following suggestion jumped out:

Integrated Development Environment (IDE) Servers:
Most modern Integrated Development Environments (IDEs) or code editors come with built-in tools to preview websites. For instance:
Visual Studio Code: Has an extension called Live Server which provides a live preview of your site with hot-reload functionality.

A quick search for the Live Server extension in VS Code and I soon had a way to instantly view rendered HTML. Now I was all set to go into the unfamiliar territory of website development.

The original chat can be found here.

But I'm not an HTML expert

Although I have been in the IT business for quite a while, my focus has pretty much always been on the back end. I have a rudimentary knowledge of HTML, but not much beyond that. What I needed what an expert to help me, so using SudoLang I prompted ChatGPT with the following:

# An expert in writing HTMLRole-play as a world-class HTML writer who is proficient in using the Bootstrap framework to produce attractive and responsive pages.HTMLExpert {  /write - Take a description of a page and output HTML to match it}

With the role set, I asked the following:

/write a page that contains links to an album of MP3 files. The page should contain a placeholder for a description of the album. Along with links to download the MP3 files, there should be the name of each song and a link to a page containing the lyrics for that song.

The result was an HTML page (source) that admirably met my specification:

ChatGPT also added the following, providing helpful assistance in how to use the result:

To use this template:
Replace "Album Name" with the name of your album.
Replace the placeholder text under the ID albumDescription with your album's description.
For each song in your album:
Duplicate the
element.
Replace "Song 1" with the song name.
Replace "path/to/song1.mp3" with the correct path or URL to the MP3 file.
Replace "path/to/song1_lyrics.html" with the correct path or URL to the song's lyrics page.

This looked like a great start, so I wondered if it could be improved:

That is great. What would be even better would be a music player so that each song could be played directly from the page.

To which I got the response:

Of course! Let's integrate an HTML5 element with controls for each song so that users can play them directly from the page.
Here's an enhanced version of your album page with the music player functionality:

And sure enough, the page (source) now had a music player on it. I supplied an MP3 in the appropriate location, and sure enough I could play it from my browser.

I then decided to see if the process would benefit from a more precise specification. So, as an experiment, I tried using Markdown as a specification format and prompted ChatGPT as follows:

/write

# Album NameA link to an image for the album coverA list of songs on the album and for each song:- The title of the song- A link to download the MP3- A toggle button to show and hide the notes for the song, the text for the button changes on toggle- A toggle button to show and hide the lyrics for the song, the text for the button changes on toggle- An audio player to allow the MP3 to played from the pageThe songs are in a sub-folder call mp3sThe images are in a sub-folder called imagesThe footer should say that all material is copyrighted to Andy Balham

The result, once again, was pretty good (source):

However, testing the page I found that the toggling didn't quite work. Using my programming knowledge I prompted ChatGPT:

The toggling doesn't work. I think targetElement.classList.contains('show') should be btn.textContent.includes('Hide')).

To which ChatGPT replied:

You're right; the toggle behaviour I provided previously was based on Bootstrap's show class state on the target element, which might not reflect the exact moment when the toggle happens. Let's adjust the toggle behaviour by updating the script section.
Here's the updated script:

And sure enough, an amended script was provided which fixed the issue (source).

I then spent some time experimenting with refining the SudoLang program to style the page in different ways, before I decided that including the lyrics in the main page would be a bit onerous.

So I asked ChatGPT to regenerate the page with the lyrics details of the page definition updated as follows:

For each song:- A toggle button 'Lyrics' to show and hide the lyrics for the songFolder information:- The lyrics are in a sub-folder called lyrics, one file per song

The result was an amendment to the collapsible lyrics section:

<div class="collapse mt-3" id="lyricsSample">  <div class="card card-body">    <a href="lyrics/song-title-lyrics.txt">View Lyricsa>  div>div>

A link wasn't what I wanted, so I asked for alternatives:

Is there any way the text from the lyrics files could be included without using an ref?

Sure enough ChatGPT responded with some options:

To include the lyrics text directly on the page without using an link, you have a couple of options, but each comes with its own considerations and limitations:
Embedding Lyrics Directly in HTML:
Using JavaScript to Load Lyrics:
Server-Side Inclusion:

The JavaScript option looked like what I wanted, so asked for that:

Please update the HTML with the JavaScript version

As you can see from below, the result (source) was a partial success. The lyrics were loaded on demand, but the rendering left a little to be desired:

So I asked ChatGPT to sort that out.

If the lyrics are in plain text format, can you amend the JavaScript to insert suitable HTML markup to make the lyrics break across lines

The response was an amended JavaScript function, which I then copy and pasted into the page (source) and the results are shown below:

The result is not perfect, blank lines have not been preserved, but it will suffice for my needs. Overall, I was impressed with how I was able to go from nothing to a page that contains functionality that I wanted. It won't win any design awards, but it would have taken me significantly longer to develop with the AI assistance.

The whole chat can be found via this link.

How could I repeat it?

As a final step, I thought I would lean on ChatGPT once more and ask it to write a JavaScript program to merge the HTML template it had created with a YAML file in the following format:

albumName: Aye-Aye?songs:  - title: Guinness Greed    mp3Filename: 01_guinness_greed    notes: "Prompt: /genre greed /subject Guinness /write must include reference to John"  - title: Flood's Domain    mp3Filename: 02_floods_domain    notes: "Prompt: /genre fear /subject floods /write"

My prompt for this was simply the following:

You are an expert in write Node.js programs. I would like you to write a program that receives a YAML file as input, merges the details of that file with the an HTML template, and then outputs an HTML file containing the merged result.
Here is an example YAML file:
Here is the HTML template:

The resulting output from the resulting program was almost perfect:

The only thing that wasn't right was that the H1 element hadn't been replaced with the album name. ChatGPT had used the replace method and that seemed to only replace the first match, so I asked the following:

How can the program be changed to make replace('Album Name', yamlData.albumName) replace all instances of 'Album Name' and not just the first instance?

To which I got the following response and an updated mergeYAMLWithTemplate function:

To replace all instances of a specific string in JavaScript, you can use a regular expression with the global (g) flag. This way, the replace method will replace all occurrences of the string in the text, not just the first one.

I pasted in the updated function, ran the program and was presented with the following expected result:

The whole chat can be found via this link and the JavaScript program here.

Summary

It was a very interesting to try a new process for developing a solution to a problem. At each stage, it was a significant advantage to be able to ask for options on how to solve the next step. The interaction felt very natural, asking for further details or asking for small changes. The process allowed me to explore very quickly and, although I could have written the final program, it was much quicker to get it generated for me. Interestingly, my solution probably would have been at risk of gold-plating. Taking the simple solution has its advantages.

There is no doubt to me that these tools are a significant productivity boost. In the software realm, we can test the results (you do test don't you? 😉). This means we can easily check the output that we are given, which is not the case in others.

The resulting website can be found here.

Step Function Errors Should Be Errors

Andy Blackledge — Wed, 29 Nov 2023 18:47:34 GMT

In this post, we shall see the consequences and limitations of using step function errors as flow control.

See the accompanying GitHub repo for working code examples.

TL;DR

Be wary when using step functions errors for flow control in step functions
After throwing and catching an error, only the inputs and the basic error details are available

Our inheritance

We inherited a codebase that contained a step function which was asynchronously invoked from an API Gateway request. The step function performed validation on the request, before proceeding to processing the request if the validation was successful. When the validation failed, either due to schema errors or due to the content of the request, an error was thrown by the Lambda function. These errors were caught by the step function and, depending on the name of the error, the step function invoked a DynamoDB integration task to update a table with the result. Users of the API could then use another API call to query the state of the request.

A simplified version is shown below:

Improving things for our users

Users don't always get things right, so it was common for requests to fail. In particular, they would fail for one of the validation reasons. Whilst the underlying reasons for the failures were logged to CloudWatch, it quickly became a pain to keep looking them up whenever a user of the API suffered from a failure.

To provide better feedback for our users, we decided to add the validation errors to the result table. The users could then query this table via the API to understand why their request had failed, and all without bothering us.

The only problem was, at the point the table was being updated, we had no way of accessing the validation failures.

Post-error step function context

The problem was that after an error has been handled by a step function, only the following is available:

The inputs to the step function
The error type

This makes sense, as there can be no guarantee of the state once an error occurs. The inputs never change, so it is safe to access their values, but it is not safe to access anything else apart from the type of error.

This means that it isn't possible to pass structured information into the error handlers. Which happened to be exactly what we needed to do if we were to extend the DynamoDB integrations to store the extra details.

Are these really errors?

It is somewhat a matter of opinion, but I would argue that neither validation failure is a true error. I say this, as we would expect both scenarios to occur in normal operations. The use of throwing and catching errors in this case looked to me like it was done as a convenience.

To resolve the issue, the validation Lambda function was updated to add the errors directly to the table. This took advantage of the fact that the relevant item was already present and that the DynamoDB integration would not overwrite any added errors. This allowed us to avoid changing the flow of the step function, but came at the cost of increasing the responsibility of the validation function.

A better approach

I would argue that the validation step should produce a validation result.

interface ValidationResult {    user?: User;  formatErrors?: ZodIssue[]; // The errors from using the zod npm package  contentErrors?: string[];}function validateUser(event as any): ValidationResult {  // }export const handler = async (  event: Record<string, any>): Promise => {  return validateUser(event);};

With this, we can then rework the step function with an additional choice step, as shown below:

Now the DynamoDB integrations can access the relevant details. For example, for format errors.

{  "Key": {    "key": {      "S.$": "$$.Execution.Input.requestId"    }  },  "TableName": "",  "ExpressionAttributeNames": {    "#status": "status",    "#formatErrors": "formatErrors"  },  "ExpressionAttributeValues": {    ":status": {      "S": "InvalidFormat"    },    ":formatErrors": {      "S.$": "States.JsonToString($.validationResult.formatErrors)"    }  },  "UpdateExpression": "SET #status = :status, #formatErrors = :formatErrors"}

With this approach, the validation step is pure validation and not given a secondary responsibility of storing any validation errors.

General discussion

Consider the following C# method:

MyEntity FindMyEntity(string id);

What should the method return if there is no such instance with a matching id? Should it return null or throw an exception?

I would say it depends on the caller. If the caller cannot continue without the entity, throw an exception. If it can continue, return null. This could be catered for with the following:

MyEntity FindMyEntity(string id, bool throwExceptionIfNotFound = false);

Now the caller can control the behaviour as they need.

You could go further and return a result object.

FindResult FindMyEntity(string id);

You can then explicitly check for the 'not found' scenario.

if (entityFindResult.NotFound) // ...else // Use entityFindResult.Value

Summary

I would advise against using errors as a replacement for step function flow control using choice steps. Once the error is thrown, the current state is lost and only the error type is then available downstream.

In general, I would advise against throwing errors/exceptions/ as part of 'normal' processing. That is, unless the idioms of your preferred language recommend it. I have found that there are usually alternatives that make your code cleaner and more extensible.

My first steps with the future of programming

Andy Blackledge — Sun, 22 Oct 2023 08:57:45 GMT

40 years ago I saw someone try to program with natural language. Now that has become a reality with large language models (LLMs) and SudoLang. My first question was, can we use these tools to generate comedy gold?

40 years ago

It was 1982 and I was attending the school computer club. We were taking turns to program the 1K ZX81. One member took their turn at the 'keyboard' and attempted to type something along the lines of...

10 IF THE PLAYER PRESSES LEFT THEN MOVE THE SPACESHIP LEFT

The result was the following...

10 INPUT F THE PLAYER PRESSES LEFT THEN MOVE THE SPACESHIP LEFT

Apart from the interesting fact that variables on the ZX81 could have spaces in them, the computer was clearly not going to do what was being asked. How foolish they were, my 13yo self thought smugly to himself.

Fast-forward 40 years and this sort of programming is now possible, thanks to LLMs such as ChatGPT and the insight of Eric Elliot and his work developing SudoLang.

Encountering SudoLang

I was listening to the excellent .Net Rocks! podcast, when they mentioned SudoLang. When I mentioned to my co-workers that SudoLang was being developed by Eric Elliot, I was told in no uncertain terms what a very smart fellow he was. So, with my interest sufficiently piqued, I got a ChatGPT Plus subscription, followed the steps in the AI Programming for Absolute Beginners article, and my mind was blown.

To quote from the SudoLang GitHub repo:

SudoLang is a programming language designed to collaborate with AI language models including ChatGPT, Bing Chat, Anthropic Claude, and Google Bard. It is designed to be easy to learn and use. It is also very expressive and powerful.
All sufficiently advanced language models understand it without any special prompting. You do not need to paste the SudoLang specification before using SudoLang with your favorite AI.

Programming with SudoLang

The question was what sort of program should I write. How about one that generates some comedy names in the spirit of Key and Peele and their East vs West sketches.

SudoLang programs are written in Markdown and start with the role that provides the context. Think of it like briefing an actor. In our case, an actor that does the following.

# East vs West Player GeneratorGenerate a plausible, but comedic sounding name for an American football player.

We then specify the state of the application. The LLM understands that the curly braces delimit the various components, giving them scope. In SudoLang you lean heavily on what the LLM can infer, so We don't need to specify types were they can be inferred. Where they can't, we provide a hint in a comment such as East|West.

NameGenerator {  State {    Name    CollegeOrSchoolOrPrison    Team // East|West    Position  }}

A key concept in SudoLang is programming by constraints, so now we add some to our program.

NameGenerator {  State {    ...  }  Constraints {    The name must consist of at least a first name and a surname.    The name must not be the name of a real player.    The name must sound amusing, but must be able to be pronounced.  }}

Like the pupil 40 years ago, all we do is express our intention in natural language. In SudoLang, natural language is preferable, as that is what LLMs understand.

Finally, we add a couple of commands and a function to greet the user. Again, we use a convention for commands that the LLM will be familiar with (i.e. /command).

NameGenerator {  State {    ...  }  Constraints {    ...  }  /next Generate the next name  /help Provide help}welcome("Welcome to the name generator. Type /next to get the next name")

And now our program ready to run or, if you prefer, our actor is ready to perform.

Pasting it into ChatGPT GPT-4, we get the following response:

Welcome to the name generator. Type /next to get the next name. If you need assistance or further information, type /help. Enjoy!

Note here that it has given us the greeting we programmed, but our actor has gone a little further and added that assistance is also available.

Let's try a name, so typing /name returns:

I'm sorry, the provided command "/name" is not recognized. Please use one of the supported commands such as "/next" or "/help". Would you like to proceed with "/next" to generate a name?

Oops, I forgot my own commands. However, the program gave a helpful response without us having to explicitly program for that eventuality. Typing /next returns:

I apologize for the confusion. Let me generate a name for you now.NameGenerator:  State:    Name: "Dexter 'Dizzy Dukes' Duckworth"    CollegeOrSchoolOrPrison: "Puddle Jumpers Academy for Overachievers"    Team: "East"    Position: "Running Back"

Our actor is almost British here, apologising for other's mistakes. However, it has successfully done what we wanted and generated a mildly amusing name, plus a manufactured college of sorts. Not exactly comedy gold, but not too bad.

Refining the program

To refine our generator, let's add another line to the context to try to get names more like Key and Peele.

# East vs West Player Generator...The name can have a middle name or initial and can have a suffix, such as III.

We'll also add a height and weight to the state. We will leave it to the LLM to work out suitable units.

State {  ...  Weight  Height}

The colleges in the Key and Peele sketch are real ones, so let's add a constraint to ensure that ours are too. While we are at it, we add a constraint to make sure our offensive linemen are suitably hefty.

Constraints {  ...  The college, school, or prison must be real.  The weight and height must be consistent with their position in the team}

Finally, we add a /refine command and amend the /next to output YAML.

/next Generate the next name and output the result in YAML.../refine Generate another name based on the last, but with more exaggeration

OK. When we paste in the program we get:

Welcome to the name generator. Type /next to get the next name. For assistance or further options, type /help. Let's create some memorable players!

Once more, our actor has taken some liberties with the script, but the gist is spot on. Let's try out the /next command.

NameGenerator:  State:    Name: "Thaddeus T. 'Thunder Thighs' Thompson III"    CollegeOrSchoolOrPrison: "University of Alabama"    Team: "West"    Position: "Defensive Lineman"    Weight: "310 lbs"    Height: "6'4''"

Success. We have our output in YAML, the college is real, and the height and weight are position-appropriate. Let's try /refine.

NameGenerator:  State:    Name: "Thaddeus T. 'Titanium Tackle' Thompson IV"    CollegeOrSchoolOrPrison: "University of Alabama"    Team: "West"    Position: "Defensive Lineman"    Weight: "325 lbs"    Height: "6'5''"

OK. Not the greatest exaggeration perhaps, but the program did do what we asked.

Let's try one final revision of our program. First let's give each player a catchphrase.

# East vs West Player Generator...Each player has a catchphrase based on their background, position,  and physical stature.NameGenerator {  State {    ...    Catchphrase  }  Constraints {    ...    The weight and height must be consistent with their position in the team,      but can be exaggerated for comic purposes.    The catchphrase must not be based on any stereotypes.  }  ...  /opposite Generate another name based on the last,    but with opposite characteristics}overTheTopWelcomeInTheStyleOfAFootballCoach()

We have also loosened the height and weight constraint to try to get more extreme values and added a new command to generate an opposite character. The welcome was good before, but it could be more in keeping with the subject matter. So let's see if the amended version can gee us up a bit.

Alright, team! Gather 'round! 🏈 This ain't just any old name generator! We're here to scout out the funniest, quirkiest, and downright unforgettable players! You want comedy? We got it! You want star power? We got it! Strap on those helmets, get in formation, and type "/next" to discover the next legendary player! Remember, we play hard, we play fair, but most importantly, we play for laughs! LET'S GO TEAM! 📣🎉

Yes, that certainly had an effect. Let's do what coach says and type /next.

NameGenerator:  State:    Name: "Fitzgerald F. 'Flashy Footwork' Fiddlesticks"    CollegeOrSchoolOrPrison: "Harvard University"    Team: "East"    Position: "Wide Receiver"    Weight: "185 lbs"    Height: "6'2''"    Catchphrase: "You can't catch what you can't see!"

Not bad, let's try the /opposite.

NameGenerator:  State:    Name: "Barnaby B. 'Boulder Boot' Bumblebee"    CollegeOrSchoolOrPrison: "Stanford University"    Team: "West"    Position: "Defensive Tackle"    Weight: "320 lbs"    Height: "5'10''"    Catchphrase: "I'm the mountain you can't move!"

Now let's go off-piste and just ask Please generate a punter from a prison.

NameGenerator:  State:    Name: "Percival P. 'Penitentiary Punt' Peterson"    CollegeOrSchoolOrPrison: "San Quentin State Prison"    Team: "West"    Position: "Punter"    Weight: "195 lbs"    Height: "6'0''"    Catchphrase: "From the yard to the 50-yard line!"

This highlights that our programs can do more than we explicitly specify. So let's push that a bit further.

Using the LLM to improve our program

Now that the LLM knows is very familiar with the context we are dealing with, let's see if it can suggest ways we can improve the program. Typing Please suggest some possible commands to help refine the comedy potential of the characters gets us the following.

Certainly! Here are some potential commands to help refine the comedy potential:
/exaggerate: Amp up the comedy by making a character's feature or trait more extreme.
/mashup: Combine elements of two or more previously generated characters.
/nickname: Generate a unique and funny nickname based on the character's existing traits.
/backstory: Provide a short, humorous backstory for the character that ties into their attributes or catchphrase.
/random_trait: Add a random and unexpected trait or hobby that contrasts with the character's football position or personality for comedic effect.
/swap: Swap one trait or attribute for its comedic opposite.
...and another six
These commands would help add layers of comedy and depth to each character, making them more memorable and entertaining.

Not bad suggestions, let's add them to our program and try out /backstory for 'Bubba "Biscuit" McGillicuddy IV'.

Bubba "Biscuit" McGillicuddy IV hails from the small town of Butterville, a place famous for its annual Butter Festival. Born into the McGillicuddy family, owners of the renowned "McGillicuddy's Marvelous Muffins & More" bakery, Bubba was always the center of attention.
Growing up, while other kids played with toys, young Bubba played with dough. In fact, his nickname "Biscuit" was coined when, at the age of seven, he mistakenly wore a biscuit mold as a helmet and ran head-first into a wall, leaving a biscuit-shaped indentation. The name stuck, and so did his love for the sport.
...and so on.

This got some genuine laughs when I read it out, so I consider our quest for humour successful.

Summary

SudoLang has changed the way I look at interacting with LLMs such as ChatGPT. The genius of SudoLang is that it takes advantage of the conventions that the LLMs have learnt, such a /commands, interfaces, and { scoping }. I now approach each interaction with ChatGPT as if I am approaching an expert who I need to brief on what I want, what I don't want, and how I want it. SudoLang gives me a way to express this in a structured way. Once the expert has some experience of what I am after, I can even ask the expert how I can improve on what I am asking for.

I feel the same excitement of possibilities as I did when I got my own ZX81 40 years ago. But now, of course, I can use natural language.

Adventures with AWS X-Ray and CDK - Part 3

Andy Blackledge — Sun, 24 Sep 2023 18:38:36 GMT

In this post I continue my adventures with X-Ray and try my hand at observing a whole application. In the previous posts in the series, I looked at using X-Ray in a small context. Here we will see what happens when an end-to-end process is traced and logged.

You can download the accompanying code and try it yourself from the GitHub repo.

The example application

The case study we looked at in the series on implementing Enterprise Integration patterns is an application that acts as a loan broker. The application in question receives a request containing the details of the loan required via an API, and then returns the best rate to a webhook.

The following diagram shows how we use a central EventBridge event bus to implement this.

View the service map

In the last post, I walked through how I added X-Ray to the whole application. Now when I run some requests through the API, we see the following service map.

What is quite clear from this picture, is that events are at the heart of this application.

Now, by clicking on the client, we can trace a request all the way through the application and out to the webhook.

However, when I tried this, I found something that was hindering the observability. Ironically, it was observability that I added in the post on Domain Observability.

Removing observability from tracing

In that post, I added business-level observability by hooking a Lambda function up to the domain events being raised.

However, as I had enabled tracing for this Lambda function, the trace included numerous entries for the observers which clouded the picture of the process.

When viewing the log, this was further apparent.

The solution is to specify Tracing.DISABLED for the observability Lambda functions. However, as I still wanted the traces when testing the Lambda functions, I added a isTestMode property the observability CDK stack as follows.

const loggerFunction = new NodejsFunction(  this,  'Logger',  getNodejsFunctionProps({    // We don't want the observing in the trace for production    tracing: props.isTestMode ? Tracing.ACTIVE : Tracing.DISABLED,    //   }));

Now production traces are clean, but we can also take advantage of X-Ray when testing the functionality.

Adding custom subsegments

The article Generating custom subsegments with the X-Ray SDK for Node.js describes subsegments as follows.

Subsegments extend a trace's segment with details about work done in order to serve a request. Each time you make a call with an instrumented client, the X-Ray SDK records the information generated in a subsegment. You can create additional subsegments to group other subsegments, to measure the performance of a section of code, or to record annotations and metadata.

In our application, we have Lambda functions that simulate response from lender systems. At the moment, this is just an algorithm, but in practise would be a call that would take time and be prone to error. This would be an ideal call to surround with a custom subsegment.

With this in mind, I added the following code to allow the lender configuration to control the delay in responding and whether an error occurred.

const simulateExternalCallAsync = async (  lenderConfig: LenderConfig): Promise<void> => {  const randomPercentage = randomInt(100);  const errorPercentage = lenderConfig.errorPercentage ?? 0;  const throwError = randomPercentage <= errorPercentage;  const delayMillis = lenderConfig.minDelayMillis ?? 1000 + randomInt(1000);  await new Promise((resolve) => setTimeout(resolve, delayMillis));  if (throwError) {    const errorMessage = `Simulated error (${randomPercentage} <= ${errorPercentage})`;    throw new Error(errorMessage);  }};

With this in place, I added the code below around the call to simulateExternalCallAsync to add and close the subsegment.

import * as AWSXRay from 'aws-xray-sdk';// const segment = AWSXRay.getSegment();const subsegment = segment?.addNewSubsegment('External Call');try {  // Simple values that are indexed for filter expressions  subsegment?.addAnnotation('callType', 'Lender');  subsegment?.addAnnotation('lenderId', lenderConfig.lenderId);  // Related data for debugging purposes  subsegment?.addMetadata('lenderDetails', {    lenderId: lenderConfig.lenderId,    lenderName: lenderConfig.lenderName,    lenderUrl: `https://${lenderConfig.lenderId}.com`,  });  await simulateExternalCallAsync(lenderConfig);} catch (error) {  if (error instanceof Error) {    // Add error to the subsegment    subsegment?.addError(error);  }  throw error;} finally {  // Ensure the subsegment is closed  subsegment?.close();}

I redeployed the lenders, with the configuration set to introduce delays but not throw any errors. After running a request, I could see the following in the trace.

So we can now see the time taken for our 'external call'. What we can also see is the annotation and metadata that we added to the subsegment. Annotations are key-value pairs with simple data (strings, numbers, or booleans) that are indexed for use with filter expressions, whilst metadata can be any related data you'd like to store that's not indexed.

Clicking on the 'Annotations' tab of the segment details, we can see what type of call it was and which lender was called.

For more data, clicking on 'Metadata' shows the lender name and the URL being 'called'.

Forcing some errors

After having a look at what a happy system looks like, I decided to introduce some errors. I updated the configuration for one of the lenders so that it would always throw an error and redeployed.

After running a request through the system, the service map clearly showed that the Lambda function for the lender has errors.

Looking at the trace, I could see that the asynchronous request from the step function had timed out, as the timeout was set to 12 seconds.

The trace also contained the custom subsegment, which clearly shows that there was an error making the external call.

As before, the 'Annotations' and 'Metadata' tabs showed the call details. However, as we added the error to the subsegment, the 'Exceptions' tab also shows us the error details, including the stack trace.

This shows the power of using subsegments to instrument key parts of the application.

Running a workload

The final part of my adventure with X-Ray involved configuring the Lambda functions for the lenders to randomly error and then to put multiple requests through the application.

Looking at the resulting service map, I could see some errors I was expecting and some I didn't.

In particular, the step function was showing that it had errored. So I selected it and filtered the traces to see what was going on. On inspection, the Lambda function that sends the response to the webhook was erroring. Clicking on the 'Exceptions' tab clearly indicated to me that the issue was that there was no response to send in this case, but the code didn't cater for it.

Looking at another error, I saw that the Lambda function that looks up the lenders from the parameter store is throwing an error. Again, the 'Exceptions' tab shows the underlying reason. In this case, there is a rate limit on accessing the parameters in the parameter store. This indicates to me that perhaps we need the application to implement some sort of cache in front of the raw access.

Summary

In this post, I documented my experience using X-Ray with an application. I found the ability to view individual traces through the application, along with the associated logs, very valuable indeed. The ability to see errors and then drill down to the causes was also very valuable and allowed me to see some behaviour that otherwise would have been tricky to spot. Overall, I was very impressed as to what X-Ray has to offer.

Of course, there are also numerous excellent third-party offerings, such as the following:

So, although you get X-Ray out of the box with AWS, please consider these as well.

Adventures with AWS X-Ray and CDK - Part 2

Andy Blackledge — Tue, 29 Aug 2023 07:01:19 GMT

In the first part of this series, I added AWS X-Ray to a set of examples for my CDK Cloud Test Kit. In this part, I look at adding it to an example application that I put together for my series on implementing Enterprise Integration patterns. Let's see what adventures I have.

The example application

The following diagram shows how we use a central EventBridge event bus to implement this.

Adding X-Ray

Adding X-Ray to the application involved the following steps.

Enable tracing via CDK
- Lambda functions
- Step functions
Wrap SDK clients
- EventBridge

Enabling tracing on the Lambda functions was as straightforward as adding the following line to the default properties for all Lambda functions in the application.

export const NODE_DEFAULT_PROPS = {  //   tracing: Tracing.ACTIVE,};

The only place I chose to override this default behaviour was in the API handler. Here I used Tracing.PASS_THROUGH, so that it would adhere to the upstream sampling set in the API. See the following StackOverflow post What is Active tracing mean in lambda with Xray? for a good explanation on what the tracing levels mean.

The application only uses one step function and so it was amended directly as follows.

this.stateMachine = new StateMachine(this, 'StateMachine', {  tracingEnabled: true,  // });

The final step was to wrap all the SDK clients that make requests via passive services, such as EventBridge or SQS.

As was shown in the diagram above, all communication in the application is done through EventBridge. In fact, all Lambda functions use the same putDomainEventAsync method to send domain events.

export const putDomainEventAsync = async extends Record<string, any>>({  eventBusName,  domainEvent,}: {  eventBusName?: string;  domainEvent: DomainEvent;}): Promise => {  // };

The upshot of this is that there was only one place to wrap the EventBridge SDK client:

const eventBridge = AWSXRay.captureAWSv3Client(new EventBridge({}));

And with this, I had added X-Ray to the whole application.

Step Functions and EventBridge

One thing that I was aware of, was that the step function uses EventBridgePutEvents direct integration, as highlighted below.

I was asking myself if this call would be traced by X-Ray. To test if this is the case, I decided to run one of the the unit tests that executes the step function.

The unit test

The unit test differs from a more traditional unit test in that it exercises the step function in the cloud as part of an ephemeral, test-specific CDK stack. This approach allows the Lambda functions used by the step function to be swapped for test-specific implementations. This allows us to script responses for those function and so test all the routes through the step function. For an in-depth look at the approach, please see my post Step Function integration testing with CDK.

The first test I tried initiates the step function by publishing a quoteSubmitted event. It then waits for a Lambda function, acting as a test observer, to receive a quoteProcessedEvent.

// Actawait putDomainEventAsync({  eventBusName: loanBrokerEventBus.eventBusArn,  domainEvent: quoteSubmitted,});// Awaitconst { observations: quoteProcessedObservations, timedOut } =  await testClient.pollTestAsync({    filterById: LoanBrokerTestStack.QuoteProcessedObserverId,    until: async (o) => o.length > 0,    timeoutSeconds: 30,  });// Assertexpect(timedOut).toBeFalsy();

However, when I ran the test, I found that it was timing out waiting for the expected event.

Why is my test failing?

As I was now using X-Ray, I had a look at the service map.

On the up side, I could see that the EventBridgePutEvents step function task does allow events to be traced through EventBridge. On the down side, I could see that there were failures in three places, the step function and two Lambda functions.

Looking at the trace for the step function, I could see that the 'RequestCreditReport' task was failing just after 6 seconds. The timeout for this task was set to 6 seconds, so this looked like the task was probably timing out. That would certainly explain why the expected event was not being published.

I could also see that the Lambda function that provides mock credit references was failing. This would explain why the step function task was timing out, as the step function task was never receiving an event back.

The invocation duration of 2.99s also looked like a timeout, as the timeout for Lambda functions was set to the default of 3 seconds. The console allowed me to quickly dive into the logs and confirm that that was the case.

Looking at the trace for the other failure, I could see that three attempts were made to the Lambda function that handles callbacks to the step function.

Again, I was easily able to navigate to the logs and see the reason.

I could see that the Lambda function is being invoked as the result of an EventBridge rule. The Lambda function is then trying to restart the step function, but the step function to be restarted has already finished due to the task timeout. The result is an error, which then causes EventBridge to retry.

The solution

The solution to the timeouts was quite simply to double the memory of the highlighted Lambda functions to 256mb and double the timeout to 6 seconds. With these changes in place and deployed, the unit test ran successfully and the resulting service map reflect the clean run.

This service map clearly shows how EventBridge is at the heart of our application. When I selected a successful trace, I could see all the relevant logs in one place.

Summary

Although I started this post with the intention of diving into the traces of a full application, that will have to wait until the next post.

This short exercise with the unit test shows how you can use the service map and traces to see where issues are. The integration with the logs then allows you to drill down and see the underlying reasons. In particular, X-Ray gave visibility to the asynchronous event-driven behaviour and the behaviour under failure conditions.

I am looking forward to seeing what tracing through the application will bring.

Addendum

The unit tests used a wrapped SDK client to publish events. This cause the following error to appear in the console.

console.error  2023-08-19 14:28:41.709 +01:00 [ERROR] Error: Failed to get the current sub/segment from the context.      at Object.contextMissingLogError [as contextMissing] (D:\Users\andyb\Documents\github\blog-enterprise-integration\node_modules\aws-xray-sdk-core\dist\lib\context_utils.js:22:27)            at Object. (D:\Users\andyb\Documents\github\blog-enterprise-integration\tests\loan-broker\loan-broker.test.ts:178:5)

As mention in Configuring the X-Ray SDK for Node.js, the solution was to add AWS_XRAY_CONTEXT_MISSING=IGNORE_ERROR in my .env file.

Adventures with AWS X-Ray and CDK - Part 1

Andy Blackledge — Tue, 08 Aug 2023 18:13:42 GMT

One of the big challenges of distributed systems, such as serverless, is observability. In previous posts, I had a look at structured logging, metrics, and alarms. All these play an important part in building a maintainable system. One thing that I haven't covered is distributed tracing, as I had had no experience with it. So this is a record of my first steps using X-Ray, the AWS offering you get included with your AWS account.

Please be aware there are numerous excellent third-party offerings, such as the following:

In order to dip my toe into the world of X-Ray, I decided to instrument the examples I created for my CDK Cloud Test Kit npm package. These example use a variety of services, e.g. SNS, SQS, EventBridge, etc, so provide a good starting point to learn.

API Gateway, Lambda, and EventBridge

This example consists of an API Gateway backed by a Lambda function. The Lambda function does the following:

Generates a unique reference
Stores the body of the request in S3, under the reference generated
Creates a pre-signed URL, allowing access to the S3 object created
Publishes an event to EventBridge containing the reference and the pre-signed URL

This gives us a nice example to see if we can trace requests from API Gateway, through a Lambda function, through EventBridge, to another Lambda function.

Adding X-Ray to our CDK code

The RequestApi construct contains the RestApi instance and the Lambda function it calls.

For the RestApi, we set the deployOptions property tracingEnabled to be true:

this.api = new RestApi(this, 'RequestApi', {  //   deployOptions: {    tracingEnabled: true,    //   },});

In the corresponding Lambda function, we set tracing to be PASS_THROUGH.

const eventPublisherFunction = new NodejsFunction(this, 'EventPublisher', {  //   tracing: Tracing.PASS_THROUGH,});

The options for tracing are:

ACTIVE: Lambda will respect any tracing header it receives from an upstream service. If no tracing header is received, Lambda will sample the request based on a fixed rate.
PASS_THROUGH: Lambda will only trace the request from an upstream service if it contains a tracing header with "sampled=1"
DISABLED

To quote the StackOverflow post What is Active tracing mean in lambda with Xray?:

AWS Lambda supports both active and passive instrumentation. So basically you use passive instrumentation if your function handles requests that have been sampled by some other service (e.g. API gateway). In contrast, if your function gets "raw" un-sampled requests, you should use active instrumentation, so that the sampling takes place.

In our case, as the Lambda function is called from API Gateway, we can set the value to PASS_THROUGH.

However, in the RequestApiTestStack stack, we have a Lambda function that is triggered by EventBridge. Although EventBridge integrates with X-Ray, it does so passively. This means that the Lambda function receiving the event needs to have its tracing set to ACTIVE.

this.addTestFunction(  new NodejsFunction(this, RequestApiTestStack.EventObserverId, {    //     tracing: Tracing.ACTIVE,  }));

After deploying to AWS and running a few request, the X-Ray service map view shows the following:

We can see API Gateway invoking the associated Lambda function, and we can also see observer Lambda function being invoked. However, clearly we are missing something that links them all together.

Adding X-Ray to our Lambda code

To get our trace linked up, we need to wrap the EventBridgeClient instance with the appropriate middleware to inject the context into the calls to publish events. This is done in the RequestApi.EventPublisher function as follows:

import * as AWSXRay from 'aws-xray-sdk';const eventBridgeClient = AWSXRay.captureAWSv3Client(new EventBridgeClient({}));

Now when we view X-Ray, we can see the trace from API Gateway to the handle Lambda function, through EventBridge, and finally to our observer Lambda function.

You may notice that there are two circles in the map for each Lambda function. In the video How to do Distributed tracing in AWS? | AWS X-ray and Cloudwatch Service Lens, Marcia Villalba explains that there is one for the Lambda runtime and one for the handler code.

Viewing the traces

Now we have everything joined up, we can start looking at some traces to give ourself an idea of what X-Ray can do for us. Looking at the trace list, one entry jumps out as being considerably slower than the rest.

Clicking on this, we can see the following trace. We can also see the reason for why it is slower, as what we can see is a cold start in action.

If we look at a quicker trace, we can see a warmed-up Lambda function.

This small example gives a flavour of the insight that these traces can provide.

The CDK Cloud Test Kit also contains a couple of examples using SNS and SQS. This gives us a chance to add X-Ray to those and see what happens.

For the SNS example, we wrap the SNSClient in the X-Ray middleware.

const sns = AWSXRay.captureAWSv3Client(new SNSClient({}));

Running the tests, we see the following service map. This clearly shows the structure of the application, where one Lambda function publishes events to one of two topics and two Lambda functions subscribe as observers.

What I also noted, was that the relative sizes indicates the weight of traffic through the system. This could be useful to see if the flow within your application is as your would expect.

For SQS, we again wrap the client as we have done before.

const sqs = AWSXRay.captureAWSv3Client(new SQSClient({}));

Now when we run the tests, we see the following in X-Ray.

The interesting thing here is that the view includes metrics along with the service map. Two of the circles indicate that error metrics were recorded. Now if we select them, we get the option to view filtered traces.

Clicking on this we get a list of traces where errors occurred.

Selecting one, we go straight to the logs and we can see the error.

Hopefully, this gives you some idea of how X-Ray can help bring together traces, metrics, and logs. Allowing you to identify errors and get to the relevant logs, in order to debug issues quickly.

Step Functions

The final example that we will instrument with X-Ray is one that contains a step function. The step function implements a process that obtains a credit rating and then decides whether to accept or decline a loan. Where errors occur, a message is placed on an SQS queue. The step function definition is shown below. We can see that it includes integrations with SNS and SQS.

As well as setting the appropriate property on each Lambda function, we also need to set the tracingEnabled property to true on our step function.

export interface StateMachineProps {    //     /**     * Specifies whether Amazon X-Ray tracing is enabled for this state machine.     *     * @default false     */    readonly tracingEnabled?: boolean;    // }

With this in place, we can deploy the updated example and run our tests that exercise all the routes through the step function. The result in X-Ray is shown below.

Here we can see the step function integrations with Lambda, SNS, SQS, and DynamoDB. We can see the traces go through Lambda, SNS, and SQS, but stop at DynamoDB. Although our example observes DynamoDB events, and we can see the Lambda function elsewhere in the service map, the context is lost as soon as the record is written.

Again, we can see the metrics overlaid on the service map. The tests include some error scenarios, and the resulting metrics are reflected in the highlighting of the services. Drilling into the step function service, we can view the error trace and we see the following.

Here we can clearly see the retry behaviour occurring, before the step function errors.

Summary

In this post, we looked at how we can use X-Ray to instrument a variety of services. Whilst it was simple to do, it was invasive. As in, we had to change both the CDK code and the Lambda function code. Although I haven't tried them, I believe that some of the third-party offerings avoid such changes. We also need to be aware of the cost of using X-Ray, the pricing of which can be found at AWS X-Ray pricing.

Although I feel I have only just scratched the surface, I can see how powerful X-Ray can be in the way it combines the service map, traces, metrics, alarms, and logs. In the second part of the series, I will take what I have found so far and try instrumenting my Loan Broker example application.

Encapsulation Not Always Desirable

Andy Blackledge — Sun, 02 Jul 2023 16:06:15 GMT

From my earliest experience with object-oriented programming, I learnt that one of the biggest benefits of the paradigm was the concept of encapsulation. Lately, my experience with developing in JavaScript and creating libraries have led me to come to the conclusion that it is not always desirable.

What is encapsulation

The following links provide an admirable overview of the concept:

The following in particular sums up the idea:

Under the definition that encapsulation "can be used to hide data members and member functions", the internal representation of an object is generally hidden from view outside of the object's definition. Typically, only the object's own methods can directly inspect or manipulate its fields. Hiding the internals of the object protects its integrity by preventing users from setting the internal data of the component into an invalid or inconsistent state. A supposed benefit of encapsulation is that it can reduce system complexity, and thus increase robustness, by allowing the developer to limit the interdependencies between software components

Encapsulation is the default

The majority of my object-oriented programming experience is in Java and C#. In both these languages, the default visibility for class members is non-public. That is, they cannot be access by any class without a special relationship with that class. Such relationships are a sub-class or a class in the same packaging unit. This implicitly pushes the developer to conceal as much as possible.

When I encountered TypeScript, I was surprised to find that the opposite was true. I had to explicitly make any members private. I had to consciously make the decision to hide them from potential users of the class.

When encapsulation gets in the way

This subtle change made me question my default thinking. This was coupled with my experience of using a third-party component in a way similar to the following hypothetical code:

const myClient = new ServiceClient({ region: 'eu-west-2'});// ...myClient.region; // Not accessible now, although it was provided earlier

The component was hiding information that I had provided. What was the point of doing this? How was I going to misuse this information? Now I had to pass around the information that existed inside another parameter that was passed with it.

I also had the experience of trying to subclass a class I had published as part of an npm package. I had diligently hidden everything deemed 'not essential', but now had cut off the ability to extend it.

All this caused me to further question how I thought about member visibility.

Don't throw the baby out with the bathwater

Should we just make everything public? Of course not. Anything public forms part of the contract of your class and you should be committed to honouring that as best you can. Once you have published, then breaking that contract could result in very unhappy clients. If you are using Domain-driven Design, then you will also need to implement some business rules to keep your domain objects consistent. Again, encapsulation plays a vary valuable role here.

What I am thinking about here are primarily read-only properties that have been hidden without thought. Perhaps because the language made that the default.

Sometimes privacy is just a facade

As it turns out, sometimes privacy is just a facade. Take C# for example, where you can define this seemingly well-encapsulated class.

class MyEncapsulatedClass{    private int PrivateProperty { get; set; }    public MyEncapsulatedClass(int myPropertyValue)    {        PrivateProperty = myPropertyValue;    }}

However, using reflection you can still access the supposedly-private value.

var myEncapsulatedInstance = new MyEncapsulatedClass(666);var myEncapsulatedClass = typeof(MyEncapsulatedClass);var privateProperty =    myEncapsulatedClass.GetProperty(        "PrivateProperty", BindingFlags.NonPublic | BindingFlags.Instance);var privatePropertyValue =    privateProperty.GetValue(myEncapsulatedInstance, null);Console.WriteLine("privatePropertyValue=" + privatePropertyValue);

I concede that you have to do some work here to get the value. The point I am trying to make is that you might want to check your language before relying on encapsulation for anything security-related.

Summary

My experience with TypeScript's public-by-default approach led me to question my default position of hiding everything possible in a class. This was combined with my experience of being frustrated with using libraries that hid information unnecessarily.

Now for utility libraries or packages, I try to keep things as open as possible. Where properties are passed in on construction, I try to make them public and read-only. Internal structures are made private by conscious choice, only if exposing them would make for a fragile contract.

When implementing Domain-driven Design business rules, encapsulation becomes necessary in order to enforce those rules and keep the domain model consistent. Again, this is a conscious choice.

What I also learned, is that trying different programming languages can make you think differently and challenge your own assumptions and habits. This can only be a good thing.

Updating to Node.js 18 and AWS JavaScript SDK v3 - Part 2

Andy Blackledge — Tue, 06 Jun 2023 19:36:40 GMT

In this series, we continue looking at upgrading a TypeScript codebase from AWS SDK V2 to AWS SDK V3. Here we concentrate on DynamoDB and also look at S3, Step Functions, and Lambda functions.

DynamoDB clients

As we saw in the first part of this series, converting from V2 to V3 can be as straightforward as changing the type of service client, then using that client to send a command rather than invoking a method.

So the following:

readonly sns: AWS.SNS;...const publishInput: PublishInput = {  Message: JSON.stringify(message),  TopicArn: this.topicArn,  MessageAttributes: messageAttributes,};await this.sns.publish(publishInput).promise();

Becomes:

readonly sns: SNSClient;...const publishInput: PublishInput = {  Message: JSON.stringify(message),  TopicArn: this.topicArn,  MessageAttributes: messageAttributes,};await this.sns.send(new PublishCommand(publishInput));

The same applies for DynamoDB, there is a DynamoDBClient class in the @aws-sdk/client-dynamodb package. As expected, this can be used to send commands to put and get items. However, unlike the AWS.DynamoDB.DocumentClient class in the V2 SDK, the DynamoDBClient class only deals with objects in the low-level DynamoDB format. As you can see from the example below, this is a JSON format that uses typed objects for properties:

{  "Item": {    "Age": { "N": "8" },    "Name": { "S": "Rover" },    "Breed": { "S": "Beagle" },    "AnimalType": { "S": "Dog" }  }}

I decided to do the modern thing and ask ChatGPT if there was a way to convert the output of DynamoDBClient. To its credit it did point me in one of the possible directions.

If you are using the query method of the DynamoDBClient object to query a DynamoDB table in JavaScript, you will need to unmarshall the output to convert the raw DynamoDB item format into a more usable JavaScript format.The output of the query method is returned in the native DynamoDB item format, which is a complex JSON structure that includes data types and other metadata. To work with the data in a more convenient way, you will need to unmarshall the output using the unmarshall function provided by the @aws-sdk/util-dynamodb module.Here's an example code snippet that includes unmarshalling of the query results:

// Import the required modules from the AWS SDK for JavaScript V3import { DynamoDBClient, QueryCommand } from "@aws-sdk/client-dynamodb";import { unmarshall } from "@aws-sdk/util-dynamodb";// Set the AWS region and create a new DynamoDB client objectconst REGION = "us-east-1";const dynamodbClient = new DynamoDBClient({ region: REGION });// Set the parameters for the queryconst params = {  TableName: "my-table",  KeyConditionExpression: "partitionKey = :pk",  ExpressionAttributeValues: {    ":pk": { S: "my-partition-key" },  },};// Create a new QueryCommand object and execute the queryconst command = new QueryCommand(params);const response = await dynamodbClient.send(command);// Unmarshall the resultsconst items = response.Items.map((item) => unmarshall(item));// Output the results to the consoleconsole.log(items);

Although this is correct, and the unmarshall function will come in useful later, this isn't the route I ended up taking. The V3 SDK uses a middleware-based approach and the DynamoDBClient can be wrapped with middleware to do the marshalling as part of the pipeline.

This approach is explained in Using the DynamoDB Document Client. This shows how the @aws-sdk/lib-dynamodb package can be used as follows.

import { DynamoDBClient } from '@aws-sdk/client-dynamodb';import { DynamoDBDocumentClient } from "@aws-sdk/lib-dynamodb";...// Wrap a DynamoDBClient instanceconst documentClient = DynamoDBDocumentClient.from(new DynamoDBClient({ region }));...const queryOutput = await this.documentClient.send(  new QueryCommand(queryParams));// Return unmarshalled objectsreturn queryOutput.Items;

One thing to be aware of is that DynamoDBDocumentClient does not support all the same commands as DynamoDBClient. So you might need to have an instance of the latter available as well as the wrapped version.

In part of the codebase, a DynamoDB stream event is used to retrieve the corresponding item from the table. One thing I found was that the stream AttributeValue appears to no longer be compatible with DynamoDB version. To get round this, I had to add an explicit cast. It was here that the unmarshall function came in useful, as the key is returned in the low-level JSON format.

import { AttributeValue, DynamoDBClient } from '@aws-sdk/client-dynamodb';import { AttributeValue as StreamAttributeValue } from 'aws-lambda/trigger/dynamodb-stream';async getItemByEventKeyAsync(    eventKey: { [key: string]: StreamAttributeValue } | undefined  ): Promiseundefined> {    //    if (eventKey === undefined) {      return undefined;    }    // Cast to prevent: 'AWSLambda.AttributeValue' is not assignable to type 'DynamoDB.AttributeValue'    const key = unmarshall(eventKey as Record<string, AttributeValue>);    return getItem(this.region, this.tableName, key) as unknown as T;  }

S3

One of the examples in the codebase being converted used pre-signed URLs to pass data. It turns out that pre-signing has changed with the V3 SDK. There is now a separate package (s3-request-presigner) that you need to reference to produce a URL for a V3 command.

import { getSignedUrl } from '@aws-sdk/s3-request-presigner';...const s3Params = {  Bucket: bucketName,  Key: key,};...const signedCommand = new GetObjectCommand(s3Params);const signedUrl = await getSignedUrl(s3, signedCommand, {  expiresIn: expirySeconds ?? 60,});

Lists now can return `undefined`

Another thing that I noticed as part of the conversion process was that lists returned by the APIs can now be undefined. Below is an example where step function executions are being listed.

const { executions } = await stepFunctions.listExecutions(opts).promise();if (executions.length > 0) {  const newestRunning = executions[0];

When converting, I had to add an extra test to cater for the possibility of undefined.

const { executions } = await sfnClient.send(new ListExecutionsCommand(opts)); // Can be undefinedif (executions && executions.length > 0) {  const newestRunning = executions[0];

Invoking Lambda functions

Another small quirk that emerged from my conversion was that I needed to encode/decode the payloads when invoking a Lambda function. The Payload is now returned as a Uint8Array, so we need to use a TextEncoder to convert from and to JSON objects.

Here we encode the stringify-ed JSON object:

const encoder = new TextEncoder();const lambdaPayload = request ? { Payload: encoder.encode(JSON.stringify(request)) } : {};

And here we decode it before parsing:

const decoder = new TextDecoder();return JSON.parse(decoder.decode(Payload));

Discoverability thoughts

As part of the conversion process, I encountered the following code that I had in place to reuse connections in Node.js.

const documentClient = new DocumentClient({  httpOptions: {    agent,  },});

My thought was to navigate to the definition of the new options and look for something similar. However, I quickly found myself lost.

constructor(configuration: DynamoDBClientConfig);

Led to...

export interface DynamoDBClientConfig extends DynamoDBClientConfigType {}

Which led to...

type DynamoDBClientConfigType = Partial<__SmithyConfiguration<__HttpHandlerOptions>> & ClientDefaults & RegionInputConfig & EndpointInputConfig & RetryInputConfig & HostHeaderInputConfig & AwsAuthInputConfig & UserAgentInputConfig & EndpointDiscoveryInputConfig & ClientInputEndpointParameters;

At which point I stopped and searched for 'aws sdk V3 keep-alive' and found Reusing connections with keep-alive in Node.js

This allowed me to rewrite the original as follows:

const documentClient = DynamoDBDocumentClient.from(  new DynamoDBClient({    requestHandler: new NodeHttpHandler({      httpAgent: agent,    }),  }));

I appreciate there is a good reason for how the options are now defined, but I do feel it has affected discoverability via the definition. I just need to remember to fall back on search and AI chatbots.

As it turns out, this 'keep alive' code is not needed any more. See HTTP keep-alive is on by default in modular AWS SDK for JavaScript

The middleware-based approach

As touched on in the DynamoDB section, the V3 SDK uses a middleware-based approach. We saw it when we wrapped a DynamoDBClient instance in a DynamoDBDocumentClient instance.

const documentClient = DynamoDBDocumentClient.from(new DynamoDBClient({ region }));

The article What's the AWS SDK for JavaScript? explains how you can create your own customisations.

In V3, you can use a new middleware stack to control the lifecycle of an operation call. Each middleware stage in the stack calls the next middleware stage after making any changes to the request object.

It goes on to give the following example of adding a custom header to a Amazon DynamoDB client.

dbClient.middlewareStack.add(  (next, context) => args => {    args.request.headers["Custom-Header"] = "value";    return next(args);  },  {    step: "build"  });dbClient.send(new PutObjectCommand(params));

This approach, coupled with the ability to have a smaller bundle size, helped me understand the change in approach in the V3 SDK. On the surface, the changes looked a bit like unnecessary complication.

Summary

In this post, we looked at the challenges that I had when converting code for DynamoDB, Step Functions, S3, and Lambda functions and how I solved them. Hopefully, my experience can help others. In the main, the process was quite painless. However, my codebase was small and I had integration tests to verify the changes in the cloud.

If you have many unit tests that mock the older SDK, then your challenges may be greater than mine. Personally, I would try to avoid mocking at that level in the first place, but that might be a subject for another post.

It is well worth knowing that there is much improved documentation in Developer Preview. This documentation is searchable and goes beyond the original auto-generated version and includes code samples.

For those that like to look at code, here are the links to the resulting pull requests from my upgrading:

Updating to Node.js 18 and AWS JavaScript SDK v3 - Part 1

Andy Blackledge — Mon, 29 May 2023 17:27:04 GMT

With the announcement of the Node.js 18.x runtime being available in AWS Lambda, AWS also changed the included version of the AWS SDK for JavaScript.

Up until Node.js 16, Lambdas Node.js runtimes have included the AWS SDK for JavaScript version 2. This has since been superseded by the AWS SDK for JavaScript version 3, which was released in December 2020. With this release, Lambda has upgraded the version of the AWS SDK for JavaScript included with the runtime from v2 to v3.

I also noted that when doing development, I was being nagged as follows.

(node:16616) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023.Please migrate your code to use AWS SDK for JavaScript (v3).For more information, check the migration guide at https://a.co/7PzMCcy

The announcement and the nag sufficiently motivated myself to look at my CDK Cloud Test Kit and make the leap from SDK v2 to v3, whilst documenting my experience along the way.

TL;DR

Use the service clients for smaller bundle size and future-proofing
aws-sdk-js-codemod works OK, but the result may be deprecated
CDK hotswap doesn't update the Node.js version

Upgrade approach

My first thought was to question how should I approach the process of upgrading. Should I uninstall the aws-sdk package, see what breaks, then fix it all up? Or should I take a more step-by-step approach? Ultimately, I will need to uninstall the aws-sdk package to be sure I have amended all references, but to keep things manageable I decided to tackle the functionality service by service.

When identifying what needed to change, I noted that my codebase was not consistently explicit in the Node.js version being used. The reason for this was that the code used the NodejsFunction CDK construct and the default value for runtime is NODEJS_14_X.

export interface NodejsFunctionProps extends FunctionOptions {    /**     * @default Runtime.NODEJS_14_X     */    readonly runtime?: lambda.Runtime;}

With hindsight, in future I would favour being explicit with the runtime version. I think defaults have their place, but I feel such a key dependency deserves to have full visibility.

As CodeWhisperer has just been released for personal use, I decided to give it a try with the prompt 'Publish a message to the selected output topic using the javascript sdk v3' and got the following result.

// Publish a message to the selected output topic using the javascript sdk v3const command = new PublishCommand({    TopicArn: outputTopicArn,    Message: JSON.stringify(numbersEvent),});const result = await sns.send(command);console.log(JSON.stringify({ result }, null, 2));

However, when I ran my cloud-based tests, I got the failure below:

  SimpleEventRouter Test Suitespec.js:54    1) Routes positive sums      0 passing (1m)base.js:379  5 failing

Looking in CloudWatch, I could see that the @aws-sdk/client-sns package could not be found.

2023-05-07T07:23:27.329Z  undefined ERROR Uncaught Exception{    "errorType": "Runtime.ImportModuleError",    "errorMessage": "Error: Cannot find module '@aws-sdk/client-sns'\nRequire stack:\n- /var/task/index.js\n- /var/runtime/UserFunction.js\n- /var/runtime/Runtime.js\n- /var/runtime/index.js",    "stack": [        "Runtime.ImportModuleError: Error: Cannot find module '@aws-sdk/client-sns'",        "Require stack:",            ]}

The reason for this turned out to be that I was using the --hotswap option with cdk deploy. This updates the Lambda function code, but not the runtime. As @aws-sdk/client-sns is not bundled with the Node.js 14 runtime, we get the error above.

When a full deployment was done, we got the happy sight of all the tests passing. As these are cloud-based, I have a high-confidence in a successful migration.

  SimpleEventRouter Test Suitespec.js:54     Routes positive sums (4175ms)      5 passing (13s)

SQS next with `aws-sdk-js-codemod`

The AWS documentation on upgrading mentions a package called aws-sdk-js-codemod. To quote the README, 'This repository contains a collection of codemod scripts for use with JSCodeshift that help update AWS SDK for JavaScript APIs.' This sounded promising, so I decided to give it a go.

I followed the instructions and ran the following, pointing at TypeScript file with SQS references.

npx aws-sdk-js-codemod -t v2-to-v3 PATH...

The results can be seen below.

This all look reasonable. I needed to install the @aws-sdk/client-sqs package and to add an empty configuration to the SQS constructor (new SQS({});), but after that I was able to deploy and test.

  SimpleMessageRouter Test Suitespec.js:54     Routes as expected: {"values":[],"isExpectedPositive":true} (4178ms)      6 passing (28s)

So success, but the style looks a little different from the SNS.

Why does `codemod` SQS code differ from the SNS code?

I decided to ask CodeWhisperer how to send an SQS message using the v3 SDK and got another way.

// Send an SQS message using v3 sdkconst sendMessageRequest: AWS_SQS.SendMessageRequest = {    QueueUrl: outputQueueUrl,    MessageBody: JSON.stringify(numbersMessage),};await sqs.sendMessage(sendMessageRequest);

I was getting, if not confused, a little intrigued by these alternatives. Reading theAWS documentation, I could see that it points you down the SQSClient route as shown below.

import { SQSClient, SendMessageCommand } from "@aws-sdk/client-sqs"; // ES Modules importconst client = new SQSClient(config);const input = {  // SendMessageRequest  QueueUrl: "STRING_VALUE", // required  MessageBody: "STRING_VALUE", // required};const command = new SendMessageCommand(input);const response = await client.send(command);

I tried this approach and, unsurprisingly, this worked as well. So now, we have three possible ways:

Use SQS.sendMessage()
- With SendMessageCommandInput
- With SendMessageRequest
Use SQSClient.send() with SendMessageCommand

SendMessageCommandInput turns out to be a subclass of SendMessageRequest as documented here.

So, which to use?

I dug a little further into the documentation and found this on the 'v2 compatible style' (highlighting my own).

The client can also send requests using v2 compatible style. However, it results in a bigger bundle size and may be dropped in next major version. More details in the blog post on modular packages in AWS SDK for JavaScript

The key takeaway for me here is that if you take the easiest approach now with your codebase, then you may face another round of updates if your application sees out the support lifetime of SDK v3. I noted that this is the approach that the aws-sdk-js-codemod defaults to, so that is something to bear in mind.

Summary

So far, the process of updating has been pretty painless. Admittedly, I have been tackling a small codebase and only two AWS services. On a larger codebase that has not wrapped the AWS services in more domain-level abstractions, then this could be quite a task. Especially if it is not easy to exercise the code thoroughly in the cloud.

In the next post, I will move on to updating the rest of the AWS services being used, including DynamoDB and the marshalling challenge.

Creating Overloaded Methods in TypeScript

Andy Blackledge — Sun, 16 Apr 2023 08:13:04 GMT

For the majority of my journey in software development, I have used languages that have supported method overloading. I have found this to be useful, so I have been wondering how to implement something similar in TypeScript. How can we have a method can be declared multiple times with different parameters in a language which does not support method overloading in the traditional way?

After posting, it was brought to my attention that there is another way. I have added an update at the end to cover this approach.

The Problem

I wanted to create a class to build queries for DynamoDB. DynamoDB is a NoSQL database that indexes each item by two keys. A partition key and a sort key. When querying a DynamoDB table you always supply a partition key, and you optionally supply a sort key along with an operator such as 'greater than'. Another option is to supply two sort key values to provide a range.

The C# Solution

In C#, we could define as follows:

enum SortKeyOperator{    EQUALS,    LESS_THAN,    LESS_THAN_OR_EQUAL,    GREATER_THAN_OR_EQUAL,    GREATER_THAN,    BEGINS_WITH,}class QueryBuilder{    public void Build(string partitionKeyValue)    {    }    public void Build(        string partitionKeyValue,        string sortKeyValue)    {    }    public void Build(        string partitionKeyValue,        SortKeyOperator sortKeyOperator,        string sortKeyValue)    {    }    public void Build(        string partitionKeyValue,        string sortKeyFromValue,        string sortKeyToValue)    {    }}

This is allowed, as the combination of parameters means that each method signature is unique, even though the name of the method is not. When using the Visual Studio IDE the intellisense prompts as follows.

This allows the user to scroll through the various overloaded versions of the method.

By adding documentation to the methods, you can clearly communication the intended use of each overload.

TypeScript Attempt No.1 - Separate Methods

The simplest way I could think of to try to replicate method overloading is to have separate methods that share a common prefix. In this case, buildWith. The result is as follows:

class QueryBuilder {  buildWithPartitionKeyOnly(partitionKeyValue: string) {...}  buildWithSortKey(partitionKeyValue: string, sortKeyValue: string) {...}  buildWithComparison(    partitionKeyValue: string,    sortKeyOperator: SortKeyOperator,    sortKeyValue: string  ) {...}  buildWithRange(    partitionKeyValue: string,    sortKeyFromValue: string,    sortKeyToValue: string  ) {...}}

This would result in the following prompt when using VS Code:

I actually think this approach has some merit. The explicit naming provides some level of self-documentation. A downside is that the underlying implementation might need either some duplication in the separate methods, or some common code outside them.

TypeScript Attempt No.2 - Optional Parameters

Another approach is to use optional parameters and deconstructed parameters. We can define a single method with a single object parameter, and we can make the sort key parameters all optional. The result is as follows:

build({  partitionKeyValue,  sortKeyValue,  sortKeyComparison,  sortKeyRange,}: {  partitionKeyValue: string;  sortKeyValue?: string;  sortKeyComparison?: {    operator: SortKeyOperator;    value: string;  };  sortKeyRange?: {    fromValue: string;    toValue: string;  };}) {  if (sortKeyValue) {    // Handle case where we match by value equality  } else if (sortKeyComparison) {    // Handle case where we match by comparison  } else if (sortKeyRange) {    // Handle case where we match by range  } else {    // Handle case where we match just by primary key  }}

Whilst this works, it isn't obvious to the caller what combination of parameters should be used to get the various outcomes. For example, can sortKeyRange be used with sortKeyValue? The only way to know this, is to look inside the method. Not ideal. Can we do better?

TypeScript Attempt No.3 - Naive Discriminated Types

TypeScript allows you to define that a value can be one of a set of types, for example:

var v: number | string;

Can we take advantage of this to give the callers of the method a set of exclusive choices, so that they do not need to look inside the method to work out how to use it?

Below was my first effort:

build({  partitionKeyValue,  sortKeyCriteria,}: {  partitionKeyValue: string;  sortKeyCriteria?:    | {        value: string;      }    | {        comparison: {          operator: SortKeyOperator;          value: string;        };      }    | {        range: {          fromValue: string;          toValue: string;        };      };}) {  if (sortKeyCriteria) {    if ('value' in sortKeyCriteria) {      // Handle case where we match by value equality    } else if ('comparison' in sortKeyCriteria) {      // Handle case where we match by comparison    } else if ('range' in sortKeyCriteria) {      // Handle case where we match by range    } else {    }  } else {    // Handle case where we match just by primary key  }}

Here we use the in operator to work out which of the types has been specified. This all seemed to be working as I expected until I tried the following:

queryBuilder.build({  partitionKeyValue: "pk",  sortKeyCriteria: {    value: "sortKeyValue",    range: {      fromValue: "sortKeyValue1",      toValue: "sortKeyValue2",    },    comparison: {      operator: SortKeyOperator.GREATER_THAN,      value: "sortKeyValue",    },  },});

I was expecting a compiler error, as I had specified all three options. However, clearly TypeScript does not work that way. What I had I done wrong?

TypeScript Attempt No.4 - Discriminated Types Done Properly

The solution came from an example in the TypeScript playground. What I needed to do was define a value that would discriminate the types. The result is as follows:

build({  partitionKeyValue,  sortKeyCriteria,}: {  partitionKeyValue: string;  sortKeyCriteria?:    | {        type: 'value';        value: string;      }    | {        type: 'comparison';        operator: SortKeyOperator;        value: string;      }    | {        type: 'range';        fromValue: string;        toValue: string;      };}) {  if (sortKeyCriteria?.type === 'value') {    // Handle case where we match by value equality  } else if (sortKeyCriteria?.type === 'comparison') {    // Handle case where we match by comparison  } else if (sortKeyCriteria?.type === 'range') {    // Handle case where we match by range  } else {    // Handle case where we match just by primary key  }}

Now when using the class in VS Code, when you select the type you get the corresponding options. For example, when using range you get prompted for the relevant fromValue and toValue values.

Now we have a single method that presents a set of mutually exclusive choices. The caller cannot get the parameters wrong and doesn't need to look inside the method.

Documentation

One downside to using deconstructed parameters is that I could not find a way to document them well using JSDoc. The best I could come up with was the following.

/** * Builds a query based on the key criteria supplied. * @param param0 Key criteria */build({ partitionKeyValue, sortKeyCriteria }: {...}) {...}

This resulted in the following prompt in VS Code, which does give some indication of the options via the type values.

I may have been missing something, but for this reason I found myself quite liking the solution with separate methods. That approach was easy to document and also somewhat documented itself with the verbose names.

Here we can see that the individual parameters can be documented in the same way that they can be in the C# example, as shown below.

Summary

In this post we looked at various ways that we can implement some form of method overloading in TypeScript and compared these with an equivalent in C#. My feeling is that for internal libraries, I would favour the discriminated type approach. However, for external libraries, I feel that the ability to fully document means that the simple, multi-named approach would be better. Behind the scenes, these methods may map onto a single discriminated type method, in order to keep functionality together.

Update

After posting this, it was pointed out to me that there is another way that we can implement method overloading in TypeScript.

The way we can do it is by defining an args parameter with the JavaScript spread syntax and a list of possible parameters. In our example this would be the following:

build(  ...args:    | [partitionKeyValue: string]    | [partitionKeyValue: string, sortKeyValue: string]    | [        partitionKeyValue: string,        sortKeyOperator: NumericSortKeyOperator,        sortKeyValue: string      ]    | [        partitionKeyValue: string,        sortKeyFromValue: string,        sortKeyToValue: string      ]) {...}

In VS Code, this gives an experience very similar to that we had for C#:

We still have the problem of how to know which overload is being called. I solved this by building up a signature string from the arguments types and switching on the result.

const signature = args  .map((arg) => typeof arg)  .reduce((accumulator, argType) => `${accumulator}${argType}|`, '|');switch (signature) {  case '|string|':    // Handle case where we match by partition key only    break;  case '|string|string|':    // Handle case where we match by compound key    break;  case '|string|string|string|':    // Handle case where we match by range    break;  case '|string|number|string|':    // Handle case where we match by comparison    break;  default:    throw new Error(`Unhandled signature`);}

One thing I did have to change was the type of enum. Originally, it was a set of strings, but this would cause a clash of signatures. I changed it for a set of integers and this avoided the issue.

I used array destructuring to access the values as follows:

case '|string|number|string|':  // Handle case where we match by comparison  {    const [partitionKeyValue, sortKeyOperator, sortKeyValue] = args;    // Call the method implementation  }  break;

Whilst TypeScript does infer the types, it does not discriminate. So it can only assert that some values are one of a set:

This approach suffers from the documentation issue that other advanced approaches do. My feeling overall is that, although it gives a similar intellisense experience, it falls down when implementing the underlying functionality and I would still be tempted to go down the explicit naming route with discriminated types underneath.

Enterprise Integration Patterns - Domain Observability

Andy Blackledge — Wed, 29 Mar 2023 19:55:34 GMT

In the first two parts in this series, we first looked at choosing a messaging technology and then looked at how we can design the domain events that flow through the application. In this part, we will look at how we can use those domain events to implement an observability stack. This stack will output a range of business metrics that can be used to provide visibility of the system performance and to alert us when this is not as desired.

The application in question acts as a loan broker, it receives a request containing the details of the loan required via an API, and then returns the best rate to a webhook.

The following diagram shows how we use a central EventBridge event bus to implement this application using an event-driven design.

The code for this post can be found in the accompanying GitHub repo.

Business metrics vs. System metrics

By default, AWS outputs a large number of metrics that you can use to visualise and monitor the health of your application. For example, the Working with Lambda function metrics AWS documentation page goes into the detail of what is outputted by default by Lambda functions. You get invocation metrics, such as the number of times that your function code is invoked, performance metrics, such as the amount of time that your function code spends processing an event, and concurrency metrics, such as the number of function instances that are processing events.

These provide an invaluable insight into the health of your application, but don't necessarily answer business-level questions such as, "How many quotes are we receiving per hour?", "How fast are we processing them?", or "How many rates are we receiving from lender X?". Whilst AWS provides system metrics, to answer these business-level questions we need business metrics.

This is where we can take advantage of our event-driven architecture. The application is already producing events such as the following:

QuoteSubmitted
CreditReportFailed
LenderRateReceived

What we can do is subscribe to these events and translate them into custom business metrics. We can then build dashboards, alarms, and whatever else we want on top of those metrics.

Decoupling observability

In the first part of the series, we considered using SQS as part of our messaging technology. One limitation of SQS is that each message can only be processed by one consumer. Our use of EventBridge has the advantage that we can subscribe to any business events without affecting any existing processing. This means we can add an observability stack entirely independently of the existing application. This demonstrates the high-level of decoupling that can be achieved and the extensibility you get with an event-driven architecture.

The diagram below shows how we are going add business observability to our architecture. As you can see, it simply plugs into the event bus.

Simple logging and Logs Insights

One simple way to turn the business events into a searchable resource is to log our business events in a structured way. This can be done via the following single-line Lambda function.

export const handler = async (  event: EventBridgeEvent<'DomainEventBase', DomainEventBase>): Promise<void> => {  console.log({ ...event.detail.metadata, data: event.detail.data });};

This will result in log entries such as the following.

{  "eventType": "QuoteSubmitted",  "eventVersion": "1.0",  "correlationId": "ac702216-ac35-48cb-be3c-26beb523897e",  "requestId": "35fb604e-c56b-457b-97a8-fbdae9fd0644",  "eventId": "8e71aecb-839f-4477-a9db-5353e1f23d04",  "domain": "LoanBroker",  "service": "RequestApi",  "timestamp": "2023-01-28T15:05:16.237Z",  "data": {    "quoteReference": "2023-01-28-RHJ9YUL71",    "quoteRequestDataUrl": "https://requestapistack-requestapibucket...-Amz-SignedHeaders=host"  }}

We have flattened the metadata about the event, such as the correlation id and event type, and also included the actual data for the event, such as the quote reference. In the second part of the series, we designed our domain events to be self-contained and to be made up from metadata about the event and the event data itself. We are taking advantage of the consistency here to output an easily searchable entry.

We also made another choice when designing out domain events. That was to pass large or sensitive data as time-limited presigned URLs. This choice means that we are free to log the events without the risk of logging sensitive information by accident.

Now we have our structured logging Lambda function, we can look at how we hook it up to monitor our application. To do this we create a CDK construct as shown below.

export interface ObserverProps {  loanBrokerEventBus: EventBus;}export default class Observer extends Construct {  constructor(scope: Construct, id: string, props: ObserverProps) {    super(scope, id);    const loggerFunction = new NodejsFunction(this, 'Logger');    const domainEventRule = new Rule(this, id, {      eventBus: props.loanBrokerEventBus,      eventPattern: {        detail: {          metadata: {            domain: [EventDomain.LoanBroker],          },        },      },    });    domainEventRule.addTarget(new LambdaFunction(loggerFunction));}

The construct properties allow us to pass in the event bus to subscribe to. We create our Lambda function, along with a rule that will listen to all domain events in the application. Finally, we add our Lambda function as the target for the rule.

The final step is to define a stack and add it to the application.

export interface ObservabilityStackProps extends StackProps {  loanBrokerEventBus: EventBus;}export default class ObservabilityStack extends Stack {  constructor(scope: Construct, id: string, props: ObservabilityStackProps) {    super(scope, id, props);    new Observer(this, 'Observer', {      loanBrokerEventBus: props.loanBrokerEventBus,    });  }}new ObservabilityStack(app, 'ObservabilityStack', {  loanBrokerEventBus: messagingStack.loanBrokerEventBus,});

With this Lambda function in place, we can now use CloudWatch logs to see all the domain events in a single place.

We can take advantage of log filtering to provide a more focussed view of the events.

However, we can do even better by using CloudWatch Logs Insights. To quote the article:

CloudWatch Logs Insights enables you to interactively search and analyze your log data in Amazon CloudWatch Logs. You can perform queries to help you more efficiently and effectively respond to operational issues. If an issue occurs, you can use CloudWatch Logs Insights to identify potential causes and validate deployed fixes.

This allows us to run a query such as the following and get a picture of the domain events flowing through the system.

As we log correlation ids, we can use these when we want to focus in on a particular request by adding criteria to our queries. The query below show how this is done.

Logging business metrics

As well as outputting structured logs, we can also use this subscription to publish custom Amazon CloudWatch metrics. By publishing custom metrics we can create dashboards and alarms.

To make our lives easier, we are going to use the AWS Lambda Powertools for TypeScript Metrics npm package. This will allow us to create custom metrics asynchronously by logging metrics to standard output following Amazon CloudWatch Embedded Metric Format (EMF).

The metric we are going to capture is a count of how many times the call to the credit report service has failed.

We start by declaring a Metrics instance outside of the Lambda function handler. We assign it a namespace and service name using exported constants, which we will use later for setting up an alarm.

import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';export const METRICS_NAMESPACE = 'LoanBroker';export const METRICS_SERVICE_NAME = 'observer';const metrics = new Metrics({  namespace: METRICS_NAMESPACE,  serviceName: METRICS_SERVICE_NAME,});

The next step is to create the handler to publish the metric. First, we create a function to publish a count of all the failure events. Along with adding one to the overall count, we also publish metadata about the metric. This includes as much contextual information as possible, so that the resulting log entry can aid us in debugging what failed.

export const CREDIT_REPORT_FAILED_METRIC = 'creditReportFailed';const publishCreditReportFailedMetrics = (  creditReportFailed: CreditReportFailedV1): void => {  metrics.addMetric(CREDIT_REPORT_FAILED_METRIC, MetricUnits.Count, 1);  addMetadata(creditReportFailed, {    quoteReference: creditReportFailed.data.quoteReference,    error: creditReportFailed.data.error,    executionId: creditReportFailed.data.executionId,    executionStartTime: creditReportFailed.data.executionStartTime,    stateMachineId: creditReportFailed.data.stateMachineId,  });  metrics.publishStoredMetrics();};

Now we have our function to publish the metric, we create a handler with a switch statement to route the event to our function.

export const handler = async (  event: EventBridgeEvent<'DomainEventBase', DomainEventBase>): Promise<void> => {  switch (event.detail.metadata.eventType) {    case EventType.CreditReportFailed:      publishCreditReportFailedMetrics(event.detail as CreditReportFailedV1);      break;    default:      break;  }};

Finally, we add a subscription using the domainEventRule we created earlier.

const measurerFunction = new NodejsFunction(  this, 'Measurer', getNodejsFunctionProps());domainEventRule.addTarget(new LambdaFunction(measurerFunction));

Now that we have a metric for our failures, we can create an alarm to notify ourselves when the failures are occurring. For the purpose of this post, we create a simple alarm that triggers if there is at least one failure in a five minute period.

const creditReportFailedCount = new Metric({  namespace: OBSERVER_NAMESPACE,  metricName: CREDIT_REPORT_FAILED_METRIC,  dimensionsMap: {    service: OBSERVER_SERVICE_NAME,  },}).with({  statistic: 'sum',  period: Duration.minutes(5),});creditReportFailedCount.createAlarm(this, 'CreditReportFailedAlarm', {  evaluationPeriods: 1,  comparisonOperator: ComparisonOperator.GREATER_THAN_THRESHOLD,  threshold: 0,  treatMissingData: TreatMissingData.NOT_BREACHING,});

We can repeat this process for any of our domain events, creating metrics and alarms that allow us to observe the business performance of the application.

Deriving business metrics

Some business metrics directly correlate with individual business events, such as CreditReportFailed as we saw previously. However, there are some business metrics that do not. One example is the length of time is takes to process a quote. This process is asynchronous and, as a result, there is no single place within the application that could measure this duration.

One solution is to derive such a metric by a combination of events, in this case QuoteSubmitted and QuoteProcessed. By storing the events in a DynamoDB table, indexed by the request id, we can use this table to retrieve the corresponding event and derive the metric.

The first step is to extend the Lambda function that logs the event. In addition to the logging, we now write the event to a DynamoDB table.

export const handler = async (  event: EventBridgeEvent<'DomainEventBase', DomainEventBase>): Promise<void> => {  console.log({ ...event.detail.metadata, data: event.detail.data });  // Record the event in DynamoDB  await requestEventTableClient.putEventAsync(event.detail);};

An example for for a single request is shown below.

The primary key is the requestId and the sort key is a combination of the time the event was received by the logger and the eventId. This allows us to have a chronological view of the events for a particular request. In this example, we can see the requests for rates and their corresponding responses.

Now that we have a searchable log of the events, we can create a Lambda function that will use this to derive our metric. The code for this is shown below.

const publishQuoteProcessedMetricsAsync = async (  quoteProcessed: QuoteProcessedV1): Promise<void> => {  // Retrieve the corresponding event for when the quote was submitted  const [quoteSubmitted] = await requestEventTableClient.getEventsByType(    quoteProcessed.metadata.requestId,    EventType.QuoteSubmitted  );  // Calculate the duration  const quoteSubmittedMillis = DateTime.fromISO(    quoteSubmitted.metadata.timestamp  ).toMillis();  const quoteProcessedMillis = DateTime.fromISO(    quoteProcessed.metadata.timestamp  ).toMillis();  const durationMillis = quoteProcessedMillis - quoteSubmittedMillis;  // Publish the metric  metrics.addMetric(    QUOTE_PROCESSED_DURATION_METRIC,    MetricUnits.Milliseconds,    durationMillis  );  addMetadata(quoteProcessed, {    quoteReference: quoteProcessed.data.quoteReference,  });  metrics.publishStoredMetrics();};

By hooking this Lambda function up to the quoteProcessed event, it will start emitting metrics for the duration of processing quotes. We could extend this to derive and emit metrics for how long each lender takes to respond, or any other metric that requires correlating multiple events together.

Summary

In this post, we have seen how we can add business-level observability to our event-driven application. By taking advantage of the decoupled nature of an event-driven architecture, we were able to do this without touching the core application code at all.

Enterprise Integration Patterns - Domain Event Design

Andy Blackledge — Wed, 18 Jan 2023 18:19:45 GMT

In the first post in the series, we took a case study from Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions and looked at how we could implement it using modern serverless technologies. We considered how we could use SQS and SNS, but decided to use EventBridge and a central event bus.

In this post, we look at how we can go about identifying and designing the events that are raised and handled by the application. We consider the structure of the events, how they might evolve, and how we can handle payloads that could be potentially large and could contain sensitive information.

Full working code for this post can be found on the accompanying GitHub repo.

Case study recap

The case study we looked at is an application that acts as a loan broker. The application receives a request containing the details of the loan required via an API, and then returns the best rate to a webhook.

The following diagram shows how we use a central EventBridge event bus to implement this.

The processing of each API request is as follows:

The API handler publishes a QuoteSubmitted event
The QuoteSubmitted event is handled and initiates a step function
The step function publishes a CreditReportRequested event and pauses
The CreditReportRequested event is handled, and then a CreditReportReceived event is published
The CreditReportReceived event is handled and the step function continues
For each registered lender, a LenderRateRequested event is published and the step function pauses
Each LenderRateRequested event is handled, and a LenderRateReceived event is published
When all lenders have responded, the step function continues
The best rate is selected and a QuoteProcessed event is published with the result
The QuoteProcessed event is handled and the webhook is called with the best rate

Event identification

Central to an event-driven architecture like this are the events themselves. In our example, these are, e.g. QuoteSubmitted, LenderRateReceived, and so on. They are what I would call domain events, in that they relate purely to the business domain and not the implementation platform.

The identification of events can come out of walking through the process being implemented, or from a more formal process. These processes could be domain-driven design (DDD) or event storming.

The key is that all event describe something happened in the past, not that anything should happen in the future. The latter is a request or command, not an event. To paraphrase the Wikipedia event storming page, an actor executes a command that results in the creation of a domain event, written in past tense.

Basic event Structure

Once we have our events, we need to think about how we structure them. This part of the post was very much inspired by the Amazon EventBridge: Event Payload Standards post by David Boyne. I would very much recommend reading that post.

In that post, the following example is given of a standard EventBridge event:

{    "version": "0",    "id": "0d079340-135a-c8c6-95c2-41fb8f496c53",    "detail-type": "OrderCreated",    "source": "myapp.orders",    "account": "123451235123",    "time": "2022-02-01T18:41:53Z",    "region": "us-west-1",    "detail": {...} // whatever we like}

To quote the post:

The version, account, time and region are all properties that AWS handles for us. That leaves core properties detail, detail-type and source to be defined by us.

We can populate detail-type with the event type, e.g. QuoteSubmitted or LenderRateReceived in our example, and source with a string indicating the origin of the event, e.g. LoanBroker.CreditBureau. We could just populate detail with just the data for the event. However, there is an advantage to doing something slightly different.

David's post was itself influenced by the The power of Amazon EventBridge is in its detail post by Sheen Brisals. In it, Sheen shared with us a pattern of introducing metadata within our detail object.

"detail": {      "metadata": {        ...      },      "data": {        ...      }   }

As the post points out:

Implementing these kinds of standards within our events can provide us with some benefits:
Better filtering of events (we can filter on metadata as well as the event payload)
Easier downstream processing based on metadata
Opens the doors to more observability patterns and debugging options

Given this, let us define a TypeScript interface from which we can derive all our domain events.

export interface DomainEventextends Record> {  readonly metadata: DomainEventMetadata;  readonly data: T;}

Here we take advantage of TypeScript's support for generics. This allows us to define the structure of all our events, without tying us to any specific type. All we ask is that the data type extends Record. We ensure this by the use of the T extends Record constraint.

Another TypeScript feature we take advantage of here is having readonly properties on our interface. As the post TypeScript - Interfaces with Read-Only Properties explains, having these as read-only means that the TypeScript compiler will help us treat the resulting events as immutable. This is important, as events are a record of what has happened and - as we all know - we cannot change the past.

Metadata structure

Now that we have our basic event structure, we can start to think about the metadata that we want with each event.

The first class of information relates to where the event originated. In this case, we split the information into the service that raised it and the domain which the service is part of. In our case study we have a single domain, LoanBroker, but several services, with the CreditBureau being one. We group this using a TypeScript interface as follows.

export interface EventOrigin {  readonly domain: EventDomain; // E.g. LoanBroker  readonly service: EventService; // E.g. CreditBureau}

Why would we want to include this information? One reason is to enhance observability when we log such events. In becomes clear where the information has come from. Another reason, as we will see in a later post, is that we can add listeners to all events from either a particular service or a particular domain. Again, this can help with observability.

On the subject of observability, one of the challenges of event-driven systems is building up a picture of the flow of a request through the system. One way to do this is to use correlation and request ids.

Every call into our application will pass both a correlation and a request id in each event. The correlation id can be externally-specified, but the request id will be generated for each call. Using a correlation id in this way, allows our application to be tracked as part in longer-running sagas. For example, if a call was retried, then it may use the same correlation id. This would allow us to piece together that the two requests were related.

The final id in our context is an event id. This id is unique to each individual event. EventBridge delivers events 'at least once', which means that some events will be received more than once. To handle this scenario, the system needs to be able to behave in an idempotent manner. The event id allows event handlers to do so, by use the event id to recognise duplicate events.

With all this is mind, we create an EventContext interface with our ids.

export interface EventContext {  readonly correlationId: string; // Can be externally provided  readonly requestId: string; // Always internally generated  readonly eventId: string; // Internally generated and unique to the event}

Now we put these interfaces together, along with a timestamp. An EventBridge event does automatically get a timestamp, but we include one here to make out metadata self-contained. If we use another transport for the event detail, then we will still have this very useful information.

export interface DomainEventMetadata  extends EventOrigin, EventContext {  readonly timestamp: Date; // Keep metadata self-contained}

Now the detail for each of our domain events allows us see when it was raised, where it came from, and the context under which it was raised. Although we will be using EventBridge, we are not relying on EventBridge to provide any of the metadata. We could raise the same events through another messaging technology if that was desirable.

Evolving events with versioning

If there is one constant, it is change. Systems evolve over time, so it is important to bear this in mind when building them.

In the case of events, we may want to add information to them over time. In general, this will be a safe thing to do. However, this is only true if we know that all downstream systems accept new properties. This puts the emphasis on us to write event consumers to be as forgiving as possible.

However, it may be the case that at some point we need to fundamentally change the structure of an event. How can we do this without breaking something? With a distributed system, we are not able stop everything. We might have old events in-flight awaiting processing as well. So what can we do?

The solution I am proposing here was inspired by listening to the following podcast: Real-World Serverless: Event-driven architecture at PostNL with Luc van Donkersgoed

If you search for 'an interesting question about versioning' in the transcript, then you will be taken to the discussion of how versioned events can help with this scenario. The approach is to support multiple versions of the same event for a period of time. The event producer raises both event versions and event consumers match on the version to handle the appropriate version.

To do this, we extend the event metadata to include the event type and the event version as follows.

export interface EventSchema {  readonly eventType: EventType; // E.g. QuoteSubmitted  readonly eventVersion: string; // E.g. 1.0}

export interface DomainEventMetadata  extends EventOrigin, EventContext, EventSchema {  readonly timestamp: Date;}

This allows us to match on event version as shown below. This way we can support consumers for both the old and new versions.

export const QUOTE_PROCESSED_PATTERN_V1 = {  detail: {    metadata: {      eventType: [EventType.QuoteProcessed],      eventVersion: [{ prefix: '1.' }],    },  },};export const QUOTE_PROCESSED_PATTERN_V2 = {  detail: {    metadata: {      eventType: [EventType.QuoteProcessed],      eventVersion: [{ prefix: '2.' }],    },  },};

Now we can have the event producer raise both the old and new versions of the event. This might be a temporary solution until we remove consumers of the old event version, or it could be a permanent state of affairs. With event versioning, we have the choice.

Passing large and sensitive payloads

Another consideration with our events is the size of the payload. Although in the example code the request is small, in reality such requests can be much larger in size. As the AWS article Calculating Amazon EventBridge PutEvents event entry size states, the total entry size must be less than 256KB. The solution is also mentioned:

If the entry size is larger than 256KB, we recommend uploading the event to an Amazon S3 bucket and including the Object URL in the PutEvents entry.

Of course, each downstream component will need access to the data, and so they will need access to the S3 bucket. However, this introduces a form of coupling. If we decided to change the bucket location, then we would have to find all the downstream components and changes those too.

The solution is to use presigned URLs. These allow us to implement the Claim Check pattern, where we generate a short-lived URL that only allows read access to the event data. We do this as follows:

await s3  .putObject({    Bucket: bucketName,    Key: key,    ACL: 'bucket-owner-full-control',    Body: data,  } as PutObjectRequest)  .promise();const dataUrl = await s3.getSignedUrlPromise('getObject', {  ...s3Params,  Expires: expirySeconds ?? 60,});

We can then pass the dataUrl value in our event and use it to get the data. In our case, we created a function, fetchFromUrlAsync to do this.

import fetch from 'node-fetch';export const fetchFromUrlAsync = async (url: string): Promise => {  const fetchResponse = await fetch(url);  return (await fetchResponse.json()) as T;};

This approach has a secondary benefit. Our events are passing PII (personally identifiable information), which needs very careful management. It is very easy for this information to make its way into logs, where it can leak out with very serious consequences. By using the approach outlined here, the events only ever contain a URL which can safely be logged by any component.

Summary

In this post we looked at how we can identify and structure our events. Key to this is having separate sections for the metadata and data. We can then build on this by including context, correlation, and version information in the metadata. Finally, we looked at how the Claim Check pattern can allow us to pass large payloads and also avoid logging sensitive data.

Enterprise Integration Patterns with Serverless and CDK

Andy Blackledge — Tue, 06 Dec 2022 18:46:09 GMT

If you are interested in Event-Driven Architecture (EDA) then I would highly recommend you reading Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions. Although published first in 2003, this book contains a catalogue of sixty-five messaging patterns still relevant today. Maybe even more so, given the ease of building with such patterns today. It also explores in detail the advantages and limitations of asynchronous messaging architectures.

The book also includes some case studies and their implementation. However, this is where the age of the book shows, as this snippet from the Amazon brief indicates:

The authors also include examples covering a variety of different integration technologies, such as JMS, MSMQ, TIBCO ActiveEnterprise, Microsoft BizTalk, SOAP, and XSL.

In this series of blog posts, we will look at one of the example case studies from the book and implement it using AWS serverless services and CDK.

Full working code for this post can be found on the accompanying GitHub repo.

The Loan Broker case study

The case study we will look at is an application that acts as a loan broker. The application receives a request containing the details of the loan required, along with details of the individual wanting the loan and returns the best rate from a set of lenders.

The application does this by calling a credit bureau to obtain a credit report for the individual. The loan details plus the credit report are then sent to multiple lenders, who each submit their best rates. The application then selects the best rate and publishes the result.

As the diagram above shows, all the interactions are asynchronous and message-based. The diagram also refers to a number of the patterns listed in the book. The 'Recipient List' pattern indicates the use of a list of registered lenders when sending the rate requests. The 'Aggregator' pattern indicates that the multiple responses are aggregated into one. Finally, the 'Translator' pattern indicates that the aggregated response is to be transformed into a different representation for the outside world.

Next we will consider a couple of alternative implementations using AWS serverless services. First up, using SQS and SNS.

SQS and SNS are two complimentary messaging services that together can be used to create complex event-driven architectures.

SQS is a point-to-point messaging technology and SNS a publish-and-subscribe event technology. These feel a natural fit to implement the patterns above. With SQS providing the queues and SNS providing the fan-out to the lenders. This is shown in the diagram below which is a close replication of the original Loan Broker application diagram.

The sequence of events is as follows:

The API puts a message to be processed on the SQS request queue
A Lambda function consumes the request message from the queue and initiates a step function
The step function places a message on a credit report SQS request queue and pauses
The credit bureau places a message on the credit report SQS response queue
A Lambda function consumes the response message and continues the step function
For each registered lender, an SNS message is published to a rate request topic and the step function pauses
Each lender has an SQS queue subscribed to rate request topic, filtered by their lender identifier
The lender consumes the rate request and places a message on the rate response SQS queue
When all lenders have responded, the step function continues
The best rate is selected and a message is placed on the SQS response queue
A Lambda function consumes the response message and calls the webhook with the best rate

EventBridge implementation

AWS describes EventBridge as follows:

Amazon EventBridge is a serverless event bus that lets you receive, filter, transform, route, and deliver events.

We can take the previous architecture and replace both SQS and SNS with a single EventBridge event bus. The resulting architecture is shown below:

The sequence of events is very similar to before, but all communication is through events sent and received through the central event bus:

The API publishes a QuoteSubmitted event
A Lambda function receives the Quote Submitted event and initiates a step function
The step function publishes a CreditReportRequested event and pauses
The credit bureau receives the event, obtains the report, and then publishes a CreditReportReceived event
A Lambda function receives the CreditReportReceived event and continues the step function
For each registered lender, a LenderRateRequested event is published and the step function pauses
Each lender receives a LenderRateRequested event, and then replies by publishing a LenderRateReceived event with their rate
When all lenders have responded, the step function continues
The best rate is selected and a QuoteProcessed event is published with the result
A Lambda function receives the QuoteProcessed event and calls the webhook with the best rate

Comparing the two approaches

Let me start by saying that both approaches have the potential to be good solutions to the problem. Both approaches decouple all the Lambda functions by using messaging. This can help with both testing and scaling.

However, there are a number of key differences between the technologies:

SQS and SNS support ordered events, EventBridge does not
SQS delivers 'at most once', SNS and EventBridge deliver 'at least once'
~~EventBridge supports content-based filtering, SNS only supports attribute-based filtering~~

UPDATE: It was announced at re:Invent 2022 that Amazon SNS adds support for payload-based message filtering. This does make SNS a lot more attractive as a choice for event-driven architectures. There are still many reasons to choose EventBridge though.

For more information on the difference, I recommend the following articles:

What strikes me about the two architecture diagrams is that the EventBridge approach puts the event bus at the centre of the architecture. With the SQS/SNS approach, the Loan Broker appears as the centre. I also noted that when explaining the sequence, the EventBridge approach centred around the publishing and receiving of domain events, whilst with the SQS/SNS approach, the explanation had to refer to specific queues and topics.

Out of the box, I can see observability advantages with the SQS/SNS approach. That is, we could use the AWS console to see how many messages were in flight at any time. This could allow us to see if things were getting backed up at any point. We would have some potentially useful control points too. We could also pause any part of the system by throttling the number of Lambda function executions. This would result in messages queuing up, but not being lost. The 'at most once' delivery would also help with ensuring idempotency, without any additional effort on our part.

So given all these positives, why am I going to choose the EventBridge approach to implement? It is partly that I wanted to get some hands-on experience with implementing domain events with EventBridge (see note below). It is also that the domain event driven approach allows us to extend the system without impacting the current behaviour. This is because new components can subscribe to and process the events independently.

The result can be downloaded from the accompanying GitHub repo.

CV Driven Design (CDD) is the approach of building systems based on technology that the architects want on their CVs. Please do not do this in commercial software. Please challenge the use of any technology and ensure that its use is justified, given the available alternatives.

Summary

In this post, we looked at an example case study from Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions and how we can use AWS serverless technologies to implement it. We considered using a combination of SQS and SNS, and then compared that solution with one that used EventBridge. We decided upon implementing it using EventBridge.

In the next part, we will look at the process of designing our events and how we pass data between the components that make up our application.

Handling Step Function Task Token failures with CDK

Andy Blackledge — Thu, 06 Oct 2022 19:01:25 GMT

In the previous post in the series we looked at how to implement the 'Wait for a Callback' Service Integration Pattern using task tokens and the CDK.

However, we only considered what happens if everything goes to plan. As you might have heard somewhere, everything fails all the time and step functions and task tokens are no different.

This post covers the various ways our previous application could fail, and then how we might handle those scenarios. All the code can be downloaded and run by cloning the companion repo.

TL:DR

Specify a value for timeout on asynchronous tasks
Catch TaskTimedOut errors
Use sendTaskFailure to call back with an error
Use the context $$ to log debug information with errors

Exploring the failure modes

Below is a diagram that shows the step in our state machine that makes an external service call and then waits for a task token to continue. When the service calls back via a webhook, a task token is retrieved and then used to restart the state machine.

As with any piece of software, we need to consider the ways in which things could fail. It is all too easy to just consider the happy path and then be surprised when something goes wrong. Especially if you are left scratching your head, as you don't have the information to understand and fix it.

With this in mind, let us list out some ways the integration with the valuation service could fail:

It could fail to respond.
It could return a response that indicates it couldn't fulfil the request.
It could return a reference that we do not expect.

Now we have our failure modes, let us consider how we can handle them in such a way that we can easily identify what went the problem was.

Timeouts and heartbeats

The first failure we will consider is where the valuation service fails to respond. If we do nothing, then our step function will stay stuck on the same step. Looking at the step function quota documentation, this will be the maximum execution time of one year.

Clearly, this is not what we want. Thankfully, the solution is quite straightforward. What we need to do is to add a timeout to the asynchronous step, as shown below.

const requestValuationTask = new LambdaInvoke(this, 'RequestValuation', {  lambdaFunction: valuationRequestFunction,  integrationPattern: IntegrationPattern.WAIT_FOR_TASK_TOKEN,  payload: TaskInput.fromObject({    taskToken: JsonPath.taskToken,    'loanApplication.$': '$',  }),  timeout: Duration.seconds(30), // Don't wait forever for a reply});

At first I thought that it was necessary to also set a value for heartbeat. It was only when writing this post that I consulted the 'Task state timeouts and heartbeat intervals' documentation and found that this wasn't the case for our example. The heartbeat setting is only required if the task is sending heartbeat notifications to indicate it is still progressing.

Our example includes a mock valuation service. To test the timeout, we update this mock service so that, if it receives a certain request, then it returns without initiating the callback.

if (valuationRequest.property.nameOrNumber === 'No callback') {  return {    statusCode: 201,    headers: { 'Content-Type': 'application/json' },    body: JSON.stringify(valuationRequestResponse),  };}

With this code in place, we can write a unit test to send a request with 'No callback' as the nameOrNumber and then see what happens. The result is shown below.

Now we know that our step function will not wait for year before failing. However, it does raise another question. What happens if our step function has timed out just before the response comes back from the valuation service?

Handling late responses

To find out what happens, we turn again to our mock valuation service. This service uses an Express Workflow to add a delay before sending the mock response as shown below.

What we can do is vary the wait time depending on the valuation request. We do this by passing in a delay into the state machine as follows.

const stateMachineData: ValuationStateMachineData = {  ...valuationRequest,  valuationReference,  delaySeconds:    valuationRequest.property.nameOrNumber === 'Late callback' ? 60 : 6,};

We then bind the Wait step to the delay value.

"Wait": {  "Type": "Wait",  "SecondsPath": "$.delaySeconds",  "Next": "SendResponse"}

With this in place, we can write another unit test to send such a request with 'Late callback' as the nameOrNumber and then run it to see what happens. What we find is the following error being thrown by the Lambda function when it tries to restart the step function.

{  "errorType": "TaskTimedOut",  "errorMessage": "Task Timed Out: 'Provided task does not exist anymore'",  "code": "TaskTimedOut",  "message": "Task Timed Out: 'Provided task does not exist anymore'",  "stack": [    "TaskTimedOut: Task Timed Out: 'Provided task does not exist anymore'",    "    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)",    "    at ",    "    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18)"  ]}

So now we know what to expect when the service either doesn't reply or doesn't reply in time. What about the scenario when we get called more than once with the same task token?

Handling duplicate responses

A duplicate response could occur when we use any service that promises an 'at least once' delivery, such as EventBridge. This means that we could receive the same message more than once. Given this, let us look at what happens when the same task token is used more than once.

To do this, we amend the mock valuation service to send two responses when we send a request with 'Duplicate response' in the payload. By running another unit test, we then that we get another TaskTimedOut error.

{  "errorType": "TaskTimedOut",  "errorMessage": "Task Timed Out: 'Provided task does not exist anymore'",  "code": "TaskTimedOut",  "message": "Task Timed Out: 'Provided task does not exist anymore'",  "stack": [    "TaskTimedOut: Task Timed Out: 'Provided task does not exist anymore'",    "    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)",    "    ",    "    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18)"  ]}

This means that we use the error to tell the difference between the duplicate scenario and the late scenario. If it is important to us to know the difference, say we want to ignore such duplicates, then we could extend our DynamoDB table that holds the task tokens. We could add a property to record if the token has been used, then check that before trying to use it.

Now we have investigated various failure scenarios, let us look at how might we handle them.

Notifying ourselves of failure

In my blog post Better logging through technology, I ask developers to see logging through the eyes of support. That is, put yourself in the position where something has gone wrong and you need to work out what.

In our case, we know that the valuation service step throws a States.Timeout error when the timeout is exceeded. What we will do is amend the step function to publish an SNS message to an error topic. This will give us flexibility to subscribe to this topic and do a range of actions, such as email.

When things do go wrong, we want to be as helpful to those on support as we can. This means sending information that will allow someone to go directly to the thing that failed. For this, we are using the Step Function Context object.

By using the $$ prefix, we can publish useful information along with our message. In this case, the ids of the state machine and the execution along with the start time of the execution. This information can then be used to identify exactly what has failed and when. Then the investigations can begin.

To handle the scenario where the service doesn't reply in time, we can add a try/catch in the Lambda function and have that publish the error to our error topic. In this case, we use the AWS Lambda context object in Node.js to get hold of the function ARN.

} catch (error: any) {  const publishError: PublishInput = {    Message: JSON.stringify({      source: context.invokedFunctionArn,      description: error.message,      eventRequestId: event.requestContext.requestId,      eventBody: event.body,    }),    TopicArn: errorTopicArn,  };  await sns.publish(publishError).promise();}

Here we try to keep to a convention of having a source and description, along with error-specific values. A bit of consistency is never a bad thing.

Handling failed requests

There is another task token failure scenario, and that is when we want to restart the step function with a failure. In our example, the valuation service can call back with a response that indicates that it failed. In this case, we want our step function to exit and inform us that it failed.

The way we do this is by using the task token, but using the sendTaskFailure method. With this, we can restart the step function with an error as follows.

const taskFailureOutput = await stepFunctions  .sendTaskFailure({    taskToken: taskTokenItem.taskToken,    error: 'ValuationFailed',  })  .promise();

We then add another catch to the step function, to handle the ValuationFailed error and publish an SNS message to inform us.

Summary

In this post, we have seen a variety of ways that we can experience failures when dealing with task tokens and step functions. We saw how we can add a timeout to prevent a task from waiting a year to fail. We also saw how we can pass back a failure state to the step function and how we can handle these errors in the flow. Finally, we saw how we can notify ourselves with context information that could help ourselves to diagnose the source of any errors.

Using Step Function Task Tokens with CDK

Andy Blackledge — Sat, 10 Sep 2022 08:22:15 GMT

In this post, we will see how to implement the 'Wait for a Callback' Service Integration Pattern using task tokens and the CDK.

The pattern is described in the AWS documentation as follows (slight paraphrasing):

Callback tasks provide a way to pause a workflow until a task token is returned. A task might need to wait for a human approval, integrate with a third party, or call legacy systems. For tasks like these, you can pause Step Functions indefinitely, and wait for an external process or workflow to complete. For these situations Step Functions allows you to pass a task token to the service. The task will pause until it receives that task token back.

In our example, we will have the step function call an API endpoint and then wait for a webhook to be called, before restarting the step function.

Clone the companion repo to run the code for yourself.

TL;DR

integrationPattern needs to be set to IntegrationPattern.WAIT_FOR_TASK_TOKEN
payload must be specified and contain a property set to JsonPath.taskToken
payloadResponseOnly must not be set to true
Use the sendTaskSuccess method to restart the step function

Application overview

Below is an overview of our application. On the left we have the step function that simulates part of a mortgage loan processing system. One step of this process is to call an external Valuation Service. This service is asynchronous and sends its response via a webhook specified in the valuation request.

We are going to implement a mock Valuation Service that uses a step function to implement a six second delay, before it makes a call back to the loan processor via a webhook.

Requesting a valuation

Our step function consists of a single task that invokes a Lambda function. The definition is shown below.

const requestValuationTask = new LambdaInvoke(this, 'RequestValuation', {  lambdaFunction: valuationRequestFunction,  integrationPattern: IntegrationPattern.WAIT_FOR_TASK_TOKEN,  payload: TaskInput.fromObject({    taskToken: JsonPath.taskToken, // NOT "$$.Task.Token" as in some examples    'loanApplication.$': '$',  }),  // NOT payloadResponseOnly: true,});

Things to note are:

integrationPattern needs to be set to IntegrationPattern.WAIT_FOR_TASK_TOKEN.
payload must be specified and contain a property set to JsonPath.taskToken
If you specify 'taskToken.$': '$$.Task.Token', then you get the following error at synth time:
Error: Task Token is required in payload for callback. Use JsonPath.taskToken to set the token.
If you specify 'taskToken.$': JsonPath.taskToken, then you get the error at runtime:
The Parameters '~snip~' could not be used to start the Task: [The value for the field 'taskToken.$' must be a valid JSONPath or a valid intrinsic function call]
payloadResponseOnly must not be set to true, otherwise you get the following error:
Error: The 'payloadResponseOnly' property cannot be used if 'integrationPattern', 'invocationType', 'clientContext', or 'qualifier' are specified.

The valuation service is a third-party service and the a request is shown below.

export interface ValuationRequest {  property: {    nameOrNumber: string;    postcode: string;  };  callbackUrl: string;}

The details of the property to be valued are specified, along with a URL to be called with the actual valuation.

Below is a snippet from the Lambda function that makes the call to the service. It uses the property details passed in from the step function along with callback URL obtained from the environment to make a simple call using the axios library.

const valuationRequest: ValuationRequest = {  property: event.loanApplication.property,  callbackUrl,};const response = await axios.post(valuationServiceUrl, valuationRequest);

Waiting for the callback

The next stage of the process is to wait for the callback from the valuation service. We will need the task token when this happens, but the valuation service is not aware of the task token nor should it be.

What the valuation service does provide when we make the request is a valuationReference. What we can do is store the task token in a DynamoDB table, using the valuationReference as the key.

const valuationRequestResponse = response.data as ValuationRequestResponse;await taskTokenStore.putAsync({  keyReference: valuationRequestResponse.valuationReference,  taskToken: event.taskToken,});

Processing the callback

When the valuation callback is received, the response contains the following information:

export interface ValuationResponse {  valuationReference: string;  propertyValue: number;}

We use the valuationReference to look up the task token that we stored earlier. We then use the sendTaskSuccess method to restart the step function where we left off, passing in the valuation response as the output property.

const valuationResponse = JSON.parse(event.body) as ValuationResponse;const taskTokenItem = await taskTokenStore.getAsync(  valuationResponse.valuationReference);await stepFunctions  .sendTaskSuccess({    taskToken: taskTokenItem.taskToken,    output: JSON.stringify(valuationResponse),  })  .promise();

That is all there is to getting the basic functionality working. Once other thing to note is that the Lambda function that restarts the step function requires the appropriate IAM permission to do so. This is done via the grantTaskResponse method, as shown below.

this.stateMachine.grantTaskResponse(valuationCallbackFunction);

Testing

Once deployed, we can test test either via the AWS console or by the unit test in the companion repo.

In the AWS console, we can submit the following request:

{  "applicationReference": "app-ref",  "property": {    "nameOrNumber": "999",    "postcode": "PO1 1CE"  }}

In the graph inspector, we then see the task go blue as it waits for the callback. Then, after the six second delay, going green as the callback is received and processed.

Looking at the event history, we can clearly see the delay in the execution time. We can also see that we have passed the response from the valuation service back into the step function.

Summary

In this post we saw how we can use the CDK to define a step function that implements the 'Wait for a Callback' Service Integration Pattern. We saw how we need to be careful in how we define the task and how we might need to store the task token if calling external services.

What we didn't consider, was what happens if things go wrong. For example, what if the valuation service never called us back? What if the valuation service returned an error or a reference we didn't understand? These are all questions for the next post 🙂

Adventures with Lambda layers and CDK

Andy Blackledge — Thu, 18 Aug 2022 18:03:12 GMT

In this post, I will recount my experience of trying out Lambda layers with a small project. It proved to be a good way to get an understanding of what Lambda layers are, how they are used, what their limitations might be, and when they could be useful.

Clone the companion repo to run the code for yourself.

TL;DR

esbuild will bundle layer code if you are not careful
Use paths in tsconfig to compile locally
Lambda layers are immutable and functions reference a specific version
SSM parameters can be used to deploy updates with no rebuild
Conclusion, npm better for most use cases of reuse

The starting point

In an earlier post, I created a Lambda function that used a hexagonal architecture approach. The following diagram shows how we abstracted the implementation of the data stores from the domain logic.

This separation made me wonder whether it might be an interesting thing to try to package these data stores as a Lambda layer. The intended result is shown below.

Packaging and deploying the layer

The folder structure I chose is shown below. The layer contents are in the layer\nodejs folder. The *Store.ts files contain the data store classes which are in turn exported by the data-access.ts file. There is a convention at play here, as for Node.js layers to work, the code must be in a nodejs folder.

The DataAccessLayer.ts file contains the CDK construct that will be used to package and deploy the layer.

The DataAccessLayer construct is shown in full below

export default class DataAccessLayer extends Construct {  //  static readonly LAYER_ARN_SSM_PARAMETER = '/layer-arn/data-access';  readonly layer: ILayerVersion;  constructor(scope: Construct, id: string) {    super(scope, id);    this.layer = new LayerVersion(this, 'DataAccessLayer', {      compatibleRuntimes: [Runtime.NODEJS_12_X, Runtime.NODEJS_14_X],      code: Code.fromAsset(        path.join(__dirname, `/../../dist/src/data-access/layer`)      ),      description: 'Provides data access clients',    });    new StringParameter(this, 'DataAccessLayerArnSsmParameter', {      parameterName: DataAccessLayer.LAYER_ARN_SSM_PARAMETER,      stringValue: this.layer.layerVersionArn,      description: 'The ARN of the latest Data Access layer',      type: ParameterType.STRING,      tier: ParameterTier.STANDARD,    });  }}

The LayerVersion construct uses the Code.fromAsset method to point to the output from the TypeScript compiler (the dist subfolders). Note that it points to the parent of the nodejs folder.

Lambda layers have a version number, which is incremented each time it is deployed. As part of this construct we create an SSM Parameter to store the latest version. We will use this later on when deploying dependent components.

Now that we have our construct, we can create a stack to deploy it.

export default class DataAccessStack extends Stack {  //  constructor(scope: Construct, id: string, props?: DataAccessStackProps) {    super(scope, id, props);    const dataAccessLayer = new DataAccessLayer(this, 'DataAccessLayer');    new CfnOutput(this, 'DataAccessLayerArn', {      value: dataAccessLayer.layer.layerVersionArn,    });  }}

Once deployed, we can see the layer in the console.

Using the Lambda layer

Now that we have our Lambda layer, we need to work out how to use it. The first trick is how we need to reference the layer so that it will work at runtime. This involves referencing an absolute path starting with /opt/nodejs, as shown below.

import { AccountDetailStore, CustomerStore } from '/opt/nodejs/data-access';

This gives us the first major problem, which is how do we compile our TypeScript code against it. Although we have the source code for the layer, it isn't at the location that our import is pointing to.

The solution is to use the paths compiler option in tsconfig.json to point the compiler at the actual location of the layer source code.

{  "compilerOptions": {    /* Snip */    "paths": {      "/opt/nodejs/data-access": ["src/data-access/layer/nodejs/data-access"]    }  }  /* Snip */}

With this in place, we can happily build our Lambda function. However, it raises the question of how would you do this if the layer source code was not available locally. You can download the layers via the console, and possibly by the SDK, but this is not ideal. This is one of challenges that seem to make working with layers more complicated than working with npm packages.

Packaging the Lambda function

Our next challenge is how we can package up our Lambda function. The trick here is not to package up the layer code with the Lambda function code. The first example I looked at was using the NodejsFunction construct. This construct uses esbuild behind the scenes to bundle the code into a single file. This is usually very convenient, but had the result of including the layer code directly. When I updated the example layer code, the behaviour of the Lambda function using the layer didn't change. This was because it wasn't actually using the layer code.

You can specify BundlingOptions for the NodejsFunction construct, which has a externalModules property that is documented as follows:

A list of modules that should be considered as externals (already available in the runtime).

However, for our project, we can avoid this bundling issue by using the Function construct and the AssetCode class. Using these, we can point at the TypeScript output folder dist as shown below.

export interface CustomerUpdatedProps {  dataAccessLayerArn: string;  customerUpdatedTopic: ITopic;  customerTableName: string;  accountDetailTableName: string;}export default class CustomerUpdatedHandler extends Construct {  constructor(scope: Construct, id: string, props: CustomerUpdatedProps) {    super(scope, id);    // Snip - getting tables    const dataAccessLayer = LayerVersion.fromLayerVersionArn(      this,      'DataAccessLayer',      props.dataAccessLayerArn    );    const customerUpdatedHandlerFunction = new Function(      scope,      'CustomerUpdatedHandlerFunction',      {        runtime: Runtime.NODEJS_14_X,        handler: 'CustomerUpdatedHandlerFunction.handler',        code: new AssetCode(          path.join(            __dirname,            `/../../dist/src/application/customer-updated-handler-function`          )        ),        environment: {          [ENV_VAR_CUSTOMER_TABLE_NAME]: props.customerTableName,          [ENV_VAR_ACCOUNT_DETAIL_TABLE_NAME]: props.accountDetailTableName,        },        layers: [dataAccessLayer],      }    );    // Snip - adding subscriptions and permissions  }}

We use the layers property to make the layer available to the function. We obtain the reference to the layer by using the LayerVersion.fromLayerVersionArn method and the ARN of the layer, which we pass in via the construct props.

Now we have our function construct we can use it in a stack. Here we obtain the layer ARN from the SSM Parameter that we created when we deployed the layer. This approach means that every time we deploy the function it picks up the latest version of the layer.

export default class ApplicationStack extends Stack {  constructor(scope: Construct, id: string, props?: ApplicationStackProps) {    super(scope, id, props);    // Snip - creating topic and getting table name parameters    const dataAccessLayerArnSsmParameter =      StringParameter.fromStringParameterName(        this,        'DataAccessLayerArnSsmParameter',        DataAccessLayer.LAYER_ARN_SSM_PARAMETER      );    new CustomerUpdatedHandler(this, 'CustomerUpdatedHandler', {      dataAccessLayerArn: dataAccessLayerArnSsmParameter.stringValue,      customerUpdatedTopic,      customerTableName: customerTableNameParameter.stringValue,      accountDetailTableName: accountDetailTableNameParameter.stringValue,    });  }}

I am not recommending this approach, but it was an interesting concept to play with. I had wondered if functions could reference the latest layer and then if you could update the latest layer underneath it. However, functions have to reference a specific layer version. So the approach above is as close as I could manage, where you can update a function without a code change.

After we deploy the Lambda function, we can see in the console that we have referenced as expected.

Summary

It was interesting to have a play with Lambda layers and it was a bit of a challenge to get it all working. Given this, I suspect that I will not be reaching for them out of my toolbox any time soon.

In this post, we covered deploying and building, but we didn't cover testing. The companion repo includes some integration tests, but local testing is tricky. I prefer integration testing in AWS, so this isn't so much of a negative for me. However, it may be a deal-breaker for some.

All in all, I think I agree with the conclusions in the fine article "AWS Lambda Use Cases: When to use Lambda layers" by Yan Cui, one of which is to prefer npm as the default reuse approach.

That said, it was interesting to actually kick the tyres on layers. If you are curious, then clone the repo and have a play yourself.

Hexagonal Architecture with CDK, Lambda, and TypeScript

Andy Blackledge — Sun, 17 Jul 2022 06:49:36 GMT

In this post we look at how we can adopt a hexagonal architecture approach when developing Lambda functions. I am not proposing that this approach is the one true way, but I think it is useful to be aware of the concept and the advantages that it can convey. Even if you do not embrace the approach wholesale, adopting some facets of it can be useful in itself.

Hexagonal Architecture in a nutshell

Hexagonal architecture is an approach to writing software, where the essence of the problem - the domain - is separated from from the underlying implementation details.

For example, the problem might involve responding to a customer order by initiating multiple downstream processes. This response might also involve some business rules to determine the parameters passed to those processes.

In practice the process may be handling an EventBridge event, reading and writing to DynamoDB tables, and then sending SQS messages. With a hexagonal architecture, these implementation details are hidden behind abstractions. This enables the business logic to be expressed in purely business terms.

In our case, we are going to be implementing a Lambda function that hosts the business logic as follows:

The Lambda function will take care of mapping the AWS-specific inputs to the business domain logic. It will also provide AWS-specific services that use interfaces to isolate the logic from the implementation details.

For a more in-depth explanation, please see Hexagonal Architecture, there are always two sides to every story and Hexagonal (Ports & Adapters) Architecture.

Is this really necessary?

You would be quite right at this point to ask the question of whether this level of abstraction is justified. It can be argued that abstracting too early is a trap that many have fallen into. The result being code that has clumsy abstractions or is hard to follow with indirection after indirection.

In fact, there is a current movement to replace 'classical' coding with 'function-less' coding. This approach uses direct integrations, such a VTL templates in API Gateway or AWS SDK integrations in Step Functions. These are undoubtedly very efficient and have their place. However, such a low-level approach has downsides such as readability and portability.

The post The trade-offs with functionless integration patterns in serverless architectures covers this topic very well.

After considering the cons, let us now look at our example and see how we can use hexagonal architecture principles. We shall see how it affects the code we write and how we can test it.

The business problem

Within our business domain we have the concept of a customer entity, each of which has a single address. Each customer can have multiple accounts, which are separate entities. Each of these accounts have a correspondence address and a billing address.

When the address on a customer is updated, an event is raised and the correspondence address on the accounts must be updated in line. The event also contains a flag indicating whether the customer wanted the new address to also update their billing addresses.

The domain objects

A hexagonal approach relies on business-level abstractions. So the first thing we will do is define the entities, events, and services in our business domain.

The main two objects in our business domain are the customer and their account details. In domain-driven design terminology, these are both entities in that they have an identity and a lifecycle, i.e. they can change over time.

export class Customer {  customerId: string;  name: string;  address: Address;}export class AccountDetail {  accountDetailId: string;  customerId: string;  correspondenceAddress: Address;  billingAddress: Address;}

The address object, on the other hand, is a value object. That is, it has no identity of its own and never changes.

export class Address {  lines: string[];  postalCode: string;}

The event simply contains the id of the customer updated and whether or not they requested that their billing addresses be updated.

export class CustomerUpdatedEvent {  customerId: string;  billingUpdateRequested: boolean;}

Finally, we define the services that our hexagonal code will use. In this case, these comprise two data stores. One for the customer data and one for the account detail data. These differ from the previous objects in that they are abstract. That is, they do not include any references to how we are going to implement them.

export interface ICustomerStore {  retrieveCustomerAsync(customerId: string): Promiseundefined>;  upsertCustomerAsync(customer: Customer): Promise<void>;}export interface IAccountDetailStore {  listAccountDetailsByCustomerIdAsync(    customerId: string  ): Promise;  upsertAccountDetailAsync(accountDetail: AccountDetail): Promise<void>;}

Implementing the business logic

Now that we have the domain objects defined, we can move on to implementing the business logic in a handler class.

We start by specify in the constructor that we require two data stores. We use the interface definitions to isolate our handler from the underlying implementation details.

export default class CustomerUpdatedHandler {  constructor(    private customerStore: ICustomerStore,    private accountDetailsStore: IAccountDetailStore  ) {}}

Next we define the handleAsync method that will handle the event.

async handleAsync(event: CustomerUpdatedEvent): Promise<void> {}

The first thing the handleAsync method needs to do is to retrieve the customer. Here we use the ICustomerStore that was passed in to the constructor.

const customer = await this.customerStore.retrieveCustomerAsync(  event.customerId);if (!customer) {  throw new Error(`No customer found for id: ${event.customerId}`);}

Next we retrieve all the account details for the customer and build up an array of promises containing the updates required. Note how the code is able to express the logic in purely business terms.

const accountDetails =  await this.accountDetailsStore.listAccountDetailsByCustomerIdAsync(    event.customerId  );const updateAccountDetailPromises = accountDetails.map((ad) => {  const updatedAccountDetail = {    ...ad,    correspondenceAddress: customer.address,  };  if (event.billingUpdateRequested) {    updatedAccountDetail.billingAddress = customer.address;  }  return this.accountDetailsStore.upsertAccountDetailAsync(    updatedAccountDetail  );});

Finally, we use the Promise.allSettled method to perform the updates and we check the results in case any failed. If so, we throw an error to ensure these do not go unnoticed.

const updateAccountDetailResults = await Promise.allSettled(  updateAccountDetailPromises);const rejectedReasons = updateAccountDetailResults  .filter((r) => r.status === "rejected")  .map((r) => (r as PromiseRejectedResult).reason as string);if (rejectedReasons.length > 0) {  throw new Error(    `One or more updates were not processed: ${rejectedReasons.join(", ")}`  );}

Testing the business logic

One of the advantages of adopting a hexagonal approach is the ease of testing business logic. We can use our favourite mocking tool to supply mocks for the services and avoid having to mock AWS services or provide local simulated services.

In our case, we are using the Jest testing framework and its in-built mocking. Before each test we provide a default mock implementation of the two store interfaces.

let customerStoreMock: ICustomerStore;let accountDetailStoreMock: IAccountDetailStore;beforeEach(() => {  customerStoreMock = {    retrieveCustomerAsync: jest.fn(),    upsertCustomerAsync: jest.fn(),  };  accountDetailStoreMock = {    listAccountDetailsByCustomerIdAsync: jest.fn(),    upsertAccountDetailAsync: jest.fn(),  };});

Now we have our base mocks, we can create the boilerplate for our first test scenario.

it('handles no accounts', async () => {  // Arrange  // Act  // Assert});

Our 'arrange' step involves creating our test data, mocking the store methods, and then creating the handler passing in the mock implementations.

const testCustomerId = 'TestCustomerId';const testCustomer: Customer = {  customerId: testCustomerId,  name: 'Test Customer',  address: {    lines: ['Line1', 'Line2'],    postalCode: 'PostalCode',  },};customerStoreMock.retrieveCustomerAsync = jest  .fn()  .mockResolvedValue(testCustomer);accountDetailStoreMock.listAccountDetailsByCustomerIdAsync = jest  .fn()  .mockResolvedValue([]);const accountUpdaterFunction = new CustomerUpdatedHandler(  customerStoreMock,  accountDetailStoreMock);

Now that we have our handler, we can call the handleAsync method in our 'act' step.

await accountUpdaterFunction.handleAsync({  customerId: testCustomerId,  billingUpdateRequested: false,});

Our 'assert' step is similarly simple, using the Jest expectations to verify the calls made.

expect(customerStoreMock.retrieveCustomerAsync).toBeCalledWith(  testCustomerId);expect(  accountDetailStoreMock.listAccountDetailsByCustomerIdAsync).toBeCalledWith(testCustomerId);expect(accountDetailStoreMock.upsertAccountDetailAsync).toBeCalledTimes(0);

This simple example shows how the separation of implementation from business logic can make testing the latter easier. The accompanying GitHub repo contains more examples of business logic tests.

Implementing the Lambda function

Now that we have verified the business logic, we can look at how we can use it in the context of a Lambda function. The first thing to do is to instantiate it with concrete implementations for the ICustomerStore and IAccountDetailStore.

import DomainHandler from "../domain-handlers/CustomerUpdatedHandler";const domainHandler = new DomainHandler(  new CustomerStore(process.env["CUSTOMER_TABLE_NAME"]),  new AccountDetailStore(process.env["CUSTOMER_TABLE_NAME"]));

I won't go into the actual implementation of CustomerStore and AccountDetailStore here, but they can be found in the accompanying repo. The repo also contains examples of how the service implementations themselves can be tested independently.

With the instance of the domain handler, we write the Lambda function handler. The function takes care of translating the SNSEvent objects into domain events, which are then despatched to the domain handler handleAsync method.

export const handler = async (event: SNSEvent): Promise<void> => {  const accountUpdaterFunctionPromises = event.Records.map((r) => {    const customerUpdatedEvent = JSON.parse(      r.Sns.Message    ) as CustomerUpdatedEvent;    return domainHandler.handleAsync(customerUpdatedEvent);  });  const accountUpdaterFunctionResults = await Promise.allSettled(    accountUpdaterFunctionPromises  );  const rejectedReasons = accountUpdaterFunctionResults    .filter((r) => r.status === "rejected")    .map((r) => (r as PromiseRejectedResult).reason as string);  if (rejectedReasons.length > 0) {    throw new Error(      `One or more updates were not processed: ${rejectedReasons.join(", ")}`    );  }};

Here we can see that there is a clear separation of responsibilities, with the Lambda function handler hiding the AWS service details from the domain handler. We can also see potential for making such a handler generic, as the business logic is hidden from it.

Summary

In this post, we saw how we can structure our TypeScript Lambda functions using hexagonal architecture principles to isolate the business logic from the AWS service details. This gives advantages in testability and portability, amongst others.

However, this approach does come at the expense of additional levels of abstraction, which many might find unnecessary for their scale of application. In my experience, I have found that the bigger the application, the more need for structure. The trick is knowing how big that is, and it is surprisingly easy to reach that limit.

Resources

Creating composable CDK constructs

Andy Blackledge — Mon, 20 Jun 2022 17:46:08 GMT

In this post we demonstrate the power of composable CDK constructs. We do this by building a generic construct to add retry functionality to idempotent state machines.

The code for this post is ready to be cloned, built, and deployed from the companion repo.

TL;DR

Think about the interface first
Keep coupling one-way

The requirement

Our starting point is a state machine that makes a sequence of HTTP API calls and sends the result to an 'Output' SQS queue. If any of the calls fail, then the state is sent to a 'Failure' SQS queue. See an earlier post for a full description of how this was built.

Our challenge is to add retry functionality. That is, if one API is erroring, we can wait until it is fixed and then push the requests back through the state machine.

In addition to this basic requirement, we add one of our own. We want to make this additional functionality generic, so that it can be easily reused for other similar state machines.

The solution

Our solution is to create a Retrier construct, consisting of two Lambda functions and an SQS queue and attach it to the state machine construct.

The first Lambda function subscribes to the 'Failure' SQS queue and sends messages to its own 'Retry' SQS queue. A second Lambda function then replays the requests from this queue, back into the state machine.

There is a big danger here, which is that we now have a loop. If we are not careful, when the state machine errors we will end up in a tight loop, spinning up Lambda functions and incurring the corresponding costs.

To avoid this, we have two 'taps'. These taps are the 'Enabled' settings on the Lambda function triggers. In normal operation, tap 1 is enabled and failures are sent to the 'Retry' SQS queue. When failures have occurred, but we believe it is good to retry, tap 1 is disabled and tap 2 is enabled. The requests are then replayed and, hopefully, processed successfully. Tap 2 is then disabled, before tap 1 is enabled again.

As long as both taps are not enabled at once, we avoid any possibility of runaway Lambda functions.

Note that this approach relies on the state machine being idempotent. That is, requests can safely be retried multiple times with no unwanted side-effects. For a good explanation of this topic please see the following the article 'What Is Idempotence?'.

In our case, the each step of the state machine checks to see if the current state to see if it has already run. If it has, then it simply passes execution to the next step. With this approach, we can safely retry any failed request, regardless of which step failed.

Implementing the `Retrier` construct

As with all components, the first thing to do is think about the interface. In our case, this has two inputs and no outputs.

The inputs are supplied using the standard CDK props pattern. They are the SQS queue that will receive failed requests and the Lambda function that is to be used to retry requests.

export interface RetrierProps {  failureQueue: sqs.IQueue;  retryFunction: lambda.IFunction;}export default class Retrier extends cdk.Construct {  constructor(scope: cdk.Construct, id: string, props: RetrierProps) {    super(scope, id);  }}

Now we have our interface, we can start to define the internal components. First, we define the 'Retry' queue. This will hold failed requests until we are ready to replay them to the state machine.

const retryQueue = new sqs.Queue(this, 'RetryQueue', {  receiveMessageWaitTime: cdk.Duration.seconds(20),  visibilityTimeout: cdk.Duration.seconds(3),});

The first function consumes messages from the 'Failure' queue passed in via the properties. It then writes them to the 'Retry' queue to await replaying.

The implementation of the function can be found in the companion GitHub repo.

const queueRetriesFunction = new lambdaNodejs.NodejsFunction(scope, 'QueueRetriesFunction', {  description: 'Queues up the requests to be retried',  environment: {    'RETRY_QUEUE_URL_ENV_VAR': retryQueue.queueUrl,  },});props.failureQueue.grantConsumeMessages(queueRetriesFunction);queueRetriesFunction.addEventSource(  new lambdaEventSources.SqsEventSource(props.failureQueue));retryQueue.grantSendMessages(queueRetriesFunction);

The 'Retry' function is equally simple, it consumes messages from the 'Retry' queue and invokes the a Lambda function to retry the request.

The implementation of the function can be found in the GitHub repo.

const retryFunction = new lambdaNodejs.NodejsFunction(scope, 'RetryFunction', {  description: 'Retries the queued requests',  environment: {    'INPUT_FUNCTION_NAME_ENV_VAR': props.retryFunction.functionName,  },});retryQueue.grantConsumeMessages(retryFunction);retryFunction.addEventSource(  new lambdaEventSources.SqsEventSource(retryQueue, {    enabled: false,  }));props.retryFunction.grantInvoke(retryFunction);

For this Lambda function, we have set enabled to false. This means that, by default, the construct will consume failed requests, but not retry them until we decide.

Note how we have created a construct that makes very few assumptions of the state machine. It assumes that the requests are handled in an idempotent manner and that the state can fit within the limitations of an SQS queue message. See 'How do I configure the maximum message size for Amazon SQS?'.

Composing the constructs

Now we have our constructs, it is time to put them together. This is a straightforward process. We simply instantiate a LoanProcessor, then wire up the inputs of the Retrier to the properties of the LoanProcessor.

const loanProcessor = new LoanProcessor(this, 'LoanProcessor', {});new Retrier(this, 'Retrier', {  failureQueue: loanProcessor.failureQueue,  retryFunction: loanProcessor.inputFunction,});

Now the power of composable constructs becomes apparent, as we can see how easily we can add this retry functionality to any state machine.

Testing

To test our construct, we first configure the mock APIs to always fail. We then invoke the first Lambda function in the state machine and view the SQS queues in the AWS Console. As expected, the QueueRetriesFunction consumed a message from the failure queue and forwarded it on to the retry queue.

The next step is to fix problem that caused the error. In our case, this is as simple as reconfiguring the mock API. For real applications, this might not be so straightforward 😉

With the problem fixed, we can look at retrying. The first step is to disable the SQS trigger on the QueueRetriesFunction to ensure that when we retry we don't end up looping.

Now that is is safe to retry, we enable the SQS trigger on the RetryFunction. This should retry the original request.

Looking at the SQS queues, we can see the output queue now has a message in it. Success! 🍾

The next step in the real world would be to disable the RetryFunction trigger and enable the QueueRetriesFunction and wait for further failures.

Summary

In this post, we built a construct that can be used to provide generic retry functionality to idempotent state machines. We saw how straightforward it is to compose higher-level functionality using constructs as building blocks.

Resources

Building a state machine with Lambda Destinations and CDK

Andy Blackledge — Tue, 26 Apr 2022 17:45:27 GMT

In this post we will look at how we can use Lambda destinations and CDK to create a simple state machine. AWS announced Lambda destinations in November 2019, so perhaps I am a little late to the party, but I hadn't yet used them and I wanted to try them out.

The code for this blog post is ready to be cloned, deployed, and run from the accompanying GitHub repo.

TL;DR

Destinations can be used to loosely couple Lambda functions together
Destinations are not used when a Lambda function is invoked synchronously
You can't inspect how a Lambda function is invoked

Introduction to Lambda destinations

The blog post Introducing AWS Lambda Destinations provides a thorough introduction to the destinations, but we will cover the basics here.

To paraphrase the article above, Destinations routes the response from a Lambda invocation as follows:

On Success - When a function is invoked successfully, Lambda routes the record to the destination resource for every successful invocation.
On Failure - When a function invocation fails, Destinations routes the record to the destination resource for every failed invocation for further investigation or processing.

A destination resource can be one of the following targets:

Now, as the AWS documentation Asynchronous invocation says:

When you invoke a function asynchronously, you don't wait for a response from the function code. You can configure how Lambda handles errors, and can send invocation records to a downstream resource to chain together components of your application.

We are going to take advantage of this ability to chain components together to create a simple state machine.

The state machine

The state machine we are going to build is shown below. It is going to take an input state, then make a call to an identity check service and a credit check service, before outputting the result to a 'success' SQS queue. If either calls fail, the error and state is going to be sent to a 'failure' SQS queue. We are going to wrap all this in a CDK construct.

The following interface shows the structure of the data that is passed through the state machine. The state machine is called with the input property populated, then the Lambda functions add the identityCheck and creditReference values. The final result is then sent to the 'success' SQS queue for further processing.

export interface LoanProcessorState {  input: {    firstName: string;    lastName: string;    postcode: string;  };  identityCheck?: {    electoralRole: boolean;    bankAccount: boolean;  };  creditReference?: {    creditReferenceRating: 'Good' | 'Bad' | 'Poor';  };}

Yes, Step Functions would work as well

At this point, it is worth mentioning that Step Functions would be a good solution for a problem such as this. One advantage of this approach for simple chains is that it incurs no additional charge. However, since our example is not long-running, express workflows would address that concern. One definite advantage that Step Functions would have is that, as the service calls are independent, they could be performed in parallel.

The Lambda functions

Both Lambda functions follow the same pattern:

Return the current state if it already contains the API response
Call the API and store the response in the state
Return the updated state

Note that the Lambda functions have no knowledge of the other, they only have a dependency on the state. We will use Destinations to link them together.

The code for the credit reference Lambda function is shown below:

export const handler =  async (state: LoanProcessorState): Promise => {  if (state.creditReference) {    return state;  }  const request: CreditReferenceRequest = {    firstName: state.input.firstName,    lastName: state.input.lastName,    postcode: state.input.postcode,  };  let httpResponse = await callEndpointAsync(request);  state.creditReference = {    creditReferenceRating: httpResponse.data.rating,  };  return state;};

Assembling the construct

As for any construct, we first define the interface. That is, what we need to pass in and what we need to expose. In this case, we don't need to pass anything in, but we do need to expose the function to call and the two queues were the result or the error will be sent.

export default class LoanProcessor extends cdk.Construct {  readonly inputFunction: lambda.IFunction;  readonly outputQueue: sqs.IQueue;  readonly failureQueue: sqs.IQueue;  constructor(scope: cdk.Construct, id: string) {    super(scope, id);  }}

Next we define the queues, using long polling:

this.outputQueue = new sqs.Queue(this, 'OutputQueue', {  receiveMessageWaitTime: cdk.Duration.seconds(20),});this.failureQueue = new sqs.Queue(this, 'FailureQueue', {  receiveMessageWaitTime: cdk.Duration.seconds(20),});

Then we define our functions in reverse order.

First the last in the chain:

const creditReferenceProxyFunction = new lambdaNodejs.NodejsFunction(  scope,  'CreditReferenceProxyFunction',  {    onSuccess: new lambdaDestinations.SqsDestination(this.outputQueue),    onFailure: new lambdaDestinations.SqsDestination(this.failureQueue),  });

Then the first in the chain:

const identityCheckProxyFunction = new lambdaNodejs.NodejsFunction(  scope,  'IdentityCheckProxyFunction',  {    onSuccess: new lambdaDestinations.LambdaDestination(creditReferenceProxyFunction, {      responseOnly: true, // Don't wrap the output    }),    onFailure: new lambdaDestinations.SqsDestination(this.failureQueue),  });

Note that we are specifying true for the responseOnly property. To quote the documentation:

When set to true and used as onSuccess destination, the destination function will be invoked with the payload returned by the source function.

This will ensure that just the LoanProcessorState structure will be passed between our functions. If we do not set this, then it will be wrapped as follows:

{  "version": "1.0",  "timestamp": "2019-11-24T23:08:25.651Z",  "requestContext": {    // Snip  },  "requestPayload": {    "Success": true  },  "responseContext": {    "statusCode": 200,    "executedVersion": "$LATEST"  },  "responsePayload": ""}

Finally, we expose the input function so that our state machine can be called:

this.inputFunction = identityCheckProxyFunction;

Testing the happy path

To test our state machine, we deploy the construct as part of an Integration Test Stack and create a unit test to invoke it asynchronously.

If we invoke the Lambda function synchronously, then we will get a 200 - Success response. However, the 'success' Destination will not be invoked and our state machine will not run. I wondered if we could use the AWS Lambda context object to see if we could check within a Lambda function if it had been invoked synchronously or not. However, as far as I could tell, this is not currently possible. So if we intend for a Lambda function to only be called asynchronously, then we need to be careful to only invoke it asynchronously. We cannot assert how the Lambda function is being called from within the Lambda function itself.

Looking at the AWS Console we can see that one of our queue contains a message:

When we look at the message body we see the following:

{  "version": "1.0",  "timestamp": "2022-03-28T18:21:24.430Z",  "requestContext": {    "requestId": "e87961f5-ca55-450b-87fe-8a29c9c41646",    "functionArn": "arn:aws:lambda:eu-west-2:xxxxxxxxx:function:LoanProcessorTestStack-CreditReferenceProxyFunctio-XXXXXXXXX:$LATEST",    "condition": "Success",    "approximateInvokeCount": 1  },  "requestPayload": {    "input": {      "firstName": "Trevor",      "lastName": "Potato",      "postcode": "MK3 9SE"    },    "retryCount": 0,    "identityCheck": {      "bankAccount": true,      "electoralRole": false    }  },  "responseContext": {    "statusCode": 200,    "executedVersion": "$LATEST"  },  "responsePayload": {    "input": {      "firstName": "Trevor",      "lastName": "Potato",      "postcode": "MK3 9SE"    },    "identityCheck": {      "bankAccount": true,      "electoralRole": false    },    "creditReference": {      "creditReferenceRating": "Good"    }  }}

We can see the responsePayload property contains the output from both Lambda functions, so our state machine ran as expected and we successfully chained them together.

Note that because we are invoking the Lambda functions asynchronously, the execution time does not compound. If the first Lambda function called the second synchronously, then its execution time would include the time for the second to respond. So you would be being charged twice! Never a good thing.

When things go wrong

To test what happens when things go wrong, we configure one of our mock API endpoints to always error. In this case, the credit reference endpoint. Now when we run our unit test, we can see the failure queue has a message in it:

When we look at the message body we see the following:

{  "version": "1.0",  "timestamp": "2022-03-28T18:42:06.919Z",  "requestContext": {    "requestId": "296ebcc2-c642-4b2e-9956-9d0f581c40cc",    "functionArn": "arn:aws:lambda:eu-west-2:XXXXXXX:function:LoanProcessorTestStack-CreditReferenceProxyFunctio-XXXXXXX:$LATEST",    "condition": "RetriesExhausted",    "approximateInvokeCount": 1  },  "requestPayload": {    "input": {      "firstName": "Trevor",      "lastName": "Potato",      "postcode": "MK3 9SE"    },    "identityCheck": {      "bankAccount": true,      "electoralRole": true    }  },  "responseContext": {    "statusCode": 200,    "executedVersion": "$LATEST",    "functionError": "Unhandled"  },  "responsePayload": {    "errorType": "Error",    "errorMessage": "Request failed with status code 500",    "trace": [      "Error: Request failed with status code 500",      "    at createError (/var/task/index.js:335:19)",      "    at settle (/var/task/index.js:351:16)",      "    at IncomingMessage.handleStreamEnd (/var/task/index.js:2091:15)",      "    at IncomingMessage.emit (events.js:412:35)",      "    at IncomingMessage.emit (domain.js:475:12)",      "    at endReadableNT (internal/streams/readable.js:1334:12)",      "    at processTicksAndRejections (internal/process/task_queues.js:82:21)"    ]  }}

In this case, we can see that the responsePayload contains details of the error and the requestContext tells us which Lambda function failed.

Summary

In this post we saw how we can use Destinations to chain Lambda functions together to form a basic state machine. The functions were combined in a loosely-coupled way that avoided the execution time compounding. We also saw how we can add error handling to capture what went wrong when an asynchronous invocation fails.

Custom metric properties and CloudWatch Insights

Andy Blackledge — Mon, 04 Apr 2022 18:33:11 GMT

In the first post of the series we saw how we can use the aws-embedded-metrics npm package to easily output custom metrics from our Lambda functions. In the second post, we saw how we could use custom metrics to trigger alarms when our application went wrong. In this final post of the series, we will see how we can add properties to our custom metrics to help us diagnose what has gone wrong.

The code for this post can be found in the accompanying GitHub repo ready to be cloned and run.

TL;DR

Custom metrics can have arbitrary additional properties
These properties can be queried using CloudWatch Insights
You can't delete metrics, they are retained for 15 months

The application

Our application is a step function that consists of two Lambda functions that make calls to external APIs, followed by a Lambda function that persists the results. If calls to either of the external APIs fails, then the error is caught and handled by a Lambda function that records a custom metrics indicating that an error occurred.

We already have an alarm set up to alert us when either of the API calls fail, but what if we could easily find out exactly which request failed? Well, we can do just that by adding properties to our metrics.

Correlation IDs

To quote the first article below:

A Correlation ID is a unique identifier that is added to the very first interaction (incoming request) to identify the context and is passed to all components that are involved in the transaction flow. Correlation ID becomes the glue that binds the transaction together and helps to draw an overall picture of events.

In our application, each request has the following structure, where correlationId contains a unique identifier:

{  "correlationId": "gMQdTVVTA6jM6GOgheblB",  "firstName": "Trevor",  "lastName": "Potato",  "postcode": "MK3 9SE"}

So by logging the correlation ID with each of our metrics, we will be able to tie a sequence of the log entries together.

Adding properties

Adding properties to our metrics is a straightforward process, we just use the setProperty method and specify a name and value.

First up is our error handler Lambda function:

export const handler = metricScope(  (metrics) =>    async (event: any): Promise<void> => {      metrics        .setNamespace('EmbeddedMetricsExample')        .setDimensions({ ProcessName: 'LoanProcessor' })        .putMetric('ErrorCount', 1, Unit.Count)        // Set our properties        .setProperty('CorrelationId', event.correlationId)        .setProperty('StateMachineName', event.stateMachineName)        .setProperty('FailedStateName', event.failedStateName)        .setProperty('Cause', event.cause);    });

The next step is to update the corresponding state in the step function definition. Here we use the Context Object to access the name of the state machine and the correlation ID from the input. We also use the States.StringToJson utility function to convert the error cause into an object, which will make reading the resulting log entry much easier.

"HandleIdentityCheckFailure": {  "Next": "IdentityCheckFail",  "Type": "Task",  "Resource": "arn:aws:lambda:eu-west-2:XXXXXXXX:function:LoanProcessorTestStack-LoanProcessorErrorHandlerFu-XXXXXXXX",  "Parameters": {    "failedStateName": "IdentityCheckGateway",    "stateMachineName.$": "$$.StateMachine.Name",    "correlationId.$": "$$.Execution.Input.correlationId",    "cause.$": "States.StringToJson($.Cause)"  }}

Next we look at the proxy Lambda functions. In both cases, we want to record the URL being called and the status code returned. More importantly, we also want to record the API request id and the correlation id that we were passed in the original request.

metrics  .setNamespace('EmbeddedMetricsExample')  .setDimensions({ GatewayName: gatewayName })  .putMetric('ResponseTime', responseTime, Unit.Milliseconds)  // Set our properties  .setProperty('StatusCode', response.status)  .setProperty('GatewayUrl', url)  .setProperty('CorrelationId', request.correlationId)  .setProperty('RequestId', request.requestId);

Again, we need to update the corresponding state in the step function definition to ensure we have the appropriate values:

"IdentityCheckGateway": {  "Next": "CreditReferenceGateway",  //   "Type": "Task",  "ResultPath": "$.identityCheck",  "Resource": "arn:aws:lambda:eu-west-2:XXXXXXXX:function:LoanProcessorTestStack-LoanProcessorIdentityCheckP-XXXXXXX",  "Parameters": {    "correlationId.$": "$$.Execution.Input.correlationId",    "firstName.$": "$$.Execution.Input.firstName",    "lastName.$": "$$.Execution.Input.lastName",    "postcode.$": "$$.Execution.Input.postcode"  }}

Generating some errors

The next thing to do is to run our application and wait for errors. This is straightforward, as we can configure the APIs being called to error a certain percentage of the time. With these set to 10%, we can run a unit test to call the step function every few seconds and wait.

Sure enough, before long we see that our alarm has triggered:

And in the logs we see the following entry for the error handler Lambda function:

Investigating the errors

However, AWS provides us with an alternative to manually searching through the logs. By using the aws-embedded-metrics package, we have used structured logging. This means that we can query the logs using CloudWatch Insights and the properties that we specified.

As mentioned in an earlier post, AWS logs many metrics automatically and for free. These can be queried via CloudWatch Insights, see the following post for many useful examples:

Serverless Amazon CloudWatch Logs Insights Examples

CloudWatch Insights lets you query up to 20 log groups at the same time. Depending on your needs, this may not be very convenient. If this is the case, then there are plenty of plenty of third party log aggregation tools that might be a better fit for you.

Our query is only going to span on log group, so this limit will not affect us. What we need to do is select the log group for the error handling Lambda function and write our query.

The query above is filtering the log entries using the ErrorCount metric and the property ProcessName to identify entries where an error has occurred. It is then selecting several fields, the most important to use being the custom metric property CorrelationId.

For a full guide to the CloudWatch Logs Insights query syntax, please see the AWS documentation.

When we run the query, we get results similar to the following:

Here we can see that we can easily retrieve the correlation ids for those instances that have failed. The next thing for us to do is to take one of those correlation ids and run a query to get more details about the failure.

Here we have selected two log groups, one for the identity check API and one for the credit reference API. We then filter the records by the correlation id and select the properties that we logged.

When we run the query, we get results similar to the following:

Here we can see that the credit reference API failed and we can also see the request id that failed. If this was a third-party API, then we could contact the third party and provide them with this id for further investigation. We can also see the URL that was called, which could also be useful in diagnosing the underlying problem.

Hopefully, you can see that by adding properties to your custom metrics, you can help yourself when faced with working out why something has gone wrong.

Do as I say, not as do

Please note that the example code is very naive in its treatment of personally identifiable information (PII). The step function request contains a name and a postal code, these are passed around between states and are logged out at various points. This is a very bad idea.

A much better approach would be to store the request in something like a DynamoDB table, with a short Time to Live (TTL) to ensure it is deleted when no longer needed. A reference to this can then be passed around and logged, and the actual information only accessed when needed.

Can I delete metrics?

Throughout this series, you may have been wondering if you can delete metrics. Well, once created, a metric cannot be explicitly deleted. As explained by the Amazon CloudWatch FAQs:

Q: Can I delete any metrics?CloudWatch does not support metric deletion.

Metrics are retained for 15 months, so I wondered about whether I would be charged for them for 15 months. However, the following StackOverflow question answered my query: AWS CloudWatch unused custom metrics retention and pricing

The pricing page states that "All custom metrics charges are prorated by the hour and metered only when you send metrics to CloudWatch". By this, they mean that they charge 1/720 of the monthly costs for the custom metrics for every hour in which you push at least one data point to the custom metric (counting 30 * 24 = 720 hours in a month).

Summary

In this post, we saw how we can add extra information to our custom metrics using properties. These properties can then be queried using CloudWatch Insights, allowing us to such things as help us investigate system errors.

Creating custom metric alarms with CDK

Andy Blackledge — Wed, 16 Mar 2022 18:26:33 GMT

Overview

In the first post in this series, we looked at how we can use the aws-embedded-metrics npm package to easily generate custom metrics from our Lambda functions. In this post, we will look at how we use these custom metrics to add alerting to our applications. We will do this through Infrastructure as Code (IaC) using the AWS Cloud Development Kit (CDK), so that we ensure the appropriate alarms are in place whenever our application is deployed.

The code for this post is ready to be cloned, deployed, and run from the accompanying GitHub repo.

TL;DR

Use the Metric class to reference metrics in CDK code
Use Metric instances to create alarms
Use setDimensions to only log the dimensions you specify

The application

Our challenge is to use CDK to add a CloudWatch alarm to notify us when an error has occurred.

Outputting the metric

The first step is to get our code to output the custom metric that will trigger the alarm.

We do this via the error handling Lambda function, which consists of just the following code:

export const handler = metricScope(  (metrics) =>    async (event: any): Promise<void> => {      metrics        .setNamespace('EmbeddedMetricsExample')        .putDimensions({ ProcessName: 'LoanProcessor' })        .putMetric('ErrorCount', 1, Unit.Count)    });

Here use the setNamespace method to explicitly set a namespace for our application, otherwise our metrics will be grouped under the default value of aws-embedded-metrics. We then specify the dimension ProcessName for our metric with the value LoanProcessor. We do this, as we want to be able to set an alarm for this specific process. Finally, we specify the actual metric, which in our case is an error count of 1. In our alarm, we will sum up these counts to detect when something has gone wrong.

For further information on using dimensions, you can read the AWS documentation.

Testing the metric

To test the metric, we configure the Identity Check API to always error and run a unit test to call our step function. The result is that we can now see our namespace within the AWS Console:

Clicking on the namespace we can now see our ErrorCount metric:

Setting up the alarm

In our application, we have encapsulated the step function and related Lambda functions in a CDK construct called LoanProcessor. We want to extend this construct so that it can publish to an SNS topic when an error occurs. The first step to do this is to extend the construct props with the SNS topic. Doing it this way decouples the LoanProcessor construct from how error notifications are handled.

export interface LoanProcessorProps {  //   alarmTopic: sns.ITopic;}export default class LoanProcessor extends cdk.Construct {  constructor(scope: cdk.Construct, id: string, props: LoanProcessorProps) {    //

The next step is to reference our custom metric. This is done via the CDK Metric class. The documentation points out the following:

This class does not represent a resource, so hence is not a construct. Instead, Metric is an abstraction that makes it easy to specify metrics for use in both alarms and graphs.

So we need to be aware that the following code is not creating the metric. The metric is only created when the first one is logged.

const loanProcessorErrorCount = new cw.Metric({  namespace: 'EmbeddedMetricsExample',  metricName: 'ErrorCount',  dimensionsMap: {    ProcessName: 'LoanProcessor',  },})

With a reference to the metric, we can now create an alarm. In our case, we want to trigger it when we get at least one error. Since we don't publish any metrics when we succeed, we set treatMissingData such that it will not trigger the alarm.

const loanProcessorErrorCountAlarm = loanProcessorErrorCount.createAlarm(  this,  'LoanProcessorErrorCountAlarm',  {    evaluationPeriods: 1,    comparisonOperator: cw.ComparisonOperator.GREATER_THAN_THRESHOLD,    threshold: 0,    treatMissingData: cw.TreatMissingData.NOT_BREACHING,  });

Now, I was expecting that we would need to specify the period length and the aggregation function as part of the alarm. However, it turns out that we need to specify these using the with method on the metric.

const loanProcessorErrorCount = new cw.Metric({  // }).with({  statistic: 'sum',  period: cdk.Duration.minutes(5),});

The AWS documentation Using Amazon CloudWatch alarms provides a comprehensive guide to setting up alarms, covering the in detail how alarms are evaluated. If you are setting up alarms, I would recommend that you read and understand it.

We now have an alarm that should trigger when any 5 minute period contains at least one error. The final step is to add an action to our alarm, which in our case is to publish to the SNS topic specified via the construct props.

loanProcessorErrorCountAlarm.addAlarmAction(new cwActions.SnsAction(props.alarmTopic));

After deploying to AWS, we can now see our alarm in the AWS Console:

Testing the alarm

As before, with the Identity Check API configured to always error, we run the unit test to start the step function. Then we wait to see if our alarm triggers.

And we wait, and we wait.

Our alarm is not triggering and the reason turns out to be that we have not matched the dimensions of the metric.

Overriding the default dimensions

If we look again at the metric in the AWS Console, we notice the following:

We can see that, along with our dimension (ProcessName), the metric is being logged with the following three dimensions:

LogGroup
ServiceName
ServiceType

So, to get our alarm to trigger, we either need to add these dimensions to it or find a way to override the default dimensions. The latter is straightforward to do, we just need to use the setDimensions method when logging the metric rather than the putDimensions method.

Our metric logging code becomes the following:

export const handler = metricScope(  (metrics) =>    async (event: any): Promise<void> => {      metrics        .setNamespace('EmbeddedMetricsExample')        .setDimensions({ ProcessName: 'LoanProcessor' }) // <== Overrides defaults        .putMetric('ErrorCount', 1, Unit.Count)    });

Now when we call the step function, we can see in the AWS Console that the alarm has triggered:

Looking at the logged event from the SNS error topic, we can see the following details are being sent:

{   "AlarmName": "LoanProcessorTestStack-LoanProcessorLoanProcessorErrorCountAlarmXXXXXX-XXXXXXXX",   "AlarmConfigurationUpdatedTimestamp": "2022-03-13T09:06:59.103+0000",   "NewStateValue": "ALARM",   "NewStateReason": "Threshold Crossed: 1 datapoint [1.0 (13/03/22 16:09:00)] was greater than the threshold (0.0).",   "StateChangeTime": "2022-03-13T16:14:03.383+0000",   "OldStateValue": "OK",   "OKActions": [],   "AlarmActions": [      "arn:aws:sns:eu-west-2:99999999:LoanProcessorTestStack-LoanProcessorAlarmTopicXXXXX-XXXXXXX"   ],   "InsufficientDataActions": [],   "Trigger": {      "MetricName": "ErrorCount",      "Namespace": "EmbeddedMetricsExample",      "StatisticType": "Statistic",      "Statistic": "SUM",      "Unit": null,      "Dimensions": [         {            "value": "LoanProcessor",            "name": "ProcessName"         }      ],      "Period": 300,      "EvaluationPeriods": 1,      "ComparisonOperator": "GreaterThanThreshold",      "Threshold": 0,      "TreatMissingData": "notBreaching",      "EvaluateLowSampleCountPercentile": ""   }}

Summary

In this post, we saw how we can use the CDK to set up alarms based on our custom metrics. We saw how to specify the details of the alarm, such as the statistic to use and how to treat missing data. We also saw how we can override the default dimensions so that our alarm triggers as expected.

Resources

Adding Lambda Custom Metrics the Easy Way

Andy Blackledge — Mon, 28 Feb 2022 18:44:49 GMT

Overview

Custom metrics are a powerful way of instrumenting your applications. This allows you to observe the health and performance in near real time. This post explores custom metrics and how you can use the aws-embedded-metrics npm package to easily add them to your Lambda functions.

TL;DR

Custom metrics allow you to instrument your application
The aws-embedded-metrics npm package makes it straightforward to add them
A knowledge of dimensions, metrics, and properties is essential

What are CloudWatch metrics?

The AWS documentation (Using Amazon CloudWatch metrics) describes metrics as follows:

Metrics are data about the performance of your systems. By default, many services provide free metrics for resources. You can also enable detailed monitoring for some resources or publish your own application metrics. Amazon CloudWatch can load all the metrics in your account (both AWS resource metrics and application metrics that you provide) for search, graphing, and alarms.

For Lambda functions, AWS is already collecting the following metrics and more automatically for you:

Invocations
Duration
Errors

As mentioned in the documentation, these metrics can searched, graphed, and alerted on.

For example, here is a graph showing average duration:

These metrics can be very useful to set alarms on. For example, it is important to know if your application is suffering from excessive errors or throttling. These can be alerted on with the built-in metrics.

Introducing custom metrics

These default metrics are very useful to understand and monitor the health of your application, but they can only go so far. What if you wanted to monitor the health or performance of an external API? What if you also wanted to be able to query the performance for a particular request or particular response? Well, with custom metrics, you can do all of these things.

The AWS user guide (Publishing custom metrics) says the following:

You can publish your own metrics to CloudWatch using the AWS CLI or an API. You can view statistical graphs of your published metrics with the AWS Management Console.

This all sounds great, but how can you create custom metrics for a Lambda function?

If you are developing in Nodejs, then one answer is the Amazon CloudWatch Embedded Metric Format Client Library (awslabs/aws-embedded-metrics-node). This npm package takes advantage of the embedded metric format to generate CloudWatch Metrics via structured log events.

The AWS documentation on the embedded metric format describes it as follows:

Embedded metric format helps you to generate actionable custom metrics from ephemeral resources such as Lambda functions and containers. By using the embedded metric format to send logs from these ephemeral resources, you can now easily create custom metrics without having to instrument or maintain separate code, while gaining powerful analytical capabilities on your log data.

Essentially, by logging in a specific format CloudWatch automatically extracts the custom metrics for you. The awslabs/aws-embedded-metrics-node package makes this straightforward to do.

An overview of our example

Our example consists of a mock API endpoint and a Lambda function that calls it. The mock API endpoint is made up of an API Gateway, backed by a Lambda function, and a Lambda function that calls the endpoint using the axios npm package.

The Lambda function behind the mock API has a set of environment variables that allow us to configure the response time and the error rate. This will enable us to get some interesting metrics when we test.

When it comes to metrics, the first question we need to answer is what information do we want. In this case, we want to be able to graph the average duration of the HTTP calls to the mock API endpoint. To do this we need just one metric, the response time.

Instrumenting our example

Below is the code starting point for our example. It simply uses the axios post method to get a response, and then logs the response time.

const callEndpointAsync = async (  request: CreditReferenceRequest): Promise> => {  const startTime = Date.now();  const response = await axios.post<    CreditReferenceResponse,    AxiosResponse,    CreditReferenceRequest  >(`${endpointUrl}request`, request);  const responseTime = Date.now() - startTime;  console.log(JSON.stringify({ responseTime }, null, 2));  return response;};

The aws-embedded-metrics library allows a number of different usages. In this case, we are going to wrap our function in a metricScope so that the metrics will get flushed automatically for us. The documentation has the following snippet.

const { metricScope } = require("aws-embedded-metrics");const myFunc = metricScope((metrics) => async () => {  // ...});exports.handler = myFunc;

So our example becomes:

const callEndpointAsync = metricScope(  (metrics) =>    async (      request: CreditReferenceRequest    ): Promise> => {      // ...as before...    });

We now have a metrics instance to use to publish our custom metric. We do this as shown below, calling the putDimensions, putMetric, and setProperty methods:

const response = await axios.post<  CreditReferenceResponse,  AxiosResponse,  CreditReferenceRequest>(`${endpointUrl}request`, request);const responseTime = Date.now() - startTime;// Record our metricmetrics.putDimensions({ Service: "CreditReferenceGateway" });metrics.putMetric("ResponseTime", responseTime, Unit.Milliseconds);

You can see here that, at their most basic usage, metrics are made up of two types of values, that is dimensions and metrics. Knowledge of what these types are are a key to getting the results you want and avoiding unintended costs.

Dimensions, metrics, and properties

A metric is a measurable quantity, that is it has to be expressed by a numerical value. For example, a duration, a count, a percentage, or a rate.

Dimensions are what the metric is recorded for. In our example, we are recording the response time for a credit reference gateway. So we have used a dimension we have named Service with a value CreditReferenceGateway. We can specify multiple dimensions, but this has a cost.

Each combination of dimension values creates a separate metric, and each metric has a cost. For example, imagine if we did the following:

metrics.putDimensions({  Service: "CreditReferenceGateway",  RequestId: request.requestId,});metrics.putMetric("ResponseTime", responseTime, Unit.Milliseconds);

We would then be creating a custom metric for each request. This would be both be meaningless and potentially expensive. This is because AWS charges by the number of custom metrics used. If you publish a custom metric, then you are charged an hour's usage. The first 10 custom metrics are free, but the next 10,000 are not. For an excellent explanation of the cost of metrics please read CloudWatch Metrics Pricing Explained in Plain English.

A property is a piece of information that is associated with the metric instance. This value is not submitted to CloudWatch Metrics but is searchable by CloudWatch Logs Insights. This is useful for contextual and potentially high-cardinality data that is not appropriate for CloudWatch Metrics dimensions. One example is the requestId seen above.

Generating and viewing metrics

To generate some metrics for us to view, a simple unit test was created to call the instrumented Lambda function. This test runs for several minutes to provide metrics over a viewable range.

Here is an example of a minute run, with the mock API configured to respond more slowly over time.

Summary

In this post, we looked at how we can use the aws-embedded-metrics npm package to add a custom metric to a Lambda function. We also looked at the concepts of dimensions, metrics, and properties and why they are important to getting the results you want and avoiding unintended costs.

There is a lot more to custom metrics and I intend to cover some of that in the next post. However, given the basics above, you should be able to start adding custom metrics to your Lambda functions and improving your observability.

Resources

Inter-Stack Parameter Passing with CDK

Andy Blackledge — Sun, 30 Jan 2022 10:59:27 GMT

What do you do, when you have one stack that depends on a deployment details from another? Here we look at how we can use the AWS Parameter Store and CDK provide a solution to this problem.

TL;DR

One stack creates a parameter in the Parameter Store
Another stack accesses the parameter, either at deployment or runtime

The problem

In preparation for an upcoming blog post, I wanted to create a set of mock API endpoints in one stack and then have a set of Lambda functions to call them from a separate stack.

To do this, each Lambda function would need to know the base URL of the corresponding API endpoint, which could change on each deployment of the API stack.

I also wanted to avoid having to deploy the Lambda function stack each time I deployed the mock API stack.

The plan

AWS describes the Parameter Store as follows:

Parameter Store, a capability of AWS Systems Manager, provides secure, hierarchical storage for configuration data management and secrets management.

Values within the Parameter Store are accessed via keys, which are just string values.

Given this, it struck me that if both stacks have shared knowledge of a statically-defined key string, then they should be able to use this key to share dynamically-generated values. One stack storing the value and the other retrieving it.

The Mock API stack

The first thing for me to do was create a stack that contained a mock API. In this case, it was to be a mock Credit Reference service. As the parameter name is to be shared across stacks, we set up the stack so that it can be passed in via a properties object.

export interface MockApiStackProps {  creditReferenceUrlParameterName: string;}export default class MockApiStack extends cdk.Stack {  constructor(scope: cdk.Construct, id: string, props: MockApiStackProps) {    super(scope, id);  }}

The next step was to do was define the HttpApi and a parameter that contains the deployment time value of the corresponding url property.

const httpApi = new HttpApi(this, 'CreditReferenceHttpApi', {  description: 'Credit Reference API',});new ssm.StringParameter(this, 'CreditReferenceApiUrlParameter', {  parameterName: props.creditReferenceUrlParameterName,  stringValue: httpApi.url ?? '',  description: 'The base URL for the credit reference API',  type: ssm.ParameterType.STRING,  tier: ssm.ParameterTier.STANDARD,});

I have omitted the Lambda integration here, the full code for the stack can found on my GitHub repo.

To deploy the stack, I created the following cdk app:

const app = new cdk.App();const creditReferenceUrlParameterName = '/mock-apis/credit-reference-api/base-url';new MockApiStack(app, 'MockApiStack', { creditReferenceUrlParameterName });

When the stack was deployed the stack to AWS, I went into the AWS Console and confirmed that the parameter had been created as expected.

The Lambda stack

The next step was to create the stack that would contain the Lambda functions that call the mock APIs. As with the mock API stack, this took a properties object to allow the name of the parameter to be passed in.

export interface LoanProcessorTestStackProps {  creditReferenceUrlParameterName: string;}export default class LoanProcessorTestStack extends IntegrationTestStack {  static readonly StackId = 'LoanProcessorTestStack';  constructor(scope: cdk.Construct, id: string, props: LoanProcessorTestStackProps) {    super(scope, id, {      testStackId: LoanProcessorTestStack.StackId,    });  }}

My first thought was to pass the mock API URL in as an environment variable to the calling function. The question was how to obtain the value to pass in. The answer was to use the fromStringParameterName method on the StringParameter class.

const creditReferenceApiUrlParameter = ssm.StringParameter.fromStringParameterName(  this,  'CreditReferenceApiUrlParameter',  props.creditReferenceUrlParameterName);this.creditReferenceProxyFunction = new lambdaNodejs.NodejsFunction(  this,  'CreditReferenceProxyFunction',  {    environment: {      CREDIT_REFERENCE_URL: creditReferenceApiUrlParameter.stringValue,    },  });

The code for the Lambda function was straightforward, using the Axios npm package to make the call.

export const handler = async (event: any): Promise<any> => {  const creditReferenceUrl = process.env['CREDIT_REFERENCE_URL'];  console.log(JSON.stringify({ creditReferenceUrl }, null, 2));  if (creditReferenceUrl === undefined)     throw new Error('creditReferenceUrl === undefined');  const request: CreditReferenceRequest = {    firstName: 'Trevor',    lastName: 'Potato',    postcode: 'MK3 9SE',  };  try {    const res = await axios.post(`${creditReferenceUrl}request`, request);    console.log(res.data);  } catch (err) {    console.error(err);  }};

To deploy the stack, I added it to the CDK app and passed in the common parameter name.

const app = new cdk.App();const creditReferenceUrlParameterName = '/mock-apis/credit-reference-api/base-url';new MockApiStack(app, 'MockApiStack', { creditReferenceUrlParameterName });new LoanProcessorTestStack(app, 'LoanProcessorTestStack', { creditReferenceUrlParameterName });

Once deployed, I checked the Lambda function environment variables and confirmed that the value was being passed in.

A quick test through the AWS Console confirmed that the function was working as expected:

2022-01-29T09:08:24.808Z    10179872-38a0-4687-a8b4-382b814696cf    INFO    {  "creditReferenceUrl": "https://o8z7mzryt0.execute-api.eu-west-2.amazonaws.com/"}2022-01-29T09:08:24.964Z    10179872-38a0-4687-a8b4-382b814696cf    INFO    { reference: 'CR1234', rating: 'Ugly' }

Decoupling the stacks

One drawback with the approach taken so far was that if a deployment of the mock API caused the URLs to change, then the Lambda stack would also need to be deployed to pick up the new values. Could we decouple this dependency by resolving the URLs at runtime rather than deployment time?

The solution was to pass in the name of the parameter as an environment variable, rather than the value. Then the Lambda function could use the AWS SDK to look up the value directly from the Parameter Store.

First I updated the stack so that the parameter name was passed to the Lambda function as an environment variable. I then added a call to grantRead on the parameter, so that the Lambda function would have access to the parameter. Without this, we would get an 'access denied' error at runtime.

this.creditReferenceProxyFunction = new lambdaNodejs.NodejsFunction(  this,  'CreditReferenceProxyFunction',  {    environment: {      CREDIT_REFERENCE_URL_PARAMETER_NAME: props.creditReferenceUrlParameterName,    },  });creditReferenceApiUrlParameter.grantRead(this.creditReferenceProxyFunction);

With the stack updated, I turned my attention to the Lambda function and added code to use the new environment variable to get the parameter from the Parameter Store.

const ssm = new AWS.SSM();const creditReferenceUrlParameterName =  process.env[CREDIT_REFERENCE_URL_PARAMETER_NAME_ENV_VAR];if (creditReferenceUrlParameterName === undefined)  throw new Error('creditReferenceUrlParameterName === undefined');const creditReferenceUrlParameter = await ssm  .getParameter({    Name: creditReferenceUrlParameterName,    WithDecryption: true,  })  .promise();const creditReferenceUrl = creditReferenceUrlParameter.Parameter?.Value;

To test the changes, I first redeployed the Lambda stack. I then destroyed the mock API stack and redeployed it, before testing the Lambda function again. The result was that the URL was successfully retrieved at runtime.

2022-01-29T16:28:15.864Z    febe6e31-5cfd-43bb-8fcf-4febd062c247    INFO    {  "creditReferenceUrl": "https://ec7smjoixe.execute-api.eu-west-2.amazonaws.com/"}2022-01-29T16:28:16.344Z    febe6e31-5cfd-43bb-8fcf-4febd062c247    INFO    { reference: 'CR1234', rating: 'Ugly' }

Do as say, not as I do

The approach described above was perfectly adequate for example code for a blog post. However, I would not suggest such an approach for production code.

The first thing is that the following code should be outside the Lambda handler method. It is not necessary to initialise this on each call and there is an overhead to do so.

const ssm = new AWS.SSM();

The second thing is that there is both an overhead and a limit to accessing the Parameter Store. As this excellent article on parameters and stacks points out:

A way could be to retrieve parameters directly from code using the AWS SDK, but Ill encourage you to take a look at lambda power tools or ssm cache that expose also the capability to cache parameters. Pay attention to the fact that Parameter Store API has a default throughput limit of 40 transactions per second. This limit can be increased up to 1000 transactions per second but you will incur additional costs.

One solution to this would be to cache the value outside the handler function and refresh it if the call to the endpoint receives a 404 response. This exercise is left for the reader 😉

Another option here to consider is the AWS Secrets Manager. For a comparison of the two services, see this article on AWS Parameter Store vs. AWS Secrets Manager. If you are using the excellent middy middleware, then you can use the SSM middleware package.

Summary

We can use the AWS Parameter Store to share information between stacks. This information can be provided at deployment time or, with some consideration, at runtime.

Lambda Event Filtering With CDK

Andy Blackledge — Sat, 08 Jan 2022 18:23:49 GMT

AWS recently introduced functionality to apply EventBridge-style filtering on Lambda event sources. In this post, we go through a worked example implemented using AWS CDK that takes advantage of this new capability.

The full code for this post can be found on my GitHub repo.

TL;DR

There is no first-class support for event source filtering in CDK currently
You can add event source filtering by manipulating the CloudFormation directly
To filter SNS events, you need to use a subscribed SQS queue

Setting the scene

We are working for Potato Finance, a company that provides loans. We have been tasked with hooking in new functionality when new loan applications are received either for high-value loans or for loans for prestigious locations. Thankfully, Potato Finance has embraced event-driven architecture and has already created an SNS topic that raises events when loan applications are created, updated, or deleted.

For example, when a new loan application is created, an SNS event record like the following is published:

{  "EventSource": "aws:sns",  "Sns": {    "Type": "Notification",    "Message": "{\"eventType\":\"Created\",\"loanAmount\":266000,\"postcode\":\"JE1 9TE\",\"applicationId\":\"21546845\"}",    "MessageAttributes": {      "eventType": {        "Type": "String",        "Value": "Created"      }    }  }

We have been asked to invoke specific high-value processing when the loanAmount is greater that 500,000, and to invoke specific postcode processing when the postcode starts with 'MK' or 'PR'. Prior to event source filtering for Lambda functions, we would have had to hook up our Lambda functions to the SNS topic and implement the filtering in code. This would mean that those functions would be invoked and billed when they had nothing meaningful to do. With event source filtering, we can externalise that filtering, simplify our code, and avoid being billed unnecessarily.

Our approach

Ideally, we would liked to filter the events directly from the SNS topic. However, the AWS announcement in November 2021 said:

Today, AWS announces the ability to filter messages before the invocation of a Lambda function. Filtering is supported for the following event sources: Amazon Kinesis Data Streams, Amazon DynamoDB Streams, and Amazon SQS. This helps reduce requests made to your Lambda functions, may simplify code, and can reduce overall cost.

Given this, our solution will have to use SQS queues hooked up to the SNS topic. We will then attach Lambda functions to the SQS queues and apply filters to the event sources. Our aim is to create a CDK construct that encapsulates this functionality:

Creating our construct

As with any CDK construct, we start with the input properties. In our case, this is the application event topic that we will subscribe to, and the two Lambda functions that we will invoke when the filtering matches.

export interface ApplicationCreatedFilterProps {  applicationEventTopic: sns.ITopic;  highValueFunction: lambda.IFunction;  postcodeFunction: lambda.IFunction;}

For the construct, we start by hooking up two SQS queues to the application event SNS topic. The topic has a message attribute for the eventType, so we can use this to apply a preliminary filter. As we are only interested in new loan applications, we add a filter policy so that we only get 'Created' events sent to our SQS queues.

export default class ApplicationCreatedFilter extends cdk.Construct {  //  constructor(scope: cdk.Construct, id: string, props: ApplicationCreatedFilterProps) {    super(scope, id);    const createdEventFilterPolicy = {      eventType: sns.SubscriptionFilter.stringFilter({ allowlist: ['Created'] }),    };    const highValueQueue = new sqs.Queue(this, 'HighValueQueue');    props.applicationEventTopic.addSubscription(      new snsSubs.SqsSubscription(highValueQueue, {        rawMessageDelivery: true, // Do not wrap message        filterPolicy: createdEventFilterPolicy,      })    );    const postcodeQueue = new sqs.Queue(this, 'PostcodeQueue');    props.applicationEventTopic.addSubscription(      new snsSubs.SqsSubscription(postcodeQueue, {        rawMessageDelivery: true, // Do not wrap message        filterPolicy: createdEventFilterPolicy,      })    );}

Note that we are setting rawMessageDelivery to true for the subscriptions. Doing so ensures that the message to the queue is the same as it was sent to the topic. Otherwise, the message will be wrapped in an SNS envelope and this would make filtering much more difficult. With this set to true we should receive SQS messages like the following:

{  "eventType": "Created",  "loanAmount": 266000,  "postcode": "JE1 9TE",  "applicationId": "21546845"}

Adding filtering

The next step is for us to allow our functions to consume messages from the SQS queues and to add event sources, so that they will poll for messages from the queue. For our high value queue, this is done as follows:

highValueQueue.grantConsumeMessages(props.highValueFunction);props.highValueFunction.addEventSource(  new lambdaEventSources.SqsEventSource(highValueQueue));

Now, the logical place for us to specify the filter would be via the SqsEventSourceProps passed to the SqsEventSource constructor. However, as at the time of writing, there is no such option. There is an GitHub issue raised for this. However, the issue did link to an excellent blog post that describes an interim solution for CDK.

The solution described involves creating the EventSourceMapping explicitly, rather than using addEventSource, and then manipulating the resulting CloudFormation to add a Filters property. With this in mind, the original code for the high value queue becomes the following:

highValueQueue.grantConsumeMessages(props.highValueFunction);const highValueEventSourceMapping = new lambda.EventSourceMapping(  this,  `${highValueQueue.node.id + props.highValueFunction.node.id}Mapping`,  {    target: props.highValueFunction,    eventSourceArn: highValueQueue.queueArn,  });

The resulting CloudFormation generated from cdk synth is the following:

SUTHighValueQueueTestFunctionHighValueConsumerFunctionMapping9A5748BD:  Type: AWS::Lambda::EventSourceMapping  Properties:    FunctionName:      Ref: TestFunctionHighValueConsumerFunction40C73279    EventSourceArn:      Fn::GetAtt:        - SUTHighValueQueue3EDFCDAC        - Arn

Looking at the example given in the announcement, we can see that we need to add a FilterCriteria property similar to the example shown below:

Properties:  # ...snip...  FilterCriteria:    Filters:      - Pattern: '{"data": {"tire_pressure": [{"numeric": ["<", 32]}]}}'

To do this, we need to access the underlying CloudFormation and use the addPropertyOverride to add our filter. We do this as follows:

const highValueCfnEventSourceMapping = highValueEventSourceMapping.node  .defaultChild as lambda.CfnEventSourceMapping;highValueCfnEventSourceMapping.addPropertyOverride('FilterCriteria', {  Filters: [    {      Pattern: JSON.stringify({        body: { loanAmount: [{ numeric: ['>', 500000] }] },      }),    },  ],});

Note that when working with SQS, you filter the payload under the body attribute.

Running cdk synth, we can see that the following CloudFormation was generated:

SUTHighValueQueueTestFunctionHighValueConsumerFunctionMapping9A5748BD:  Type: AWS::Lambda::EventSourceMapping  Properties:    FunctionName:      Ref: TestFunctionHighValueConsumerFunction40C73279    EventSourceArn:      Fn::GetAtt:        - SUTHighValueQueue3EDFCDAC        - Arn    FilterCriteria:      Filters:        - Pattern: '{"body":{"loanAmount":[{"numeric":[">",500000]}]}}'

This looks correct, so we repeat the approach for the postcode queue as follows:

const postcodeCfnEventSourceMapping = postcodeEventSourceMapping.node  .defaultChild as lambda.CfnEventSourceMapping;postcodeCfnEventSourceMapping.addPropertyOverride('FilterCriteria', {  Filters: [  {    Pattern: JSON.stringify({ body: { postcode: [{ prefix: 'MK' }, { prefix: 'PR' }] } }),  },],});

Here we take advantage of the fact that now we have the full power of EventBridge pattern matching. In this case, we match on multiple prefixes in order to identify two prestigious locations in the UK that Potato Finance are particularly interested in.

The full code for the construct can be found on the GitHub repo.

Testing the construct

To test the construct, we use the Serverless Testing Toolkit to create a test stack. The test stack allows us to deploy an isolated instance of the construct to AWS. With this deployed, we can run a set of unit tests against it and verify that events are filtered and routed as expected.

Testing the filtering can be a bit of challenge. This is in part due to the fact that if a message does not meet the filter criteria, then it is just thrown away. It is worth knowing about the following from the AWS documentation:

When you write event patterns to match events, you can use the TestEventPattern API or the test-event-pattern CLI command to test that your pattern matches the correct events. For more information, see TestEventPattern.

Summary

In this post, we have seen how we can use the new event source filtering functionality to route messages based on their content. In our case, we were able to filter events from an existing SNS topic without having to change the code that generated those events. Previously, we would either have had to amend the publishing code to add message attributes, or we would have had to code the filtering into our consuming Lambda functions and have unnecessary invocations.

Deployment Issues Addendum

From time-to-time when developing the code for this blog post, I encountered a CREATE_FAILED error when updating the event source mapping. The message was of the following format:

Resource handler returned message:"An event source mapping with SQS arn (" arn:aws:sqs:eu-west-2:{account}:{queueName} ")   and function (" {functionName} ") already exists.Please update or delete the existing mapping with UUID {mappingUUID}...

My solution was not ideal, but involved using the AWS console to manually remove the SQS trigger from the Lambda function before redeploying. A similar issue had been raised a while back on GitHub. Perhaps this will go away when there is first-class support in the CDK for event source filtering.

Mindful Coding

Andy Blackledge — Mon, 20 Dec 2021 21:24:39 GMT

Mindful eating is an approach that is all about paying attention to the way we eat. Research has shown eating mindfully helps us enjoy our food much more. This caused me to wonder whether we could take a similar approach to coding, mindful coding if you will.

Practising mindful eating simply means being present when we eat - paying attention to it and being aware of it.

It struck me that when coding, at times it is possible to operate automatically and almost without thinking. That is, with a lack of awareness. Greedily producing code if you will. Perhaps by being a little more present, there are some aspects that we might benefit from being more aware of.

Be aware of inefficiency and repetition

Coding can still involve plenty of inefficiency and repetition. Whether it is clicking the same menu options or writing the same boilerplate code over and over again. Being aware of these and taking action can help save time in the long run and make the process more enjoyable.

I recently had to learn a new IDE, one with no common keyboard shortcuts with previous IDEs I had used. I started by clicking the icons to run, debug, and so on, but I was aware that this was not the most efficient way to do it. I made myself seek out the keyboard shortcuts, and then forced myself to use them. Initially, this was slower than using the mouse. However, it quickly became second nature and made using the IDE much more pleasant.

Most IDEs will have some kind of code snippet support. That is, a way of defining templates for common coding patterns into which you insert placeholders. Be aware of your own patterns and take advantage of the snippet functionality to cater for these. I typically like to create a custom TODO template containing the date and my initials. I have to admit that I am guilty of not creating snippets as early as I should. I need to be aware that when I think "I will create a snippet next time", then I should create the snippet right there and then. I know that once I have the snippets, they are a delight to use.

Be aware of learning opportunities

As you navigate any code base, there will be opportunities to learn. It is often a lower-level detail, such as a newly-introduced keyword you haven't seen before, but it could also be a higher-level pattern too. I have found that it is worthwhile to be aware of these moments and take the time to explore what is being done and why. It is easy to get stuck in your own coding patterns and these moments can provide you with new and better ways of doing things. I am not advocating that you take a huge diversion when you have a deadline to meet, but it can be worth spending a little time, even if it is just noting down something to follow-up later.

Be aware of your own habits

This is one that can be quite hard, as it requires a degree of self-awareness and sometimes requires outside input. For example, I have come the realise that one of my habits is to start to create abstractions earlier than is necessary. I only really became aware of this when listening to a podcast in which one of the participants admitted to the same habit. Now that I am aware of it, I find it much easier to pull myself up when I start to build abstractions too early in the process.

Another habit that I required outside input to become aware of was 'expanding pull request syndrome'. When doing one piece of work, I would notice other work that would be a benefit and I would include that work in the pull request. This made the pull request harder to review and riskier to merge. Thankfully, the team I was working in gave great feedback and brought my attention to it. Now, my awareness is heightened and, when I start to notice be tempted by extraneous improvements, I am able to stop myself and raise a separate piece of work if necessary.

Be aware of impact

This is focussed on avoiding your code, if not quite dying by a thousand cuts, getting muddied by a thousand commits. For example, as code is added to a method or a class, at some point that method or class should probably be refactored into smaller, more focussed components. Since the start of computing, divide and conquer has helped manage complexity.

Given this, I try to be aware of the impact of my changes:

On length - The temptation to just add another few lines of code to a method, or just another method to a class must have occurred to us all. It might feel less risky, it might even be less risky, but you might also be adding to a problem. Being aware of this can help you make the right call and help position the code for future maintainability and expandability.
On complexity - Anther common temptation is to add a new flag to an existing method. This seems innocuous enough with the first flag, but then another comes along. Inside the method, the if statements start springing up and the tests get more convoluted. By being aware of the potential impact of your changes on complexity, you can help shape the code in ways to minimise that complexity. For example, the need for the flag may be the indicator that there is a higher-level concept that could be passed into the method in question.
On coupling - Coupling can drag a piece of software down, making testing almost impossible and changes riskier and riskier. Of all the things to be aware of, coupling is the most important for me. It is a big decision to bring in a dependency on another component, even if it is internal to the package it is in. If it is in another package, then it is an even bigger decision. If it is an external package, doubly so. I try ask myself if the new dependency is absolutely necessary and, if so, how the coupling can be mitigated. For example, can the dependency be put behind an internal interface?

Summary

In our hectic coding lives, it is all too easy to almost 'code on automatic'. By being a little more present and aware, perhaps we can help ourselves be more productive, more knowledgeable, and enjoy coding even more.

👀 Look mum no Lambdas!

Andy Blackledge — Sun, 21 Nov 2021 17:28:21 GMT

There was a recent announcement that AWS added support for over 200 AWS Services to Step Functions. Here we look at how this impressive feat was achieved and we go through a worked example of how we can take advantage of it when using the AWS CDK.

All code in this post can be found on my GitHub repo.

TL;DR

Use the CallAwsService task to call any AWS service via the SDK
The properties of CallAwsService are not strongly-typed, so you need to do your research

How did they do that?

Before the announcement, Step Functions already had support for a number of AWS services. These included SNS, SQS, and DynamoDB, and enabled these services to be invoked directly from a Step Function without the need for any 'Lambda glue'. However, there were many services missing, e.g. S3, and the operations you could perform on the services were limited, e.g. just publishing to SNS.

For example, in CDK we could define tasks of type DynamoPutItem as follows:

const putObjectIndex = new sfnTasks.DynamoPutItem(sfnScope, 'PutObjectIndex', {  table: props.indexTable,  item: { // snip

The Step Functions team must have been faced with the prospect of creating more and more of these specific types of task. This would have been onerous and would also add an ongoing burden to the team. Given this, their solution was to create a new type of task, CallAwsService, that allows Step Functions to call any AWS service via the AWS SDK. As the announcement says:

Now, with the AWS SDK integration, its even simpler to build on AWS. SaaS developers can take data stored in Amazon S3, augment it with information stored in Amazon DynamoDB, then process with AWS machine learning services such as Amazon Textract or Amazon Comprehend to add new capabilities for their users.

For example, we can now invoke the S3 listObjectsV2 method to get the objects in a particular bucket as follows:

const listObjects = new sfnTasks.CallAwsService(sfnScope, 'ListObjects', {  service: 's3',  action: 'listObjectsV2',  parameters: {    Bucket: props.sourceBucket.bucketName,  }, // snip

The upside of this approach is that we can now avoid 'Lambda glue' for these 200-plus services, but one downside is that in CDK we don't get strongly-typed task types. Maybe that is up to the community to provide, so over to you 😉

Let's start building!

As mentioned earlier, S3 was one of the services that previously could not be called directly from Step Functions. Given that, and the simplicity of the service, it seems a good candidate for us to try out the new functionality. With this in mind, we intend to create a CDK construct that satisfies the following requirement.

GIVEN an S3 bucket and a DynamoDB table
WHEN the Step Function is invoked
THEN for each object in the bucket:
- An item is added to the table containing metadata of the object

With this requirement in mind, let's start creating our construct. We will start with a basic version that has the inputs we require and a simple state machine that lists the objects in the source bucket.

export interface BucketIndexerProps {  sourceBucket: s3.Bucket;  indexTable: dynamodb.Table;}export default class BucketIndexer extends cdk.Construct {  readonly stateMachine: sfn.StateMachine;  constructor(scope: cdk.Construct, id: string, props: BucketIndexerProps) {    super(scope, id);    const listObjects = new sfnTasks.CallAwsService(this, 'ListObjects', {      service: 's3',      action: 'listObjectsV2',      parameters: {        Bucket: props.sourceBucket.bucketName,      },      iamResources: [props.sourceBucket.arnForObjects('*')],    });    this.stateMachine = new sfn.StateMachine(this, 'BucketIndexerStateMachine', {      definition: sfn.Chain.start(listObjects),    });    props.sourceBucket.grantRead(this.stateMachine);    props.indexTable.grantWriteData(this.stateMachine);  }}

In order to run this, we need to deploy our construct. To do this we create an integration test stack. This approach to serverless testing is covered in my series Serverless integration testing with the AWS CDK.

Let's have a look at the properties for CallAwsService:

service: The AWS service to call (full list).
action: The API action to call (use camelCase).
parameters: Parameters for the API action call (use PascalCase for the parameter names).
iamResources: The resources for the IAM statement that will be added to the state machine role's policy to allow the state machine to make the API call (by default the action for this IAM statement will be service:action).

As mentioned earlier, there is no strong-typing for parameters, so you will not get prompted automatically for any and you will probably need a bit of trial an error to get them right.

It wasn't clear initially what to supply for iamResources. After finding an example for S3 on GitHub, which I can sadly not find and credit, it looks like this is the ARN for the resource or resources being accessed. In our case, the S3 objects in the bucket.

To test, we put an object in the bucket and run the state machine. The following output confirms that the call is being made successfully.

{  "Contents": [    {      "ETag": "\"99914b932bd37a50b983c5e7c90ae93b\"",      "Key": "MyKey",      "LastModified": "2021-11-21T10:02:14Z",      "Size": 2,      "StorageClass": "STANDARD"    }  ],  "IsTruncated": false,  "KeyCount": 1,  "MaxKeys": 1000,  "Name": "bucketindexerteststack-testsourcebucketc9809ad6-xtvk751nyhls6",  "Prefix": ""}

Iterating the results

The next step if for us to process the results and extract extra information for each object.

To do this we define a Map state to iterate over the Content array. Here we can take advantage of the maxConcurrency property and do this in parallel.

const forEachObject = new sfn.Map(this, 'ForEachObject', {  itemsPath: '$.Contents',  parameters: {    'Content.$': '$$.Map.Item.Value',    'BucketName.$': '$.Name',  },  maxConcurrency: 6,});

To extract information about the individual S3 objects, we need another CallAwsService task state. This task invokes the headObject SDK method and stores the results alongside the existing data.

const headObject = new sfnTasks.CallAwsService(this, 'HeadObject', {  service: 's3',  action: 'headObject',  parameters: {    'Bucket.$': '$.BucketName',    'Key.$': '$.Content.Key',  },  iamResources: [props.sourceBucket.arnForObjects('*')],  resultPath: '$.Head',});

We then update the definition to include this new functionality.

this.stateMachine = new sfn.StateMachine(this, 'BucketIndexerStateMachine', {  definition: sfn.Chain.start(listObjects).next(    forEachObject.iterator(sfn.Chain.start(headObject))  ),});

Deploying the stack and running the state machine again, we see the following output for each object:

{  "BucketName": "bucketindexerteststack-testsourcebucketc9809ad6-xtvk751nyhls6",  "Content": {    "ETag": "\"99914b932bd37a50b983c5e7c90ae93b\"",    "Key": "MyKey",    "LastModified": "2021-11-21T10:02:14Z",    "Size": 2,    "StorageClass": "STANDARD"  },  "Head": {    "AcceptRanges": "bytes",    "ContentLength": 2,    "ContentType": "application/octet-stream",    "ETag": "\"99914b932bd37a50b983c5e7c90ae93b\"",    "LastModified": "2021-11-21T10:02:14Z",    "Metadata": {}  }}

Indexing the objects

The final piece in our puzzle is to create an index of the objects in the DynamoDB table. This we will do using the DynamoPutItem task. This gives us a chance to see the difference to using the CallAwsService. The DynamoPutItem task has strongly-typed properties and helper classes (e.g. DynamoAttributeValue). This means that we are prompted to specify values of the correct types for the table and item properties and we are assisted in populating them.

const dynamoAttributeStringAt = (jsonPath: string): sfnTasks.DynamoAttributeValue =>  sfnTasks.DynamoAttributeValue.fromString(JsonPath.stringAt(jsonPath));const putObjectIndex = new sfnTasks.DynamoPutItem(this, 'PutObjectIndex', {  table: props.indexTable,  item: {    bucketName: dynamoAttributeStringAt('$.BucketName'),    key: dynamoAttributeStringAt('$.Content.Key'),    metadata: sfnTasks.DynamoAttributeValue.fromMap({      lastModified: dynamoAttributeStringAt('$.Content.LastModified'),      contentType: dynamoAttributeStringAt('$.Head.ContentType'),    }),  },});

With this state defined, we add it to the iterator in the state machine definition.

this.stateMachine = new sfn.StateMachine(this, 'BucketIndexerStateMachine', {  definition: sfn.Chain.start(listObjects).next(    forEachObject.iterator(sfn.Chain.start(headObject).next(putObjectIndex))  ),});

Now when we run our test, we can use the AWS console to see that out step function is now putting items into the DynamoDB table as expected. No Lambdas involved! 🎆

Summary

In this post, we built a step function that - before the SDK integration - would have required us to create, test, and maintain two Lambda functions. Having no code is good, as it always comes with a cost. In building our construct, we saw how to define SDK calls in a step function using the CDK, and how these are loosely-typed.

So now you have 200+ toys to play with, so what are you waiting for?

The accompanying GitHub repo contains all the code in this post as part of a working project, and also includes a more sophisticated version which takes into account continuation tokens.

Faster Lambda development with hotswap and tail

Andy Blackledge — Sun, 14 Nov 2021 09:42:42 GMT

In this post we use a worked example to look at two ways we can help speed up development of Lambda functions. One is to use the SAM CLI to tail function logs, and the other is to use the new --hotswap flag when deploying with the AWS CDK.

All the code for this post is available on my GitHub repo.

TL;DR

Use the --hotswap if you are in development environment to speed up Lambda development 💨
Don't ever, ever use --hotswap if you are in a production environment
You can tail logs with the SAM CLI, even if you are using CDK 📃

Setting the scene

Our story starts with the development of a CDK construct that subscribes to an SNS topic, process them with a Lambda function and, depending on the event attributes, routes the messages to one of two SQS queues.

To test the construct, we have also created an integration test stack. This is used to deploy the construct, so we can test it in the cloud. For further details on this approach, see my earlier series on Serverless Testing with the AWS CDK.

The diagram below shows the results of our efforts.

Debugging the slow way

The construct routes events on a SNS message attribute called Deadline. Using my Serverless Testing Toolkit we have set up four tests for the following scenarios:

No attribute specified, expect normal priority
Invalid date specified, expect normal priority
Date specified more than 3 days from today, expect normal priority
Date specified less than 3 days from today, expect high priority

However, when we run the tests we see the following.

DeadlineRouter Test Suite   Routes as expected: {"isExpectedHigh":false} (4181ms)   Routes as expected: {"deadline":"","isExpectedHigh":false} (2134ms)   Routes as expected: {"deadline":"2021-11-09T13:40:38.056+00:00","isExpectedHigh":false} (2114ms)  1) Routes as expected: {"deadline":"2021-11-11T13:40:38.065+00:00","isExpectedHigh":true}

The first thing to do is to do a quick visual inspection of the code.

export const handler = async (event: SNSEvent): Promise<void> => {  for await (const record of event.Records) {    const deadlineString =      record.Sns.MessageAttributes.Readline?.Value as string;      //                           ^^^^^^^^    const isHighPriority = getIsHighPriority(deadlineString);

It seems that we must have had text processing on our minds when we wrote our code. No problem, we can fix the typo and do a cdk deploy. A bit tedious, but not the end of the world. For the purposes of comparison later on, we use a stopwatch to time how long the deployment takes...

64 seconds later

With our fix in place, we re-run our tests and see the following.

DeadlineRouter Test Suite   Routes as expected: {"isExpectedHigh":false} (4200ms)   Routes as expected: {"deadline":"","isExpectedHigh":false} (2682ms)  1) Routes as expected: {"deadline":"2021-11-09T13:54:05.798+00:00","isExpectedHigh":false}   Routes as expected: {"deadline":"2021-11-11T13:54:05.806+00:00","isExpectedHigh":true} (2165ms)

So things have definitely changed, but there is clearly still some work to do. Time to fall back on our old standby of sprinkling some console.log dust on our code.

function getIsHighPriority(deadlineString: string): boolean {  console.log(JSON.stringify({ deadlineString }, null, 2));  if (!deadlineString) {    return false;  }  const deadlineDate = DateTime.fromISO(deadlineString);  console.log(JSON.stringify({ deadlineDate }, null, 2));  if (!deadlineDate.isValid) {    return false;  }  const durationLeftDays = deadlineDate.diff(DateTime.now(), 'days').days;  console.log(JSON.stringify({ durationLeftDays }, null, 2));  const highPriorityThresholdDays =      parseInt(process.env[HIGH_PRIORITY_THRESHOLD_DAYS] ?? '0', 10);  console.log(JSON.stringify({ highPriorityThresholdDays }, null, 2));  return durationLeftDays <= highPriorityThresholdDays;}

Now, this time we decide to use the new --hotswap flag with cdk deploy. The PR for this says:

It adds a (boolean) --hotswap flag to the deploy command that attempts to perform a short-circuit deployment, updating the resource directly, and skipping CloudFormation.
If we detect that the current change cannot be short-circuited (because it contains an infrastructure change to the CDK code, most likely), we fall back on performing a full CloudFormation deployment, same as if cdk deploy was called without the --hotswap flag.

12 seconds later

Yes, it only took 12 seconds this time. However, there were a few caveats thrown by the CLI.

 The --hotswap flag deliberately introduces CloudFormation drift to speed up deployments It should only be used for development - never use it for your production Stacks!

This point cannot be stressed enough. However, in our case we have an ephemeral deployment purely for the purposes of testing. Given that, this option is ideal for what we want to do.

Now to run the tests and see what we find.

A tale of two CLIs

To quote AWS:

The AWS Serverless Application Model (SAM) is an open-source framework for building serverless applications. It provides shorthand syntax to express functions, APIs, databases, and event source mappings.

Now although we are using CDK and not SAM for our infrastructure, SAM has a CLI that has some useful features. One of those is the ability to tail logs without having to go into the AWS console.

We install the SAM CLI and read the following in the logging guide.

You can fetch logs by using the function's name: sam logs -n mystack-HelloWorldFunction-1FJ8PD

We use the AWS Toolkit to identify the name of our function and issue the following command.

sam logs --name DeadlineRouterTestStack-SUTRouterFunction11A6E8DD-g0upBJpeuFqK --tail

Unfortunately, we get the following unfriendly error and stack trace.

  File "runpy.py", line 194, in _run_module_as_main  File "runpy.py", line 87, in _run_code  File "C:\Program Files\Amazon\AWSSAMCLI\runtime\lib\site-packages\samcli\__main__.py", line 12, in <module>    cli(prog_name="sam")  ...ValueError: Required parameter name not set

Maybe we have done something wrong, maybe this feature is not ready yet. Our googling and efforts turn up no nothing, so perhaps we can try a different method. The documentation also says:

When your function is a part of an AWS CloudFormation stack, you can fetch logs by using the function's logical ID: sam logs -n HelloWorldFunction --stack-name mystack

What we need here is the logical ID. We can get from the synthesized CloudFormation:

    "SUTRouterFunction11A6E8DD": {      "Type": "AWS::Lambda::Function",

Another alternative is to derive it manually from the deployed name. In our case, this was SUTRouterFunction11A6E8DD, so we try the following in a command window.

sam logs --stack-name DeadlineRouterTestStack -n RouterFunction11A6E8DD --tail

Running our tests, we then start to see the logs arrive in our console window. No more searching around in the AWS console 🎆 What is more, we can see the following values being logged for the durationLeftDays value.

It doesn't look likely that this value will ever be greater than 3. Looking at the code, we spot that we had the dates the wrong way round.

  const durationLeftDays = deadlineDate.diff(DateTime.now(), 'days').days;  //                       ^^^^^^^^^^^^      ^^^^^^^^^^^^^^  console.log(JSON.stringify({ durationLeftDays }, null, 2));}

A quick code change and cdk deploy --hotswap later, in less than a minute we see the following that confirms the construct is working as expected.

DeadlineRouter Test Suite   Routes as expected: {"isExpectedHigh":false} (4201ms)   Routes as expected: {"deadline":"","isExpectedHigh":false} (2175ms)   Routes as expected: {"deadline":"2021-11-09T14:37:20.141+00:00","isExpectedHigh":false} (2211ms)   Routes as expected: {"deadline":"2021-11-11T14:37:20.149+00:00","isExpectedHigh":true} (2163ms)

Summary

We have seen how we can speed up Lambda development by combining the use of the new --hotswap CDK option with the ability of the SAM CLI to tail logs directly. With these two tools in our belt, we can speed up the inner development loop and iterate much more quickly on our functions whilst still having the confidence that testing in the cloud brings.

Easy Node.js Lambda bundling with CDK

Andy Blackledge — Mon, 01 Nov 2021 21:29:09 GMT

In this post, we go through a worked example showing the various ways that Node.js Lambda functions can be bundled using CDK. From doing it the hard way, to using code by convention.

Full code for this post can be found on the accompanying GitHub repo.

TL;DR

Use the NodejsFunction construct to make your life easy

Introducing the Priority Router construct

The function we are going to bundle is part of a CDK construct that routes SNS events to one of two SQS queues based on a Deadline attribute.

Below is a diagram showing the construct and the components that make it up:

The routing function

At the heart of the construct is the function that does the routing and the code for it is shown below.

export const HIGH_PRIORITY_THRESHOLD_DAYS = 'HIGH_PRIORITY_THRESHOLD_DAYS';export const HIGH_PRIORITY_QUEUE_URL = 'HIGH_PRIORITY_QUEUE_URL';export const NORMAL_PRIORITY_QUEUE_URL = 'NORMAL_PRIORITY_QUEUE_URL';const sqs = new SQS();export const handler = async (event: SNSEvent): Promise<void> => {  for await (const record of event.Records) {    const isHighPriority = false; // Route everything as normal for now    const outputQueueUrl = isHighPriority      ? process.env[HIGH_PRIORITY_QUEUE_URL]      : process.env[NORMAL_PRIORITY_QUEUE_URL];    if (outputQueueUrl === undefined)       throw new Error('outputQueueUrl === undefined');    const outputMessageRequest: SendMessageRequest = {      QueueUrl: outputQueueUrl,      MessageBody: record.Sns.Message,    };    const outputMessageResult =       await sqs.sendMessage(outputMessageRequest).promise();  }};

As you can see, the initial implementation is going to default to send all messages to the normal priority queue. We will add the real processing later.

The `PriorityRouter` construct

The full code for the PriorityRouter construct can be found in the accompanying repo. It takes an SNS queue as input and creates two SQS queues as output, one for high priority messages and one for normal messages.

For the function, we use the Function construct and Code.fromAsset to point it to the outDir as defined in tsconfig.json.

const priorityRouterFunction = new lambda.Function(this, 'PriorityRouterFunction', {  code:     lambda.Code.fromAsset('dist/src/event-router'), // outDir from tsconfig.json  handler: 'priorityRouterFunction.handler',  runtime: lambda.Runtime.NODEJS_14_X,  environment: {    [HIGH_PRIORITY_QUEUE_URL]: this.highPriorityQueue.queueUrl,    [NORMAL_PRIORITY_QUEUE_URL]: this.normalPriorityQueue.queueUrl,    [HIGH_PRIORITY_THRESHOLD_DAYS]: props.highPriorityThresholdDays.toString(),  },});

To test the construct, a test stack was created to act as a test harness so that the construct could be tested in isolation, see the repo for details of the test stack and unit tests.

This approach was covered in my series on Serverless Testing with CDK and uses my Serverless Testing Toolkit npm package.

The test stack was deployed using cdk deploy and the tests run.

Add routing implementation

With the default behaviour verified, the next step was to add the processing of the Deadline attribute on the SNS event. To do this, the luxon npm package was chosen and the code updated as follows.

import { DateTime } from 'luxon';export const handler = async (event: SNSEvent): Promise<void> => {    // snip    const deadlineString = record.Sns.MessageAttributes.Deadline?.Value as string;    const isHighPriority = getIsHighPriority(deadlineString);    // snip};function getIsHighPriority(deadlineString: string): boolean {  if (!deadlineString) {    return false;  }  const deadlineDate = DateTime.fromISO(deadlineString);  if (!deadlineDate.isValid) {    return false;  }  const durationLeftDays = DateTime.now().diff(deadlineDate, 'days').days;  const highPriorityThresholdDays =     parseInt(process.env[HIGH_PRIORITY_THRESHOLD_DAYS] ?? '0', 10);  return durationLeftDays <= highPriorityThresholdDays;}

The test stack was redeployed, the tests re-run and promptly failed 😢

The problem with dependencies

When the logs were checked, they revealed the following error.

2021-10-31T12:35:14.698+00:00    2021-10-31T12:35:14.698Z      undefined    ERROR    Uncaught Exception       {    "errorType":"Runtime.ImportModuleError",    "errorMessage":"Error: Cannot find module 'luxon'   }

The problem is that any dependencies are not bundled with the function by default. You don't need to bundle aws-sdk, as it is automatically available, but any other dependencies you do. The solution is to manually bundle the code yourself or use a bundling tool such as webpack or esbuild.

As well as solving this problem, bundling with one of these tools will speed up cold starts. It does this by reducing the overall package size and reducing the file access, as all code ends up in a single file.

Bundling with `esbuild`

After installing esbuild, the following script was added to package.json as bundle-lambda.

./node_modules/.bin/esbuild src/event-router/priorityRouterFunction.ts  --bundle --platform=node --target=node14 --external:aws-sdk   --outfile=esbuild-output/priorityRouterFunction.js

Running npm run bundle-lambda resulted in the function and all its dependencies being written to the esbuild-output directory.

With the code now bundled with its dependencies, the construct was updated to point to the new location.

const priorityRouterFunction = new lambda.Function(this, 'PriorityRouterFunction', {  code: lambda.Code.fromAsset('esbuild-output'), // Output from esbuild  handler: 'priorityRouterFunction.handler',  // snip});

After running cdk deploy, the tests were re-run and all passed. It seemed the bundling had worked, but we can make it even easier.

`NodejsFunction` for the win

The NodejsFunction construct simplifies the bundling by allowing you to supply the .js or .ts file for your function handler and it will use esbuild behind the scenes to automatically bundle it for you. With this, the code became even simpler and removed the need to use the bundle-lambda script.

const priorityRouterFunction = new lambdaNodejs.NodejsFunction(this, 'PriorityRouterFunction', {  entry: path.join(__dirname, 'priorityRouterFunction.ts'), // Current directory  handler: 'handler',  // snip});

If you are creating an npm package for a construct that contains a function such as this, then the package may not contain the .ts file. In this case, it may be necessary to make the construct code to probe for the existence of the .ts file first and fall back to the .js file if not found.

The final option take the simplification one step further using coding by convention. The convention in this case is that if you do not specify a value for entry, then NodejsFunction will look for a .ts or .js file based on the name of the construct file and the function id.

const priorityRouterFunction = new lambdaNodejs.NodejsFunction(this, 'PriorityRouterFunction', {  handler: 'handler',  // snip});

For example, if the above code was in a construct file PriorityRouter then NodejsFunction would look for a file called PriorityRouter.PriorityRouterFunction.ts or PriorityRouter.PriorityRouterFunction.js that contains an exported function called handler.

Summary

CDK gives us a simple way to bundle our Node.js functions. Why do it the hard way? 😜

Low Cost Step Functions with CDK

Andy Blackledge — Mon, 27 Sep 2021 19:41:38 GMT

Step Functions are great. They let you orchestrate your Lambda functions in a declarative manner, allowing you to avoid combine those functions without directly chaining them together (and thus compounding your costs). However, they are expensive. The first 4000 transitions are free, but the rest are $0.025 per 1,000 state transitions. You could use Express Workflows instead, but they can only run for five minutes of wall-clock time. So how can we have a cheap, long-running way of easily orchestrating Lambda functions? Perhaps CDK can help us build such a thing.

Note, Step Functions also have error-handling, retries, parallel processing and more very useful functionality that we won't be trying to replicate here. Well, at least not yet 😉

TL;DR

Using CDK, Lambda functions, SNS, and DynamoDB, it is possible to build a simple analog of Step Functions. See the GitHub repo for the full code and working examples.

The Aim

The aim is to have a single orchestrator Lambda function that uses SNS topics to send asynchronous requests to Lambda functions that perform the various tasks. The orchestrator function then subscribes to a response topic in order to process the output from those tasks. A DynamoDB table is to be used to hold the state of the orchestration between the asynchronous calls. The resulting architecture should look something like the following.

CDK Best Practices

The following is taken from Best practices for developing and deploying cloud infrastructure with the AWS CDK and will inform how we build the solution. I would recommend anyone interested in CDK to read the whole thing.

Infrastructure and runtime code live in the same package
A construct that is self-contained, in other words that completely describes a piece of functionality including its infrastructure and logic, makes it easy to evolve the two kinds of code together, test them in isolation, share and reuse the code across projects, and version all the code in sync.
Model your app through constructs, not stacks
When breaking down your application into logical units, represent each unit as a descendant of Construct and not of Stack. Stacks are a unit of deployment, and so tend to be oriented to specific applications. By using constructs instead of stacks, you give yourself and your users the flexibility to build stacks in the way that makes the most sense for each deployment scenario.

Thinking In Constructs

With this advice in mind, the components are to be organised as follows.

An orchestration is to be composed of a single orchestrator construct and one or more task constructs. The orchestrator and task constructs are each made up of a Lambda function and an SNS topic. The topics are to used for the request/response communication between the orchestrator and the tasks. The trick is going to be how we make wiring up these interdependent constructs as straightforward as possible.

Note that the orchestration state is not part of these constructs. This is in line with the following guidance from the best practices.

Separate your application into multiple stacks as dictated by deployment requirements
Consider keeping stateful resources (like databases) in a separate stack from stateless resources. You can then turn on termination protection on the stateful stack, and can freely destroy or create multiple copies of the stateless stack without risk of data loss.

If we have long-running orchestrations, then we may have state that needs to persist between deployments of the orchestration implementation. Perhaps there was a bug-fix that required a patch release. We want to be careful that such state is not deleted in such scenarios. Given this, the decision is to keep the state external.

The Orchestrator Construct

The Orchestrator construct is an abstract class that provides the base functionality for Orchestrators. The first thing to consider with the Orchestrator construct is the inputs and the outputs. For constructs, the inputs are passed in as a props object and the outputs are properties exposed by the construct itself. For the Orchestrator construct, these are as follows.

export interface OrchestratorProps {  executionTable: dynamodb.ITable;  handlerFunction: lambda.Function;}export default abstract class Orchestrator extends cdk.Construct {  readonly responseTopic: sns.ITopic;  readonly handlerFunction: lambda.Function;}

Inputs:

executionTable is a reference to the DynamoDB table that will be used to store the orchestration state. The construct could create this itself, but as we saw from the best practices, it can be wise to keep stateful resources external.
handlerFunction is a reference to the Lambda function that will do the orchestration. This resource will be instantiated by the concrete subclass, as it will provide functionality specific to the concrete orchestration.

Outputs:

responseTopic is the SNS topic that tasks use in order to publish their responses back to the orchestrator.
handlerFunction is the same function as passed in via the inputs. We expose a reference to it, as it is needed in order to interact with the orchestration.

With the inputs and outputs defined, we move on to the constructor where we create and wire up the resources.

constructor(scope: cdk.Construct, id: string, props: OrchestratorProps) {  super(scope, id);  this.handlerFunction = props.handlerFunction;  props.executionTable.grantReadWriteData(props.handlerFunction);  props.handlerFunction.addEnvironment(    OrchestratorEnvVars.EXECUTION_TABLE_NAME,    props.executionTable.tableName  );  this.responseTopic = new sns.Topic(this, `ResponseTopic`);  this.responseTopic.addSubscription(    new snsSubs.LambdaSubscription(props.handlerFunction)  );}

Here the handlerFunction is exposed. It is then granted access to the state table and an environment variable is added to provide it with the name. The response topic is then created and the handlerFunction subscribed to it to receive the response messages.

The Task Construct

As with the orchestration construct, the first thing to define are the inputs and outputs.

export interface AsyncTaskProps {  handlerType: new () => AsyncTaskHandler;  handlerFunction: lambda.Function;}constructor(  orchestrator: Orchestrator,  id: string,  props: AsyncTaskProps) {  readonly requestTopic: sns.ITopic;}

Inputs:

handlerType is a parameterless constructor function that is used to retrieve the name of the concrete implementation, see props.handlerType.name.
handlerFunction is a reference to the Lambda function that will do the orchestration. This function will delegate the handling to a subclass of AsyncTaskHandler.

Outputs:

requestTopic is the SNS topic created that the orchestration will use to send requests to the task function.

constructor(orchestrator: Orchestrator, id: string, props: AsyncTaskProps) {  super(orchestrator, id);  this.requestTopic = new sns.Topic(this, 'RequestTopic');  this.requestTopic.addSubscription(    new snsSubs.LambdaSubscription(props.handlerFunction)  );  this.requestTopic.grantPublish(orchestrator.handlerFunction);  orchestrator.handlerFunction.addEnvironment(    `${props.handlerType.name.toUpperCase()}_REQUEST_TOPIC_ARN`    this.requestTopic.topicArn  );  orchestrator.responseTopic.grantPublish(props.handlerFunction);  props.handlerFunction.addEnvironment(    AsyncTaskEnvVars.RESPONSE_TOPIC_ARN,    orchestrator.responseTopic.topicArn  );}

The constructor first creates the requestTopic and subscribes the task handlerFunction to it to receive requests.

Next it uses the orchestrator parameter to access underlying handlerFunction. It grants this function access to publish requests to the task, then it adds an environment variable to the function. The environment variable is named following a convention based on the name of the handlerType. The orchestrator function will follow the same convention in order to derive the SNS topic ARN for a particular handlerType. Finally, the task handlerFunction is granted access to the orchestrator response topic and an environment variable added with the SNS topic ARN.

The Constructs In Action

Overview

To demonstrate the constructs in action, we are going to build a simple orchestration that takes three numbers and adds them together. It is going to do this using a sequence of two tasks, each adding two numbers together.

Take the inputs x, y, and z and store them
Set the running total to 0
Call a task to add a and b together
Store the result as the running total
Call a task to add c and the running total together
Store the result as the running total
Return the running total as the output

Add Two Number Task

The first thing to do is define the request and response for the task. This is done by creating two interfaces as follows.

export interface AddTwoNumbersRequest {  value1: number;  value2: number;}export interface AddTwoNumbersResponse {  total: number;}

Next, a subclass of AsyncTaskHandler is created to handle the request and return the response. AsyncTaskHandler is doing the heavy lifting of handling SNS events and turning them into AddTwoNumbersRequest instances, then taking AddTwoNumbersResponse and publishing them to the orchestrator response topic.

export class AddTwoNumbersHandler extends AsyncTaskHandler<  AddTwoNumbersRequest,  AddTwoNumbersResponse> {  async handleRequestAsync(    request: AddTwoNumbersRequest  ): Promise {    return {      total: request.value1 + request.value2,    };  }}

Finally, a handler function is exported. This simply despatches the incoming event to the handleAsync method on the AsyncTaskHandler base class.

export const handler = async (event: any): Promise<void> =>  new AddTwoNumbersHandler().handleAsync(event);

The CDK best practices guide mentions the following:

The AWS CDK not only generates AWS CloudFormation templates for deploying infrastructure, it also bundles runtime assets like Lambda functions and deploys them alongside your infrastructure.

We can take advantage of this by using the NodejsFunction construct and noting the following convention.

If the NodejsFunction is defined in stack.ts with my-handler as id (new NodejsFunction(this, 'my-handler')), the construct will look at stack.my-handler.ts and stack.my-handler.js.)

So if we put following code in AddTwoNumbers.ts, then the CDK will look in AddTwoNumbers.AddTwoNumbersHandler.ts for a handler function.

export default class AddTwoNumbers extends AsyncTask<  AddTwoNumbersRequest,  AddTwoNumbersResponse> {  constructor(orchestrator: Orchestrator, id: string) {    super(orchestrator, id, {      handlerType: AddTwoNumbersHandler,      handlerFunction: new lambdaNodejs.NodejsFunction(        orchestrator,        AddTwoNumbersHandler.name      ),    });  }}

Here we are again using the constructor for AddTwoNumbersHandler. Once to pass in to the base class and again as a convention for the Lambda function id. This means that if we structure the code into the following two files then the CDK will bundle the code using esbuild, which is one proven way to minimise cold starts.

The resulting files look as follows:

Simple Sequence Orchestration

Now we have a task to call, we come to defining the orchestration itself. First up, we need to define the inputs, the outputs, and the structure of the data the orchestration works upon. These are all defined as interfaces as follows.

export interface SimpleSequenceInput {  x: number;  y: number;  z: number;}export interface SimpleSequenceOutput {  total: number;}export interface SimpleSequenceData {  x: number;  y: number;  z: number;  total: number;}

Next we need to define how to get the initial data, based on the inputs, and how we get the output based on the data. This is done as follows, providing a getData function for the former and a getOutput function for the latter.

const orchestrationProps: OrchestrationBuilderProps<  SimpleSequenceInput,  SimpleSequenceOutput,  SimpleSequenceData> = {  getData: (input): SimpleSequenceData => ({    ...input,    total: 0,  }),  getOutput: (data): SimpleSequenceOutput => ({ total: data.total }),};

The next step, no pun intended, is to define the steps of our orchestration. This is done using the fluent builder pattern and the OrchestrationBuilder class. Each step has a unique id, a reference to the type of handler, and two functions. The getRequest function returns a request instance based on the current data. This is the request that is sent to the task handler. The updateData function takes the response returned by the task and updates the data. In contrast to Step Functions, this approach has some level of type safety thanks to TypeScript.

const orchestration = new OrchestrationBuilder<  SimpleSequenceInput,  SimpleSequenceOutput,  SimpleSequenceData>(orchestrationProps)  .invokeAsync({    stepId: 'AddX&Y',    HandlerType: AddTwoNumbersHandler,    getRequest: (data) => ({      value1: data.x,      value2: data.y,    }),    updateData: (data, response) => {      data.total = response.total;    },  })  .invokeAsync({    stepId: 'AddZ&Total',    HandlerType: AddTwoNumbersHandler,    getRequest: (data) => ({      value1: data.z,      value2: data.total,    }),    updateData: (data, response) => {      data.total = response.total;    },  })  .build();

Now we have our orchestration defined, we need to subclass OrchestratorHandler as follows and export a handler function to despatch events to it.

export class SimpleSequenceHandler extends OrchestratorHandler<  SimpleSequenceInput,  SimpleSequenceOutput,  SimpleSequenceData> {  constructor() {    super(orchestration);  }}export const handler = async (event: any): Promise<any> =>  new SimpleSequenceHandler().handleAsync(event);

OrchestratorHandler is doing a lot of heaving lifting here behind the scenes. It handles the events to start the orchestration and it steps through the orchestration, pausing when an asynchronous task is called. When a response event is received, it then resumes stepping through.

The final piece in the puzzle is the orchestrator construct as follows.

export default class SimpleSequence extends Orchestrator {  constructor(scope: cdk.Construct, id: string, props: SimpleSequenceProps) {    super(scope, id, {      ...props,      handlerFunction: new lambdaNodejs.NodejsFunction(        scope,        SimpleSequenceHandler.name      ),    });    AddTwoNumbers(this, AddTwoNumbersHandler.name);  }}

Again, we use the NodejsFunction construct and the convention to wire it up to the appropriate handler. We also wire up the AddTwoNumbers to the orchestrator with one line of code. I hope you can see how the AddTwoNumbers code could easily be packaged and reused across orchestrations. This might be useful if a task held its own state, perhaps a call to an external service with a circuit breaker.

Summary

We have seen that we can create a framework for creating serverless orchestrations without Step Functions. By using the CDK, we can take advantage of its compositional abilities and how it can combine the code and the infrastructure. Admittedly, the result lacks several features, such as error-handling, that you would need for production. However, this shows what is possible and the full code can be found in the GitHub repo, along with a set of working examples and unit tests.

Software in 3 steps: Make it run, make it right, make it fast

Andy Blackledge — Tue, 07 Sep 2021 18:12:04 GMT

Here is my guide to the three steps to software development. Please take it with a pinch of salt and let it be some food for thought.

TL;DR

Understand the problem, explore the solution space (Make it run 🏃)
Now you know the solution, express it as well as possible (Make it right )
Now it is expressed well, how can it be optimised? (Make it fast 🚀)

Depending on the context, 2 and 3 may not be necessary.

Make it run 🏃

This step is all about having some working code that meets the acceptance criteria. This, of course, assumes you have some acceptance criteria. If you don't, then I would suggest you find out what it is before starting. You will also need to know how to work out if your solution meets that criteria. This could be an ad-hoc manual test or, hopefully, something more structured and repeatable. Either way, before your start you should have a plan to know when you can stop.

The make it run step is all about gaining understanding of the problem in hand and exploring the solution space. I have found that writing tests or just thinking about a test approach often pays great dividends in gaining this understanding. I have never done formal Test Driven Design (TDD), but I have done enough tests before code to get a feel for how it can be a powerful technique.

The code at this step is somewhat equivalent to an Minimally Viable Product (MVP). Corners can be cut with regard to structures and abstractions. Exploration and understanding are the key here. I find that if I fall into the trap of creating class hierarchies, interfaces, and such items too early, then I run the danger of feeling tied to those artefacts and I can lose agility in reaching a solution. It can be hard to give up your code.

Once the code is meeting the acceptance criteria, then I consider moving on to the next step. However, depending on the context, this may not be necessary. Is the code for a one-off prototype or destined for a short-lived application? I am not advocating wilful technical debt here, just pointing out that just running may be good enough and it needs to be considered.

Make it right

Making it right is all about taking the learning from making it run and, now that you know how to solve the problem, expressing the solution well and laying the foundation for the future. By keeping investment in the code up to this point light, it should hopefully make it easier to potentially start from scratch. Fred Brooks said, "Plan to throw one, you will, anyhow." I appreciate that this might not be a luxury that some folk have, but it you do then you may well be surprised at how much better you can make that second version.

The key here is to lay the foundation for the future of the software. To make it maintainable, understandable, and sustainable. This is where careful naming, well thought out abstractions, separation of concerns, and all that clean code jazz play their part. You can apply all that you have learnt about what makes quality software. This is where you can make your code delight the reader and make it pleasure to maintain. OK, that may be a step too far, but you can at least try 👍

By the time you have got it running and you have got it right, then you have probably got a lot further than most. Taking the next step may well not be necessary at all.

Make it fast 🚀

The final step is to make it fast. Then main reason to keep it separate is to try to prevent premature optimisation, which a wiser person than me said was the root of all evil. However, I am not afraid to admit that have fallen into the trap myself and had to live with the consequences for many years.

I expect that there are many fine guides our there on how to profile and optimise your code. What I do know is that if code is well-structured, i.e. 'right', then optimising becomes much more straightforward. With concerns logically separated, there are often suitable seams in which to put caches and other typical optimisations. If your code isn't 'right', then optimising can just pour fuel on the fire of technical debt.

Summary

So there you have it, there are only three steps and you might not need them all. As I like to say, "What could possibly go wrong?" 😜

AWS EventBridge test automation with CDK

Andy Blackledge — Fri, 27 Aug 2021 09:38:14 GMT

Listening to podcasts and reading articles, it seems AWS EventBridge is getting quite a bit of attention. Given this, I thought I would kick its tyres myself and see if I could automate testing it in the process. This article documents that journey and what I found on the way.

All the code can be found as part of the GitHub repo for the Serverless Testing Toolkit package I am developing.

TL;DR

You can't tag event buses, despite what the documentation says.
Invalid event patterns fail the whole deployment.
You can't do a partial match on source.
The testEventPattern SDK method requires different inputs than the CDK methods.

Putting the wheels on the event bus

The first thing I wanted to do with EventBridge was create an event bus and put some events on it. The first part is straightforward enough using the AWS CDK. I wrapped an EventBus instance in a CDK Construct and exposed it as a property.

export default class NotificationHub extends cdk.Construct {  static readonly NotificationHubEventBusId = 'NotificationHubEventBus';  readonly eventBus: events.EventBus;  constructor(scope: cdk.Construct, id: string) {    super(scope, id);    this.eventBus = new events.EventBus(this, NotificationHub.NotificationHubEventBusId);  }}

Now, I am currently developing an npm package called Serverless Testing Toolkit. This toolkit has a base CDK Stack that can extended to provide a hosting environment for the construct under test. I wrote about this approach in my series Serverless integration testing with the AWS CDK. This testing approach relies on tagging resources, such as Lambda functions or SQS queues, so that they can be located and invoked. I hoped to use this approach to put events on an EventBridge event bus. The AWS documentation certainly gave me reason to believe.

In EventBridge, rules and event buses can be tagged.

With this good news, I extended the IntegrationTestStack from the Serverless Testing Toolkit and tagged the EventBus exposed by our system under test (SUT).

export default class NotificationHubTestStack extends IntegrationTestStack {  //  static readonly Id = `NotificationHubTestStack`;  static readonly BusObserverFunctionId = 'BusObserverFunction';  static readonly TestLenderId = 'test-lender-id';  constructor(scope: cdk.Construct, id: string) {    super(scope, id, {      testStackId: NotificationHubTestStack.Id,      testFunctionIds: [NotificationHubTestStack.BusObserverFunctionId],    });    // SUT    const sut = new NotificationHub(this, 'SUT');    this.addTestResourceTag(sut.eventBus, NotificationHub.NotificationHubEventBusId);    // Rule and target observer    const testLenderRule = new events.Rule(this, 'SourceRule', {      eventBus: sut.eventBus,      eventPattern: {        source: [`lender.${NotificationHubTestStack.TestLenderId}`],      },    });    sourceRule.addTarget(      new eventsTargets.LambdaFunction(        this.testFunctions[NotificationHubTestStack.BusObserverFunctionId]      )    );  }}

Supplying one or more values for testFunctionIds causes the IntegrationTestStack base class to create Lambda functions. These functions record all events received as observations in a DynamoDB table (also created automatically). This table can then be queried to verify that the system is working as expected.

My intention was to hook such a function up as the target for an EventBridge rule. To do this, I first created a Rule on the SUT event bus and specified a pattern based on a specific source value. I then added a Target to the rule, pointing a the test Lambda function created by IntegrationTestStack. If all goes to plan, the function should write the event 'as-is' to the DynamoDB table for verification. All that was left was to deploy the Stack and put some events on the bus.

Driving the event bus

Deploying the Stack proved straightforward. The problems started when I created a client class to put events on the bus. As mentioned earlier, for other types of resources I had used the AWS Resource Groups Tagging API to locate resources by tags. However, when I tried the same approach with an event bus, there was no sign of it in the returned resources.

\_()_/

I looked at the event bus in the AWS console, but still no joy. It looks like currently (26 August 2021), there is no way to tag an event bus. This was a bit of a kick in the teeth, but I had a fall back plan. This involved using the EventBridge listEventBuses method, then using the pattern matching on the name to resolve to an ARN . This wouldn't be as robust as using tags, but would have to suffice until support was added for event bus tags.

Armed with the knowledge of how the ARN for an event bus, I extended the Serverless Testing Toolkit IntegrationTestClient class. I added a new getEventBridgeTestClient method that returned an EventBridgeTestClient for a given id. The implementation of EventBridgeTestClient can be seen below.

export default class EventBridgeTestClient {  readonly eventBridge: AWS.EventBridge;  constructor(region: string, public eventBusArn: string) {    this.eventBridge = new AWS.EventBridge({ region });  }  async putEventAsync(entry: PutEventsRequestEntry): Promise {    const response = await this.putEventsAsync([entry]);    return response;  }  async putEventsAsync(entries: PutEventsRequestEntry[]): Promise {    const request: PutEventsRequest = {      Entries: entries.map((e) => ({        ...e,        EventBusName: this.eventBusArn,      })),    };    const response = await this.eventBridge.putEvents(request).promise();    return response;  }}

I now had all the pieces in place to be able to write my test, but before that I needed to implement some setup. This setup involved creating a IntegrationTestClient instance for the test stack, initialising it, and using it to obtain an EventBridgeTestClient instance for interacting with the event bus. Before each test, there also needs to be a call to initialiseTestAsync to clear down results from previous tests.

const testClient = new IntegrationTestClient({  testStackId: NotificationHubTestStack.Id,});let notificationHubEventBus: EventBridgeTestClient;before(async () => {  await testClient.initialiseClientAsync();  notificationHubEventBus = testClient.getEventBridgeTestClient(    NotificationHub.NotificationHubEventBusId  );});beforeEach(async () => {  await testClient.initialiseTestAsync();});

For the test, I used my extension of the Arrange, Act, Assert approach as described in the Serverless integration testing with the AWS CDK series. This involves an Await step that polls the observations recorded by test functions. In this case, the test simply had to look for one or more such observations and verify that the detail was as originally specified.

it('handles events published directly to event bus', async () => {  // Arrange  const caseEvent: CaseStatusUpdatedEvent = {    eventType: CaseEventType.CaseStatusUpdated,    lenderId: NotificationHubTestStack.TestLenderId,    caseId: 'C1234',  };  const eventRequest: PutEventsRequestEntry = {    Source: `lender.${caseEvent.lenderId}`,    DetailType: caseEvent.eventType,    Detail: JSON.stringify(caseEvent),  };  // Act  await notificationHubEventBus.putEventAsync(eventRequest);  // Await  const { observations, timedOut } = await testClient.pollTestAsync({    until: async (o) => o.length > 0,  });  // Assert  expect(timedOut, 'timedOut').to.be.false;  const busEvent = observations[0].data;  expect(busEvent.detail).to.deep.equal(caseEvent);});

To my genuine surprise, the test passed first time. This doesn't make for much of an anecdote, but does show that EventBridge is quite straightforward. With the simple case up and running, my thoughts turned to testing more complicated event routing.

Becoming an event bus route master

I decided to carry on with the test function approach and define some more rules on the event bus. First, I defined a pattern to match the lenderId value in the detail property.

static readonly TestEventPattern = {  source: ['test.event-pattern'],};static readonly EqualTestEventPattern = {  ...NotificationHubTestStack.TestEventPattern,  detail: {    lenderId: 'LenderA',  },};

With the pattern defined, I added a rule and targeted the test function.

this.addEventBridgeRuleTargetFunction(  this.addEventBridgePatternRule(    'EqualRule',    sut.eventBus,    NotificationHubTestStack.EqualTestEventPattern  ),  NotificationHubTestStack.BusObserverFunctionId,  events.RuleTargetInput.fromText('EQUAL'));

For convenience, I added a couple of methods to IntegrationTestClient, one to create a pattern-based rule and one to wire up a test function to observe the events. In order to differentiate the observed events, I hardcoded the data sent to the test function for the rule to be EQUAL.

All looked good until I tried to deploy.

0/3 |18:01:02 | UPDATE_FAILED        | AWS::Events::Rule       | EqualRule (EqualRuleA4D32458) Event pattern is not valid. Reason: "lenderId" must be an object or an arrayat [Source: (String)"{"source":["test.event-pattern"],"detail":{"lenderId":"LenderA"}}"; line: 1, column: 56] (Service: AmazonCloudWatchEvents; Status Code: 400; Error Code: InvalidEventPatternException; Request ID: 7ec85bb1-59fe-421b-92d5-f8ed1827d4fc; Proxy: null)

I was impressed at the detail of the error message, clearly pinpointing the error of my ways. However, I felt it was a little late in the day to find out such an error. I would have rather caught such a thing earlier. Thankfully, it turns out there is a way to do this that I will go into later.

At the time, however, I amending the pattern as follows and that allowed the deployment to succeed.

static readonly EqualTestEventPattern = {  ...NotificationHubTestStack.TestEventPattern,  detail: {    lenderId: ['LenderA'], // <-- Array!  },};

With this in place, I created a unit test that followed the pattern of the previous unit test. In the test, I put through a set of events and asserted that the observations were as expected for each. I extended this to cover all the pattern options as described by the AWS documentation. The resulting test can be found in the GitHub repo.

One thing I noticed during this experimentation, was that it wasn't possible to do prefix matching on the source of an event. The recommendation is for a Java package style naming, e.g. com.mycompany.myapp, but you cannot match on all com.mycompany entries.

I was now fairly confident in using CDK to create Rule instances and then to wire them up to a Lambda function Target. I was not overly happy that it required a deployment to validate and test the pattern matching. However, I had noticed that both the AWS SDK had a testEventPattern method that promised to do both without deployment.

Early warning testing with `testEventPattern`

Consulting the documentation, the theory seemed simple enough. Pass in an event and an event pattern and testEventPattern would validate the pattern and return whether it matched. With this in mind, I created the following test.

// Arrangeconst caseEvent = {  eventType: CaseEventType.CaseStatusUpdated,  lenderId: 'LenderA',  caseId: 'C1234',},const putEventsRequest: PutEventsRequestEntry = {  Source: `test.event-pattern`,  DetailType: caseEvent.eventType,  Detail: JSON.stringify(caseEvent),};// Actconst testRequest: TestEventPatternRequest = {  Event: JSON.stringify(putEventsRequest),  EventPattern: JSON.stringify(NotificationHubTestStack.EqualTestEventPattern),};const isEqualMatch = (  await IntegrationTestClient.eventBridge.testEventPattern(testRequest).promise()).Result;// Assertexpect(isEqualMatch).to.be.true;

However, when running the test, I got the following error:

ValidationException: Parameter Event is not valid.

Drilling into the documentation for TestEventPatternRequest, I found the following for Event:

The event, in JSON format, to test against the event pattern. The JSON must follow the format specified in Amazon Web Services Events, and the following fields are mandatory: id, account, source, time, region, resources, detail-type

So it seemed that testEventPattern requires a fully-formed event to work. At this point, I decided that I would create a method isEventPatternMatchAsync on IntegrationTestClient to encapsulate this:

static async isEventPatternMatchAsync({  eventPattern,  putEventsRequest,}: {  eventPattern: cdkEvents.EventPattern;  putEventsRequest: PutEventsRequestEntry;}): Promise<boolean> {  //  const mappedEvent: any = {    id: '6a7e8feb-b491-4cf7-a9f1-bf3703467718',    'detail-type': putEventsRequest.DetailType ? putEventsRequest.DetailType : 'detail-type',    source: putEventsRequest.Source ? putEventsRequest.Source : 'source',    account: '0000000000',    time: putEventsRequest.Time ? putEventsRequest.Time : '2017-12-22T18:43:48Z',    region: 'us-west-1',    resources: putEventsRequest.Resources ? putEventsRequest.Resources : [],    detail: putEventsRequest.Detail ? JSON.parse(putEventsRequest.Detail) : undefined,  };  const request: TestEventPatternRequest = {    Event: JSON.stringify(mappedEvent),    EventPattern: JSON.stringify(eventPattern),  };  const response = await this.eventBridge.testEventPattern(request).promise();  return response.Result ?? false;}

I re-ran the tests and got a number of failures. All the failed tests involved matching on the detail type. After comparing the deployed patterns with the JSON for the CDK patterns, I noticed that the deployed patterns had detail-type as a property. With this in mind, I amended the isEventPatternMatchAsync method to map the event pattern as follows.

const mappedEventPattern = {  ...eventPattern,  'detail-type': eventPattern.detailType,  detailType: undefined,};const request: TestEventPatternRequest = {  Event: JSON.stringify(mappedEvent),  EventPattern: JSON.stringify(mappedEventPattern),};

This did the trick. Now I had a way of validating and testing event patterns before deployment and then using the exact same patterns in my CDK constructs. For future projects, this should make development much easier. The resulting test can be found in the GitHub repo.

Conclusion

I found EventBridge fairly straightforward to use, certainly within the limitations of my experiment. There were some oddities, such as no tags and differing structures, but on the whole I found it nice to use.

If you want, please check out my Serverless Testing Toolkit. You can either use it as a package, or you can take whatever code you like from the repo. All feedback is welcomed.

Better logging through technology

Andy Blackledge — Tue, 17 Aug 2021 18:42:23 GMT

Logging is a subject close to my heart. Good quality logging will get you out of a tight spot, poor quality logging will have you pulling your hair out (metaphorically in my case 😊). Here, I try to share the benefit of my experience and thinking in this area.

TL;DR

See through the eyes of support
Don't burn the evidence
Context is everything
Keep the noise down

`ERROR`

My first real foray into logging came when developing a Visual Basic WinForms application. In those days, an unhandled error in a form event, e.g. from clicking button, would result in the whole application unceremoniously crashing in a flaming heap. Not a great user experience, and no wreckage to sift through to prevent it happening again.

To counter this, I created a Visual Studio add-in that would add some standard error handing to each event and some standard logging too. The result was a much more robust application, and when the inevitable errors occurred, we could ask the user to send us the resulting log file. This became the key for us to understand and address issues and logging became an integral part of our application.

This experience was foundational in forming my thinking about ERROR level logging. What I believe you need to do, is see the log through their eyes of support, where all you have is the ERROR level output to guide you. Get this logging wrong and you are up a certain creek without a certain paddle.

So, what do I think you need to concentrate on?

Log all aspects of the error
- In C#, this is the message, the stack trace, and any inner exception details too.
Log all contextual information that you have available
- At the point of an error, you are in a unique position to log the state at the time it occurred. This opportunity shouldn't be squandered, and all salient information should be logged (but no credit card details folks!).
Ensure that the resulting logging is meaningful
- If at all possible, force an error to occur and view the resulting logging. Put yourself in the position of production support and ask yourself if this information would lead you to the cause and the solution. If not, then think how it could be improved.

Not strictly to do with logging, but related to it, is making sure that stack traces are preserved. All to often, I have seen this in C# code:

public void MyDodgyMethod() {  try {    DoSomethingDodgy();  }  catch (Exception ex) {    LogError(ex.Message);    throw ex; // DON'T DO THIS PLEEEEEEEEEASE!  }}

This has the effect of throwing away the stack trace and starting a new, misleading one. I would urge you to know your tools and know how to get the most out of them in error situations. In C#, it can be a good thing to throw a new exception with local contextual data but include the current exception as an inner one. However, care must be taken not to do more harm than good. In short, don't burn the evidence.

`WARN`

I have an admission to make in that I haven't used the WARN level much at all in my time. I guess you could use them to indicate some non-functional degradation, or an error condition that was recoverable. However, much like compiler warnings, I suspect these just get ignored and end up becoming noise.

`INFO`

My approach to INFO level logging is to use it to give a picture of execution path through an application. Points at which I would look to add INFO level logging would be at the boundary of services and at key decision points within those services.

I have worked on systems that have been made up of many separate services, some of which I have been responsible for. These services were subject to orchestration and, inevitably, failures. To cover my own behind, I was rigorous in adding INFO level logging to all the boundary points to those services I developed. I developed a set of standard tools to log the inbound call and the outbound response, along with ERROR level logging for exceptions.

The result of this approach made it straightforward for me to identify whether a failure had occurred within a particular service. Ideal for passing the buck 😉 Even when the failure was within the service, I knew I had meaningful information to work with. So, it wasn't all bad.

When I developed a process orchestration framework, I had the framework itself log out INFO level entries to automatically trace the flow. This freed up the users of the framework to concentrate on the functionality, knowing that the tracing aspect was already done. If possible, I would heartily encourage such as aspect-oriented approach as it brings great dividends in the consistency and quality of logging.

However, none of this tracing is much use without context. By context, I mean something to identify what the tracing statement was related to. It could be a correlation id, an id of a domain entity, a thread id, but without any the statement loses much of its meaning and usefulness.

As systems get more distributed, this gets more important. There are plenty of tools out there to help, e.g. AWS X-Ray, but I can't say I have used them yet. The key whatever you use, is to make your logs meaningful. Again, look at them through the eyes of support, and think how they could be improved.

DEBUG

For me, DEBUG level logging is there to be switched on when you need to see the nitty-gritty of what is going on in an application. However, all too often, I have seen DEBUG level logging be the dumping ground for what I call "I'm here!" logging. For example:

2009-03-18 19:56:28,901 DEBUG - Process lists generated.

What generated the process lists? Is it noteworthy that the process lists have been generated? What does the process list contain? There is the other issue that if there are multiple calls executing simultaneously then how do I know which call this relates to?

2009-03-18 17:06:35,566 DEBUG - ProcessEngine: d865e14b-971b-4b2d-b138-d6217afdd41d : Decision 1.0 - ExecuteRules_Decision : C8046691354, Overall rule results [OverallRuleStatus=Decline]

This tells us the component that did the logging (ProcessEngine), an id that can be used to trace a single call (d865e14b- etc.), the module that was executing (Decision 1.0 - ExecuteRules_Decision), the core entity that was being processed (C8046691354) and the result of the module's processing (Overrall rule results [OverallRuleStatus=Decline]).

To paraphrase what I was told as a child, "If you haven't any useful to say, don't say anything". There should be a clear purpose to all log entries at all levels. Of course, there is a need for quick DEBUG entries as part of development. However, you should be rigorous enough to ensure that they are removed before committing the code. The more noise that the logs receive, the less useful they become.

One example where DEBUG level logging proved invaluable in my career, was in the development and support of a rules engine. The rules engine allowed users to define rules using a graphical, non-code way. The engine would then evaluate these rules using an interpreter. This meant you could not debug the rules in a standard way using an IDE.

As these rules could be complex, it was imperative to be able to understand what was going on 'under the hood'. The solution was to build logging into the evaluation engine, so each part of the rule could be logged. Each 'AND', 'Greater than', and so on. Where necessary, this verbose logging could be switched on and the results pored over to ascertain exactly how the output from the rule was reached.

`ALL`

IMHO logging should be treated as an integral part of any system you are developing. Ad-hoc logging, or logging as an afterthought, will probably not be half as useful as a support-focussed, context-laden, consistent approach.

I should also mention structured logging here, as it is the way forward where it is available. Better people than me have written about it, so here is a guide to What Is Structured Logging and Why Developers Need It. Log aggregation deserves a mention as well. Having all your logs together in one place makes all the difference. Coupled with structured logging, the results can be transformative.

Happy and fruitful logging!

Lessons from 15 years living with my code

Andy Blackledge — Thu, 12 Aug 2021 16:40:15 GMT

At the time of writing, I am about to change jobs after over 15 years in my current role. This, I suspect, is quite unusual for the IT business. Given this, here are some thoughts on my experience of writing software and then having to be responsible for it for a long, long time.

Log, log, and log some more

One lesson from supporting the same software for over a decade is that quality logging will be your friend. In my time, I inherited more than one area of tooling where logging was almost non-existent. The consequences of which were that delivery using that tooling was difficult and error prone. When redeveloping these areas, I made a point of ensuring that the tools explained what they were doing at varying levels of detail. By using logging frameworks, these levels could be switched on or off. For production, there would be minimal logging for reasons of speed. For development, full disclosure to give a detailed view into why a certain result occurred. The key was to treat the logging as an integral part of the overall product and not as an afterthought.

Where individual services were developed, I paid attention to ensuring that logging occurred at the service boundary. That all inbound calls could be logged along with the overall success or failure of the call. I developed standard components to make this as simple and consistent as possible. The result was that it was always straightforward to see if an error was emanating from a service that I was responsible for and where that error had occurred. I was always surprised to see how many services had no such blame deflection capability.

Pay attention to errors

Paying attention to errors most definitely paid dividends over the years. My experience is that people rarely come round to tell you that your component is working flawlessly, they are much more likely to come round when something has gone wrong. By paying attention to errors, I am talking about both throwing meaningful errors and handling them effectively too.

For me, a meaningful error is one that has a well-worded message and contains as much key information as is feasible at the point it is raised. For example, 'Unexpected input' or 'Error 4235' are not what I would consider well-worded. If viewed in isolation, which errors often can be, they are meaningless. The mindset I would suggest, is to put yourself in the position where all you have to work with is the message. Another tip here is to always have static text at the start of the message, to make it easier to locate in a codebase.

When it comes to handling errors, my experience is that unless you can add value then you do nothing. To often I have seen errors being handled such that vital information being lost. A classic C# mistake is to throw ex and throw away the stack trace in the process. You need to understand how errors bubble up in your chosen tech. My approach was to test that any errors raised manifested themselves in a meaningful way, either in a response or a log. The result was that when these occurred in the wild, and I was dragged into help, I had clear clues to guide me. Don't leave yourself in the dark.

BYOM - Bring Your Own Model

One of the projects I was tasked with was to create a process orchestration framework with a visual editor. The framework was to use Windows Workflow Foundation v1 and the editor to use Visio 2007 as an embedded control. Workflows in Workflow Foundation were defined by a XAML file, so the task could be solved by building a UI that walked the nodes in the Visio diagram object model and outputted the appropriate XAML for the runtime. However, I decided to bring my own process model to the party.

What were my reasons for doing so?

To isolate the editor from runtime. I didn't want the Windows Form editor to have references to the Workflow Foundation components. With an intermediate model this could be avoided.
Workflow Foundation XAML V1 was a pure hierarchical model, so required some complex generation to model non-hierarchical processes.
Workflow Foundation XAML was V1 and so was a prime candidate to change (and it did!)

What we found when we started to use it in anger was that Workflow Foundation V1 was slow and that is had a threading model not compatible with our components. The solution? Write our own process runtime using, you guessed it, our own model.

Owning your model gives you a degree of independence. There is a good argument that you ain't going to need it, that it adds a level of unnecessary indirection. You will need to balance this against the advantages. I found that having your own model makes you really think about what you need in it and what you don't. The upshot being a deeper understanding of the domain you are dealing with.

BTW Eventually we got rid of Visio too, replacing it with - no surprise - our own lightweight, custom-for-purpose model.

KISS - Keep It Simple Stupid

For one project, we had to build a back-end service that received files and queued them up to be spooled to the printer. I was under pressure from a senior member of staff to use a relational database as an index for the jobs. In the end I went with the simplest solution I could think of, and made it all file-based. The result, simple to deploy and maintain. Need to retry a job? Drag it from the failure folder to the input folder. Need to test the printing? Drop a file into the input folder. Simplicity really does help in the long run, especially if you have to support it.

This goes for algorithms, regular expressions, and data structures too. If you are going to be coming back to software you haven't seen for 5 years, you want to make it easy for yourself (not to mention others).

Embrace testing

In my 15 years, one of the changes I made was to understand and embrace testing. This was primarily unit testing, with attention being paid to how to create meaningful tests that were neither brittle nor flaky. Roy Osherove's book The Art of Unit Testing was a particular inspiration, making me think deeply about how to do this. Some of the tests I wrote were 'classic' unit tests and some were snapshot tests (also known as approval tests).

For the core infrastructure that I built and supported, I paid close attention to unit testing possible error conditions and ensuring that they were reported in a meaningful way. This could actually form the bulk of the testing, as things have many more ways to fail than succeed. However, this thoroughness paid dividends in support, allowing me to quickly understand the underlying cause of issues.

'A bug is a test that has yet to be written' quipped someone wiser than myself. I took this on board, and tried to be rigorous in creating a test to recreate an issue before diving into the code to fix it. Sometimes this is tempting as you get that feeling that you know exactly what needs to be fixed. However, sometimes you are wrong. Yes, even you. I found that taking the time to fashion a test recreating the issue often yields insights that you would never get from just patching the code.

Attention to clean code really does make a difference

No surprises here, but thoughtful naming, reasonable method size, and all the other factors that result in clean code make a huge difference in the long run. I was always happy to return to my own code, not because it was cleverer than anyone else's, but because I always paid attention to those principles as best I could. I tried to never let my standards drop and I was always very grateful for that on my return. Code is read many, many more times than it is written. Especially over the course of 15 years.

Step Function integration testing with CDK

Andy Blackledge — Wed, 21 Jul 2021 19:15:44 GMT

In the previous parts in the series, I explored how we can leverage CDK to construct serverless applications out of components that can be individually deployed and tested in the cloud. In this part, I look at how we can treat step functions in a similar way, and how we can add a mocking capability to the testing components to make this easier.

AWS describe step functions as follows:

AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications. Workflows manage failures, retries, parallelization, service integrations, and observability so developers can focus on higher-value business logic.

The State Machine

Let us get straight to it and look at the workflow that we wish to test, as rendered by the brand new Workflow Studio.

This part of the application is concerned with combining affordability configurations, e.g. use 3 time basic salary, with affordability scenarios, e.g. basic salary of 40K. It then calculates, or recalculates, the result.

One happy path goes something like this:

A file change event triggers the step function.
The ReadInputFileHeader lambda loads the file from S3 and extracts the file header.
The SwitchOnFileType choice looks at the file header type and sees Configuration.
The ReadScenarioHeaders lambda loads all Scenario headers.
The CombineHeaders lambda outputs an array of all Configuration and Scenario tuples.
The CalculateResults map iterates over the array of Configuration and Scenario tuples
For each Configuration and Scenario tuple, the CalculateResult lambda calculates the result and stores it in S3.

In addition, there are two unhappy paths. One where the initial read fails (after a maximum of 2 retries), and another where the file type is neither Configuration or Scenario.

The Testing Challenge

In my experience, one of the challenges of step function development is getting the input and output processing right. The following diagram is from the AWS documentation and shows that the process has quite a few moving parts.

In fact, AWS have a tool just for debugging this process called the Data flow simulator.

IMHO, having to provide such a tool indicates that approach taken to state management may not be as intuitive as it could have been.

Given this, one challenge is how to test these mappings. That is, how do you test that the data flows through the step function as expected.

Another challenge is how to to test another key feature of step functions, error handling. Step functions are great in that you can define retry policies and error handlers, thereby making your processing more robust. The trick is how do you test that when these conditions arise, the step function behaves as you expect.

The Testing Approach

The approach I am going to take here is to concentrate on testing the flow and mappings of the step function. I am going to assume that the lambda functions adhere to known contracts and wire the step function to mock and observer versions of those functions. These versions will be provided by the testing infrastructure. The resulting test architecture looks like the following.

As you will see further on, for each test, the mock functions are given an array of expectations and responses. These are asserted and replayed respectively. In addition, a mock function can be made to error and this allows us to test the error handling scenarios. On the other hand, observer functions simply write the received event to a table for later assertion by the unit test.

The Testing Constructs

The System Under Test

The first construct to create is the step function. Since we want the functions and the topic to be passed in for testing purposes, we need to follow the CDK pattern and create a Props interface along with our subclass of StateMachine.

export interface ResultCalculatorStateMachineProps  extends Omit {  fileHeaderReaderFunction: lambda.IFunction;  fileHeaderIndexReaderFunction: lambda.IFunction;  combineHeadersFunction: lambda.IFunction;  calculateResultFunction: lambda.IFunction;  errorTopic: sns.ITopic;}export default class ResultCalculatorStateMachine extends sfn.StateMachine {  constructor(scope: cdk.Construct, id: string, props: ResultCalculatorStateMachineProps) {    super(scope, id, {      ...props,      definition:        StateMachineBuilder.new()          // ...snip...    });  }}

The code for the full definition can be found on the GitHub repo here.

Note that I use my own State Machine Builder npm package. I don't like the AWS fluent interface, so I built my own using the builder pattern (🎺 <- my own trumpet being blown 😊). I also blogged about it here.

The Testing Stack

Now we have our step function, we can host it in a testing stack. In the last part, we started using a base stack with reusable functionality, like the ability to tag resources for easy location and the ability to deploy observer functions. In this part, we continue this process, taking advantage of functionality to deploy mocks as well as observers.

We start by sub-classing IntegrationTestStack, to get access to all the functionality within it.

export default class ResultCalculatorStateMachineTestStack extends IntegrationTestStack {

Next we declare a set of static values that will be used by the unit tests to locate resources, define mock responses, and retrieve observations.

static readonly ResourceTagKey = 'ResultCalculatorStateMachineTestStack';static readonly StateMachineId = 'ResultCalculatorStateMachine';static readonly FileHeaderReaderMockId = 'FileHeaderReaderMock';static readonly FileHeaderIndexReaderMockId = 'FileHeaderIndexReaderMock';static readonly CombineHeadersMockId = 'CombineHeadersMock';static readonly ResultCalculatorObserverId = 'ResultCalculatorObserver';static readonly ErrorTopicObserverId = 'ErrorTopicObserver';

The observers and mocks are defined simply by specifying values for observerFunctionIds and mockFunctionIds in the stack properties.

super(scope, id, {  testResourceTagKey: ResultCalculatorStateMachineTestStack.ResourceTagKey,  observerFunctionIds: [    ResultCalculatorStateMachineTestStack.ResultCalculatorObserverId,    ResultCalculatorStateMachineTestStack.ErrorTopicObserverId,  ],  mockFunctionIds: [    ResultCalculatorStateMachineTestStack.FileHeaderReaderMockId,    ResultCalculatorStateMachineTestStack.FileHeaderIndexReaderMockId,    ResultCalculatorStateMachineTestStack.CombineHeadersMockId,  ],});

Behind the scenes, IntegrationTestStack will deploy observer and mock functions configured with the ids specified. IntegrationTestStack will also deploy a DynamoDB table that the functions rely upon. This table holds the mock responses, the mock states, and the observations.

The step function also requires a topic to which to publish errors.

const testErrorTopic = new sns.Topic(this, 'TestErrorTopic');testErrorTopic.addSubscription(  new snsSubs.LambdaSubscription(    this.observerFunctions[ResultCalculatorStateMachineTestStack.ErrorTopicObserverId]  ));

Here we are wiring up an observer function to subscribe to the topic, so we can assert if error messages were published. The IntegrationTestStack exposes the generated observer functions via the observerFunctions property. This property is then be indexed by the observer id to obtain a reference to the function.

Finally we define the system under test, our state machine construct.

const sut = new ResultCalculatorStateMachine(  this,  ResultCalculatorStateMachineTestStack.StateMachineId,  {    fileHeaderReaderFunction:      this.mockFunctions[ResultCalculatorStateMachineTestStack.FileHeaderReaderMockId],    fileHeaderIndexReaderFunction:      this.mockFunctions[ResultCalculatorStateMachineTestStack.FileHeaderIndexReaderMockId],    combineHeadersFunction:      this.mockFunctions[ResultCalculatorStateMachineTestStack.CombineHeadersMockId],    calculateResultFunction:      this.observerFunctions[ResultCalculatorStateMachineTestStack.ResultCalculatorObserverId],    errorTopic: testErrorTopic,  });this.addTestResourceTag(sut, ResultCalculatorStateMachineTestStack.StateMachineId);

As with observer functions, the IntegrationTestStack exposes mock functions via the mockFunctions property, indexed by the mock ids. Here we use this property to wire up the generated functions to the state machine. We also tag the state machine, so that we can interact with it in the unit tests.

The Unit Tests

For the unit test we will be using the Mocha testing framework and the Chai assertion library. The approach doesn't use anything specific to these, so it should be still viable if other frameworks and libraries are used.

As described in Part 3, we need to do create and initialise an instance of UnitTestClient before each test run. This instance will be used to interact with the cloud-based resources.

describe('FileEventPublisher Tests', () => {  const testClient = new UnitTestClient({    testResourceTagKey: ResultCalculatorStateMachineTestStack.ResourceTagKey,  });  before(async () => {    await testClient.initialiseClientAsync();  });  it('New scenario created', async () => {    // Our test goes here  });});

Happy Path First

In this test, we want to ensure that the flow and mappings for the happy path are as expected. So first we specify the assertions and responses for the mock functions.

// Arrange// .. snip test object creation...await testClient.initialiseTestAsync({  testId: 'New scenario created',  mocks: {    [TestStack.FileHeaderReaderMockId]: [      { assert: { requiredProperties: ['s3Key'] }, response: scenarioFileHeader },    ],    [TestStack.FileHeaderIndexReaderMockId]: [      {        assert: { requiredProperties: ['fileType'] },        response: configurationFileHeaderIndexes,      },    ],    [TestStack.CombineHeadersMockId]: [      { assert: { requiredProperties: ['configurations'] }, response: combinedHeaders },    ],  },});

For each mock function, we have specified a set of assertions and responses. The assertions simply say that the event that triggers the mock function must contain the properties specified. If not, an error is thrown by the mock function. If the event matches, or if no assertion is specified, then the mock function returns the specified response.

For brevity, I have snipped the code that creates the fileEvent, scenarioFileHeader, configurationFileHeaderIndexes, and combinedHeaders test objects. For those with a curious bent, you can find the full code for the tests here.

// Actconst sutClient = testClient.getStepFunctionClient(ResultCalculatorStateMachineTestStack.StateMachineId);await sutClient.startExecutionAsync({ fileEvent });

The Act step is quite straightforward. We use the testClientinstance to get a step function client to interact with the system under test. We use StateMachineId defined on ResultCalculatorStateMachineTestStack to do this. We then call the client to start the step function with the test event.

We introduced the Await step in the last part of the series. This step is required in asynchronous event-driven testing, as things take time to work their way through queues, topics, streams and so forth. Here we wait until the step function state indicates that it has finished.

// Awaitconst { timedOut, outputs } = await testClient.pollOutputsAsyncany>>({  until: async () => sutClient.isExecutionFinishedAsync(),  intervalSeconds: 2,  timeoutSeconds: 12,});

At this point in the code, the test has either timed out or the execution of the step function has finished. So, it is time to assert that things are as expected.

// Assertexpect(timedOut, 'Timed out').to.equal(false);const status = await sutClient.getStatusAsync();expect(status).to.equal('SUCCEEDED');const resultCalculatorOutputs = outputs.filter(  (o) => o.observerId === TestStack.ResultCalculatorObserverId);expect(resultCalculatorOutputs.length).to.equal(configurationCount);

As well as the assertions that the test did not time out and that the step function succeeded, we also have an assertion that the ResultCalculator function was called the expected number of times. In fact, we could go even further here and inspect the content of the events that triggered the functions.

The Unhappy Paths

The first unhappy path that we consider caters for the scenario where the step function is incorrectly triggered. I.e., it is triggered by a change to neither a configuration nor a scenario file. In this case, we want to assert that the execution fails and that an error event is raised.

If we have our system configured correctly, then this makes testing this scenario at that level quite tricky as it should really never happen. However, as we are able to control the responses from our mock functions, then it becomes quite simple.

In the Arrange step of the test, we configure the mock function to return a file header with a file type of Result.

const unhandledFileHeader = { fileType: FileType.Result, name: `name:${nanoid()}` };await testClient.initialiseTestAsync({  testId: 'Unhandled file type',  mocks: {    [TestStack.FileHeaderReaderMockId]: [{ response: unhandledFileHeader }],  },});

Next we Act just as we did before.

const sutClient = testClient.getStepFunctionClient(ResultCalculatorStateMachineTestStack.StateMachineId);await sutClient.startExecutionAsync({ fileEvent });

The Await step is where things are a little different. In addition to waiting for the step function to finish, we also wait until our error topic observer has outputted at least one event.

const getErrorOutputs = (outputs: ObserverOutput<any>[]): ObserverOutput<any>[] =>  outputs.filter((o) => o.observerId === TestStack.ErrorTopicObserverId);const { outputs, timedOut } = await testClient.pollOutputsAsyncany>>({  until: async (o) => sutClient.isExecutionFinishedAsync() && getErrorOutputs(o).length > 0,  intervalSeconds: 2,  timeoutSeconds: 12,});

Finally, we can assert that the test had the expected outcome.

expect(timedOut, 'Timed out').to.equal(false);const status = await sutClient.getStatusAsync();expect(status).to.equal('FAILED');const lastEvent = await sutClient.getLastEventAsync();expect(lastEvent).to.not.equal(undefined);expect(lastEvent?.executionFailedEventDetails?.cause).to.equal('Unhandled FileType');

Here we are using the getLastEventAsync method to retrieve the last event emitted by the step function. With this, we can then assert that it contained the expected error cause. This gives us confidence that the flow is as we expect.

We can also assert that the expected error event was published.

const errorEventRecords = getErrorOutputs(outputs)  .map((o) => (o.event as SNSEvent).Records)  .reduce((all, r) => all.concat(r), []);const errorEvent = JSON.parse(errorEventRecords[0].Sns.Message);expect(errorEvent.error).to.equal('Unhandled FileType');

The other unhappy path to consider is when the first function errors when trying to read the input file header. To make matters more interesting, we have also configured the state to retry a couple of times.

This is where our mock functions again come to our aid. By specifying a value for error instead of response, we can instruct them to throw an error with the text specified.

await testClient.initialiseTestAsync({  testId: 'File reader retries and fails',  mocks: {    [TestStack.FileHeaderReaderMockId]: [      { error: 'Test error 1' },      { error: 'Test error 2' },      { error: 'Test error 3' },    ],  },});

In this case, we don't supply any other mock responses, as we do not expect any other functions to be called.

The rest of the test is the same as the first unhappy path, until we get to the Assert.

expect(lastEvent).to.not.equal(undefined);expect(lastEvent?.executionFailedEventDetails?.cause).to.equal('Failed to read the input file');

Here we assert that the failure is due to the expected cause.

Summary

In this post, I went through how it is possible to unit test the flow and mapping of step function by the use of mock and observer functions. I did gloss over how these are implemented, but if you are interested then the whole code can be found in the GitHub repo. My intention here was to explore the idea and show the possibilities of using CDK in this way.

It was interesting to develop a step function like this. I was decoupled from implementing the functions, I just concentrated on the contracts. I was able to deploy the step function and work on it interactively, amending the mappings and re-running the tests in the AWS console. I would then take the updated mappings, update the code definition, deploy the test stack and run the tests.

During this series of posts, I have ended up creating the start of what I hope will be an npm package that will enable anyone to start testing in this way. That is the intention, let us see if it comes to fruition. Happy testing!

Serverless integration testing with CDK - Part 3

Andy Blackledge — Sun, 11 Jul 2021 16:36:55 GMT

In Part 1 and Part 2 of this series, we looked at how we could take a serverless application and group it into a set of testable CDK constructs. In this part, I look into how we can refine the approach to make it as simple as possible to write and run tests in this way.

Quick recap

We started with our event-driven application architecture, where files added to an S3 bucket cause a ripple of events. The challenge being how to test that the system is working as expected.

We then grouped the application into a set of CDK constructs, as shown by the shaded boxes below. The idea being that these can be deployed and tested in isolation and then combined to produce the whole application.

We took the Event publisher construct first and got to a point where we could repeatedly run a unit test to drive the processing and then observe the results.

Observations from the first two parts

Calling the AWS CLI is clunky.
- The approach relied on invoking the AWS CLI via the child_process package. Looking at the code for AWS Testing Library, I could see that it is possible to call the AWS services directly using the AWS SDK and use the credentials loaded from from the Shared Credentials File as described here.
This approach to testing will require similar scaffolding infrastructure for each construct being tested.
- For example, a function to act as an observer of events and a table to record the observable outputs from the test.
The unit tests need to know the names of the deployed resources.
- The approach relied on the deployment outputting the resource names for use in the tests, either by hardcoding or by passing them in by some other mechanism. As we are following the best practise and not specifying resource names, this adds a extra step each time they change as result of a deployment.

Moving forward

With these observations in mind, enter the Integration test stack and the Unit test client.

The Integration test stack is an abstract CDK Stack class that provides the following functionality:

Deployment of common test resource scaffolding, such as an observer function and a test output table.
A method to tag the deployed resources so that they can be discovered without needing to know the resource names or ARNs.

The Unit test client is a class that works in conjunction with the Integration test stack and provides the following:

A set of methods to locate and interact with test resources using the AWS SDK. For example, upload an object to an S3 bucket.
A method that encapsulates the polling of the test outputs, to make writing unit tests simple and clear.

Using these two classes our test architecture becomes the following, with a generic Test observer function and a generic Integration test table being provided by the Integration test stack and our unit tests using the Unit test client.

Creating the test stack

Now let's look at some code. To take advantage of the base functionality, our new test stack needs to extend IntegrationTestStack. For the moment, we will concentrate on how to use this class, but if you are interested in the inner workings of IntegrationTestStack, then you can find the code for it and everything else in the GitHub repo.

export default class FileEventPublisherTestStack extends IntegrationTestStack {  static readonly ResourceTagKey = 'FileEventPublisherTestStackResource';  static readonly TestBucketId = 'TestBucket';  constructor(scope: cdk.Construct, id: string) {    super(scope, id, {      testResourceTagKey: FileEventPublisherTestStack.ResourceTagKey,      deployIntegrationTestTable: true,      deployTestObserverFunction: true,    });

We first declare a couple of constants that we will need later for the Unit test client to work its magic, then we turn our attention to the meat of a CDK Stack, the constructor.

Here we supply three properties to super:

testResourceTagKey: This value is used as the key to tag any resources that our tests need to interact with.
deployIntegrationTestTable: This boolean property specifies whether or not we want a table to be created to hold the observable outputs of the tests.
deployTestObserverFunction: This boolean property specifies whether or not we want a function to be created that writes all observed events received to the output table.

Next we need a test bucket to upload test file to.

const testBucket = new s3.Bucket(this, FileEventPublisherTestStack.TestBucketId, {  removalPolicy: cdk.RemovalPolicy.DESTROY,  autoDeleteObjects: true,});this.addTestResourceTag(testBucket, FileEventPublisherTestStack.TestBucketId);

Here we specify a test bucket that will be cleaned up automatically (the CDK construct creates a function to do this). We also have a call to addTestResourceTag, which tags it for later discovery and interaction.

Onto the system under test.

const sut = new FileEventPublisher(this, 'SUT', {  fileBucket: testBucket,  deploymentTarget: 'TEST',});

Here we specify an instance of our construct FileEventPublisher. We wire the fileBucket property up to the test bucket we created above. We also set the deploymentTarget to TEST. Setting this value changes the removal policy for the underlying resources to one more suitable for a test environment. We do this as we want to avoid having test resources hanging around causing confusion and potentially costing money.

Finally, we add something to observe the results of our tests.

sut.fileEventTopic.addSubscription(  new subscriptions.LambdaSubscription(this.testObserverFunction));

Earlier, we specified true for the deployTestObserverFunction property. This will cause the IntegrationTestStack to create a simple function that writes all events received to the test output table. In this case, we want to observe SNS events, so we wire it up to the topic of the system under test.

Now we have our test stack ready to go. We can use cdk synth to verify it, and then use cdk deploy to deploy it. Once done, it's time to turn our focus to the unit tests.

Creating the unit test

The first thing we need to do is create and initialise an instance of UnitTestClient.

describe('FileEventPublisher Tests', () => {  const testClient = new UnitTestClient({    testResourceTagKey: FileEventPublisherTestStack.ResourceTagKey,  });  before(async () => {    await testClient.initialiseClientAsync();  });  it('New file upload', async () => {    // Our test goes here  });});

The constructor requires a value for testResourceTagKey. This is quite literally the key for the UnitTestClient to locate resources by their tag. When initialiseClientAsync is called, UnitTestClient uses the getResources method from the ResourceGroupsTaggingAPI SDK to locate all resources that have a tag with a key of testResourceTagKey. This method returns an array of resource ARNs and their tags, an example of which is shown below.

[  {    "ResourceARN": "arn:aws:dynamodb:eu-west-2:728023361653:table/FileEventPublisherTestStack-IntegrationTestTableD3B8E277-OSEUU2XIBRK",    "Tags": [      {        "Key": "FileEventPublisherTestStackResource",        "Value": "IntegrationTestTable"      }    ]  }]

One thing to note is that all the SDK calls require the AWS region. The way the code is currently set up, this value is specified via a .env file containing a entry of the following format.

AWS_REGION=eu-west-2

Before we can start on the usual parts of a unit test, we first need to make a call to initialiseTestAsync passing in a test identifier. Behind the scenes, this value is used to record outputs from the tes. The call to initialiseTestAsync uses the identifier to clear down all previously recorded data against it.

await testClient.initialiseTestAsync('New file upload');

Arrange

The Arrange step is next. In this case, we create the object to be uploaded and the S3 key to upload it to.

const configurationFile = newConfigurationFile();const configurationFileS3Key = configurationFile.header.name;

Act

The Act step is more interesting.

await testClient.uploadObjectToBucketAsync(  FileEventPublisherTestStack.TestBucketId,  configurationFileS3Key,  configurationFile);

Here we use the uploadObjectToBucketAsync method on the UnitTestClient instance to upload our test object. We pass in the statically-known value FileEventPublisherTestStack.TestBucketId that was used to tag the resource. Underneath, the UnitTestClient uses this to resolve it to the deployed bucket name before making the AWS SDK call to do the upload. This approach keeps our unit test clean and decouples it from the details of a specific deployment.

Await

Next, we move to the additional step in an event-driven serverless test, the Await step.

const { outputs, timedOut } = await testClient.pollOutputsAsync({  until: async (o) => getFileEvents(o).length === 2,  intervalSeconds: 2,  timeoutSeconds: 12,});

There is quite a bit going on with this single line. It tells the UnitTestClient to poll the Integration test table every 2 seconds until the until function supplied returns true or if 12 seconds have elapsed. The return value from pollOutputsAsync is deconstructed into outputs, an array of the test outputs, and timedOut, a boolean indicating whether or not the polling timed out before the until function returned true.

What is happening is that the implementation of pollOutputsAsync is locating the Integration test table by its tag. It then queries the table for all outputs associated with the test identifier and periodically calls the until function with the outputs found so far. In this case, the Test observer function is hooked up to an SNS topic, so it is writing SNSEvent instances to the table.

In our test, we are looking for two events that will be fired in close succession. Now, SNS can batch up events into a single SNSEvent instance that contains multiple records. Our events could well be batched together, so what we need to do is create a utility function to extract and return the underlying FileEvent instances.

function getFileEvents(outputs: SNSEvent[]): FileEvent[] {  return outputs    .map((o) => o.Records.map((r) => JSON.parse(r.Sns.Message)))    .reduce((allEvents, events) => allEvents.concat(events), []);}

Assert

Now we are into standard unit test territory with the Assert step. We check if the polling timed out and then, if not, that if our action resulted in the expected events.

expect(timedOut).to.equal(false);const fileEvents = getFileEvents(outputs);expect(fileEvents.length).to.equal(2);expect(  fileEvents.some(    (e) =>      e.s3Key === configurationFileS3Key &&      e.eventType === FileEventType.Created &&      e.sectionType === FileSectionType.Header  ));expect(  fileEvents.some(    (e) =>      e.s3Key === configurationFileS3Key &&      e.eventType === FileEventType.Created &&      e.sectionType === FileSectionType.Body  ));

We now have a unit test that fully exercises the construct. In the GitHub repo, you can see how I created other tests for the construct. For example, testing the scenarios where an existing file was updated or partially updated. This was straightforward to do and gives me a high level of confidence in the system. In addition, anyone can clone the repo, set their region in the .env file, deploy the test stack and they are ready to run the tests.

Testing the approach

To further test the approach, I then looked at the Header index construct. This construct is driven by SNS events, reads files from an S3 bucket, stores the headers in a table, and makes them queryable via function. The resulting test architecture looked as follows.

In this case, the Integration test stack does not need to create an observer or a table for the outputs. However, it still has a role in tagging the resources that our unit tests need to interact with. With these tags in place, the UnitTestClient instance can be used to publish messages to a topic using the publishMessageToTopicAsync method.

await testClient.publishMessageToTopicAsync(  FileHeaderIndexTestStack.TestFileEventTopicId,  fileEvent,  fileEvent.messageAttributes);

And also to invoke a function using the invokeFunctionAsync method.

await testClient.invokeFunctionAsync(  FileHeaderIndexer.ReaderFunctionId,  { fileType: file.header.fileType })

Overall, I felt the approach worked very well. I was able to deploy iteratively and test each part in turn. For example, starting with a simple function implementation that logs the event. Deployments were not as quick as I would like, but I felt they were not so long as to invalidate the approach. For complicated business logic, I would recommend local unit tests anyway.

You can find the resulting test stack and unit test in the GitHub repo. I have to say, that the level of confidence I had once the tests were passing was very pleasing. The generic components are also shaping up well to be published as an npm package.

For the next part, I will look into how the approach might be applied to unit testing Step Functions.

Serverless integration testing with CDK - Part 2

Andy Blackledge — Thu, 24 Jun 2021 18:27:55 GMT

In the first part of this series, I looked at how we could use CDK to compose our serverless applications from constructs that can be tested in isolation. In this part, I look at how we can automate the testing of these constructs to get repeatable results and the confidence that comes with them.

The application

The application we are working on is event-driven, where files added to an S3 bucket cause a ripple of events. The applicant does a simplified affordability calculation for loan application scenarios. For example, when a scenario is added or updated, we want the affordability for that scenario to automatically be recalculated and the results stored. The architecture is shown below.

In the first part of the series, we went through how we can compose the application from three CDK constructs. These CDK constructs can be deployed and tested in isolation and are shown by the shaded boxes below.

For this post, we are going to concentrate on the Event publisher construct. We had got to a point where we could deploy a test stack containing the Event publisher construct, along with a test bucket and a test subscriber that logged all received events to CloudWatch. By using the AWS Toolkit, we were able to upload test files and view the resulting events.

Utilising the AWS CLI

This is a pretty good start, as we can verify that we have wired up all the AWS resources in such a way that we get the the effect we want. This wouldn't be possible just from local testing. What isn't so good is the fact that our testing is ad-hoc and not repeatable.

With this in mind, my first thought is how can we write a unit test that can upload files to S3. In fact, how can we upload files to S3. One way is through the use of the AWS CLI and the s3 cp command, e.g.:

aws s3 cp test.txt s3://mybucket/test2.txt

Note, I am aware that there might be better ways that using the AWS CLI for this (e.g. using the SDK), but for the purposes of this post this approach is sufficient (and works 😊).

Two further questions spring to mind:

How do we know the name of the bucket?
How do run a command line tool from Node.js?

The first question is easy to answer, as we can add an output to our test stack FileEventPublisherTestStack as follows:

new cdk.CfnOutput(this, `TestBucketName`, {  value: testBucket.bucketName,});

Now when we deploy the stack we see the following:

Outputs:FileEventPublisherTestStack.TestBucketName = fileeventpublisherteststack-testbucketb80bc560-hghxtm1zahbc

The second question took a bit of googling, but the end result was the following function:

import * as child from 'child_process';async function execCommand(command: string): Promise<string> {  return new Promise((resolve, reject) => {    child.exec(command, (error, stdout) => {      if (error) {        reject(error);      } else {        resolve(stdout);      }    });  });}

Armed with these I can now start writing a unit test to verify that uploading a new File results in two events, one for header and one for the body. We will give each file a unique name, so that we can correlate the events generated with our input actions.

Arranging

The first step is to create a unique file and save it locally for uploading.

// Arrangeconst configurationFile: File = {  header: {    fileType: FileType.Configuration,    name: `Configuration_${nanoid(10)}`,  },  body: {    incomeMultiplier: 0,  },};const configurationFileName = `${configurationFile.header.name}.json`;fs.writeFileSync(configurationFileName, JSON.stringify(configurationFile));

Note, I am using the excellent nanoid package to generate some short unique ids.

Acting

The next step is to upload the file to S3. This where we can use the output from deploying the test stack.

// Acttry {  const testBucketName = 'fileeventpublisherteststack-testbucketb80bc560-hghxtm1zahbc';  const uploadTestFileCommand = `aws s3 cp ${configurationFileName} s3://${testBucketName}`;  console.log(await execCommand(uploadTestFileCommand));} finally {  fs.unlinkSync(configurationFileName);}

Now we run the unit test, we see the following that confirms that we are successfully uploading the file.

Completed 103 Bytes/103 Bytes (382 Bytes/s) with 1 file(s) remainingupload: .\Configuration_x5RvXtJGFl.json to s3://fileeventpublisherteststack-testbucketb80bc560-hghxtm1zahbc/Configuration_x5RvXtJGFl.json

Asserting

We now have the Arrange and Act parts of the Arrange/Act/Assert testing pattern in place. The next question is how to assert. We could try to read the CloudWatch logs, but that feels clunky and imprecise. If the output of system under test wrote to a persistent store of some kind then we could query that, but in this case it raises SNS events that are not stored anywhere.

The answer is to extend the test scaffolding so that the test subscriber stores the received messages in an easily queryable form. Thankfully, in DynamoDB, AWS has the perfect tool for this job. The result will look like the following.

The TestOutputsTable is keyed by the s3Key and SNS messageId, which will enable us to get back all outputs that are related to a particular test file. By including the FileEvent as part of the item, we will be able to assert that our tests have the expected effects.

const testOutputsTable = new dynamodb.Table(this, 'TestOutputsTable', {  partitionKey: { name: 's3Key', type: dynamodb.AttributeType.STRING },  sortKey: { name: 'messageId', type: dynamodb.AttributeType.STRING },  billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,});const fileEventTestSubscriberFunction = newNodejsFunction(  this,  'FileEventTestSubscriberFunction',  'fileEventTestSubscriber',  {    TEST_RESULTS_TABLE_NAME: testOutputsTable.tableName,  });sut.fileEventTopic.addSubscription(  new subscriptions.LambdaSubscription(fileEventTestSubscriberFunction));testOutputsTable.grantWriteData(fileEventTestSubscriberFunction);new cdk.CfnOutput(this, `TestOutputsTableName`, {  value: testOutputsTable.tableName,});

Again, we have an output and so on deployment we see the following:

Outputs:...FileEventPublisherTestStack.TestOutputsTableName = FileEventPublisherTestStack-TestOutputsTable8A620419-21HCDG1KTVQKN

Now we have a way of querying the results of our test, we can build the appropriate AWS CLI command and run it to get the results of our test. We can then parse the results into an array of the FileEvent instances raised and assert our expectations.

// Assertconst testOutputsTableName = 'FileEventPublisherTestStack-TestOutputsTable8A620419-21HCDG1KTVQKN';const queryTestOutputsCommand = `aws dynamodb query \  --table-name ${testOutputsTableName} \  --key-condition-expression "s3Key = :s3Key" \  --expression-attribute-values "{ \\":s3Key\\": { \\"S\\": \\"${configurationFileName}\\" } }"`;const queryResult = JSON.parse(await execCommand(queryTestOutputsCommand)) as QueryOutput;const fileEvents = queryResult.Items?.map(  (item) => AWS.DynamoDB.Converter.unmarshall(item).message as FileEvent);expect(fileEvents?.length).to.equal(2);

What could possibly go wrong?

AssertionError: expected 0 to equal 2+ expected - actual-0+2

What have we missed? \()/

Patience

Well, we have missed a key aspect of our application. That aspect is that it is asynchronous and that it takes a finite amount of time for the S3 event to be raised, processed, the DynamoDB event to be raised, processed, and finally the output written to the outputs table to be read. We need to be patient.

// Waitawait new Promise((resolve) => setTimeout(resolve, 6 * 1000));// Assert

Whilst this is a bit of a sledgehammer to crack a nut, when we run our test we see the following:

1 passing (5s)

Success! 🍾

Summary

We now have a semi-repeatable way to test at least one aspect of one part of our application. I say 'semi-repeatable', as there is no guarantee that the time we gave the test will always be sufficient. We have a good start, but clearly there ways to improve. That will be the topic of the next post.

All the code above is available on GitHub here.

I am aware that others are doing work in this space, so please check out the following if you are interested:
Theodo's promising sls-test-tools that provide "a range of utilities, setup, teardown and assertions to make it easier to write effective and high quality integration tests for Serverless Architectures on AWS."
The AWS Testing Library which allows you to assert the presence of items within AWS resources
How to test your EventBridge integrations, an article describing an approach to serverless testing.

Serverless integration testing with CDK - Part 1

Andy Blackledge — Tue, 08 Jun 2021 19:57:19 GMT

From my own experience and that I have read of others, one of the biggest challenges of serverless is how to test. Beyond unit testing, do you try to replicate cloud infrastructure locally or do you rely on high-level end-to-end tests? With multiple resources interacting asynchronously, how can you develop repeatable, meaningful tests? Here I ponder how we might take advantage of the AWS CDK to help. Using it to package our serverless applications into units that can be independently deployed, tested, and then torn down.

All the code for this post is available on GitHub.

The system under test

For this post, let us consider a system that does a simplified affordability calculation for a loan application. The system contains a number of configurations and a number of scenarios. A configuration contains a set of values that are used in the affordability model, such as to specify how much of any overtime income is to be used. A scenario contains the details supplied by the loan applicants, such as a breakdown of the applicants income. The system automatically calculates the results for each combination of configuration and scenario whenever a new one is added or an existing one is amended.

The system revolves around a bucket that contains JSON files with the following structure:

{  "header": {    "fileType": "Configuration|Scenario|Result",    "name": "E.g. High Risk Scenario"  },  "body": {      }}

The system only recalculates when the body contents of Configuration or Scenario files are updated. Changing the header details does not cause any recalculation. Note, one assumption here is that the fileType is never changed once set.

The system design is as follows:

The diagrams for this post were created using the excellent Miro tool.

When a file is added or updated to the Files bucket, an event is raised. The Hash writer Lambda function handles this event and calculates hashes for the header and body of the file. It then writes these to the Hashes DynamoDB table. The Hashes table raises change events that are then handled by the File event publisher Lambda function. The File event publisher function processes these events and sends notifications of the following format to the File events SNS topic.

{  "eventType": "Create|Update",  "s3Key": "E.g. Configuration_7Jk0Sf5JsDPZt5skWFyNR.json",  "fileType": "Configuration|Scenario|Result",  "contentType": "Header|Body"}

Note, for the purposes of simplicity, Delete events are not being considered in this example.

The Header updates SQS queue subscribes to the File events SNS topic for events with a contentType of Header. The Header writer Lambda function processes messages from the Header updates queue and retrieves the header from the Files bucket. The Header writer function then writes the an entry of the following format to the Headers DynamoDB table.

{  "fileType": "Configuration|Scenario|Result",  "s3Key": "E.g. Configuration_7Jk0Sf5JsDPZt5skWFyNR.json",  "name": "E.g. High Risk Scenario"}

The Headers table is configured with fileType as the partition key and s3Key as the sort key. The Header reader Lambda function encapsulates access to the Headers table. It takes requests of the following format:

{  "fileType": "Configuration|Scenario|Result"}

And returns responses as follows:

{  "headers": [    {      "fileType": "Configuration|Scenario|Result",      "s3Key": "E.g. Configuration_7Jk0Sf5JsDPZt5skWFyNR.json",      "name": "E.g. High Risk Scenario"    }  ]}

The final part of the system is the calculator. The Body updates SQS queue subscribes to the File events SNS topic for events with a contentType of Body. The Calculation initiator Lambda function processes messages from the Body updates queue and retrieves the associated file from the Files bucket. The header is then passed to the Calculator step function, which uses the Header reader function to work out the combinations to calculate, before performing each calculation and putting the results in the Files bucket.

Note, in a production system we would want to add appropriate dead letter queues and other error handling. These have been left out of the example for simplicity.

Testing

Not one part of the system we have designed is particularly complicated. In fact, the Lambda functions are going to be very simple indeed. So simple in fact, that we might query the value in building and maintaining unit tests for them. As with systems of this type, the functionality emerges from the interaction between the various simple resources. Given this, it seems reasonable to target our testing on verifying that those resources work together as expected.

One way to approach this is to break the system down as follows:

Event publisher: subscribes to events from an S3 bucket, reads the file contents, and raises change events to an SNS topic
Header index: subscribes to change events from an SNS topic, reads an S3 bucket, and exposes an API for listing the file headers
Result calculator: subscribes to change events from an SNS topic, uses an API to list the file headers, reads files from an S3 bucket, calculates the results and puts them in the S3 bucket

With the system broken down like this, we can create CDK constructs for each part and then create individual test stacks to deploy them for testing in isolation.

Let us first consider the Event publisher construct. To the outside world, it takes in an S3 bucket and exposes an SNS topic. With this in mind, we can create the following minimal implementation. In CDK, the pattern is to provide inputs as properties on the props passed into the constructor, and to expose outputs as public properties on the construct itself.

export interface FileEventPublisherProps {  fileBucket: s3.Bucket;}export default class FileEventPublisher extends cdk.Construct {  readonly fileEventTopic: sns.Topic;  constructor(scope: cdk.Construct, id: string, props: FileEventPublisherProps) {    super(scope, id);    this.fileEventTopic = new sns.Topic(this, `${id}FileEventTopic`, {      displayName: `File event topic for ${props.fileBucket.bucketName}`,    });  }}

With this construct, we can now create a test stack that will wire up the inputs and outputs of the construct to test resources. In this case, an S3 bucket and a Lambda function.

For the Lambda function, we create one based on inline code that simply logs out the event for inspection. For the S3 bucket, we use a handy CDK property called autoDeleteObjects. Setting this to true creates a Lambda function that is triggered when the bucket is removed from the stack or when the stack is deleted. This function deletes all objects in the bucket. Having this on test buckets allows us to better clean up after ourselves.

export default class FileEventPublisherTestStack extends cdk.Stack {  constructor(scope: cdk.Construct, id: string) {    super(scope, id);    const testBucket = new s3.Bucket(this, 'TestBucket', {      removalPolicy: cdk.RemovalPolicy.DESTROY,      autoDeleteObjects: true,    });    const testSubscriber = new lambda.Function(this, 'TestSubscriber', {      handler: 'index.handler',      runtime: lambda.Runtime.NODEJS_12_X,      code: lambda.Code.fromInline(        `exports.handler = (event) => { console.log(JSON.stringify(event, null, 2)) }`      ),    });    const sut = new FileEventPublisher(this, 'SUT', {      fileBucket: testBucket,    });    sut.fileEventTopic.addSubscription(new subscriptions.LambdaSubscription(testSubscriber));  }}

The stack can then be deployed as part of a CDK App as follows.

const app = new cdk.App();new FileEventPublisherTestStack(app, 'FileEventPublisherTestStack');

We now have an S3 bucket to drop test files into, and we have a Lambda function that outputs the resulting events for us to verify. If we are using VS Code and the AWS Toolkit, then we can do both of these things without leaving our editor whilst we develop and test the functionality. Of course, at the moment, dropping things into the bucket has no effect.

I have left out the implementation of the fileHashWriter and fileEventPublisher, but if you are interested then they are available in the GitHub repo. With the functions defined, I created a utility function for including these as part of a construct. It uses the NodejsFunction construct which creates a Node.js Lambda function bundled using esbuild. As well as bundling the function, it also defaults to setting AWS_NODEJS_CONNECTION_REUSE_ENABLED environment variable to 1, to automatically reuse TCP connections when working with the AWS SDK for JavaScript.

private newFunction(  functionId: string,  functionModule: string,  environment: Record<string, any>): lambda.Function {  //  const functionEntryBase = path.join(__dirname, '..', '..', '..', 'src', 'functions');  return new lambdaNodejs.NodejsFunction(this, functionId, {    runtime: lambda.Runtime.NODEJS_12_X,    entry: path.join(functionEntryBase, `${functionModule}.ts`),    handler: 'handler',    environment,  });}

With this utility function in place, we can add the other parts to the FileEventPublisher construct. I didn't do this all in one step. First, I created an inline version of each function and tested that it was wired up correctly. I did this by uploading to the bucket using the AWS Toolkit and then viewing the CloudWatch logs, again via the AWS Toolkit. Although deploying wasn't exactly speedy, it wasn't too bad and it was feasible to develop and test without recourse to unit tests.

Note: I still think unit tests have a big part to play in serverless development. However, they have their limits and sometimes they are so simple that covering them via integration tests would be sufficient IMHO.

The final version of the FileEventPublisher was as follows.

export default class FileEventPublisher extends cdk.Construct {  readonly fileEventTopic: sns.Topic;  constructor(scope: cdk.Construct, id: string, props: FileEventPublisherProps) {    super(scope, id);    this.fileEventTopic = new sns.Topic(this, `${id}FileEventTopic`, {      displayName: `File event topic for ${props.fileBucket.bucketName}`,    });    // The table to hold the hashes of the files sections    const fileHashesTable = new dynamodb.Table(this, 'FileHashesTable', {      partitionKey: { name: 's3Key', type: dynamodb.AttributeType.STRING },      sortKey: { name: 'sectionType', type: dynamodb.AttributeType.STRING },      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,      stream: dynamodb.StreamViewType.NEW_AND_OLD_IMAGES,    });    // The function that is notified by the bucket and writes the hashes to the table    const hashWriterFunction = this.newFunction('FileHashWriterFunction', 'fileHashWriter', {      FILE_HASHES_TABLE_NAME: fileHashesTable.tableName,    });    props.fileBucket.grantRead(hashWriterFunction);    props.fileBucket.addEventNotification(      s3.EventType.OBJECT_CREATED,      new s3n.LambdaDestination(hashWriterFunction)    );    fileHashesTable.grantWriteData(hashWriterFunction);    // The function to receive stream events from the hashes table and publish event to the topic    const fileEventPublisherFunction = this.newFunction(      'FileEventPublisherFunction',      'fileEventPublisher',      {        FILE_EVENT_TOPIC_ARN: this.fileEventTopic.topicArn,      }    );    fileEventPublisherFunction.addEventSource(      new lambdaEvents.DynamoEventSource(fileHashesTable, {        startingPosition: lambda.StartingPosition.TRIM_HORIZON,      })    );    this.fileEventTopic.grantPublish(fileEventPublisherFunction);  }}

Now, manually testing the deployed functionality is fine, but what we should be striving for are some repeatable tests that can be automated. That will be the subject of Part 2 in this series.

Designing a CDK State Machine Builder - Part 3

Andy Blackledge — Sat, 22 May 2021 12:46:32 GMT

In my previous post, I started to implement my alternative API for defining state machines in CDK. In this post, I continue this and, after a few bumps in the road, get a usable version finished and tested.

The code for this post can be found in the GitHub repo here.

Edit: The resulting code is now available as an npm package.

As we left in in Part 2, we still had the following functionality to implement.

Map states
Parallel states
Error handlers

Unbeknownst to me at this point, there would also be other challenges when I started to consider some of the more involved state machine scenarios.

If you recall, in our design, Map states are defined as follows.

.map('Map1', {  itemsPath: '$.Items1',  iterator: new StateMachineBuilder()    .perform(state1)    .perform(state2)    .perform(state3)    .perform(state4),})

With Parallel states defined like this.

.parallel('Parallel2', {  branches: [    new StateMachineBuilder().perform(state5).perform(state6),    new StateMachineBuilder().perform(state7).perform(state8),  ],})

In both cases, the idea is that the 'sub-states' are built by defining separate StateMachineBuilder instances. When build is called, we should be able to call build on the 'sub-states' and obtain an appropriate IChainable instance.

As with the perform and choices steps in Part 2, I needed new BuilderStep classes to capture and hold the details of the states to create.

class MapStep implements BuilderStep {  // Snip}class ParallelStep implements BuilderStep {  // Snip}

With these in place, I could amend the following methods to store the steps for later use by the build method.

map(id: string, props: BuilderMapProps): StateMachineBuilder {  this.steps.push(new MapStep(id, props));  return this;}parallel(id: string, props: BuilderParallelProps): StateMachineBuilder {  this.steps.push(new ParallelStep(id, props));  return this;}

With the new step types being added, I needed to extend the switch in getStepChain to call new methods that return an appropriate IChainable instance.

case StepType.Map:  stepChain = this.getMapStepChain(scope, stepIndex);  break;case StepType.Parallel:  stepChain = this.getParallelStepChain(scope, stepIndex);  break;

Implementing getMapStepChain and getParallelStepChain required a very similar approach. In both cases, the build method is called on TODO

Create the state
Create the sub-state by invoking build with the scope
Add the sub-state to the current state
Wire up the next state, if there is one

The difference between the two being that the Parallel state can have multiple sub-states.

private getMapStepChain(scope: cdk.Construct, stepIndex: number): sfn.IChainable {  //  const step = this.steps[stepIndex] as MapStep;  const map = new sfn.Map(scope, step.id, step.props);  map.iterator(step.props.iterator.build(scope));  const stepChain = this.hasNextStep(stepIndex)    ? map.next(this.getStepChain(scope, stepIndex + 1))     : map;  return stepChain;}private getParallelStepChain(scope: cdk.Construct, stepIndex: number): sfn.IChainable {  //  const step = this.steps[stepIndex] as ParallelStep;  const parallel = new sfn.Parallel(scope, step.id, step.props);  step.props.branches.forEach((branch) => {    parallel.branch(branch.build(scope));  });  const stepChain = this.hasNextStep(stepIndex)    ? parallel.next(this.getStepChain(scope, stepIndex + 1))    : parallel;  return stepChain;}

Using the examples created in Part 1 and the testing approach in Part 2, I was able to verify that StateMachineBuilderwas behaving as expected and outputting the equivalent definition.

The final part in the puzzle, or so I thought, was to implement error handlers. The API design for defining these was for a catches array on the appropriate props passed in to each method, an example of which is shown below.

.perform(function1, {  catches: [    { errors: ['States.Timeout'], handler: 'Catch1' },    { errors: ['States.All'], handler: 'Catch2' },  ],}).map('Map1', {  itemsPath: '$.Items1',  iterator: new StateMachineBuilder().perform(state1).perform(state2),  catches: [{ handler: 'Catch5' }],}).parallel('Parallel1', {  branches: [    new StateMachineBuilder().perform(state3),    new StateMachineBuilder().perform(state4),  ],  catches: [{ handler: 'Catch6' }],})

All looking pretty straightforward I thought, all I needed to do was iterate over the catches and invoke the addCatch method with with the IChainable for the handler state.

private getPerformStepChain(scope: cdk.Construct, stepIndex: number): sfn.IChainable {  //  const step = this.steps[stepIndex] as PerformStep;  const stepState = (step as PerformStep).state;  step.props.catches.forEach((catchProps) => {    const handlerStepIndex = this.getStepIndexById(catchProps.handler);    const handlerChainable = this.getStepChain(scope, handlerStepIndex);    stepState.addCatch(handlerChainable, catchProps);  });  const stepChain = this.hasNextStep(stepIndex)    ? stepState.next(this.getStepChain(scope, stepIndex + 1))    : stepState;  return stepChain;}

However, there was one snag as shown below.

Property 'addCatch' does not exist on type 'INextableState'. Did you mean '_addCatch'?

It turned out that addCatch is only on TaskStateBase. My solution was to remove PerformProps from the perform method and rename it to TryPerformProps. I then created a new method called tryPerform that takes a TaskStateBase instead.

tryPerform(state: sfn.TaskStateBase, props: BuilderTryPerformProps): StateMachineBuilder {  this.steps.push(new TryPerformStep(state, props));  return this;}

I could then use my original approach for a new getTryPerformStepChain method, and was able to call the addCatch method as intended. The same pattern could then be repeated for map and parallel, but taking into account the fact that the catches are optional in these cases.

export default class StateMachineBuilder {  // Snip  private getMapStepChain(scope: cdk.Construct, stepIndex: number): sfn.IChainable {    // Snip    if (step.props?.catches) {      step.props.catches.forEach((catchProps) => {        const handlerStepIndex = this.getStepIndexById(catchProps.handler);        const handlerChainable = this.getStepChain(scope, handlerStepIndex);        map.addCatch(handlerChainable, catchProps);      });    }    // Snip  }  private getParallelStepChain(scope: cdk.Construct, stepIndex: number): sfn.IChainable {    // Snip    if (step.props?.catches) {      step.props.catches.forEach((catchProps) => {        const handlerStepIndex = this.getStepIndexById(catchProps.handler);        const handlerChainable = this.getStepChain(scope, handlerStepIndex);        parallel.addCatch(handlerChainable, catchProps);      });    }    // Snip  }}

All was looking promising, but running the unit tests resulted in the following failure.

+ expected - actual         "expressionAttributeValues": {           "$.Var1.$": "$.Var1"         }       }-      "Resource": "${Token[TOKEN.241]}"+      "Resource": "${Token[TOKEN.157]}"       "Type": "Task"     }

As far as I understand it, the issue here is down to the way that CDK generates placeholders in the definition to link to resources later on. For our purposes, we do not care what resource this will point to. Given this, I wrote the following method to replace all token references, such as [TOKEN.241], with the generic value [TOKEN.n].

function getComparableGraph(builderStateMachine: StateMachineWithGraph) {  const graphJson = builderStateMachine.graphJson;  const comparableGraphJson = graphJson.replace(/\[TOKEN\.[0-9]+\]/g, '[TOKEN.n]');  return JSON.parse(comparableGraphJson);}

In the unit tests, I amended to comparison to use the new method to compare the results.

expect(getComparableGraph(builderStateMachine)).to.deep.equal(  getComparableGraph(cdkStateMachine));

With this changes, all the unit tests were passing and I felt pretty good. However, I thought about other state machine scenarios and, in particular, the scenario where there is a common downstream state as shown below.

This scenario is simple enough to define using our API.

const definition = new StateMachineBuilder()  .choice('Choice1', {    choices: [{ when: sfn.Condition.booleanEquals('$.var1', true), next: 'State2' }],    otherwise: 'Choice2',  })  .choice('Choice2', {    choices: [{ when: sfn.Condition.booleanEquals('$.var2', true), next: 'State2' }],    otherwise: 'State1',  })  .perform(state1)  .end()  .perform(state2)  .perform(state3)  .end()

However, when testing I got the following error.

Error: State 'State2' already has a next state    at Pass.makeNext (node_modules\@aws-cdk\aws-stepfunctions\lib\states\state.ts:287:13)    at Pass.next (node_modules\@aws-cdk\aws-stepfunctions\lib\states\pass.ts:137:11)    at StateMachineBuilder.getPerformStepChain (src\constructs\StateMachineBuilder-v1.ts:196:19)    at StateMachineBuilder.getStepChain (src\constructs\StateMachineBuilder-v1.ts:163:26)

Thinking about it, this made sense. The code would have already traversed one path to State2 through a branch of Choice1, and then invoked the next method. Given this, my thought was to cache the IChainable values for all visited steps. We could then return the cached instance and avoid multiple calls to next.

export default class StateMachineBuilder {  // Snip  private readonly stepChainByIndex = new Map<number, sfn.IChainable>();  // Snip  private getStepChain(scope: cdk.Construct, stepIndex: number): sfn.IChainable {    //    const visitedStepChain = this.stepChainByIndex.get(stepIndex);    if (visitedStepChain !== undefined) {      return visitedStepChain;    }    // Snip    this.stepChainByIndex.set(stepIndex, stepChain);    return stepChain;  }  // Snip}

Re-running the unit tests resulted in smiles all round, as the fix had done its job. However, another scenario came to mind, that of loops. I quickly put together another unit test, this time featuring a Choice state that looped back to higher up the state machine.

Again, it was quite simple to replicate the state machine using our API.

const definition = new StateMachineBuilder()  .perform(state1)  .choice('Choice1', {    choices: [{ when: sfn.Condition.booleanEquals('$.var1', true), next: 'State1' }],    otherwise: 'State2',  })  .perform(state2)

What was not quite so simple, was the following error.

Error: There is already a Construct with name 'Choice1' in Stack [BackwardsLoop-Builder]    at Node.addChild (node_modules\constructs\src\construct.ts:534:13)    ...snip...    at new Choice (node_modules\@aws-cdk\aws-stepfunctions\lib\states\choice.ts:50:5)    at StateMachineBuilder.getChoiceStepChain (src\constructs\StateMachineBuilder-v2.ts:258:23)    at StateMachineBuilder.getStepChain (src\constructs\StateMachineBuilder-v2.ts:180:26)

This had me stumped for a bit. The solution came from looking at the CDK version.

const definition = sfn.Chain.start(  state1.next(    new sfn.Choice(definitionScope, 'Choice1')      .when(sfn.Condition.booleanEquals('$.var1', true), state1)      .otherwise(state2)  ));

Here, the when is being invoked on the Choice before the next is invoked on state1. In StateMachineBuilder, the when was being invoked after the next. The underlying code must be traversing the next link and trying to add Choice1 state for a second time.

The solution was store the State instances in a lookup, before recursively calling getStepChain. With this lookup in place, the getStepChain method could resolve a step to a visited state, but before it had been wired up to any others.

export default class StateMachineBuilder {  // Snip  private readonly stepStateByIndex = new Map<number, sfn.State>();  private getStepChain(scope: cdk.Construct, stepIndex: number): sfn.IChainable {    //    const visitedStepState = this.stepStateByIndex.get(stepIndex);    if (visitedStepState !== undefined) {      return visitedStepState;    }    // Snip  }  private getPerformStepChain(scope: cdk.Construct, stepIndex: number): sfn.IChainable {    //    const step = this.steps[stepIndex] as PerformStep;    const stepState = (step as PerformStep).state;    this.stepStateByIndex.set(stepIndex, stepState);    const stepChain = this.hasNextStep(stepIndex)      ? stepState.next(this.getStepChain(scope, stepIndex + 1))      : stepState;    return stepChain;  }  // Snip}

This pattern was then repeated in the other methods and the tests were re-run. The result was success, but I was not quite convinced. I added a few more tests, but all seemed well. It appeared we had a working API. One other thing did bug me though, as I was aware of quite a bit of repeated code. As a result, I did a bit of refactoring whilst continuing to run the unit tests. The result can be found here.

Edit: The StateMachineBuilder component is now available on npm.

Designing a CDK State Machine Builder - Part 2

Andy Blackledge — Sat, 15 May 2021 07:54:56 GMT

In Part 1 of this series, I went through my process of designing an alternative API for defining state machines using CDK. In this part, I document my trials and tribulations of implementing that API.

All the code for this post can be found in this GitHub repo.

Edit: The resulting code is now available as an npm package.

One question I always ask myself before writing any code is how am I going to test it. That is, how can I have any confidence that the code is running as I expect? I might not take a full-blown Test-driven development (TDD) approach, but I need to have some sort of plan of how I am going to execute the code and verify the results. Ideally, this plan involves a straightforward way to both create and repeat those tests.

In Part 1, I created a set of examples covering different aspects of defining state machines using CDK. In Visualising a CDK State Machine using a custom Construct, I created a custom construct that outputs the graph JSON for such definitions. It seemed logical to me to combine the two; use the examples as test cases and compare the graph objects from the two implementations to verify. The result is shown below.

it('renders simple chain', async () => {  //  const cdkStateMachine = new StateMachineWithGraph(new cdk.Stack(), 'SimpleChain-CDK', {    getDefinition: (definitionScope): sfn.IChainable => {      //      const state1 = new sfn.Pass(definitionScope, 'State1');      // Define other states...      const definition = sfn.Chain.start(        state1.next(state2.next(state3.next(state4.next(state5.next(state6)))))      );      return definition;    },  });  const builderStateMachine = new StateMachineWithGraph(new cdk.Stack(), 'SimpleChain-Builder', {    getDefinition: (definitionScope): sfn.IChainable => {      //      const state1 = new sfn.Pass(definitionScope, 'State1');      // Define other states...      const definition = new StateMachineBuilder()        .perform(state1)        // Perform other states...        .build(definitionScope);      return definition;    },  });  const cdkGraph = JSON.parse(cdkStateMachine.graphJson);  const builderGraph = JSON.parse(builderStateMachine.graphJson);  expect(builderGraph).to.deep.equal(cdkGraph);});

Running this test resulted in the following expected, but informative, failure.

AssertionError: expected { Object (StartAt, States) } to deeply equal { Object (StartAt, States) }    at ... + expected - actual {-  "StartAt": "TODO"+  "StartAt": "State1"   "States": {-    "TODO": {+    "State1": {+      "Next": "State2"+      "Type": "Pass"+    }

With my testing strategy in place, I turned my attention to getting the test to pass.

The first method to implement was perform, where we supply a state to be added to the definition when we call the build method. To do this, we need to capture the details for the build method to use. I knew we would need to capture details for other methods, such as choice, so I created a an interface and class to capture these. As TypeScript doesn't have a true type system, I included a type enumeration to make introspection easy at runtime.

enum StepType {  Perform = 'Perform',}interface BuilderStep {  type: StepType;  id: string;}class PerformStep implements BuilderStep {  //  constructor(public state: sfn.State) {    this.type = StepType.Perform;    this.id = state.id;  }  type: StepType;  id: string;}

With this in place, I added a class-level array to StateMachineBuilder to hold the steps and updated the perform method to capture the details to perform.

private readonly steps = new Array();perform(state: sfn.State): StateMachineBuilder {  this.steps.push(new PerformStep(state));  return this;}

With this in place, I started to look at the build method and how we could use these details to build a CDK state machine definition. The build method takes a scope parameter and returns an instance that implements IChainable. In my mind, I could see that we would need to recurse through the steps to replicate the CDK approach. It seemed logical to me that the build method should just initiate the recursion by returning the IChainable for the first step. The getStepChain method would then recursively call itself to build the structure.

build(scope: cdk.Construct): sfn.IChainable {  return this.getStepChain(scope, 0);}private getStepChain(scope: cdk.Construct, stepIndex: number): sfn.IChainable {  // TODO: Recursively call getStepChain}

I could see that the getStepChain method would need to handle the various step types, so I added a switch and deferred the processing to a specific handler method.

private getStepChain(scope: cdk.Construct, stepIndex: number): sfn.IChainable {  //  const step = this.steps[stepIndex];  let stepChain: sfn.IChainable;  switch (step.type) {    //    case StepType.Perform:      stepChain = this.getPerformStepChain(scope, stepIndex);      break;    default:      throw new Error(`Unhandled step type: ${JSON.stringify(step)}`);  }  return stepChain;}

The getPerformStepChain method is the place where the real work was to be done. I.e., the place where the states would be wired together to build the state machine. The logic I had in mind was as follows.

Get the state for the current step
If there is a next step:
- Invoke the next method on the current step state, passing in the chain for the next step
Else
- Return the current step state

This was implemented as below.

private getPerformStepChain(scope: cdk.Construct, stepIndex: number): sfn.IChainable {  //  const step = this.steps[stepIndex] as PerformStep;  const stepState = (step as PerformStep).state;  const stepChain = stepIndex < this.steps.length - 1    ? stepState.next(this.getStepChain(scope, stepIndex + 1))    : stepState;  return stepChain;}

All looked fine, but there was a problem. I could see the following error.

Property 'next' does not exist on type 'State'. Did you mean '_next'?

I had assumed that the State class had a next method. However, by looking at the definition for the Pass state, I could see the following.

export declare class Pass extends State implements INextable

It turned out that the next method lives on a separate interface. What I wanted was something that encapsulated a State with a next method, so I created my own INextableState interface.

interface INextableState extends sfn.State, sfn.INextable {}

With this, I could replace the references to State and the problem with next went away. When I re-ran the unit test, all was good. We now had an alternative way of defining state machines in CDK. The only caveat being, they can only consist of a sequence of states. Good, but not that useful, so the next thing to look at was implementing choices.

In Part 1, we designed the API to define a choice as follows.

.choice('Choice1', {  choices: [{ when: sfn.Condition.booleanEquals('$.var1', true), next: 'Choice2' }],  otherwise: 'Choice3',})

As with perform, we need to capture these details in the choices method. To do this, I extended the StepType enumeration, created a ChoiceStep class, and amended the choice method to store a ChoiceStep instance containing the captured values.

enum StepType {  // Snip  Choice = 'Choice',}class ChoiceStep implements BuilderStep {  //  constructor(public id: string, public props: BuilderChoiceProps) {    this.type = StepType.Choice;  }  type: StepType;}export default class StateMachineBuilder {  // Snip  choice(id: string, props: BuilderChoiceProps): StateMachineBuilder {    this.steps.push(new ChoiceStep(id, props));    return this;  }}

With this in place, I could extend the getStepChain method to handle the Choice step type and call a new getChoiceStepChain method.

switch (step.type) {  // Snip  case StepType.Choice:    stepChain = this.getChoiceStepChain(scope, stepIndex);    break;

The implementation of the getChoiceStepChain required a slightly different approach, as it needed to instantiate the State as well as invoking the appropriate methods on it. It was for this reason that we added the scope parameter to the build method.

To build the resulting Choice state, I needed to invoke the when and otherwise methods with IChainable values. However, the choices method only captures the string id values. The solution was straightforward and was to create a getStepIndexById method to covert one to the other. I went with a simple linear lookup for now, but if performance was paramount, then a indexed lookup could be implemented.

private getStepIndexById(id: string): number {  //  const stepIndex = this.steps.findIndex((s) => s.id === id);  if (stepIndex === -1) {    throw new Error(`Could not find index for id: ${id}`);  }  return stepIndex;}private getChoiceStepChain(scope: cdk.Construct, stepIndex: number): sfn.IChainable {  //  const step = this.steps[stepIndex] as ChoiceStep;  const stepChain = new sfn.Choice(scope, step.id, step.props);  step.props.choices.forEach((choice) => {    const nextStepIndex = this.getStepIndexById(choice.next);    const nextStepChain = this.getStepChain(scope, nextStepIndex);    stepChain.when(choice.when, nextStepChain);  });  const otherwiseStepIndex = this.getStepIndexById(step.props.otherwise);  const otherwiseStepChain = this.getStepChain(scope, otherwiseStepIndex);  stepChain.otherwise(otherwiseStepChain);  return stepChain;}

In the getStepIndexById method, I made sure to shout loudly and clearly when the the id could not be found. In my experience, you will thank yourself if you throw informative errors when an unhandled value is encountered.

We were nearly there, but there was still one more piece of the choices puzzle. To separate the various end states of the state machine, we have calls to the end method as follows.

.perform(state1).end().perform(state2).end()

Our intention here was to tell the build method to stop recursing and so make the previous state an 'end' state. To do this, I needed to create a new BuilderStep and amend the end method to add an instance to steps captured.

enum StepType {  // Snip  End = 'End',}class EndStep implements BuilderStep {  //  constructor() {    this.type = StepType.End;  }  id: string;  type: StepType;}export default class StateMachineBuilder {  // Snip  end(): StateMachineBuilder {    this.steps.push(new EndStep());    return this;  }}

In getPerformStepChain we had a test for whether we should continue and recursively add a 'next' state. This test relied on the last state being the last state in the steps array. With the existence of the end states, this assumption was no longer true. To cater for this, I extended the test to check ahead for an 'end' state created and encapsulated the test in a hasNextStep method.

export default class StateMachineBuilder {  // Snip  private hasNextStep(stepIndex: number): boolean {    //    const isLastStep = stepIndex === this.steps.length - 1;    const isNextStepEnd = !isLastStep && this.steps[stepIndex + 1].type === StepType.End;    const hasNextStep = !(isLastStep || isNextStepEnd);    return hasNextStep;  }  private getPerformStepChain(scope: cdk.Construct, stepIndex: number): sfn.IChainable {    // Snip    const stepChain = this.hasNextStep(stepIndex)      ? stepState.next(this.getStepChain(scope, stepIndex + 1))      : stepState;    return stepChain;  }}

With this in place, I re-ran the unit tests and was met with unalloyed success. In the next part, I look to implement the map and parallel methods, and to implement functionality to add error handlers too.

Edit: The final StateMachineBuilder component is now available on npm.

Designing a CDK State Machine Builder - Part 1

Andy Blackledge — Wed, 05 May 2021 18:23:23 GMT

In my previous previous post, I was a little critical of the CDK approach to defining state machines. In this post, I attempt to design an alternative and share my approach to developing such an API. Whether the result is better or worse than the original, that is up to you.

The code for this post can be found on GitHub here.

Edit: The resulting code is now available as an npm package.

My main criticism of the CDK approach is as follows:

Readability
Maintainability
Interaction with Prettier

Let us take a simple example where we have a state machine consisting of six states in series. Using the CDK approach, we would chain the states together as follows:

const definition = sfn.Chain.start(  state1.next(state2.next(state3.next(state4.next(state5.next(state6))))));

For me, it doesn't score highly on the readability front. The syntax highlighting helps pick out the states, but the required nesting and resulting accumulation of brackets jars my eye. On maintainability, adding another state wouldn't be too bad, but it may not jump out in the pull request. As for Prettier, it doesn't have much to say in this example.

Being a critic is easy, but having a go yourself is another thing. The way I like to start is to just write the code as I would ideally like to express the problem. In this case, I have experimented in the problem area before and had the Fluent Builder Pattern as a possible solution. With this in mind, I wrote the following code:

const definition = new StateMachineBuilder()  .perform(state1)  .perform(state2)  .perform(state3)  .perform(state4)  .perform(state5)  .perform(state6)  .build();

The idea here is that we add states sequentially to the instantiated StateMachineBuilder, and when complete we call the build method to return a definition. The advantage is that we can easily see the states and their order, we can easily reorder them, and that Prettier will format nicely for us.

To get the code to compile, I created a skeleton implementation for StateMachineBuilder.

export default class StateMachineBuilder {  perform(state: sfn.State): StateMachineBuilder {    return this;  }  build(): sfn.IChainable {    throw new Error('build not implemented yet');  }}

At this moment of the development, I was not overly concerned about fleshing out the implementation. For me, the danger of doing adding implementation at this stage is that you make it harder for you to rework the API as you explore the problem space. This runs the risk of tying yourself into abstractions and syntax that you have to live with forever, and that may well have benefitted from refinement.

At this stage, I was content if I can envisage how the implementation would work. In this case, I envisaged the StateMachineBuilder accumulating the states, and then wiring them up when build is called. Given this, I was happy to proceed to the next example, choices.

For this, I created a CDK definition for the following flow:

The result was as follows:

const definition = sfn.Chain.start(  new sfn.Choice(definitionScope, 'Choice1')    .when(      sfn.Condition.booleanEquals('$.var1', true),      new sfn.Choice(definitionScope, 'Choice2')        .when(sfn.Condition.booleanEquals('$.var2', true), state1)        .otherwise(state2)    )    .otherwise(      new sfn.Choice(definitionScope, 'Choice3')        .when(sfn.Condition.booleanEquals('$.var2', true), state3)        .otherwise(state4)    ));

On the readability front, I found the distance between the when and otherwise for Choice1 to be less than ideal. However, at least Prettier had done a decent job with providing meaningful indentation in this case.

Using the fluent builder approach, I played around with various syntaxes and settled on the following approach:

const definition = new StateMachineBuilder()  .choice('Choice1', {    choices: [{ when: sfn.Condition.booleanEquals('$.var1', true), next: 'Choice2' }],    otherwise: 'Choice3',  })  .choice('Choice2', {    choices: [{ when: sfn.Condition.booleanEquals('$.var2', true), next: 'State1' }],    otherwise: 'State2',  })  .choice('Choice3', {    choices: [{ when: sfn.Condition.booleanEquals('$.var3', true), next: 'State3' }],    otherwise: 'State4',  })  .perform(state1)  .end()  .perform(state2)  .end()  .perform(state3)  .end()  .perform(state4)  .end()  .build(definitionScope);

The first big choice (no pun intended) was to have the builder instantiate the Choice objects itself. A knock-on effect of this, is that the build method now needs to take the scope under which the Choice objects are created. This does have the benefit of not having to specify the scope with every choice.

The next choice was how to do the branching. In contrast to the CDK approach, I decided on using references to the id values of the states. Whilst this could lead to errors, I could envisage the build method doing validation and picking these up. Given this, I was happy to go with this approach, as it avoids the problem of ever-increasing indentation as the branches get more and more nested.

I originally went with the approach that the choice would drop through to the next step in the flow. However, after experimenting with writing a few mock examples, I felt that having an explicit otherwise made the code more readable, whilst also having the benefit of matching the terminology of CDK. This reworking was very straightforward, as I had yet to write any implementation. All I needed to write, was just enough code to make the examples compile.

To facilitate this, I needed to extend the CDK ChoiceProps to allow the choice to be defined with the core CDK properties, along with an array of choices and the alternative.

interface BuilderChoice {  when: sfn.Condition;  next: string;}interface BuilderChoiceProps extends sfn.ChoiceProps {  choices: BuilderChoice[];  otherwise: string;}

This example brought to light the need for an end method. This indicates to the StateMachineBuilder that the previous state is a terminal one.

Another state machine control structure is the Map state. This state selects an array of items from the input and invokes an inner state machine for each item. I considered the following example, where two Map states each invokes inner state machines with four sequential steps.

Using raw CDK, this is defined as:

const definition = sfn.Chain.start(  new sfn.Map(definitionScope, 'Map1', {    itemsPath: '$.Items1',  })    .iterator(state1.next(state2.next(state3.next(state4))))    .next(      new sfn.Map(definitionScope, 'Map2', {        itemsPath: '$.Items2',      }).iterator(state5.next(state6.next(state7.next(state8))))    ));

In this case, I felt that Prettier was not helping and the result was difficult to read. For me, it was hard to see that the iterator was part of the Map, and that there were two Map states chained together.

A bit of experimentation with various syntaxes later, I settled on defining this scenario as follows:

const definition = new StateMachineBuilder()  .map('Map1', {    itemsPath: '$.Items1',    iterator: new StateMachineBuilder()      .perform(state1)      .perform(state2)      .perform(state3)      .perform(state4),  })  .map('Map2', {    itemsPath: '$.Items2',    iterator: new StateMachineBuilder()      .perform(state5)      .perform(state6)      .perform(state7)      .perform(state8),  })  .build(definitionScope);

In this case, Prettier has done a splendid job for us and, IMHO, it is very clear as to what is going on. If we needed to reorder the steps, then that would be very straightforward indeed. The key to this was supplying the map method with a BuilderMapProps instance.

interface BuilderMapProps extends sfn.MapProps {  iterator: StateMachineBuilder;}

The question that I had to consider was whether I could make the implementation work. I could envisage the outer build method traversing the states and, for Map states, invoking the build method on any iterator values with the scope passed to it. This would give us the definition to supply to the CDK iterator method. Confident this would probably work, I moved onto how to implement Parallel states.

One of the very nice features of step functions is that you can easily set up tasks to be performed in parallel, with the infrastructure taking care of the heavy lifting for you. Consider the following state machine, where we have two Parallel states that each have two branches to be executed in parallel.

In CDK, and formatted by Prettier, the resulting definition is as follows:

const definition = sfn.Chain.start(  new sfn.Parallel(definitionScope, 'Parallel1')    .branch(state1.next(state2))    .branch(state3.next(state4))    .next(      new sfn.Parallel(definitionScope, 'Parallel2')        .branch(state5.next(state6))        .branch(state7.next(state8))    ));

As with the Map example, Prettier doesn't do a great job with readability. For the syntax, I took inspiration from the approach for map and came up with the following.

const definition = new StateMachineBuilder()  .parallel('Parallel1', {    branches: [      new StateMachineBuilder().perform(state1).perform(state2),      new StateMachineBuilder().perform(state3).perform(state4),    ],  })  .parallel('Parallel2', {    branches: [      new StateMachineBuilder().perform(state5).perform(state6),      new StateMachineBuilder().perform(state7).perform(state8),    ],  })  .build(definitionScope);

This approach resulted in the branches being clearly nested within the parent parallel methods. As with map, I needed to provide an extended properties instance to define them.

interface BuilderParallelProps extends sfn.ParallelProps {  branches: StateMachineBuilder[];}

The final step function control structure to tackle was catch blocks. These allow errors generated by states to be caught, and then the flow of the state machine routed to recovery processing. Consider the following flow where the all the states have catch blocks to recover from a variety of errors.

This is expressed in CDK as follows:

const definition = sfn.Chain.start(  function1    .addCatch(catch1, { errors: ['States.Timeout'] })    .addCatch(catch2, { errors: ['States.All'] })    .next(      function2        .addCatch(catch3, { errors: ['States.Timeout'] })        .addCatch(catch4, { errors: ['States.All'] })        .next(          new sfn.Map(definitionScope, 'Map1', {            itemsPath: '$.Items1',          })            .iterator(state1.next(state2))            .addCatch(catch5)            .next(              new sfn.Parallel(definitionScope, 'Parallel1')                .branch(state3, state4)                .addCatch(catch6)            )        )    ));

As with some of the other examples, I felt that Prettier had resulted in the essence of the flow being lost. It certainly wasn't clear to me that, at the top level, there were four sequential states, each with error handlers. I imagined trying to re-order them and hoping to get the brackets right.

It made sense to me to have the handlers as properties of the state. With this in mind, I created BuilderCatchProps and added it to the properties for the perform, map, and parallel methods.

interface BuilderCatchProps extends sfn.CatchProps {  handler: string;}interface BuilderPerformProps {  catches?: BuilderCatchProps[];}interface BuilderParallelProps extends sfn.ParallelProps {  branches: StateMachineBuilder[];  catches?: BuilderCatchProps[];}interface BuilderMapProps extends sfn.MapProps {  iterator: StateMachineBuilder;  catches?: BuilderCatchProps[];}

Now we could rewrite the CDK version as the following.

const definition = new StateMachineBuilder()  .perform(function1, {    catches: [      { errors: ['States.Timeout'], handler: 'Catch1' },      { errors: ['States.All'], handler: 'Catch2' },    ],  })  .perform(function2, {    catches: [      { errors: ['States.Timeout'], handler: 'Catch3' },      { errors: ['States.All'], handler: 'Catch4' },    ],  })  .map('Map1', {    iterator: new StateMachineBuilder().perform(state1).perform(state2),    catches: [{ handler: 'Catch5' }],  })  .parallel('Parallel1', {    branches: [      new StateMachineBuilder().perform(state3),      new StateMachineBuilder().perform(state4),    ],    catches: [{ handler: 'Catch6' }],  })  .end()  .perform(catch1)  .end()  .perform(catch2)  .end()  .perform(catch3)  .end()  .perform(catch4)  .end()  .perform(catch5)  .end()  .perform(catch6)  .end()  .build(definitionScope);

Whilst the definition is considerably lengthier than the CDK version, I felt that the essence of the flow was well-separated from the exception handling. It also led me to consider a performAndEnd method, with the aim of making the definition a bit briefer. However, at this stage I felt that keeping the syntax simple was the way to go.

At this point, I had an alternative way of defining state machines in CDK. At least in theory. It is more verbose, that is for sure, but - IMHO - it is more readable, more maintainable, and plays nicer with Prettier. This API was developed with the implementation in mind, but without committing to one. This allowed me to iterate very quickly over different ways of expressing the problem with code, until I found one that I felt was as good as I could make it.

In the next post, we shall see how I get on with implementing the theory.

Edit: The resulting StateMachineBuilder component is now available on npm.

Visualising a CDK State Machine using a custom Construct

Andy Blackledge — Sun, 25 Apr 2021 15:28:28 GMT

Previously, I went through the process of converting a JSON-based step function using SAM, to a code-based step function using CDK. One of the challenges I faced, was visualising the final result. My workaround was to use the AWS Toolkit to download the deployed definition and render that to a graph. In this post, I go through the process of create a CDK construct that allows us to have the definition generated locally.

Edit: The resulting code is now available as an npm package.

Given that CDK works by generating CloudFormation, it stood to reason that the CDK must have functionality to render the state machine definitions to a format that could be rendered by the AWS Toolkit. A search of the documentation revealed that the StateGraph class has a toGraphJson method which promises to:

Return the Amazon States Language JSON for this graph.

All very promising, so armed with this information, I created my first attempt:

export default class StateMachineWithGraph extends sfn.StateMachine {  //  readonly graphJson: string;  constructor(scope: cdk.Construct, id: string, props: sfn.StateMachineProps) {    //    super(scope, id, props);    const stateGraph = new sfn.StateGraph(      props.definition.startState,      'Temporary graph to render to JSON'    );    this.graphJson = (stateGraph.toGraphJson() as unknown) as string;  }}

I had to create a StateGraph instance, as I could not see how to access this internal class in the base StateMachine. Still, it all looked very plausible. I amended my Stack to use the construct (a drop-in replacement) and added a call to a new writeGraphJson method to write out the result to a file.

private static writeGraphJson(stateMachine: StateMachineWithGraph): void {  //  const stateMachinePath = path.join(__dirname, 'stateMachines');  if (!fs.existsSync(stateMachinePath)) fs.mkdirSync(stateMachinePath);  fs.writeFileSync(    path.join(stateMachinePath, `${stateMachine.node.id}.asl.json`),    stateMachine.graphJson  );}

My hopes were high, so ran cdk synth to exercise the code. The result, as often happens in software, was failure:

Error: Trying to use state 'Pass' in Temporary graph to render to JSON (Pass), but is already in State Machine Test definition (Pass). Every state can only be used in one graph.

The problem is, as the error explains, that a state can only be part of a single graph and we are adding our states to multiple ones for the purposes of rendering. My solution, to replace the definition property with an instance of the following function that allows the states to be generated multiple times.

export interface StateMachineWithGraphProps extends Omit {  getDefinition: (scope: cdk.Construct) => sfn.IChainable;}

Here I am using the Omit helper type to selectively replace the definition property. With this interface, I amended the construct to call the new function twice.

The first call is with the construct scope, and provides the definition for the base construct. For the second call, we supply a new Stack as a separate scope.

export default class StateMachineWithGraph extends sfn.StateMachine {  //  readonly graphJson: string;  constructor(scope: cdk.Construct, id: string, props: StateMachineWithGraphProps) {    //    super(scope, id, {      ...props,      definition: props.getDefinition(scope),    });    const stateGraph = new sfn.StateGraph(      props.getDefinition(new cdk.Stack()).startState,      'Temporary graph to render to JSON'    );    this.graphJson = (stateGraph.toGraphJson() as unknown) as string;  }}

This change did mean a change to the original Stack, as we now need to supply a function that generates a definition, rather than definition itself.

const processApplicationStateMachine = new StateMachineWithGraph(  this,  'ProcessApplicationStateMachine',  {    getDefinition: (definitionScope): sfn.IChainable =>      sfn.Chain.start(        performIdentityChecks          .next(aggregateIdentityResults)          .next(            new sfn.Choice(definitionScope, 'EvaluateIdentityResults')              .when(overallIdentityResultIsFalse, performDeclineTasks)              .otherwise(                performAffordabilityCheck.next(                  new sfn.Choice(definitionScope, 'EvaluateAffordabilityResult')                    .when(affordabilityResultIsBad, performDeclineTasks)                    .when(affordabilityResultIsPoor, performReferTasks)                    .otherwise(performAcceptTasks)                )              )          )      ),  });

Once more, I tried cdk synth and once more found failure:

Error: SingletonFunction at 'AggregateIdentityResultsExpression/EvalFunction' should be created in the scope of a Stack, but no Stack found

I was pretty sure it was being created in the scope of a Stack, but it did strike me that it wasn't being created in the same scope as the definition. Given this, I moved it into the definition function.

const processApplicationStateMachine = new StateMachineWithGraph(  this,  'ProcessApplicationStateMachine',  {    getDefinition: (definitionScope): sfn.IChainable => {      //      const aggregateIdentityResults = new sfnTasks.EvaluateExpression(        definitionScope,        'AggregateIdentityResultsExpression',        {          expression: '($.identityResults).every((r) => r.success)',          resultPath: '$.overallIdentityResult',        }      );      return sfn.Chain.start(        performIdentityChecks          .next(aggregateIdentityResults)          ...

Success, at least in terms of a new error message:

Error: State 'PerformIdentityChecks' already has a next state

I now kicked myself, as I should have guessed that all the states need to be created in the same scope as the definition. The logical outcome of this was to create a new method that took a scope, created the states, and then returned the definition:

private getProcessApplicationDefinition(scope: cdk.Construct): sfn.IChainable

This resulted in the rather clean result below:

const processApplicationStateMachine = new StateMachineWithGraph(  this,  'ProcessApplicationStateMachine',  {    getDefinition: (definitionScope): sfn.IChainable =>      this.getProcessApplicationDefinition(definitionScope),  });

Surely success would be ours now. A quick cdk synth revealed that this was not the case:

TypeError [ERR_INVALID_ARG_TYPE]: The "data" argument must be of type string or an instance of Buffer, TypedArray, or DataView. Received an instance of Object    at Object.writeFileSync (fs.js:1429:5)

The reason for this is that, rather annoyingly, the toGraphJson method does not actually return JSON. To get the JSON, we need to do the following:

this.graphJson = JSON.stringify(stateGraph.toGraphJson(), null, 2);

A quick spin of cdk synth and - lo and behold - we have the following renderable result:

{  "StartAt": "PerformIdentityChecks",  "States": {    "PerformIdentityChecks": {      "Type": "Map",      "ResultPath": "$.identityResults",      "Next": "AggregateIdentityResultsExpression",      "InputPath": "$.application",      "Iterator": {        "StartAt": "PerformIdentityCheck",        "States": {          "PerformIdentityCheck": {            "End": true,            "Retry": [...

Now with this construct, we can automatically generate the graph JSON whenever the stack is synthesised or deployed. The code for this post can be found on GitHub here.

As I mentioned previously, I am not overly keen on the readability of state machine definitions in CDK. Having a visual graph certainly helps in this, but I have a much more radical idea to improve things. That is subject for another post.

Edit: The StateMachineWithGraph component is now available on npm.

Converting an AWS Step Function to use CDK - Part 2

Andy Blackledge — Sun, 18 Apr 2021 21:11:58 GMT

In Part 1, we started to convert a state machine from an ASL definition used by SAM, to a fluent definition written in CDK. The full graph of the state machine is shown below:

So far, I managed, thanks to the payloadResponseOnly property, to convert the initial states that perform and aggregate identity checks.

const processApplicationStateMachine = new sfn.StateMachine(  this,  'ProcessApplicationStateMachine',  {    definition: sfn.Chain.start(      new sfn.Map(this, 'PerformIdentityChecks', {      inputPath: '$.application',      itemsPath: '$.applicants',      resultPath: '$.identityResults',      })      .iterator(        new sfnTasks.LambdaInvoke(this, 'PerformIdentityCheck', {        lambdaFunction: performIdentityCheckFunction,        payloadResponseOnly: true,        })      )      .next(        new sfnTasks.LambdaInvoke(this, 'AggregateIdentityResults', {        lambdaFunction: aggregateIdentityResultsFunction,        payloadResponseOnly: true,        inputPath: '$.identityResults',        resultPath: '$.overallIdentityResult',        })      )    ),  });

The next step was to add a Choice state to decide whether to continue with the application or to decline it and perform the associated tasks. To start, I put in some placeholder Pass states and did a test deployment to check I had the syntax correct.

.next(  new sfn.Choice(this, 'EvaluateIdentityResults')    .when(      sfn.Condition.booleanEquals('$.overallIdentityResult', false),      new sfn.Pass(this, 'PerformDeclineTasks')    )    .otherwise(new sfn.Pass(this, 'PerformAffordabilityCheck')))

Now I needed to replace the PerformDeclineTasks Pass state with the real one. Looking at the original graph, the PerformDeclineTasks state needs to be referenced from two different states in the graph.

This meant that I couldn't declare the state in line with the rest of the definition, unless I wanted to duplicate the states. To be honest, the definition was getting quite unwieldy anyway, so I started refactoring and creating constants to hold the states. For example, the Map state to perform the identity checks became the following.

const performIdentityChecks = new sfn.Map(this, 'PerformIdentityChecks', {  inputPath: '$.application',  itemsPath: '$.applicants',  resultPath: '$.identityResults',}).iterator(  new sfnTasks.LambdaInvoke(this, 'PerformIdentityCheck', {    lambdaFunction: performIdentityCheckFunction,    payloadResponseOnly: true,  }));

In addition to creating constants for the states, we can create constants for the conditions as well. This gives us the added benefit of being able to create meaningful names for the conditions too, e.g.:

const overallIdentityResultIsFalse = sfn.Condition.booleanEquals(  '$.overallIdentityResult',  false);

The overall result was a much more succinct definition:

const processApplicationStateMachine = new sfn.StateMachine(  this,  'ProcessApplicationStateMachine',  {    definition: sfn.Chain.start(      performIdentityChecks        .next(aggregateIdentityResults)        .next(          new sfn.Choice(this, 'EvaluateIdentityResults')            .when(overallIdentityResultIsFalse, performDeclineTasks)            .otherwise(new sfn.Pass(this, 'PerformAffordabilityCheck')) // Placeholder        )    ),  });

One thing that caused a stumble was that the original definition for the PerformDeclineTasks state used the Parameters property for the SendDeclineEmail state to pass in a combination of static and dynamic values.

PerformDeclineTasks:  Type: Parallel  End: true  Branches:  - StartAt: SendDeclineEmail    States:      SendDeclineEmail:        Type: Task        Resource: "${SendEmailFunctionArn}"        Parameters:          emailType: Decline          application.$: "$.application"        End: true

It wasn't immediately obvious to me how to do this with CDK, as there was no Parameters property on LambdaInvokeProps. A bit of digging led me to the following AWS article: Task parameters from the state JSON

This pointed me towards the payload property and the TaskInput class. Using these I could replicate what was being achieved in the original flow, as shown below:

const performDeclineTasks = new sfn.Parallel(this, 'PerformDeclineTasks').branch(  new sfnTasks.LambdaInvoke(this, 'SendDeclineEmail', {    lambdaFunction: sendEmailFunction,    payloadResponseOnly: true,    payload: sfn.TaskInput.fromObject({      emailType: 'Decline',      'application.$': '$.application',    }),  }));

After refactoring all states and conditions, I ended up with the following state machine definition. The full source code can be found here.

const processApplicationStateMachine = new sfn.StateMachine(  this,  'ProcessApplicationStateMachine',  {    definition: sfn.Chain.start(      performIdentityChecks        .next(aggregateIdentityResults)        .next(          new sfn.Choice(this, 'EvaluateIdentityResults')            .when(overallIdentityResultIsFalse, performDeclineTasks)            .otherwise(              performAffordabilityCheck.next(                new sfn.Choice(this, 'EvaluateAffordabilityResult')                  .when(affordabilityResultIsBad, performDeclineTasks)                  .when(affordabilityResultIsPoor, performReferTasks)                  .otherwise(performAcceptTasks)              )            )        )    ),  });

Here are my thoughts on the end result:

The nesting is getting quite deep, and that is with only a couple of decisions.
I don't find it particularly readable to my eye, despite my refactoring attempts.
The hierarchical nature doesn't lend itself to easy editing. It is easy to cut and paste the wrong part and get lost with all the brackets.
There is no local visualisation. I had to deploy to AWS and then use the AWS Toolkit to get the definition.

Overall, I was not as impressed as I would hoped with using this approach for step functions. The fluent syntax promised something, but didn't quite deliver for me. I use the Prettier extension for VS Code for formatting, and this syntax didn't seem to play well with it.

However, I have some ideas to help address my concerns. First up, will be a CDK Construct to generate the ASL for local visualisation.

P.S. Evaluate Expression

On this journey, I stumbled across Evaluate Expression tasks, described by the AWS documentation as follows:

Use the EvaluateExpression to perform simple operations referencing state paths. The expression referenced in the task will be evaluated in a Lambda function (eval()). This allows you to not have to write Lambda code for simple operations.

Armed with this knowledge I was able replace the aggregateIdentityResults Lambda function with the following:

const aggregateIdentityResults = new sfnTasks.EvaluateExpression(  this,  'AggregateIdentityResultsExpression',  {    expression: '($.identityResults).every((r) => r.success)',    resultPath: '$.overallIdentityResult',  });

Note the brackets around $.identityResults in the expression. Without these, the engine tries to replace a placeholder called $.identityResults.every and gets very upset indeed.

Converting an AWS Step Function to use CDK - Part 1

Andy Blackledge — Thu, 15 Apr 2021 20:13:22 GMT

If you like fluent coding, then AWS Cloud Development Kit step function definitions looks right up your street. However, things are not as straightfoward as you might think.

In my previous post Easier Step Functions with the AWS Toolkit for VS Code, I extolled the virtues of using the AWS Toolkit in conjunction with the AWS Serverless Application Model (SAM). Now, I will propose a completely different way of doing things.

This post assumes that you have some familiarity with the CDK. If you are not, then the Getting started with the AWS CDK guide is the best place to start. The following quote from the guide does a good job of providing an overview of the key concept behind CDK.

"An AWS CDK app is an application written in TypeScript, JavaScript, Python, Java, or C# that uses the AWS CDK to define AWS infrastructure. An app defines one or more stacks. Stacks (equivalent to AWS CloudFormation stacks) contain constructs, each of which defines one or more concrete AWS resources, such as Amazon S3 buckets, Lambda functions, Amazon DynamoDB tables, and so on."

For this post I took a copy of the original SAM-based repo, and then amended it to use CDK. The result can be found here.

Step functions in CDK require references to the functions they invoke. In the demo project all the functions are in the same file and follow a naming convention. This enabled me to create the following method in the Stack class:

  private addFunction(functionName: string): lambda.Function {    return new lambdaNodejs.NodejsFunction(this, `${functionName}Function`, {      entry: path.join(__dirname, '..', 'src', 'functions', 'index.ts'),      handler: `handle${functionName}`,    });  }

With this in place, I was able to declare all the required functions as follows:

    const performIdentityCheckFunction = this.addFunction('PerformIdentityCheck');    const aggregateIdentityResultsFunction = this.addFunction('AggregateIdentityResults');    const performAffordabilityCheckFunction = this.addFunction('PerformAffordabilityCheck');    const sendEmailFunction = this.addFunction('SendEmail');    const notifyUnderwriterFunction = this.addFunction('NotifyUnderwriter');

I could now turn my attention to converting the step function itself. The step function is a simplified flow that processes a loan application. The first steps run an identity check for each applicant and then aggregates the results

With SAM, this was defined with the following YAML:

  PerformIdentityChecks:    Type: Map    InputPath: "$.application"    ItemsPath: "$.applicants"    ResultPath: "$.identityResults"    Iterator:      StartAt: PerformIdentityCheck      States:        PerformIdentityCheck:          Type: Task          Resource: "${PerformIdentityCheckFunctionArn}"          End: true    Next: AggregateIdentityResults  AggregateIdentityResults:    Type: Task    Resource: "${AggregateIdentityResultsFunctionArn}"    InputPath: "$.identityResults"    ResultPath: "$.overallIdentityResult"    Next: EvaluateIdentityResults

CDK uses a fluent syntax with properties matching those above. So easy I thought, just replicate the same logic in TypeScript:

const processApplicationStateMachine = new sfn.StateMachine(  this,  'ProcessApplicationStateMachine',  {    definition: sfn.Chain.start(      new sfn.Map(this, 'PerformIdentityChecks', {        inputPath: '$.application',        itemsPath: '$.applicants',        resultPath: '$.identityResults',      })        .iterator(          new sfnTasks.LambdaInvoke(this, 'PerformIdentityCheck', {            lambdaFunction: performIdentityCheckFunction,          })        )        .next(          new sfnTasks.LambdaInvoke(this, 'AggregateIdentityResults', {            lambdaFunction: aggregateIdentityResultsFunction,            inputPath: '$.identityResults',            resultPath: '$.overallIdentityResult',          })        )    ),  });

The next step was for me to test it. Using the AWS Toolkit, I right-clicked on the step function and used one of the JSON test files in the project.

I then went in to the AWS console and was heartened to see it all green.

However... looking at the step output for PerformIdentityCheck, I saw the following output:

{  "ExecutedVersion": "$LATEST",  "Payload": {    "success": false  },  "SdkHttpMetadata": {        },  "SdkResponseMetadata": {    "RequestId": "76e41976-672d-4be0-a4d2-a5b80e7f9afe"  },  "StatusCode": 200}

This was not quite what I was expecting, but surely this is easily solved by using the outputPath property on the functions to select just the Payload. E.g.:

new sfnTasks.LambdaInvoke(this, 'PerformIdentityCheck', {  lambdaFunction: performIdentityCheckFunction,  outputPath: '$.Payload',})

With this change in place, I deployed again, and fired off my test. The result... abject failure.

I checked the output of the PerformIdentityCheck step, all was as expected, I checked the output of the map step, and again all was as expected.

The problem was with the AggregateIdentityResults. It had executed as expected, outputting the following.

{  "resourceType": "lambda",  "resource": "invoke",  "output": {    "ExecutedVersion": "$LATEST",    "Payload": false,        "StatusCode": 200  },  "outputDetails": {    "truncated": false  }}

However, an Invalid path '$.Payload' : No results for path: $['Payload'] error was being thrown after it had executed.

{  "error": "States.Runtime",  "cause": "An error occurred while executing the state 'AggregateIdentityResults' (entered at the event id #13). Invalid path '$.Payload' : No results for path: $['Payload']"}

Cue a lost hour trying to work out why $.Payload worked for one function task, but not for another. I did eventually get to the bottom of this (see the end of the post), but my investigations led me to the following issue from April 2020: RunLambdaTask with outputPath not working

This pointed me in the direction of a solution. This was to use the payloadResponseOnly property, defined by the docs as follows:

'Invoke the Lambda in a way that only returns the payload response without additional metadata.'

E.g.:

new sfnTasks.LambdaInvoke(this, 'PerformIdentityCheck', {  lambdaFunction: performIdentityCheckFunction,  payloadResponseOnly: true,})

With this in place, I re-ran the test, and checked the result in AWS.

{  "application": {      },  "identityResults": [    {      "success": false    }  ],  "overallIdentityResult": false}

Hurrah! This was exactly as expected, with no extraneous data being returned by the function invocations.

What I did notice, by looking at the generated definitions, is that the generated ASL for invoking the functions different depending on the value of payloadResponseOnly. Without payloadResponseOnly: true the definition is as generated as follows:

    "AggregateIdentityResults": {            "Resource": "arn:aws:states:::lambda:invoke",      "Parameters": {        "FunctionName": "arn:aws:lambda:eu-west-2:361728023653:function:ProcessApplicationStack-AggregateIdentityResultsFu-B7MG7QWC1VLN",        "Payload.$": "$"      }    }

Whilst with payloadResponseOnly: true, we get the following that matches the original SAM-based definition:

    "AggregateIdentityResults": {            "Resource": "arn:aws:lambda:eu-west-2:361728023653:function:ProcessApplicationStack-AggregateIdentityResultsFu-B7MG7QWC1VLN"    }

This difference must be the reason for the difference in the response, but why that should be I don't know. However, I can now get the results I want, so I shall move on.

That concludes this part, I had anticipated getting further, but that outcome is pretty standard for software development. In part 2 I will continue converting the step function to CDK and record the challenges I encounter on the way.

Edit: The reason for the Payload error was due to my misunderstanding of how the paths are processed. The key bit I was missing is below:

"The OutputPath is computed after applying ResultPath. All service integrations return metadata as part of their response. When using ResultPath, it's not possible to merge a subset of the task output to the input."

Easier Step Functions with the AWS Toolkit for VS Code

Andy Blackledge — Sun, 11 Apr 2021 20:22:13 GMT

If you are developing AWS Step Functions and you are using VS Code, then the AWS Toolkit for Visual Studio Code makes your life so, so much easier.

The AWS guide Working with AWS Step Functions provides a comprehensive guide to installing and using the extension. This is my write-up of the features that I have found to be the most useful so far. These are:

State Machine Templates
Code snippets
Code completion and validation
State machine graph visualization

Note that the extension also has functionality to download definitions from AWS, create state machines in AWS, and to update the definition for an existing state machine. I have found these less useful, as I prefer to deploy everything using a tool such as SAM. However, your mileage may vary.

The key to using the features above, is to save your definition file with the extension .asl.json or .asl.yaml (more on this later). This is when the magic of the extension kicks in. If you use the 'AWS: Create a new Step Functions state machine' option from the VS Code Command Palette, then this will be done automatically when you save the created file.

The State Machine Templates are accessed via the 'AWS: Create a new Step Functions state machine' option. This presents you with a list of starting points and provides a nice way to create your initial definition.

To build up your definition, you will need to add states. This is where the code snippets come in very handy. They provide a guide to creating the different types of states, saving you from having to remember the specifics.

For me the best feature of all is the code completion and validation. The code completion adapts to the type of state and prompts you for properties specific to that type. In addition, when entering values for Next, StartAt, or Default properties, you will be prompted for state names. The code validation highlights the following errors:

Missing properties
Incorrect values
No terminal state
Non-existent states that are pointed to

This feature makes a world of difference in reducing the sort of simple typo errors that can eat up your time deploying to and testing in AWS. Yes, there are ways of testing locally that could help, but seeing and fixing the errors in the editor is always going to be more efficient.

Finally, there is the ability to render a graph of your state machine. I find following the flow of the state machine much easier to follow and validate when done by eye. The graph is rendered when you select the 'Render graph' and appears alongside your definition, as shown below:

A very recent addition to the toolkit was announced in March this year and is that AWS Step Functions now has tooling support for YAML. This means that instead of the bracket-heavy definitions of yore, e.g.:

We can now express ourselves more cleanly and - hey - maybe add some comments if we are feeling louche. For example, the snippet above becomes:

For my investigations into the toolkit, I have created a working project that emulates a basic loan processing flow. This complete code for this can be found on GitHub here. My next challenge is to take this project and convert it to use the AWS Cloud Development Kit (CDK), and see how that compares with the SAM approach.

10 PRINT "I AM COOL"

Exposing an Azure Function using API Management

Why use API Management?

Are there any alternatives?

Azure Application Gateway with Web Application Firewall (WAF)

Azure Functions Proxies

Azure Front Door

Creating the API Management instance

Exposing the Azure function as an API

Creating tenant-specific subscriptions

Disabling public access to the function

Summary

Links

Deploying and Debugging my First Azure Function

It's all gone south

Back to the Portal

Debugging the deployment

Remote debugging (eventually)

Summary

Observations on Developing My First Azure Function

The Webhook Proxy Application

Choosing an Azure Function model

The 'Out of the Box' Experience

The Built-in HTTP model

Dependency Injection and Logging

Unit testing my function

Deploying to Azure

Summary

My First AI-Assisted Project

What are my options?

How can I test it?

But I'm not an HTML expert

How could I repeat it?

Summary

Step Function Errors Should Be Errors

TL;DR

Our inheritance

Improving things for our users

Post-error step function context

Are these really errors?

A better approach

General discussion

Summary

My first steps with the future of programming

40 years ago

Encountering SudoLang

Programming with SudoLang

Refining the program

Using the LLM to improve our program

Summary

Links

Adventures with AWS X-Ray and CDK - Part 3

The example application

View the service map

Removing observability from tracing

Adding custom subsegments

Forcing some errors

Running a workload

Summary

Links

Adventures with AWS X-Ray and CDK - Part 2

The example application

Adding X-Ray

Step Functions and EventBridge

The unit test

Why is my test failing?

The solution

Summary

Addendum

Links

Adventures with AWS X-Ray and CDK - Part 1

API Gateway, Lambda, and EventBridge

Adding X-Ray to our CDK code

Adding X-Ray to our Lambda code

Viewing the traces

SNS and SQS

Step Functions

Summary

Encapsulation Not Always Desirable

What is encapsulation

Lists now can return `undefined`

SQS next with `aws-sdk-js-codemod`

Why does `codemod` SQS code differ from the SNS code?