Adventures with AWS X-Ray and CDK - Part 3

Photo by NASA on Unsplash

Adventures with AWS X-Ray and CDK - Part 3

Instrumenting a whole application

In this post I continue my adventures with X-Ray and try my hand at observing a whole application. In the previous posts in the series, I looked at using X-Ray in a small context. Here we will see what happens when an end-to-end process is traced and logged.

You can download the accompanying code and try it yourself from the GitHub repo.

The example application

The case study we looked at in the series on implementing Enterprise Integration patterns is an application that acts as a loan broker. The application in question receives a request containing the details of the loan required via an API, and then returns the best rate to a webhook.

The following diagram shows how we use a central EventBridge event bus to implement this.

Architecture diagram using EventBridge

View the service map

In the last post, I walked through how I added X-Ray to the whole application. Now when I run some requests through the API, we see the following service map.

Service map showing end-to-end application

What is quite clear from this picture, is that events are at the heart of this application.

Focus on events at the heart of the service map

Now, by clicking on the client, we can trace a request all the way through the application and out to the webhook.

Service map showing how to view traces from the client

However, when I tried this, I found something that was hindering the observability. Ironically, it was observability that I added in the post on Domain Observability.

Removing observability from tracing

In that post, I added business-level observability by hooking a Lambda function up to the domain events being raised.

Architecture diagram showing the observability Lambda function

However, as I had enabled tracing for this Lambda function, the trace included numerous entries for the observers which clouded the picture of the process.

Trace showing how observability is being noisy

When viewing the log, this was further apparent.

Log showing how observability is being noisy

The solution is to specify Tracing.DISABLED for the observability Lambda functions. However, as I still wanted the traces when testing the Lambda functions, I added a isTestMode property the observability CDK stack as follows.

const loggerFunction = new NodejsFunction(
  this,
  'Logger',
  getNodejsFunctionProps({
    // We don't want the observing in the trace for production
    tracing: props.isTestMode ? Tracing.ACTIVE : Tracing.DISABLED,
    // <snip>
  })
);

Now production traces are clean, but we can also take advantage of X-Ray when testing the functionality.

Adding custom subsegments

The article Generating custom subsegments with the X-Ray SDK for Node.js describes subsegments as follows.

Subsegments extend a trace's segment with details about work done in order to serve a request. Each time you make a call with an instrumented client, the X-Ray SDK records the information generated in a subsegment. You can create additional subsegments to group other subsegments, to measure the performance of a section of code, or to record annotations and metadata.

In our application, we have Lambda functions that simulate response from lender systems. At the moment, this is just an algorithm, but in practise would be a call that would take time and be prone to error. This would be an ideal call to surround with a custom subsegment.

With this in mind, I added the following code to allow the lender configuration to control the delay in responding and whether an error occurred.

const simulateExternalCallAsync = async (
  lenderConfig: LenderConfig
): Promise<void> => {
  const randomPercentage = randomInt(100);
  const errorPercentage = lenderConfig.errorPercentage ?? 0;
  const throwError = randomPercentage <= errorPercentage;

  const delayMillis = lenderConfig.minDelayMillis ?? 1000 + randomInt(1000);
  await new Promise((resolve) => setTimeout(resolve, delayMillis));

  if (throwError) {
    const errorMessage = `Simulated error (${randomPercentage} <= ${errorPercentage})`;
    throw new Error(errorMessage);
  }
};

With this in place, I added the code below around the call to simulateExternalCallAsync to add and close the subsegment.

import * as AWSXRay from 'aws-xray-sdk';

// <snip>

const segment = AWSXRay.getSegment();
const subsegment = segment?.addNewSubsegment('External Call');

try {
  // Simple values that are indexed for filter expressions
  subsegment?.addAnnotation('callType', 'Lender');
  subsegment?.addAnnotation('lenderId', lenderConfig.lenderId);
  // Related data for debugging purposes
  subsegment?.addMetadata('lenderDetails', {
    lenderId: lenderConfig.lenderId,
    lenderName: lenderConfig.lenderName,
    lenderUrl: `https://${lenderConfig.lenderId}.com`,
  });

  await simulateExternalCallAsync(lenderConfig);

} catch (error) {
  if (error instanceof Error) {
    // Add error to the subsegment
    subsegment?.addError(error);
  }
  throw error;
} finally {
  // Ensure the subsegment is closed
  subsegment?.close();
}

I redeployed the lenders, with the configuration set to introduce delays but not throw any errors. After running a request, I could see the following in the trace.

Trace showing custom segment

So we can now see the time taken for our 'external call'. What we can also see is the annotation and metadata that we added to the subsegment. Annotations are key-value pairs with simple data (strings, numbers, or booleans) that are indexed for use with filter expressions, whilst metadata can be any related data you'd like to store that's not indexed.

Clicking on the 'Annotations' tab of the segment details, we can see what type of call it was and which lender was called.

Trace showing custom segment annotations

For more data, clicking on 'Metadata' shows the lender name and the URL being 'called'.

Trace showing custom segment metadata

Forcing some errors

After having a look at what a happy system looks like, I decided to introduce some errors. I updated the configuration for one of the lenders so that it would always throw an error and redeployed.

After running a request through the system, the service map clearly showed that the Lambda function for the lender has errors.

Service map showing Lambda function error

Looking at the trace, I could see that the asynchronous request from the step function had timed out, as the timeout was set to 12 seconds.

Trace showing the step function timeout

The trace also contained the custom subsegment, which clearly shows that there was an error making the external call.

Trace showing custom segment failure

As before, the 'Annotations' and 'Metadata' tabs showed the call details. However, as we added the error to the subsegment, the 'Exceptions' tab also shows us the error details, including the stack trace.

Trace showing custom segment exception

This shows the power of using subsegments to instrument key parts of the application.

Running a workload

The final part of my adventure with X-Ray involved configuring the Lambda functions for the lenders to randomly error and then to put multiple requests through the application.

Looking at the resulting service map, I could see some errors I was expecting and some I didn't.

Service map showing a workload running with errors

In particular, the step function was showing that it had errored. So I selected it and filtered the traces to see what was going on. On inspection, the Lambda function that sends the response to the webhook was erroring. Clicking on the 'Exceptions' tab clearly indicated to me that the issue was that there was no response to send in this case, but the code didn't cater for it.

Trace showing an error due to bad logic

Looking at another error, I saw that the Lambda function that looks up the lenders from the parameter store is throwing an error. Again, the 'Exceptions' tab shows the underlying reason. In this case, there is a rate limit on accessing the parameters in the parameter store. This indicates to me that perhaps we need the application to implement some sort of cache in front of the raw access.

Trace showing an error due to rate limiting

Summary

In this post, I documented my experience using X-Ray with an application. I found the ability to view individual traces through the application, along with the associated logs, very valuable indeed. The ability to see errors and then drill down to the causes was also very valuable and allowed me to see some behaviour that otherwise would have been tricky to spot. Overall, I was very impressed as to what X-Ray has to offer.

Of course, there are also numerous excellent third-party offerings, such as the following:

So, although you get X-Ray out of the box with AWS, please consider these as well.