Structured logging & message correlation

Part of the series: Logging and Debugging High Throughput Services

Are you writing log messages like this?

// Call comes in...

Log.WriteInfo("Processing Transaction Id {0}, Tenant Id {1}, Amount {2}, Extra data: {3}.", request.TransactionId, request.TenantId, request.Amount, request.ExtraData);

// ...a bunch of logic happens...

Log.WriteInfo("Transaction processed with {0} in {1}.", response.IsApproved ? "Approval" : "Decline", response.EndTime - request.StartTime);

return response;

I’m going to refer to that as “traditional logging” here because we all used to write messages like that. They are great for debugging! If you’re not doing this, it’s a great place to start. Basically, if we have an issue, we pull our log files and then read through them top-to-bottom to try to piece together what happened for some flow through the code. You can get a lot of mileage out of these logs.

But what if we wanted to know, for example, what’s the average response time for a given tenant that happens to pass ExtraData set to “UseFallbackEncryption=true” and whoops, we don’t store that in the database? With traditional logging, good luck with that. Let’s say we have 100 million messages across 30 servers we need to mine. Just pulling the data becomes a task. If you pull it off, then what? You could try and write a parser but the task becomes basically impossible. There’s no defined structure to the data so it’s hard to parse. I’ve been there, it sucks.

What if I told you answering this type of question from your logs can be easy? Ridiculously easy.

Structured Logging

The first step is making the logs consumable by a machine. We need to give them structure! This involves choosing a logging framework that supports writing structured logs and then making sure our messages add data instead of embedding it into the message strings. The latter part can be a task. In our payment gateway product, I probably fixed up two-thousand log lines, maybe more (over years).

If you want a great logger, I recommend Serilog. It’s fast. I cannot recommend NLog or Log4Net. I refuse to even link them! They are slow. I don’t have a lot of experience with either, to be honest, but I have at multiple times benchmarked them against other frameworks. Update: Here’s a linked NLog because I owe it an apology! Rolf Kristensen gave me some pointers over on GitHub about how to configure it. So if you’re willing to dial it in for your use cases, it is a perfectly fine logger!

What do I use? I have my own logger: Macross.Logging.Files.

I am filled with much shame writing that statement. Trust me, I know. The world does not need another logger. I really tried to not write that. But ultimately I needed a couple things I couldn’t get. What I wanted was my logging to look like what is displayed in the Debugging UI. I also wanted the files to be aware of the group and use that to generate files.

A lot of these logging frameworks are really feature rich. Mine isn’t. It is great at one or two things and it’s fast. Some logging frameworks you can plug in different sinks and write your messages to the DB, or the cloud, or a queue, or send them to the moon. I’ve never needed that, and it just slows things down. I write my logs to disk, and I let an external process pick them up from there. Right now, it’s Filebeat. More on that next time!

The Macross lib is a .NET Core ILogger reimagining of the logger we had in the payment gateway, which was .NET Framework. Same ideas, different code. Over the years we tested the old logger against many popular logging frameworks, and it always out-performed, much to our amazement. Logging was one thing we were really good at. Survival necessity, I suppose.

Anyway, pick a logger you are comfortable with!

Here’s an example of how you write structured messages…

Log.WriteInfo(
   new
   {
       request.TransactionId,
       request.TenantId,
       request.Amount,
       request.ExtraData
   },
   "Transaction Processed.");

That might look kind of strange if you are used to the .NET Core logging mechanics. ILogger.WriteInfo is an extension method from Macross.Logging.Abstractions. It is equivalent to:

Log.LogInformation(
   "Transaction Processed.{Data}",
   new
   {
       request.TransactionId,
       request.TenantId,
       request.Amount,
       request.ExtraData
   }
);

The structured logging examples you see more commonly around the web will be more like this:

Log.LogInformation(
   "Processing {TransactionId}, {TenantId}, {Amount}, {Extra data}.", 
   request.TransactionId,
   request.TenantId,
   request.Amount,
   request.ExtraData);

That’s technically semantic logging, I think, but your logger will probably pull out the values into some kind of structure. Nothing wrong there, it works great for simple values. If you want to push another object though, it won’t work. Here are some comparisons for those three cases:

// Example DTO...
public class Request
{
	public int TransactionId { get; set; } = 100;

	public int TenantId { get; set; } = 1818;

	public decimal Amount { get; set; } = 1.99M;

	public ExtraData ExtraData { get; set; } = new ExtraData();
}

public class ExtraData
{
	public string PassThrough { get; set; } = "Pass_Through";

	public bool UseFallbackEncryption { get; set; } = false;

	public IDictionary<string, object> ExtensionData { get; set; } = new Dictionary<string, object> { ["key1"] = 10 };
}

Request request = new Request();

// Logged different ways...
Log.WriteInfo(
	new
	{
		request.TransactionId,
		request.TenantId,
		request.Amount,
		request.ExtraData
	},
	"Transaction Processed.");

Log.LogInformation(
	"Transaction Processed.{Data}",
	new
	{
		request.TransactionId,
		request.TenantId,
		request.Amount,
		request.ExtraData
	}
);

Log.LogInformation(
	"Processing {TransactionId}, {TenantId}, {Amount}, {Extra data}.", 
	request.TransactionId,
	request.TenantId,
	request.Amount,
	request.ExtraData);

Logging output:

// Macross.Logging.Files
{"TimestampUtc":"2020-05-07T05:36:40.2368432Z","ThreadId":1,"LogLevel":"Information","GroupName":"Main","CategoryName":"TestConsoleApp.Program","Content":"Transaction Processed.","TransactionId":100,"TenantId":1818,"Amount":1.99,"ExtraData":{"PassThrough":"Pass_Through","UseFallbackEncryption":false,"ExtensionData":{"key1":10}}}
{"TimestampUtc":"2020-05-07T05:36:40.2868537Z","ThreadId":1,"LogLevel":"Information","GroupName":"Main","CategoryName":"TestConsoleApp.Program","Content":"Transaction Processed.{ TransactionId = 100, TenantId = 1818, Amount = 1.99, ExtraData = TestConsoleApp.Program\u002BExtraData }","Data":{"TransactionId":100,"TenantId":1818,"Amount":1.99,"ExtraData":{"PassThrough":"Pass_Through","UseFallbackEncryption":false,"ExtensionData":{"key1":10}}}}
{"TimestampUtc":"2020-05-07T05:36:40.2891003Z","ThreadId":1,"LogLevel":"Information","GroupName":"Main","CategoryName":"TestConsoleApp.Program","Content":"Processing 100, 1818, 1.99, TestConsoleApp.Program\u002BExtraData.","TransactionId":100,"TenantId":1818,"Amount":1.99,"Extra data":{"PassThrough":"Pass_Through","UseFallbackEncryption":false,"ExtensionData":{"key1":10}}}

// Serilog
{"Timestamp":"2020-05-06T22:36:40.3197531-07:00","Level":"Information","MessageTemplate":"Transaction Processed.","Properties":{"{Data}":"{ TransactionId = 100, TenantId = 1818, Amount = 1.99, ExtraData = TestConsoleApp.Program+ExtraData }","SourceContext":"TestConsoleApp.Program"}}
{"Timestamp":"2020-05-06T22:36:40.3764703-07:00","Level":"Information","MessageTemplate":"Transaction Processed.{Data}","Properties":{"Data":"{ TransactionId = 100, TenantId = 1818, Amount = 1.99, ExtraData = TestConsoleApp.Program+ExtraData }","SourceContext":"TestConsoleApp.Program"}}
{"Timestamp":"2020-05-06T22:36:40.3766900-07:00","Level":"Information","MessageTemplate":"Processing {TransactionId}, {TenantId}, {Amount}, {Extra data}.","Properties":{"TransactionId":100,"TenantId":1818,"Amount":1.99,"Extra data":"TestConsoleApp.Program+ExtraData","SourceContext":"TestConsoleApp.Program"}}

The differences are:

Macross.Logging.Files understands the GroupName concept. Depending on config, that can also be part of the file name.
Macross.Logging.Files handled the ExtraData complex object right out of the box. I think in some cases you can use @ with Serilog (eg: {@Data}) to have it expand out nested objects.

The important part of all of this though is that our logs now have an easy-to-parse structure! That will come into play big time in the next episode.

Log Message Correlation

There is one more really important thing we need to do. We have to tie all the messages written in the same logical group together. This is critical. In the payment gateway, while we’re processing a request, all the log messages we write should be tagged as related, somehow. The same goes for a website processing a user request. Or maybe a gaming session? We want to be able to see everything that happened for a given activity. If you want to get out ahead of the game, check out: W3C Trace Context. I’ll get more into that when we dive into Open Telemetry, but it’s an open standard for passing this context through our own code and all of our dependent services and products.

Why is this so important?

Imagine a very busy service processing many requests at a time. Your log could look like this:

{“Message”:”Processing request.”, “TransactionId”: 10}
{“Message”:”Processing request.”, “TransactionId”: 11}
{“Message”:”Processing request.”, “TransactionId”: 12}
{“Message”:”Bank returned an unknown error code.”}
{“Message”:”Processing request.”, “TransactionId”: 13}
{“Message”:”Processing request.”, “TransactionId”: 14}

How do we know which transaction failed? The developer didn’t add TransactionId on that message!

Let’s introduce CorrelationId:

{“CorrelationId”:1, “Message”:”Processing request.”, “TransactionId”: 10}
{“CorrelationId”:2, “Message”:”Processing request.”, “TransactionId”: 11}
{“CorrelationId”:3, “Message”:”Processing request.”, “TransactionId”: 12}
{“CorrelationId”:2, “Message”:”Bank returned an unknown error code.”}
{“CorrelationId”:4, “Message”:”Processing request.”, “TransactionId”: 13}
{“CorrelationId”:5, “Message”:”Processing request.”, “TransactionId”: 14}

Much clearer now!

Where this gets really powerful is when the CorrelationId follows the transaction across applications.

Application1.ServerA.log:
{“CorrelationId”:1, “Message”:”Processing request.”, “TransactionId”: 1}
{“CorrelationId”:2, “Message”:”Processing request.”, “TransactionId”: 2}
{“CorrelationId”:3, “Message”:”Processing request.”, “TransactionId”: 3}

Application2.ServerB.log:
{“CorrelationId”:3, “Message”:”Malformed message received. Rejecting.”}

Now we can follow our logic through the system. Developer doesn’t really need to worry about tagging every message with repetitive data, what we’ll see next time is we can just query it via the CorrelationId.

How do you pull this off? In .NET Framework it’s going to depend heavily on the framework you are using. The payment gateway had a class for storing the log state that would get passed along everywhere and then serialized over service hops. There were a lot of push/pop operations going on as logic was executed.

The .NET Core logging framework has a much more elegant mechanism called log scopes. You open a log scope and then it will be applied to all messages written under that scope. It does this via AsyncLocal which is a magical thing in the framework used to flow data from one thread to another as we async/await our operations.

When you start your logical operation, open a scope. Then make sure your logging framework writes that scope on the messages. ASP.NET Core does this automatically for us, by the way. All your requests are automatically processed under a parent scope.

If we modify our test from above like this:

using IDisposable Scope = Log.BeginScope(new { CorrelationId = Guid.NewGuid() });

Log.WriteInfo(
	new
	{
		request.TransactionId,
		request.TenantId,
		request.Amount,
		request.ExtraData
	},
	"Transaction Processed.");

Log.LogInformation(
	"Transaction Processed.{Data}",
	new
	{
		request.TransactionId,
		request.TenantId,
		request.Amount,
		request.ExtraData
	}
);

Log.LogInformation(
	"Processing {TransactionId}, {TenantId}, {Amount}, {Extra data}.", 
	request.TransactionId,
	request.TenantId,
	request.Amount,
	request.ExtraData);

We end up with log messages written like this:

{"TimestampUtc":"2020-05-07T06:30:19.0649622Z","ThreadId":1,"LogLevel":"Information","GroupName":"TestConsoleApp.Program","CategoryName":"TestConsoleApp.Program","Content":"Transaction Processed.","CorrelationId":"d08d6643-0d14-4069-b8ad-2ad5b410c9e9","TransactionId":100,"TenantId":1818,"Amount":1.99,"ExtraData":{"PassThrough":"Pass_Through","UseFallbackEncryption":false,"ExtensionData":{"key1":10}}}
{"TimestampUtc":"2020-05-07T06:30:19.1098374Z","ThreadId":1,"LogLevel":"Information","GroupName":"TestConsoleApp.Program","CategoryName":"TestConsoleApp.Program","Content":"Transaction Processed.{ TransactionId = 100, TenantId = 1818, Amount = 1.99, ExtraData = TestConsoleApp.Program\u002BExtraData }","CorrelationId":"d08d6643-0d14-4069-b8ad-2ad5b410c9e9","Data":{"TransactionId":100,"TenantId":1818,"Amount":1.99,"ExtraData":{"PassThrough":"Pass_Through","UseFallbackEncryption":false,"ExtensionData":{"key1":10}}}}
{"TimestampUtc":"2020-05-07T06:30:19.1116783Z","ThreadId":1,"LogLevel":"Information","GroupName":"TestConsoleApp.Program","CategoryName":"TestConsoleApp.Program","Content":"Processing 100, 1818, 1.99, TestConsoleApp.Program\u002BExtraData.","CorrelationId":"d08d6643-0d14-4069-b8ad-2ad5b410c9e9","TransactionId":100,"TenantId":1818,"Amount":1.99,"Extra data":{"PassThrough":"Pass_Through","UseFallbackEncryption":false,"ExtensionData":{"key1":10}}}

See how all three lines now have the same CorrelationId?

Check out this blog for details about how ASP.NET Core does all of this automatically for us, including transmitting & reestablishing our CorrelationId in our dependent services. I’ll get more into that when we look at Open Telemetry, but all the info is there.

Tune in next time to see how we work with all this amazing data!

Structured Logging

Log Message Correlation

Leave a Reply Cancel reply