[Service Fabric] Auto-scaling your Service Fabric cluster–Part II

In Part I of this article, I demonstrated how to set up auto-scaling on the Service Fabric clusters scale set based on a metric that is part of a VM scale set (Percentage CPU). This setting doesn’t have much to do at all with the applications that are running in your cluster, it’s just a pure hardware scaling that may take place because of your services CPU consumption or some other thing consuming CPU.

There was a recent addition to the auto-scaling capability of a Service Fabric cluster that allows you to use an Application Insights metric, reported by your service to control the cluster scaling. This capability gives you more finite control over not just auto-scaling, but which metric in which service to provide the metric values.

Creating the Application Insights (AI) resource

Using the Azure portal, the cluster resource group appears to have a few strange issues in how you create your AI resource. To make sure you build the resource correctly, follow these steps.

1. Click on the name of the resource group where your cluster is located.

2. Click on the +Add menu item at the top of the resource group blade.

3. Choose to create a new AI resource. In my case, I created a resource that used a General application type and used a different resource group that I keep all my AI resources in.

clip_image001

4. Once the AI resource has been created, you will need to retrieve and copy the Instrumentation Key. To do this, click on the name of your newly created AI resource and expand the Essentials menu item, you should see your instrumentation key here:

clip_image003

You’ll need the instrumentation key for your Service Fabric application, so either paste it in to Notepad or somewhere where you can access it easily.

Modify your Service Fabric application for Application Insights

In Part I of this article, I was using a simple stateless service app https://github.com/larrywa/blogpostings/tree/master/SimpleStateless and inside of this code, I had commented out code for use in this blog post.

I left my old AI instrumentation key in-place so you can see where to place yours.

1. If you already have the solution from the previous article (or download it if you don’t), the solution already has the NuGet package for Microsoft.ApplicationInsights.AspNetCore added. If you are using a different project, add this NuGet package to your own service.

2. Open appsettings.json and you will see down at the bottom of the file where I have added my instrumentation key. Pay careful attention to where you put this information in this file, brackets, colons etc can really get you messed up here.

3. In the WebAPI.cs file, there are a couple of things to note:

a. The addition of the ‘using Microsoft.ApplicationInsights;’ statement.

b. Line 31 – Uncomment the line of code for our counter.

c. Line 34 – Uncomment the line of code that create a new TelemetryClient.

d. Lines 43 – 59 – Uncomment out the code that creates the custom metric named ‘CountNumServices’. Using myClient.TrackMetric, I can create any custom metric of any name that I wish. In this example, I am counting up to 1200 (20 minutes of time) and then I count down from there to 300, 1 second at a time. Here, I am trying to provide ample time for the cluster to react appropriately for the scale-out and scale-in.

4. Rebuild the solution.

5. Deploy your solution to the cluster.

Here is a catchy situation. To actually be able to setup your VM scale set to scale based on an AI custom metric, you have to deploy the code to start generating some of the custom metric data, otherwise it won’t show up as a selection. But, if you currently have a scaling rule set, as in Part I, the nodes may still be reacting to those rules and not your new rules. That’s ok, we will change that once the app is deployed and running in the cluster. And, you need to let the app run for about 10 minutes to start generating some of these metrics.

Confirming custom metric data in Application Insights

Before considering testing out our new custom metric, you need to make sure the data is being generated.

1. Open your AI resource that you recently created. The key was placed in the appsettings.json file.

2. Make sure you are on the Overview blade, then select the Metrics Explorer menu item.

3. On either of the charts, click on the very small Edit link at the upper right-hand corner of the chart.

4. In the Chart details blade, down toward the bottom, you should see a new ‘Custom’ section. Select CountNumServices.

clip_image004

5. Close the Chart details blade. It is handy to have this details chart so you can see what the current count it to predict whether the cluster should be scaling out or in.

clip_image006

Modifying your auto-scaling rule set

1. Make sure you are on the Scaling blade for your VM scale set.

2. In the current rule set, either click on the current Scale out rule, Scale based on a metric or click on the +Add rule link.

3. In my case, I clicked on Scale based on a metric, so we’ll go with that option. Select +Add a rule.

4. Here is the rule I created:

a. Metric source: Application Insights

b. Resource type: Application Insights

c. Resource: aiServiceFabricCustomMeric ~ this is my recently created AI resource

d. Time aggregation: Average

e. Metric name: CountNumServices ~ remember, if you don’t see it when you try to scroll and find the metric to pick from, you may not have waited long enough for data to appear.

f. Time grain: 1 minute

g. Time grain statistic: Average

h. Operator: Greater than

i. Threshold: 600

j. Duration: 5 minutes

k. Operation: Increase count by

l. Instance count: 2

m. Cool down: 1 minute

Basically, after the count reaches 600, increase by two nodes. Wait 1 minute before scaling up any more nodes (note, in real production, this number should be at least 5 minutes). Actually, the number will be greater than 600 because we are using Average as the time aggregation and statistic.

5. In the Scale rule options blade

6. Select the Add button.

7. To create a scale in rule, I followed the same process but with slightly different settings:

a. Metric source: Application Insights

b. Resource type: Application Insights

c. Resource: aiServiceFabricCustomMeric ~ this is my recently created AI resource

d. Time aggregation: Average

e. Metric name: CountNumServices ~ remember, if you don’t see it when you try to scroll and find the metric to pick from, you may not have waited long enough for data to appear.

f. Time grain: 1 minute

g. Time grain statistic: Average

h. Operator: Less than

i. Threshold: 500

j. Duration: 5 minutes

k. Operation: Decrease count by

l. Instance count: 1

m. Cool down: 1 minute

Notice that while decreasing, I am decreasing at a lesser rate than when increasing. This is to just make sure we don’t make any dramatic changes that could upset performance.

8. Save the new rule set.

9. If you need to add or change who gets notified for any scale action, go to the Notify tab, make your change and save it.

10. At this point, you could wait for an email confirmation of a scale action or go in to the Azure portal AI resource and view the metrics values grow. Remember though, what you see in the chart are ‘averages’, not a pure count of the number of incremented counts. Also, when scaling, it uses averages.

In case you are curious what these rules would look like in an ARM template, it looks something like this:

clip_image008

I’ll provide the same warning here that I did in Part I. Attempting to scale out or in in any rapid fashion will cause instability in your cluster. When scaling out, you may consider to increase the number of nodes that are added at one time to keep you from having to wait so long for performance improvements (if that is what you are going for).

[Service Fabric] Auto-scaling your Service Fabric cluster–Part I

Like most people, whenever I need to build an ARM template to do something with Service Fabric, I’m browsing around Github or where ever else I can find the bits and pieces of JSON that I need.

I was recently working on a project where they needed 3 things:

1. They wanted their Service Fabric cluster to use managed disks instead of Azure Storage accounts.

2. They wanted to have auto-scaling setup for their virtual machine scale set (VMSS). In this case, we started out using the Percentage CPU rule, which is based on the VM scale set hardware and nothing application specific. In Part II, I’ll talk about auto-scaling via custom metrics and Application Insights.

3. They had a stateless service where they wanted to register their service event source to output logging information in to the WADETWEventTable.

This template includes those 3 implementations. You can find a link to the files here https://github.com/larrywa/blogpostings/tree/master/VMSSScaleTemplate. This is a secure cluster implementation so you’ll have to have your primary certificate in Azure Key Vault etc.

Using managed disks

Some of the newer templates you’ll find out on Github already use managed disks for the Service Fabric cluster, but in case the one you are using doesn’t, you need to find the location in your JSON file, in the Microsoft.Compute/virtualMachineScaleSets resource provider and make the following modifications.

clip_image002

Another important setting you need to make sure you have is the overProvision = false setting (here placed in a variable)

clip_image003

This variable is actually used in the Microsoft.Compute/virtualMachineScaleSets resource provider properties:

clip_image005

More information about overprovisioning can be found here https://docs.microsoft.com/en-us/rest/api/compute/virtualmachinescalesets/create-or-update-a-set, but if this setting is missing or set to true, you may see more than the requested number of machines and nodes created at deployment and then the ones that are not in use are turned off. This will cause errors to appear in the Service Fabric Explorer. Service Fabric will eventually go behind and clean up itself but when you first see the errors, you’ll think you did something wrong.

Setting Up Auto-scale on your VMSS

At first my idea was to go to my existing cluster, turn on auto-scaling inside of the VMSS settings and then export the template from the Azure portal. I then discovered that my subscription did not have permission to use the microsoft.insights resource provider. Not sure you’ll run in to this, but if you do you can either enable it in the portal under Subscriptions -> your subscription -> Resource providers -> Microsoft.Insights.

The ‘microsoft.insights/autoscalesettings’ resource provider is placed at the same level in the JSON file as the other major resource providers like the cluster, virtual machine scale set etc. This scaleset, although it is found as a setting for the scaleset, is not a sub-section of the resource provider for the VMSS, it is actually a separate resource provider as shown in the next screenshot.

In the Json outline editor, the auto-scale resource will look like this:

clip_image006

There is a rule with conditions for scaling up and scaling down based on Percentage CPU metrics. Before deployments, go in and set your own desired levels in the capacity area, line 838 in this sample. The values for the Percentage CPU I have set in this sample are ridiculously low just to be able to see something happen, and if you’re impatient, you can make them even lower just to see scaling take place.

Registering your Event Source provider

To register an event source provider, you need to add your event source provider to the Microsoft.Compute/virtualMachineScaleSets resource provider in the Microsoft.Azure.Diagnostics section.

clip_image008

You will of course need to make sure that the ‘provider’ listed above matches the EventSource name in your ServiceEventSource.cs file of your service. This is what I have for my stateless service that I’ll deploy to the cluster once it’s up and running.

clip_image010

Template Deployment

For deployment to create the Service Fabric cluster, there is a deploy.ps1 PowerShell script included but you can really deploy the ARM template with whatever script you normally would use for your other ARM templates. Make sure you go through the process or creating your cluster certificate and put it in the Azure Key Vault first though, you will need that type of information for any secure cluster.

Note that in this example on Github, the parameters file is named ’empty-sfmanageddisk.parameters.json’ and to get the deploy.ps1 to work, you need to get rid of the ’empty-‘ part of the name.

Once your cluster is up and running….confirm your auto-scale settings

Within your cluster resource group, click on the name of your VM scale set. Then click on the Scaling menu item. What you should see is something like this:

clip_image012

If you want to make changes to the levels for Scale out and Scale in, just click on the ‘When’ rules and an edit blade will appear where you can make those changes. Then click the Update button.

Next, to get notified of an auto-scale event, click on the Notify tab, enter an email address and then click the Save button.

clip_image013

Before moving on to the next step, make sure that your cluster nodes are up and running by clicking on the name of your cluster in your resource group and looking for a visual confirmation of available nodes:

clip_image015

Publish your Service Fabric application

For this example, I have a simple stateless ASP.Net Core 2.0 sample that uses the Kestrel communications listener. You can either choose a dynamic port or specify a specific port number in your ServiceManifest.xml file. Publish the service to the cluster using Visual Studio 2017.

In the WebAPI.cs file, you will notice some code that is commented out. I’ll use this in part II of this article when I discuss scaling via custom metrics.

You can find this code sample at https://github.com/larrywa/blogpostings/tree/master/SimpleStateless

 

Email from Azure during auto-scale event

Each time an auto-scale event happens, you will receive an email similar to the one below:

clip_image017

WARNING: The values that I have set for my auto-scale ranges are extremely aggressive. This is just so that initially I can see something happening. In an actual production environment, you cannot expect rapid scale-up and scale-down of Service Fabric nodes. Remember, a node (machine is first created) and then the Service Fabric bits are installed on the VM and then all that has to be spun up, registered and the rest of the Service Fabric magic. You are looking at several minutes of time for this to take place. If your scale-up and scale-down parameters are too aggressive, you could actually get the cluster in a situation where it is trying to spin up a node why taking down the same node. Your node will then go in to an unknown state in which you will have to manually correct by turning off auto-scaling until things settle down.