C# semantic classification with Roslyn

Β· 1262 words Β· 6 minutes to read

A while ago, I blogged about using Roslyn’s completion service. In today’s post, I wanted to continue looking at some of the excellent compiler features that can be utilized to build IDE-like features in your projects. This time, we will look at how to do semantic classification of the code using Roslyn.

Using the classifier πŸ”—

Roslyn exposes a static Classifier service, which can be used to ask the compiler to semantically classify the spans contained in a given document or in a semantic model (or part of it). The API exists since Roslyn 1.0 and is part of the workspace layer of Roslyn - the Microsoft.CodeAnalysis.CSharp.Workspaces Nuget package. Under the hood it is underpinned by an internal language service ISyntaxClassificationService.

Classifier exposes two public methods, which, as mentioned briefly, operate on a document or a semantic model level. In either case, you’d need to initialize a Roslyn workspace (most often an MSBuild based workspace) to be able to work with the API - even if you want to classify a standalone, loose piece of C#; in that case a dummy workspace is necessary.

public static class Classifier
{
    public static async Task<IEnumerable<ClassifiedSpan>GetClassifiedSpansAsync(
        Document document,
        TextSpan textSpan,
        CancellationToken cancellationToken = default);

    public static IEnumerable<ClassifiedSpan>GetClassifiedSpans(
        SemanticModel semanticModel,
        TextSpan textSpan,
        Workspace workspace,
        CancellationToken cancellationToken = default);
}

Looking at the API, you’d likely wonder why one method is async, but not the other. The reason why the document-based method is async, and the semantic model-based one isn’t, is because the first one will need to internally obtain the semantic model from the document, which in itself is an async operation. Once the semantic model is available, there is no async work left to do, hence the second method doesn’t need to be asynchronous anymore.

How do we use the Classifier? Let’s imagine we’d like to classify the following simple piece of code:

using System;

public class MyClass
{
    public static void MyMethod(int value)
    {
    }
}

We have already mentioned that a workspace is necessary to initialize a workspace, and the quickest way to do that is to use AdHocWorkspace and default MefHostServices. They will contain the necessary internal compiler services that the classifier requires. For simplicity of the demo, we will hardcode our input code into a local variable - in normal use cases you’d be reading from disk or from some client/user request. If you are dealing with a full C# solution, instead of stand alone C# code to classify, the more appropriate choice over AdHocWorkspace would be to use MSBuildWorkspace.

var code = @"using System;

public class MyClass
{
    public static void MyMethod(int value)
    {
    }
}";

var host = MefHostServices.Create(MefHostServices.DefaultAssemblies);
var workspace = new AdhocWorkspace(host);

Once you have the workspace, you’d need to produce a Document or a SemanticModel representing our piece of code to classify. Let’s first look at the semantic model approach, as it’s - in my opinion - a bit less work.

var souceText = SourceText.From(code);
var syntaxTree = CSharpSyntaxTree.ParseText(souceText);
var compilation = CSharpCompilation.Create("Dummy").AddReferences(MetadataReference.CreateFromFile(typeof(object).Assembly.Location)).AddSyntaxTrees(syntaxTree);
var semanticModel = compilation.GetSemanticModel(syntaxTree);

var classifiedSpans = Classifier.GetClassifiedSpans(semanticModel, new TextSpan(0, code.Length), workspace);

foreach (var classifiedSpan in classifiedSpans)
{
    var position = souceText.Lines.GetLinePositionSpan(classifiedSpan.TextSpan);
    Console.WriteLine($"{souceText.ToString(classifiedSpan.TextSpan)} - {classifiedSpan.ClassificationType} - {position.Start}:{position.End}");
}

The first thing to do, is to grab the SourceText representing our string-based C# code. SourceText can then be fed into the syntax tree parser, producing a C# syntax tree. At this point we are half way there, but we still need to initalize the compilation, as the semantic model is a product of the compilation pipeline. When you do that, you need to make sure all the necessary metadata references needed for the code to compile are available - in our case only mscorlib is needed though (typeof(object).Assembly). Finally, we can find the semantic model for a syntax tree by querying the newly created compilation.

Next, we can call the classifier, and pass in our semantic model, and the text span corresponding to the piece of code we want to classify. We use new TextSpan(0, code.Length), which simply means the entire code will be classified; however it is also possible to tweak the TextSpan so that position is offset and length is shorter, and thus only part of the code would be submitted for classification - it all depends on the use cases.

At the end we print all the results, which should show us a nice set of classification info:

using - keyword - 0,0:0,5
System - namespace name - 0,6:0,12
; - punctuation - 0,12:0,13
public - keyword - 2,12:2,18
class - keyword - 2,19:2,24
MyClass - class name - 2,25:2,32
{ - punctuation - 3,12:3,13
public - keyword - 4,16:4,22
static - keyword - 4,23:4,29
void - keyword - 4,30:4,34
MyMethod - method name - 4,35:4,43
MyMethod - static symbol - 4,35:4,43
( - punctuation - 4,43:4,44
int - keyword - 4,44:4,47
value - parameter name - 4,48:4,53
) - punctuation - 4,53:4,54
{ - punctuation - 5,16:5,17
} - punctuation - 6,16:6,17
} - punctuation - 7,12:7,13

For the sake of completeness, let’s also show how the code would look like if we were to go over the document-based API. In order to add a document to a workspace, we also need to create a project that would hold that document. Overall, there are several ways of achieving that - one example is shown below. All the rest of the code (dealing with souceText or displaying the classified spans) is the same as before.

var souceText = SourceText.From(code);
var projectInfo = ProjectInfo.Create(ProjectId.CreateNewId(), VersionStamp.Create(), "MyProject", "MyProject", LanguageNames.CSharp).WithMetadataReferences(new[] { MetadataReference.CreateFromFile(typeof(object).Assembly.Location) });
var project = workspace.AddProject(projectInfo);
var document = workspace.AddDocument(project.Id, "MyFile.cs", souceText);

var classifiedSpans = await Classifier.GetClassifiedSpansAsync(document, new TextSpan(0, code.Length));
foreach (var classifiedSpan in classifiedSpans)
{
    var position = souceText.Lines.GetLinePositionSpan(classifiedSpan.TextSpan);
    Console.WriteLine($"{souceText.ToString(classifiedSpan.TextSpan)} - {classifiedSpan.ClassificationType} - {position.Start}:{position.End}");
}

In this approach, we do not need to manually create a Compilation because it will be implicitly created for us based on the Project we set up. Overall, there is really very little difference between the two APIs. Typically when working with structured solutions and MSBuildWorkspace, you’d already be dealing with the documents anyway and the code from the second sample would be more natural to use, while when working with stand alone C# classification based on AdHocWorkspace, then probably the first example would be less tedious to use.

Why do you need the classification? πŸ”—

The most obvious use case is to provide syntax highlighting. Using semantic classifier and the power of the compiler provides an extremely reliable and advanced way of highlighting the code, taking all the aspects and language features into account - especially when a typical alternative would be static and regular expression based. This approach is now used in the highlighting features of OmniSharp.

One final thing about classification is that if you look closely at the results we produced, there is one strange thing going on. MyMethod at positions 4,35:4,43 is actually classified twice:

MyMethod - method name - 4,35:4,43
MyMethod - static symbol - 4,35:4,43

once as method name and once as static symbol. The second classification is the so called “additive classification”. At the moment Roslyn only uses static symbols for this additive classification but that might change in the future. This information allows, for example, for additional highlighting to be applied to static symbols (e.g. make them bold).

You can always exclude it from the result set too, by querying the ClassificationTypeNames.AdditiveTypeNames collection:

var filteredClassifiedSpans = classifiedSpans.Where(s =>
    !ClassificationTypeNames.AdditiveTypeNames.Contains(s.ClassificationType))
);

In fact, this is what we do in OmniSharp too, and this is what Visual Studio does too.

You can find the source code for this blog post at Github

About


Hi! I'm Filip W., a cloud architect from ZΓΌrich πŸ‡¨πŸ‡­. I like Toronto Maple Leafs πŸ‡¨πŸ‡¦, Rancid and quantum computing. Oh, and I love the Lowlands 🏴󠁧󠁒󠁳󠁣󠁴󠁿.

You can find me on Github and on Mastodon.

My Introduction to Quantum Computing with Q# and QDK book
Microsoft MVP