Get AST from .Net assembly without source code (IL code)

2.1k views Asked by At

I'd like to analyze .Net assemblies to be language independent from C#, VB.NET or whatever.
I know Roslyn and NRefactory but they only seem to work on C# source code level?
There is also the "Common Compiler Infrastructure: Code Model and AST API" project on CodePlex which claims to "supports a hierarchical object model that represents code blocks in a language-independent structured form" which sound exactly for what I looking for.
However I'am unable to find any useful documentation or code that is actual doing this.
Any advice how to archive this?
Can Mono.Cecil maybe doing something?

4

There are 4 answers

0
Piedone On

You can do this and there is also one (although tiny) example of this in the source of ILSpy.

var assembly = AssemblyDefinition.ReadAssembly("path/to/assembly.dll");
var astBuilder = new AstBuilder(new DecompilerContext(assembly.MainModule));
decompiler.AddAssembly(assembly);
astBuilder.SyntaxTree...
4
Ira Baxter On

If you treat the .net binary file as a stream of bytes, you ought to be able to "parse" it just fine.

You simply write a grammar whose tokens are essentially bytes. You can certainly build a classical lexer/parser with almost any set of lexer/parser tools by defining the lexer to read single bytes as tokens.

You can then build the AST using standard AST-building machinery for the parsing engine (on your own for YACC, automatically with ANTLR4).

What you will discover, of course, is that "parsing" isn't enough; you'll still need to build symbol tables, and carry out control and data flow analyses if you are going to do serious analysis of the corresponding code. See my essay on LifeAfterParsing.

You will also likely have to take into account "distinguished" functions that provide key runtime facilities to the particular programming languages that actually generated the CIL code. And these will make your analyzers language-dependent. Yes, you still get to share the part of the analysis that works on generic CIL.

1
AIVoid On

As far as I know, it's not possible to build AST from binary (without sources) since AST itself generated by parser as part of compilation process from sources. Mono.Cecil won't help because you can only modify opcodes/metadata with them, not analyze assembly.

But since it's .NET you can dump IL code from dll with help of ildasm. Then you can pass generated sources to any parser with CIL dictionary hooked up and get AST from parser. The problem is that as far as I know there is only one publically available CIL grammar for parser, so you don't really have a choice. And ECMA-355 is big enough so it's bad idea to write your own grammar. So I can suggest you only one solution:

  1. Pass assembly to ildasm.exe to get CIL.
  2. Then pass CIL to ANTLR v3 parser with this CIL grammar wired up (note it's a little bit outdated - grammar created at 2004 and latest CIL specification is 2006, but CIL doesn't really change to much)
  3. After that you can freely access AST generated by ANTLR

Note that you will need ANTLR v3 not v4, since grammar written for 3rd version, and it's hardly possible to port it to v4 without good knowledge of ANTLR syntax.

Also you can try to look into new Microsoft ryujit compiler sources at github (part of CoreCLR) - I don't sure that it's helps, but in theory it must contains CIL grammar and parser implementations since it works with CIL code. But it's written in CPP, have enormous code base and lacks of documentation since it's in active development stage, so it's may be easier to stuck with ANTLR.

0
svick On

The CCI Code Model is somewhere between a IL disassembler and full C# decompiler: it gives your code some structure (e.g. if statements and expressions), but it also contains some low level stack operations like push and pop.

CCI contains a sample that shows this: PeToText.

For example, to get Code Model for the first method of the Program type (in the global namespace), you could use code like this:

string fileName = "whatever.exe";

using (var host = new PeReader.DefaultHost())
{
    var module = (IModule)host.LoadUnitFrom(fileName);
    var type = (ITypeDefinition)module.UnitNamespaceRoot.Members
        .Single(m => m.Name.Value == "Program");
    var method = (IMethodDefinition)type.Members.First();
    var methodBody = new SourceMethodBody(method.Body, host, null, null);
}

To demonstrate, if you decompile the above code and show it using PeToText, you're going to get:

Microsoft.Cci.ITypeDefinition local_3;
Microsoft.Cci.ILToCodeModel.SourceMethodBody local_5;
string local_0 = "C:\\code\\tmp\\nuget tmp 2015\\bin\\Debug\\nuget tmp 2015.exe";
Microsoft.Cci.PeReader.DefaultHost local_1 = new Microsoft.Cci.PeReader.DefaultHost();
try
{
    push (Microsoft.Cci.IModule)local_1.LoadUnitFrom(local_0).UnitNamespaceRoot.Members;
    push Program.<>c.<>9__0_0;
    if (dup == default(System.Func<Microsoft.Cci.INamespaceMember, bool>))
    {
        pop;
        push Program.<>c.<>9.<Main0>b__0_0;
        Program.<>c.<>9__0_0 = dup;
    }
    local_3 = (Microsoft.Cci.ITypeDefinition)System.Linq.Enumerable.Single<Microsoft.Cci.INamespaceMember>(pop, pop);
    local_5 = new Microsoft.Cci.ILToCodeModel.SourceMethodBody((Microsoft.Cci.IMethodDefinition)System.Linq.Enumerable.First<Microsoft.Cci.ITypeDefinitionMember>(local_3.Members).Body, local_1, (Microsoft.Cci.ISourceLocationProvider)null, (Microsoft.Cci.ILocalScopeProvider)null, 0);
}
finally
{
    if (local_1 != default(Microsoft.Cci.PeReader.DefaultHost))
    {
        local_1.Dispose();
    }
}

Of note are all those push, pop and dup statements and the lambda caching condition.