📜 ⬆️ ⬇️

An example of parsing C ++ code using libclang in Python

On one personal project in C ++, I needed to get information about the types of objects during the execution of the application. C ++ has a built-in Run-Time Type Information (RTTI) mechanism, and of course the first thought was to use it, but I decided to write my own implementation, because I didn’t want to pull the entire built-in mechanism, because I needed only a small part of its functionality. I also wanted to try to practice new features of C ++ 17, with which I was not particularly familiar.


In this post I will provide an example of working with the libclang parser in the Python language.


I will omit the details of releasing my RTTI. The following points are important for us in this case:



Example:


 #pragma once #include <string> #include "RTTI.h" struct BaseNode : public IRttiTypeIdProvider { virtual ~BaseNode() = default; bool bypass = false; }; struct SourceNode : public BaseNode { RTTI_HAS_TYPE_ID std::string inputFilePath; }; struct DestinationNode : public BaseNode { RTTI_HAS_TYPE_ID bool includeDebugInfo = false; std::string outputFilePath; }; struct MultiplierNode : public BaseNode { RTTI_HAS_TYPE_ID double multiplier; }; struct InverterNode : public BaseNode { RTTI_HAS_TYPE_ID }; 

It was already possible to work with this, but after a while I needed to get information about the fields of these classes: the name of the field, the offset and the size. To implement all this, you will have to manually form a structure with a description of each field of the class of interest somewhere in the .cpp file. Having written several macros, the description of the type and its fields began to look like this:


 RTTI_PROVIDER_BEGIN_TYPE(SourceNode) ( RTTI_DEFINE_FIELD(SourceNode, bypass) RTTI_DEFINE_FIELD(SourceNode, inputFilePath) ) RTTI_PROVIDER_END_TYPE() RTTI_PROVIDER_BEGIN_TYPE(DestinationNode) ( RTTI_DEFINE_FIELD(DestinationNode, bypass) RTTI_DEFINE_FIELD(DestinationNode, includeDebugInfo) RTTI_DEFINE_FIELD(DestinationNode, outputFilePath) ) RTTI_PROVIDER_END_TYPE() RTTI_PROVIDER_BEGIN_TYPE(MultiplierNode) ( RTTI_DEFINE_FIELD(MultiplierNode, bypass) RTTI_DEFINE_FIELD(MultiplierNode, multiplier) ) RTTI_PROVIDER_END_TYPE() RTTI_PROVIDER_BEGIN_TYPE(InverterNode) ( RTTI_DEFINE_FIELD(InverterNode, bypass) ) 

And this is only for 4 classes. What problems can be identified?


  1. When copying blocks of code manually, you can overlook the name of the class when defining the field (accumulated a block from SourceNode for DestinationNode, but in one of the fields they forgot to change SourceNode to DestinationNode). The compiler will skip everything, the application may not even fall, but the field information will be incorrect. And if you record or read data based on information from such a field, everything will explode (as they say, but I don’t want to check it myself).
  2. If you add a field to the base class, then you need to update ALL entries.
  3. If you change the name or the order of the fields in the class, then you need to remember to update the name and order in this bag of code.

But the main thing - all this needs to be written manually. When it comes to such a monotonous code, I get very lazy and look for a way to generate it automatically, even if it takes more time and effort than manual writing.


Python helps me with this, I write scripts on it to solve such problems. But we are dealing not just with template text, but with text built on the basis of C ++ source code. We need a tool to get information about C ++ code, and libclang will help us with this.


libclang is a high-level C-interface for Clang. Provides APIs for tools to parse source code in an abstract syntax tree (AST), load already analyzed ASTs, bypass ASTs, match locations of a physical source with elements within ASTs, and other tools from the Clang set.

As follows from the description, libclang provides a C-interface, and to work with it through Python you need a binding library (binding). At the time of this writing, there is no official such library for Python, but from the unofficial there is this https://github.com/ethanhs/clang .


Install it through the package manager:


 pip install clang 

The library is provided with comments in the result code. But to understand the libclang device, you need to read the libclang documentation . There are not many examples of using the library, and there are no comments explaining why everything works like this and not otherwise. Those who already had experience with libclang will have fewer questions, but personally I didn’t have that experience, so I had to notably dig in the code and poke around in the debugger.


Let's start with a simple example:


 import clang.cindex index = clang.cindex.Index.create() translation_unit = index.parse('my_source.cpp', args=['-std=c++17']) for i in translation_unit.get_tokens(extent=translation_unit.cursor.extent): print (i.kind) 

This creates an object of type Index , which can parse a file with C ++ code. The parse method returns an object of type TranslationUnit ; this is a unit of code translation. TranslationUnit is an AST node (node), and each AST node stores information about its position in the source code (extent). We cycle through all the tokens in the TranslationUnit and display the type of these tokens (the property of kind).


For example, take the following C ++ code:


 class X {}; class Y {}; class Z : public X {}; 

Script Execution Result
 TokenKind.KEYWORD TokenKind.IDENTIFIER TokenKind.PUNCTUATION TokenKind.PUNCTUATION TokenKind.PUNCTUATION TokenKind.KEYWORD TokenKind.IDENTIFIER TokenKind.PUNCTUATION TokenKind.PUNCTUATION TokenKind.PUNCTUATION TokenKind.KEYWORD TokenKind.IDENTIFIER TokenKind.PUNCTUATION TokenKind.KEYWORD TokenKind.IDENTIFIER TokenKind.PUNCTUATION TokenKind.PUNCTUATION TokenKind.PUNCTUATION 

Now let's handle AST. Before writing Python code, let's see what we generally expect from the clang parser. Run the clang in dump mode AST:


 clang++ -cc1 -ast-dump my_source.cpp 

The result of the command
 TranslationUnitDecl 0xaaaa9b9fa8 <<invalid sloc>> <invalid sloc> |-TypedefDecl 0xaaaa9ba880 <<invalid sloc>> <invalid sloc> implicit __int128_t '__int128' | `-BuiltinType 0xaaaa9ba540 '__int128' |-TypedefDecl 0xaaaa9ba8e8 <<invalid sloc>> <invalid sloc> implicit __uint128_t 'unsigned __int128' | `-BuiltinType 0xaaaa9ba560 'unsigned __int128' |-TypedefDecl 0xaaaa9bac48 <<invalid sloc>> <invalid sloc> implicit __NSConstantString '__NSConstantString_tag' | `-RecordType 0xaaaa9ba9d0 '__NSConstantString_tag' | `-CXXRecord 0xaaaa9ba938 '__NSConstantString_tag' |-TypedefDecl 0xaaaa9e6570 <<invalid sloc>> <invalid sloc> implicit __builtin_ms_va_list 'char *' | `-PointerType 0xaaaa9e6530 'char *' | `-BuiltinType 0xaaaa9ba040 'char' |-TypedefDecl 0xaaaa9e65d8 <<invalid sloc>> <invalid sloc> implicit __builtin_va_list 'char *' | `-PointerType 0xaaaa9e6530 'char *' | `-BuiltinType 0xaaaa9ba040 'char' |-CXXRecordDecl 0xaaaa9e6628 <my_source.cpp:1:1, col:10> col:7 referenced class X definition | |-DefinitionData pass_in_registers empty aggregate standard_layout trivially_copyable pod trivial literal has_constexpr_non_copy_move_ctor can_const_default_init | | |-DefaultConstructor exists trivial constexpr needs_implicit defaulted_is_constexpr | | |-CopyConstructor simple trivial has_const_param needs_implicit implicit_has_const_param | | |-MoveConstructor exists simple trivial needs_implicit | | |-CopyAssignment trivial has_const_param needs_implicit implicit_has_const_param | | |-MoveAssignment exists simple trivial needs_implicit | | `-Destructor simple irrelevant trivial needs_implicit | `-CXXRecordDecl 0xaaaa9e6748 <col:1, col:7> col:7 implicit class X |-CXXRecordDecl 0xaaaa9e6800 <line:3:1, col:10> col:7 class Y definition | |-DefinitionData pass_in_registers empty aggregate standard_layout trivially_copyable pod trivial literal has_constexpr_non_copy_move_ctor can_const_default_init | | |-DefaultConstructor exists trivial constexpr needs_implicit defaulted_is_constexpr | | |-CopyConstructor simple trivial has_const_param needs_implicit implicit_has_const_param | | |-MoveConstructor exists simple trivial needs_implicit | | |-CopyAssignment trivial has_const_param needs_implicit implicit_has_const_param | | |-MoveAssignment exists simple trivial needs_implicit | | `-Destructor simple irrelevant trivial needs_implicit | `-CXXRecordDecl 0xaaaa9e6928 <col:1, col:7> col:7 implicit class Y `-CXXRecordDecl 0xaaaa9e69e0 <line:5:1, col:21> col:7 class Z definition |-DefinitionData pass_in_registers empty standard_layout trivially_copyable trivial literal has_constexpr_non_copy_move_ctor can_const_default_init | |-DefaultConstructor exists trivial constexpr needs_implicit defaulted_is_constexpr | |-CopyConstructor simple trivial has_const_param needs_implicit implicit_has_const_param | |-MoveConstructor exists simple trivial needs_implicit | |-CopyAssignment trivial has_const_param needs_implicit implicit_has_const_param | |-MoveAssignment exists simple trivial needs_implicit | `-Destructor simple irrelevant trivial needs_implicit |-public 'X' `-CXXRecordDecl 0xaaaa9e6b48 <col:1, col:7> col:7 implicit class Z 

Here CXXRecordDecl is the type of the node representing the class declaration. You may notice that there are more such nodes here than the classes in the source file. This is because reference nodes are represented by the same type, i.e. nodes that are links to other nodes. In our case, the indication of the base class is the reference. When disassembling this tree, the reference node can be determined using a special flag.


Now we will write a script that lists the classes in the source file:


 import clang.cindex import typing index = clang.cindex.Index.create() translation_unit = index.parse('my_source.cpp', args=['-std=c++17']) def filter_node_list_by_node_kind( nodes: typing.Iterable[clang.cindex.Cursor], kinds: list ) -> typing.Iterable[clang.cindex.Cursor]: result = [] for i in nodes: if i.kind in kinds: result.append(i) return result all_classes = filter_node_list_by_node_kind(translation_unit.cursor.get_children(), [clang.cindex.CursorKind.CLASS_DECL, clang.cindex.CursorKind.STRUCT_DECL]) for i in all_classes: print (i.spelling) 

The class name is stored in the spelling property. For different types of nodes, the spelling value may contain some type modifiers, but for a class or structure declaration it contains the name without modifiers.


Result of performance:


 X Y Z 

When parsing AST, clang also parses files connected via #include . Try adding #include <string> to the source, and in the dump you will get 84 thousand lines, which is obviously a bit much to solve our problem.


To view the AST dump of such files via the command line, it is better to delete all #include . Bring them back when you study AST and get an idea of ​​the hierarchy and types in the file of interest.


In the script, in order to filter only the AST belonging to the source file, and not connected via #include , you can add the following filtering function by file:


 def filter_node_list_by_file( nodes: typing.Iterable[clang.cindex.Cursor], file_name: str ) -> typing.Iterable[clang.cindex.Cursor]: result = [] for i in nodes: if i.location.file.name == file_name: result.append(i) return result ... filtered_ast = filter_by_file(translation_unit.cursor, translation_unit.spelling) 

Now you can do field extraction. Below is the full code that generates a list of fields , taking into account inheritance and generates text from the template. There is nothing clang specific, so no comments.


Full script code
 import clang.cindex import typing index = clang.cindex.Index.create() translation_unit = index.parse('Input.h', args=['-std=c++17']) def filter_node_list_by_file( nodes: typing.Iterable[clang.cindex.Cursor], file_name: str ) -> typing.Iterable[clang.cindex.Cursor]: result = [] for i in nodes: if i.location.file.name == file_name: result.append(i) return result def filter_node_list_by_node_kind( nodes: typing.Iterable[clang.cindex.Cursor], kinds: list ) -> typing.Iterable[clang.cindex.Cursor]: result = [] for i in nodes: if i.kind in kinds: result.append(i) return result def is_exposed_field(node): return node.access_specifier == clang.cindex.AccessSpecifier.PUBLIC def find_all_exposed_fields( cursor: clang.cindex.Cursor ): result = [] field_declarations = filter_node_list_by_node_kind(cursor.get_children(), [clang.cindex.CursorKind.FIELD_DECL]) for i in field_declarations: if not is_exposed_field(i): continue result.append(i.displayname) return result source_nodes = filter_node_list_by_file(translation_unit.cursor.get_children(), translation_unit.spelling) all_classes = filter_node_list_by_node_kind(source_nodes, [clang.cindex.CursorKind.CLASS_DECL, clang.cindex.CursorKind.STRUCT_DECL]) class_inheritance_map = {} class_field_map = {} for i in all_classes: bases = [] for node in i.get_children(): if node.kind == clang.cindex.CursorKind.CXX_BASE_SPECIFIER: referenceNode = node.referenced bases.append(node.referenced) class_inheritance_map[i.spelling] = bases for i in all_classes: fields = find_all_exposed_fields(i) class_field_map[i.spelling] = fields def populate_field_list_recursively(class_name: str): field_list = class_field_map.get(class_name) if field_list is None: return [] baseClasses = class_inheritance_map[class_name] for i in baseClasses: field_list = populate_field_list_recursively(i.spelling) + field_list return field_list rtti_map = {} for class_name, class_list in class_inheritance_map.items(): rtti_map[class_name] = populate_field_list_recursively(class_name) for class_name, field_list in rtti_map.items(): wrapper_template = """\ RTTI_PROVIDER_BEGIN_TYPE(%s) ( %s ) RTTI_PROVIDER_END_TYPE() """ rendered_fields = [] for f in field_list: rendered_fields.append(" RTTI_DEFINE_FIELD(%s, %s)" % (class_name, f)) print (wrapper_template % (class_name, ",\n".join(rendered_fields))) 

This script does not take into account whether the class has RTTI. Therefore, after obtaining the result, you will have to manually remove the blocks describing classes without RTTI. But it is a trifle.


I hope someone will be useful and save time. All code is posted on GitHub .



Source: https://habr.com/ru/post/439270/