More Annotations

Favourite Annotations

Text

COMMON WINDBG PROBLEMS AND SOLUTIONS « NYNAEVE To fix this problem, use the ntsd executable in the debugger installation directory. WinDbg takes a very long time to process module load events, and it is using max processor time (spinning) on one CPU. This typically happens if you have many unqualified breakpoints that track module load events (created via bu) saved in

your workspace.

(“?:”) is

FUN WITH LOGITECH MX900 BLUETOOTH RECEIVERS « NYNAEVE Looking around a bit, there was actually a more recent version of SetPoint available (Logitech supports 2.22 with the MX900, the latesting being 2.60 which is designed for Logitech’sÂ Bluetooth keyboard and mouse suite).Â I figured that it was worth a try to install 2.60 and see if that worked.Â Sure enough, the installer

actually didn’t

really do.

PROGRAMMING AGAINST THE X64 EXCEPTION HANDLING SUPPORT In the last post in the programming x64 exception handling series, I described how collided unwinds were implemented, and just how they operate. That just about wraps up the guts of unwinding (finally), except for one last corner case: So-called frame consolidation unwinds. Consolidation unwinds are a special form of unwind that is indicated to RtlUnwindEx with a special exception code, STATUS A CATALOG OF NTDLL KERNEL MODE TO USER MODE CALLBACKS A catalog of NTDLL kernel mode to user mode callbacks, part 5: KiUserCallbackDispatcher. Last time, I briefly outlined the operation of KiRaiseUserExceptionDispatcher, and how it is used by the NtClose system service to report certain classes of handle misuse under the debugger. All of the NTDLL kernel mode to user mode “callbacks”

that I

A CATALOG OF NTDLL KERNEL MODE TO USER MODE CALLBACKS A catalog of NTDLL kernel mode to user mode callbacks, part 2: KiUserExceptionDispatcher. Yesterday, I listed the set of kernel mode to user mode callback entrypoints (as of Windows Server 2008). Although some of the callbacks share certain similarities in their modes of operation, there remain significant differences between each

of them, in

AN INTRODUCTION TO DBGPRINTEX (AND WHY IT ISN’T AN EXCUSE One of the things that was changed around the Windows XP era or so in the driver development world was the introduction of the DbgPrintEx routine. This routine was introduced to combat the problem of debug spew from all sorts of different drivers running together by allowing debug prints to be filtered by a “component id”, which is supposed to be unique per class of driver. COMMON WINDBG PROBLEMS AND SOLUTIONS « NYNAEVE To fix this problem, use the ntsd executable in the debugger installation directory. WinDbg takes a very long time to process module load events, and it is using max processor time (spinning) on one CPU. This typically happens if you have many unqualified breakpoints that track module load events (created via bu) saved in

your workspace.

(“?:”) is

actually didn’t

really do.

that I

of them, in

AN INTRODUCTION TO DBGPRINTEX (AND WHY IT ISN’T AN EXCUSE One of the things that was changed around the Windows XP era or so in the driver development world was the introduction of the DbgPrintEx routine. This routine was introduced to combat the problem of debug spew from all sorts of different drivers running together by allowing debug prints to be filtered by a “component id”, which is supposed to be unique per class of driver. FUN WITH LOGITECH MX900 BLUETOOTH RECEIVERS « NYNAEVE Looking around a bit, there was actually a more recent version of SetPoint available (Logitech supports 2.22 with the MX900, the latesting being 2.60 which is designed for Logitech’sÂ Bluetooth keyboard and mouse suite).Â I figured that it was worth a try to install 2.60 and see if that worked.Â Sure enough, the installer

actually didn’t

INTRODUCTION TO X64 DEBUGGING, PART 2 « NYNAEVE Introduction to x64 debugging, part 2. Last time, I talked about some of the basic differences you’ll see when switching to an x64 system if you are doing debugging using the Debugging Tools for Windows package.Â In this installment, I’ll run through some of the other differences with debugging that you’ll likely run into – in THE KERNEL OBJECT NAMESPACE AND WIN32, PART 3 « NYNAEVE The kernel object namespace and Win32, part 3. This posting (the last in the series) attempts to focus on the remaining two parts of the kernel namespace that are visible to Win32. These two parts are broken up as follows: DOS devices. DOS device names are object names that can be manipulated with the file management functions, such as USEFUL WINDBG COMMANDS: .FORMATS « NYNAEVE One of the many things that you end up having to do while debugging a program is displaying data types. While you probably know many of the basic commands like db, da, du, and soforth, one perhaps little-used command is useful for displaying a four or eight byte quantity in a number of different data types: the “.formats” command.This command is useful for viewing various WIN32 CALLING CONVENTIONS: __STDCALL IN ASSEMBLER « NYNAEVE This is quite similar to a __cdecl declared function with the same implementation.Â The only difference is the lack of an add esp instruction following the call. Looking at the function implementation, we can see that unlike the __cdecl version of this function, StdcallFunction1 removes the arguments from the stack: StdcallFunction1 proc near A CATALOG OF NTDLL KERNEL MODE TO USER MODE CALLBACKS A catalog of NTDLL kernel mode to user mode callbacks, part 1: Overview. As I previously mentioned, NTDLL maintains a set of special entrypoints that are used by the kernel to invoke certain functionality on the behalf of user mode. In general, the functionality offered by these entrypoints is fairly simple, although having an understanding of A CATALOG OF NTDLL KERNEL MODE TO USER MODE CALLBACKS A catalog of NTDLL kernel mode to user mode callbacks, part 6: LdrInitializeThunk. Previously, I described the mechanism by which the kernel mode to user mode callback dispatcher ( KiUserCallbackDispatcher) operates, and how it is utilized by win32k.sys for various window manager related operations. The next special NTDLL kernel mode to user A CATALOG OF NTDLL KERNEL MODE TO USER MODE CALLBACKS A catalog of NTDLL kernel mode to user mode callbacks, part 2: KiUserExceptionDispatcher. Yesterday, I listed the set of kernel mode to user mode callback entrypoints (as of Windows Server 2008). Although some of the callbacks share certain similarities in their modes of operation, there remain significant differences between each

of them, in

AN INTRODUCTION TO DBGPRINTEX (AND WHY IT ISN’T AN EXCUSE One of the things that was changed around the Windows XP era or so in the driver development world was the introduction of the DbgPrintEx routine. This routine was introduced to combat the problem of debug spew from all sorts of different drivers running together by allowing debug prints to be filtered by a “component id”, which is supposed to be unique per class of driver. ENABLING THE LOCAL KERNEL DEBUGGER ON VISTA RTM « NYNAEVE Pavel: No, I don’t need to use kdbgctrl before I can connect with this configuration. When I tried to initiate a bugcheck, I think the result was a hang until I attached the real (1394) kd (when I was experimenting with this on Friday). COMMON WINDBG PROBLEMS AND SOLUTIONS « NYNAEVE To fix this problem, use the ntsd executable in the debugger installation directory. WinDbg takes a very long time to process module load events, and it is using max processor time (spinning) on one CPU. This typically happens if you have many unqualified breakpoints that track module load events (created via bu) saved in

your workspace.

actually didn’t

really do.

COMPILER TRICKS IN X86 ASSEMBLY: TERNARY OPERATOR Compiler tricks in x86 assembly: Ternary operator optimization. One relatively common compiler optimization that can be handy to quickly recognize relates to conditional assignment (where a variable is conditionally assigned either one value or an alternate value). This optimization typically happens when the ternary operator in C

(“?:”) is

THE KERNEL OBJECT NAMESPACE AND WIN32, PART 3 « NYNAEVE The kernel object namespace and Win32, part 3. This posting (the last in the series) attempts to focus on the remaining two parts of the kernel namespace that are visible to Win32. These two parts are broken up as follows: DOS devices. DOS device names are object names that can be manipulated with the file management functions, such as PROGRAMMING AGAINST THE X64 EXCEPTION HANDLING SUPPORT In the last post in the programming x64 exception handling series, I described how collided unwinds were implemented, and just how they operate. That just about wraps up the guts of unwinding (finally), except for one last corner case: So-called frame consolidation unwinds. Consolidation unwinds are a special form of unwind that is indicated to RtlUnwindEx with a special exception code, STATUS A CATALOG OF NTDLL KERNEL MODE TO USER MODE CALLBACKS A catalog of NTDLL kernel mode to user mode callbacks, part 5: KiUserCallbackDispatcher. Last time, I briefly outlined the operation of KiRaiseUserExceptionDispatcher, and how it is used by the NtClose system service to report certain classes of handle misuse under the debugger. All of the NTDLL kernel mode to user mode “callbacks”

that I

your workspace.

actually didn’t

really do.

(“?:”) is

that I

AN INTRODUCTION TO DBGPRINTEX (AND WHY IT ISN’T AN EXCUSE One of the things that was changed around the Windows XP era or so in the driver development world was the introduction of the DbgPrintEx routine. This routine was introduced to combat the problem of debug spew from all sorts of different drivers running together by allowing debug prints to be filtered by a “component id”, which is supposed to be unique per class of driver. FUN WITH LOGITECH MX900 BLUETOOTH RECEIVERS « NYNAEVE Looking around a bit, there was actually a more recent version of SetPoint available (Logitech supports 2.22 with the MX900, the latesting being 2.60 which is designed for Logitech’sÂ Bluetooth keyboard and mouse suite).Â I figured that it was worth a try to install 2.60 and see if that worked.Â Sure enough, the installer

actually didn’t

THE KERNEL OBJECT NAMESPACE AND WIN32, PART 3 « NYNAEVE The kernel object namespace and Win32, part 3. This posting (the last in the series) attempts to focus on the remaining two parts of the kernel namespace that are visible to Win32. These two parts are broken up as follows: DOS devices. DOS device names are object names that can be manipulated with the file management functions, such as USEFUL WINDBG COMMANDS: .FORMATS « NYNAEVE One of the many things that you end up having to do while debugging a program is displaying data types. While you probably know many of the basic commands like db, da, du, and soforth, one perhaps little-used command is useful for displaying a four or eight byte quantity in a number of different data types: the “.formats” command.This command is useful for viewing various A CATALOG OF NTDLL KERNEL MODE TO USER MODE CALLBACKS A catalog of NTDLL kernel mode to user mode callbacks, part 6: LdrInitializeThunk. Previously, I described the mechanism by which the kernel mode to user mode callback dispatcher ( KiUserCallbackDispatcher) operates, and how it is utilized by win32k.sys for various window manager related operations. The next special NTDLL kernel mode to user A CATALOG OF NTDLL KERNEL MODE TO USER MODE CALLBACKS A catalog of NTDLL kernel mode to user mode callbacks, part 2: KiUserExceptionDispatcher. Yesterday, I listed the set of kernel mode to user mode callback entrypoints (as of Windows Server 2008). Although some of the callbacks share certain similarities in their modes of operation, there remain significant differences between each

of them, in

AN INTRODUCTION TO DBGPRINTEX (AND WHY IT ISN’T AN EXCUSE One of the things that was changed around the Windows XP era or so in the driver development world was the introduction of the DbgPrintEx routine. This routine was introduced to combat the problem of debug spew from all sorts of different drivers running together by allowing debug prints to be filtered by a “component id”, which is supposed to be unique per class of driver. A CATALOG OF NTDLL KERNEL MODE TO USER MODE CALLBACKS A catalog of NTDLL kernel mode to user mode callbacks, part 3: KiUserApcDispatcher. I previously described the behavior of the kernel mode to user mode exception dispatcher ( KiUserExceptionDispatcher ). While exceptions are arguably the most commonly seen of the kernel mode to user mode “callbacks” in NTDLL, they are not the only such

event.

DEBUGGER TRICKS: FIND ALL PROBABLE CONTEXT RECORDS IN A Debugger tricks: Find all probable CONTEXT records in a crash dump. If you’ve debugged crash dumps for awhile, then you’ve probably ran into a situation where the initial dump context provided by the debugger corresponds to a secondary exception that happened while processing an initial exception that’s likely closer to the original THREAD LOCAL STORAGE, PART 4: ACCESSING __DECLSPEC(THREAD Thread Local Storage, part 4: Accessing __declspec (thread) data. Yesterday, I outlined how the compiler and linker cooperate to support TLS. However, I didn’t mention just what exactly goes on under the hood when one declares a __declspec (thread) variable and accesses it. Before the inner workings of a __declspec (thread) variable access ENABLING THE LOCAL KERNEL DEBUGGER ON VISTA RTM « NYNAEVE Pavel: No, I don’t need to use kdbgctrl before I can connect with this configuration. When I tried to initiate a bugcheck, I think the result was a hang until I attached the real (1394) kd (when I was experimenting with this on Friday). COMPILER OPTIMIZER TRICKS IN X86 ASSEMBLY, PART 1 « NYNAEVE Compiler optimizer tricks in x86 assembly, part 1. The compiler is often very clever about speeding up some common operations in C (with how they might appear in assembler), in a way that might at first appear a bit non-obvious. With a bit of practice, you can train yourself to quickly identify these optimizations and see what they

really do.

DEBUGGING (OR REVERSE ENGINEERING…) A REAL LIFE WINDOWS This is an excellent publication. I’m testing one of our application with Vista RC2 and CreateIpForwardEntry was also failing for me, I read all MSDN doc and did not get much help, so I thought to reverse engineer CreateIpForwardEntry. THE KERNEL OBJECT NAMESPACE AND WIN32, PART 3 « NYNAEVE The kernel object namespace and Win32, part 3. This posting (the last in the series) attempts to focus on the remaining two parts of the kernel namespace that are visible to Win32. These two parts are broken up as follows: DOS devices. DOS device names are object names that can be manipulated with the file management functions, such as COMMON WINDBG PROBLEMS AND SOLUTIONS « NYNAEVE To fix this problem, use the ntsd executable in the debugger installation directory. WinDbg takes a very long time to process module load events, and it is using max processor time (spinning) on one CPU. This typically happens if you have many unqualified breakpoints that track module load events (created via bu) saved in

your workspace.

(“?:”) is

ANALYSIS OF A NETWORKING PROBLEM: THE CASE OF THE Recently, I had the unpleasant task of troubleshooting a particularly strange problem at work, in which a particular SMB-based file server would disconnect users if more than one user attempted to simultaneously initiate a file transfer. PROGRAMMING AGAINST THE X64 EXCEPTION HANDLING SUPPORT In the last post in the programming x64 exception handling series, I described how collided unwinds were implemented, and just how they operate. That just about wraps up the guts of unwinding (finally), except for one last corner case: So-called frame consolidation unwinds. Consolidation unwinds are a special form of unwind that is indicated to RtlUnwindEx with a special exception code, STATUS DEBUGGER TRICKS: FIND ALL PROBABLE CONTEXT RECORDS IN A Debugger tricks: Find all probable CONTEXT records in a crash dump. If you’ve debugged crash dumps for awhile, then you’ve probably ran into a situation where the initial dump context provided by the debugger corresponds to a secondary exception that happened while processing an initial exception that’s likely closer to the original A CATALOG OF NTDLL KERNEL MODE TO USER MODE CALLBACKS A catalog of NTDLL kernel mode to user mode callbacks, part 2: KiUserExceptionDispatcher. Yesterday, I listed the set of kernel mode to user mode callback entrypoints (as of Windows Server 2008). Although some of the callbacks share certain similarities in their modes of operation, there remain significant differences between each

of them, in

COMPILER OPTIMIZER TRICKS IN X86 ASSEMBLY, PART 1 « NYNAEVE Compiler optimizer tricks in x86 assembly, part 1. The compiler is often very clever about speeding up some common operations in C (with how they might appear in assembler), in a way that might at first appear a bit non-obvious. With a bit of practice, you can train yourself to quickly identify these optimizations and see what they

really do.

your workspace.

(“?:”) is

of them, in

COMMON WINDBG PROBLEMS AND SOLUTIONS « NYNAEVE When you’re debugging a program, the last thing you want to have to deal with is the debugger not working properly. It’s always frustrating to get sidetracked on secondary problems when you’re trying to focus on tracking down a bug, and especially so when problems with your debugger cause you to lose a repro or burn excessive amounts of time waiting around for the debugger to finish FUN WITH LOGITECH MX900 BLUETOOTH RECEIVERS « NYNAEVE Looking around a bit, there was actually a more recent version of SetPoint available (Logitech supports 2.22 with the MX900, the latesting being 2.60 which is designed for Logitech’sÂ Bluetooth keyboard and mouse suite).Â I figured that it was worth a try to install 2.60 and see if that worked.Â Sure enough, the installer

actually didn’t

PROGRAMMING AGAINST THE X64 EXCEPTION HANDLING SUPPORT Previously, I provided a brief overview of what each of the core APIs relating to x64’s extensive data-driven unwind support were, and when you might find them useful.. This post focuses on discussing the interface-level details of RtlUnwindEx, and how they relate to procedure unwinding on Windows (x64 versions, specifically, though most of the concepts apply to other architecture in principle). INTERNALS « NYNAEVE The previous post in this series outlined how KiUserApcDispatcher operates for the purposes of enabling user mode APCs. Unlike KiUserExceptionDispatcher, which is expected to modify the return information from the context of an interrupt (or exception) in kernel mode, KiUserApcDispatcher is intended to operate on the return context of an active system call. A CATALOG OF NTDLL KERNEL MODE TO USER MODE CALLBACKS A catalog of NTDLL kernel mode to user mode callbacks, part 1: Overview. As I previously mentioned, NTDLL maintains a set of special entrypoints that are used by the kernel to invoke certain functionality on the behalf of user mode. In general, the functionality offered by these entrypoints is fairly simple, although having an understanding of A CATALOG OF NTDLL KERNEL MODE TO USER MODE CALLBACKS A catalog of NTDLL kernel mode to user mode callbacks, part 2: KiUserExceptionDispatcher. Yesterday, I listed the set of kernel mode to user mode callback entrypoints (as of Windows Server 2008). Although some of the callbacks share certain similarities in their modes of operation, there remain significant differences between each

of them, in

A CATALOG OF NTDLL KERNEL MODE TO USER MODE CALLBACKS A catalog of NTDLL kernel mode to user mode callbacks, part 5: KiUserCallbackDispatcher. Last time, I briefly outlined the operation of KiRaiseUserExceptionDispatcher, and how it is used by the NtClose system service to report certain classes of handle misuse under the debugger. All of the NTDLL kernel mode to user mode “callbacks”

that I

A CATALOG OF NTDLL KERNEL MODE TO USER MODE CALLBACKS A catalog of NTDLL kernel mode to user mode callbacks, part 3: KiUserApcDispatcher. I previously described the behavior of the kernel mode to user mode exception dispatcher ( KiUserExceptionDispatcher ). While exceptions are arguably the most commonly seen of the kernel mode to user mode “callbacks” in NTDLL, they are not the only such

event.

THREAD LOCAL STORAGE, PART 4: ACCESSING __DECLSPEC(THREAD Thread Local Storage, part 4: Accessing __declspec (thread) data. Yesterday, I outlined how the compiler and linker cooperate to support TLS. However, I didn’t mention just what exactly goes on under the hood when one declares a __declspec (thread) variable and accesses it. Before the inner workings of a __declspec (thread) variable access ENABLING THE LOCAL KERNEL DEBUGGER ON VISTA RTM « NYNAEVE Pavel: No, I don’t need to use kdbgctrl before I can connect with this configuration. When I tried to initiate a bugcheck, I think the result was a hang until I attached the real (1394) kd (when I was experimenting with this on Friday).

NYNAEVE

Adventures in Windows debugging and reverse engineering. ------------------------- NWSCRIPT JIT ENGINE: WRAP-UP (FOR NOW)

August 24th, 2010

Yesterday , I provided a brief performance overview of the MSIL JIT backend versus my implementation of an interpretive VM for various workloads. Today, I’ll mostly pontificate on conclusions from the JIT project. It has certainly been an interesting foray into .NET, program analysis, and code generation; the JIT engine is actually my first non-trivial .NET project. I have to admit that .NET turned out to not be as bad as I thought that it would be (as much as I thought I wouldn’t have said that); that being said, I don’t see myself abandoning C++ anytime soon. Looking back, I do think that it was worth going with MSIL (.NET) as the first JIT backend. Even though I was picking up .NET Reflection for the first time, aside from some initial frustrations with referencing /clr mixed types from emitted code, things turned out relatively smooth. I suspect that writing the JIT against another backend, such as LLVM, would have likely taken much more time invested to reach a fully functional state, especially with full support for cleaning up lingering state if the script program aborted at any point

in time.

Justin is working on a LLVM JIT backend for the JIT system, though, so we’ll have to see how it turns out. I do suspect that it’s probably the case that LLVM may offer slightly better performance in the end, due to more flexibility in cutting out otherwise extraneous bits in the JIT’d native code that .NET insists on (such as the P/Invoke wrapper code, thin as it may be). That being said, the .NET JIT didn’t take an inordinate amount of time to write, and it fully supports turning IL into optimized x86, amd64, and ia64 code (Andrew Rogers’s 8-year-old Itanium workstation migrated to my office at work, and I tried it the JIT engine out on ia64 on the weekend using it — the JIT system _did_ actually function correctly, without any additional development work necessary, which makes me happy). There was virtually no architecture-specific code that I had to write to make that happen, which in many respects says something impressive about using MSIL as a code generation

backend.

MSIL was easy to work with as a target language for the JIT system, and the fact that the JIT optimizes the output freed me from many of the complexities that would be involved had I attempted to target x86 or amd64 machine code directly. While there’s still some (thin) overhead introduced by P/Invoke stubs and the like in the actual machine code emitted by the .NET JIT, the code quality is enough that it performs quite well at the end of the day. Oh, and if you’re curious, you can check out an example NWScript assembly and its associated IL. Note that this is the 64-bit version of the assembly, as you can see from the action service handler call stubs. For fun, I’ve heard that you can even turn it into C# using Reflector (though without scopes defined, it will probably be a bit of a pain to wade through). All in all, the JIT engine was a fun vacation project to work on. Next steps might be to work on patching the JIT backend into the stock NWN2 server (currently it operates in my ground-up server implementation), but that’s a topic for another day.

Tags: NWN2

Posted in Programming | Comments Off on NWScript JIT engine: Wrap-up (for now) NWSCRIPT JIT ENGINE: PERFORMANCE CONSIDERATIONS

August 23rd, 2010

Last time , we learned how SAVE_STATEs are supported by the MSIL JIT backend. This time, we’ll touch on everybody’s favorite topic — performance. After all, the whole point of the JIT project is to improve performance of scripts; there wouldn’t be much point in using it over the interpretive VM if it wasn’t faster. So, just how much faster is the MSIL JIT backend than my reference interpretive NWScriptVM? Let’s find out (when using the “direct fast” action service call mechanism)… The answer, as it so often turns out to be, depends. Some workloads yield significantly greater performance, while other workloads yield comparatively similar performance. _Computational workloads_ Scripts that are computationally-heavy in NWScript are where the JIT system really excels. For example, consider the following script

program fragment:

int g_randseed = 0;

int rand()

{

return g_randseed = (g_randseed * 214013 + 2531101) >> 16;

}

// StartingConditional is the entry point. int StartingConditional(

int i,

object o,

string s)

{

for (i = 0; i < 1000000; i += 1) i += rand( ) ^ 0xabcdef / (rand( ) | 0x1);

return i;

}

Here, I compared 1000000 iterations of invoking this script's entry point, once via the JIT engine's C API, and once via the NWScriptVM's

API.

When using the interpretive VM, this test took over a whopping _five minutes_ to complete on my test system; ouch! Using the MSIL JIT on .NET 4.0, on the same system, yields an execution time on the order of just fourteen seconds, by comparison; this represents an improvement of almost 21.42 times faster execution than the interpretive VM. _Action service-bound workloads (non-string-based)_ While that is an impressive-looking result, most scripts are not exclusively computationally-bound, but rather make heavy use of action service handlers exported by the script host. For example, consider a second test program, structured along the lines of this:

vector v;

v = Vector( 1.0, 2.0, 3.0 ); v = Vector( 1.0, 2.0, 3.0 ); v = Vector( 1.0, 2.0, 3.0 ); v = Vector( 1.0, 2.0, 3.0 ); v = Vector( 1.0, 2.0, 3.0 ); v = Vector( 1.0, 2.0, 3.0 ); v = Vector( 1.0, 2.0, 3.0 ); v = Vector( 1.0, 2.0, 3.0 ); v = Vector( 1.0, 2.0, 3.0 ); v = Vector( 1.0, 2.0, 3.0 ); v = Vector( 1.0, 2.0, 3.0 ); In this context, Vector is an action service handler. With the interpretive VM in use, 1000000 iterations of this program consume on the order of thirty seconds. By comparison, the MSIL JIT backend clocks in at approximately ten seconds. That's still a significant improvement, but not quite as earth-shattering as over 21 times faster execution speed. The reduction here stems from the fact that most of the work is offloaded to the script host and not the JIT'd code; in effect, the only gain we get is a reduction in make-work overhead related to the stack-based VM execution environment, rather than any boost to raw computational

performance.

_Action service-bound workloads (string-based with one argument)_ It is possible to construct a "worst case" script program that receives almost no benefit from the JIT system. This can be done by writing a script program that spends almost all of its time passing strings to action service handlers, and receiving strings back from action service handlers. Consider a program along the lines of this: StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); StringToInt( IntToString( i ) + s ); When executed with the interpretive script VM, this program took approximately 70 seconds to complete the 1000000 iterations that I've been using as a benchmark. The MSIL JIT backend actually clocks in as just a smidgeon _slower_, at roughly 75-76 seconds on average (on my

test machine).

Why is the JIT'd code (ever) slower than the interpretive VM? Well, this turns out to relate to the fact that I used System.String to represent a string in the JIT engine. While convenient, this does have some drawbacks, because a conversion is required in order to map between the std::string objects used by action service handlers (and the VM stack object) and the System.String objects used by the JIT'd

code.

If a script program spends most of its time interfacing exclusively with action service calls that take and return strings, performance suffers due to the marshalling conversions involved. _Action service-bound workloads (string-based with more than one

argument)_

Not all action service calls related to strings are created equal, however. The more parameters passed to the action service call, the better the JIT'd code does in comparison to the script VM. The StringToInt / IntToString conversion case is an extreme example; even a minor change to use GetSubString calls shows a significant change in results, for example: s = GetSubString( s, 1, 1 ); s = GetSubString( s, 1, 1 ); s = GetSubString( s, 1, 1 ); s = GetSubString( s, 1, 1 ); s = GetSubString( s, 1, 1 ); s = GetSubString( s, 1, 1 ); s = GetSubString( s, 1, 1 ); s = GetSubString( s, 1, 1 ); s = GetSubString( s, 1, 1 ); s = GetSubString( s, 1, 1 ); In this test, the interpretive VM clocks in at approximately 30 seconds, whereas the JIT'd code finishes in nearly half the time, at around 15.5 seconds on average. _Performance conclusions_ While the actual performance characteristics will vary significantly depending on the workload, most scripts will see a noticible performance increase. Except for worst-case scenarios involving single-string action service handler, it's reasonable to postulate that most scripts have a reasonable chance at running twice as fast under the JIT than the VM if they are exclusively action service handler-bound. Furthermore, any non-trivial, non-action-service-call instructions in a script will tend to heavily tip the scales in favor of the JIT engine; for general purpose data processing (including general flow control related logic such as if statements and loops), the interpretive VM simply can't keep up with the execution speed benefits offered by native code execution. Now, it's important to note that in the case of NWN1 and NWN2, not all performance problems are caused by scripts; naturally, replacing the script VM with a JIT system will do nothing to alleviate those issues. However, for modules that are heavy on script execution, the JIT system offers significant benefits (and equally importantly, creates significant headroom to enable even more complex scripting without compromising server performance).

Tags: NWN2

Posted in Programming | 1 Comment » NWSCRIPT JIT ENGINE: MSIL BACKEND SUPPORT FOR SAVE_STATE

August 22nd, 2010

Yesterday , I described how the fast action call mechanism improves action call performance for JIT’d programs. For today’s NWScript adventure, let’s dig into how SAVE_STATE operations (script situations) are supported in the MSIL

JIT backend.

As you may recall, SAVE_STATE operations (codified by I_SAVE_STATE in the IR instruction set and OP_STORE_STATE/OP_STORE_STATEALL in the NWScript instruction set) are used to allow a copy of the script execution environment’s current execution context to be “forked off” for later use. This is easy to implement in the interpretive script VM environment, but something more elaborate is required for

the JIT backend.

The NWScript analyzer promotes resume labels for SAVE_STATE operations into first class subroutines; in the MSIL backend, these subroutines are then emitted as IL-level subroutines. When a SAVE_STATE instruction is encountered, the following steps are taken: * The backend emits IL instructions to save the state of all local variables shared with the resume subroutine. This is performed by boxing copies of these locals into an array< Object ^ >. * The backend sets up a call to a method on the main script class (ScriptProgram), CloneScriptProgram. This method allocates a new ScriptProgram instance derived from the current ScriptProgram object and prepares it for use as a saved state clone. This entails duplicating the contents of all global variables in the parent ScriptProgram object and resetting the various runtime guard counters (such as the recursion depth) to their default, zero values. * The backend sets up a call to a JIT intrinsic, Intrinsic_StoreState. This intrinsic takes the boxed local variable array, the cloned ScriptProgram object, and a “resume method id”. All of these values are stored into a new _NWScriptSavedState_ object that is hung off of the overarching NWScriptProgram object. Once these steps have been taken, a future action service handler will call an API to receive the last saved state. This API will return the most recently constructed NWScriptSavedState object. Eventually, the script host may opt to execute the saved state. This is known as executing a “script situation”; to accomplish this, the script host passes the NWScriptSavedState object to the NWScriptProgram object (indirected through a C-level API), asking the NWScriptProgram object to call the resume label with the saved state. For performance reasons, the NWScriptProgram object does not attempt to call the resume label via a Reflection invocation attempt. Instead, a dispatcher method on the INWScriptGeneratedProgram interface implemented by the ScriptProgram type, _ExecuteScriptSituation_, is invoked. (Here, the ScriptProgram instance that was created by the CloneScriptProgram call earlier is used, ensuring that a copy of the current global variables is referenced.) As you’ll recall, ExecuteScriptSituation has a signature looking something like this:

//

// Execute a script situation (resume label).

//

void

ExecuteScriptSituation( __in UInt32 ScriptSituationId, __in array< Object ^ > ^ Locals

);

Internally, ExecuteScriptSituation is implemented as essentially a large “switch” block that switches on the mysterious ScriptSituationId parameter (corresponding to the “resume method id” that was passed to Intrinsic_StoreState). This parameter identifies which resume subroutine in the script program should be executed. (When emitting IL code for subroutine, the first resume subroutine is assigned resume method id 0; the next is assigned resume method id 1, and so forth.) If the ScriptSituationId matches a legal case branch that was emitted into ExecuteScriptSituation, additional code to unbox the Locals array contents into parameters follows. These parameters are simply passed to the resume subroutine for that case statement. At this point, the resume globals are set to their correct values (by virtue of the fact that the ‘this’ pointer is set to the cloned ScriptProgram instance), and the resume locals are, similarly, set up correctly as subroutine parameters. The rest, as they say, is history; the resume label continues on as normal, executing whatever operations it wishes.

Tags: NWN2

Posted in Programming | 1 Comment » NWSCRIPT JIT ENGINE: JIT INTRINSICS, AND JITÂ€™D ACTION SERVICE HANDLER CALLS, PART 4: DIRECT FAST ACTION CALLS

August 21st, 2010

Previously , I explained how the ‘fast’ action service call interface worked — and why it doesn’t always live up to its namesake. This time, we’ll examine the no-holds-barred, non-verifiable direct fast action call path. This action service call mechanism is designed for maximum performance at the expense of type-safe, verifiable IL; as you’ll see, several punches are pulled in the name of performance

here.

The direct fast action call mechanism operates on a similar principle to the regular fast action call mechanism that we saw previously. however, instead of doing the work to package up parameters into a boxed array and performing the final conversion to native types in a generic fashion at runtime, the direct fast action call system takes a different approach — deal with these tasks at compile time, using

static typing.

In both cases, we’ll end up calling through the OnExecuteActionFromJITFast C++ virtual interface function on the INWScriptActions interface, but how we get there is quite different with the direct fast call interface. Now, recall again that the OnExecuteActionFromJITFast interface is essentially structured in such a way to combine every VM stack manipulation operation _and_ the actual call to the action service handler into a single call to native code. This is accomplished by passing two arrays to OnExecuteActionFromJITFast — a “command” (ULONG) array, describing the underlying operations to perform, and a “command parameter” (uintptr_t) array, describing data to perform the operations upon. Where the direct fast action service call mechanism differs from the (normal) fast action call service mechanism is in how these two arrays are built. In the direct fast mechanism, the JIT’d code actually packages parameters up itself without relying on the intrinsic — no more boxing or array allocations. In order to accomplish this, the direct call interface creates a custom value type for each action service call. This value type, named something like NWScript.JITCode.<ScriptName>. DirectActionServiceCmdDescriptors. ExecuteActionService_, accomplishes a dual purpose. It represents both the “command” _and_ “command parameter” arrays that will be used to call OnExecuteActionFromJITFast. Conversely, each of the individual fields in the value type need to remain strongly typed so that they can be accessed by generated code without involving boxing or other low-performance constructs. Essentially, the value type is constructed so that it can be accessed using strongly typed individual fields in .NET, but accessed as two arrays — one of ULONGs, and one of uintptr_ts, in native code. Let’s look at an example: Say we have an action that we would like to call, with the following source-level prototype in NWScript: string IntToString(int nInteger); The command and parameter arrays that we’ll want to set up for a call to OnExecuteActionFromJITFast would be as follows: Fast action commands CMDS (NWFASTACTION_CMD) CMDPARAMS (UINTPTR_T)

DESCRIPTION

NWFASTACTION_PUSHINT

(nInteger value)

Push nInteger on the stack

NWFASTACTION_CALL

_(None)_

Invoke OnAction_IntToString NWFASTACTION_POPSTRING

&ReturnString

Pop return value string from the stack Both of Cmds and CmdParams represent parallel arrays from the point of view of the native code in OnExecuteActionFromJITFast. The data structure that the direct fast action call mechanism used to represent these two arrays would thus be akin to the following: value struct CmdDesc

{

// &Cmd_0 represents the

// "Cmds" array:

// NWFASTACTION_PUSHINT System::UInt32 Cmd_0; // NWFASTACTION_CALL System::UInt32 Cmd_1; // NWFASTACTION_POPSTRING System::UInt32 Cmd_2; // Padding for alignment. If // there were an odd number of // commands, we must introduce // an alignment field here on // 64-bit platforms.

#ifdef _WIN64

System::UInt32 CmdPadding_Tail;

#endif

// &CmdParam_0 represents // the "CmdParams" array:

// nInteger

System::UInt64 CmdParam_0;

// ReturnString

NeutralString * CmdParam_Ret_1; // Floating point fields are // represented as a System::Single // with an optional System::Int32 // padding field on 64-bit systems. // Remaining fields are storage // for strings if we had any. // CmdParam_Ret_1 points to // StringStorage_0. NeutralString StringStorage_0;

};

The _NeutralString_ type represents the data format for a string that is passed cross-module to and from the script host; internally, it is simply a pair of (char * String, size_t Length), allocated from the process heap. A set of JIT intrinsics are used to allocate and delete NeutralStrings should they be referenced for an action service call. From a .NET perspective, the following wrapper suffices for the NeutralString (layout-compatible with the actual C++ structure): public value struct NeutralString

{

System::IntPtr StrPtr; System::IntPtr Length;

};

With this structure layout in place, the backend generates IL instructions to load the appropriate constants into each of the Cmd_ fields. Then, the CmdParam_ fields are set up, followed by the CmdParam_Ret_ fields. (If a NeutralString is referenced, intrinsic calls to translate to and from System::String ^’s are made as necessary.) Finally, the backend generates a call to OnExecuteActionFromJITFast. One interesting optimization that is performed here is a _de-virtualization_ of the function call. Normally, OnExecuteActionFromJITFast involves loading a _this_ pointer from a storage location, then loading a virtual function table entry for the target function. However, the backend takes advantage of the fact that the INWScriptActions object associated with a particular script cannot go away while the script’s code can be used. Instead of making a normal virtual function call, the _this_ pointer, and the address of the OnExecuteActionFromJITFast virtual function are hardwired into the emitted IL as immediate constant operands. (This does make the generated assembly specific to the process that it executes within; the resultant assembly can still be disassembled for debugging purposes, however.) After the OnExecuteActionFromJITFast call returns, IL is generated to check if the action call failed. If so, then an exception is raised. (Unlike the standard action call interface, the script abort flag on the NWScriptProgram is not tested for performance purposes. Instead, OnExecuteActionFromJITFast must return false to abort the script.) IL code is then emitted to move any return value data from its storage locations in the value structure to the appropriate IL local variable(s), if any. Finally, if any strings were involved in the action parameter or return values, the emitted IL code is wrapped in an exception handler that releases any allocated native strings (then rethrowing the

exception upwards).

Due to the amount of code generated for a direct fast action service call, all of the logic I have outlined is placed into a stub routine (similar to how one might see a system call stub for a conventional operating system). Calls to the stub are then made whenever an I_ACTION instruction is encountered, assuming that the call does not involve any engine structures. Overall, the direct fast action call interface provides superior performance to the other two action call mechanisms; even in worst case scenarion environments, such as repeated action service calls involving a small number of string parameters, profiling has shown execution times on the order of 79% as compared to a script assembly emitted with the standard action service call system. In most cases, the performance improvement is even greater.

Tags: NWN2

Posted in Programming | 1 Comment » NWSCRIPT JIT ENGINE: JIT INTRINSICS, AND JIT’D ACTION SERVICE HANDLER CALLS, PART 3: FAST ACTION CALLS

August 20th, 2010

Yesterday , we learned how the standard action service call path operates in the MSIL JIT backend for the NWScript JIT engine. This time, we’ll examine the ‘fast’ action

service call path.

As I alluded to last time, the fast action service call path attempts to cut down on the overhead of making multiple managed/native transitions for each action service handler call. While a standard path action service call may need to make multiple managed/native transitions depending on the count of arguments to a particular action service call, a fast action service call makes only one managed/native

transition.

The fast action service call interface has two components: * An extension, _INWScriptActions::OnExecuteActionFromJITFast_, to the C++-level interface that NWNScriptJIT.dll (and the interpretive NWScriptVM) use to communciate with the script host. This extension comes in the form of a new interface API that takes an action service ordinal to invoke, a count of source-level arguments to the action, and a list of commands and parameters. The commands and parameters describe a set of push or pop operations to perform on the VM stack in order to set up a call/return pair to the action service handler. These operations all happen entirely in native code, embedded in the

script host.

* A new JIT intrinsic on the INWScriptProgram interface, _Intrinsic_ExecuteActionServiceFast_, which returns the action service handler’s return value (boxed), if any, takes an array of (boxed) arguments to pass to the action service handler.<.li> It’s important to note that the current version of the fast action service call interface isn’t quite as fast as one would hope, due to in no small part the fact that it sticks to verifiable IL. In fact, it’s not always faster than the standard path, which is why it’s currently only used if there are six or more VMStackPush/Pop intrinsic calls that would be needed in addition to the ExecuteActionService

intrinsic.

Internally, Intrinsic_ExecuteActionServiceFast essentially looks at a set of data tables provided by the script host which describe the effective prototype of each action handler. Based on this information, it translates the managed parameter array into a command and parameter array to pass to the C++-level INWScriptActions::OnExecuteActionFromJITFast API and calls the script

host.

Next, the script host then does all of the associated operations (pushing items onto the VM stack, calling the action service handler, and popping the return value, if any, off the VM stack) “locally”. Finally, Intrinsic_ExecuteActionServiceFast repackages any return value into its managed equivalent and returns back to the JIT’d

program code.

If all of that sounded like a mouthful, it certainly was — there is extra overhead here; the fast action service mechanism is competing with the overhead of managed/native code. Before we continue, let’s look at how this all plays out in the underlying IL. Here’s the same “Hello, world” subroutine we had

before:

void PrintHello()

{

PrintString( "Hello, world (from NWScript)." );

}

If I were to override the cost/benefit heuristics in the JIT engine and force it to always use the fast action service handler call interface, we will see the following IL emitted: IL_0025: ldstr "Hello, world (from NWScript)."

IL_002a: stloc.1

IL_002b: ldarg.0

IL_002c: ldfld m_ProgramJITIntrinsics IL_0031: ldc.i4 0x1

IL_0036: conv.u4

IL_0037: ldc.i4 0x1

IL_003c: conv.u4

IL_003d: ldc.i4 0x1 IL_0042: newarr System.Object

IL_0047: stloc.2

IL_0048: ldloc.2

IL_0049: ldc.i4 0x0

IL_004e: ldloc.1

IL_004f: stelem.ref

IL_0050: ldloc.2

IL_0051: callvirt instance object Intrinsic_ExecuteActionServiceFast(uint32,

uint32,

object)

IL_0056: ldnull

IL_0057: stloc.2

IL_0058: pop

We have the following operations going on here: String ^ s = "Hello, world (from NWScript)"; array< Object ^ > ^ a = gcnew array< Object ^ >{ s }; m_ProgramJITIntrinsics->ExecuteActionServiceFast( 1, 1, a ); Clearly, the fast action service path as it is implemented today is a tradeoff. When there are a large number of parameters and return values (this isn’t as uncommon as you think when you consider that NWScript passes and returns structures, such as ‘vector’ (3 floats), by value), the overhead of the fast action service call mechanism appears to be less than that of many managed/native switches (at least under .NET 4.0 on amd64). However, when fewer intrinsic calls (leading to managed/native switches) are involved, then the standard path ends up winning out. Now, there are some improvements that could be made here on the JIT side of things, above and beyond the fast action call mechanism. If we look at the generated logic and examine it under the profiler, the bulk of the overhead involved in the fast action service call interface as it’s implemented in its prototype stage today comes from the need to allocate an array of object GC pointers, box arguments up to place them into the array, unboxing the array contents when copying the array contents to create the command table for OnExecuteActionFromJIT, and boxing/unboxing the return value from Intrinsic_ExecuteActionFast. All of these are limitations of the JIT (intrinsic) interface and not the C++-level interface; furthermore, essentially all of these steps could be eliminated if the JIT backend could avoid the usage of the object GC pointer array in the variadic intrinsic call. While I was unable to find a clean way to do this in verifiable IL (without interposing a large amount of automatically generated C++/CLI code emitted by some other generation program), it _is_ possible to circumvent much of this overhead — if we are willing to emit

non-verifiable IL.

This leads us to the next topic, _direct fast action service handler calls_, which we’ll discuss in detail in the next post.

Tags: NWN2

Posted in Programming | 1 Comment » NWSCRIPT JIT ENGINE: JIT INTRINSICS, AND JIT’D ACTION SERVICE HANDLER CALLS, PART 2: STANDARD ACTION CALLS

August 19th, 2010

Last time , I outlined the general usage of the JIT intrinsics emitted by the MSIL backend for the NWScript JIT engine, and how they relate to action service calls. Today, let’s take a closer at how an action service handler is actually called in NWScript in the wild. The MSIL backend currently supports three action call mechanisms (the ‘standard’ intrinsic, and the ‘fast’ intrinsic, and the (mostly) intrinsic-less ‘direct fast’ system); we’ll take a look at the ‘standard’ path first. The standard action service path involves several the operation of at least one, but most probably several different intrinsics. In the standard path, the generated MSIL code is responsible for performing each fundamental step of the action service call operation distinctly; that is, the MSIL code pushes each parameter onto the VM stack in right to left order, making a call to the appropriate Intrinsic_VMStackPush function for each parameter type. Internally, these intrinsics place data on the ‘dummy’ VM stack object that will be passed to an action service handler. Once all of the parameters are pushed on the stack, a call is made to Intrinsic_ExecuteActionService, which makes the transition to the action service handler itself. (Actually, it calls a dispatcher routine, which then calls the handler based on an index supplied, but we can ignore that part for now.) Finally, if the action service handler had any return values, the generated MSIL code again invokes intrinsics to remove the return values from the VM stack and transfer them into MSIL locals so that they can be acted on. Thus, the standard action service handler path is very much a direct translation into MSIL of the underlying steps the NWScript VM would take when interpreting the instructions leading up to an action call. If we look at the actual IL for an action call, we can see this in action (pardon the pun). Consider the following NWScript source text:

void PrintHello()

{

PrintString( "Hello, world (from NWScript)." );

}

The generated IL for this subroutine’s call to PrintHello looks something like as so (for NWN2): .method private instance void NWScriptSubroutine_PrintHello() cil managed

{

// Code size 93 (0x5d)

.maxstack 6

.locals init (uint32 V_0,

string V_1)

// ...

IL_0025: ldstr "Hello, world (from NWScript)."

IL_002a: stloc.1

IL_002b: ldarg.0

IL_002c: ldfld m_ProgramJITIntrinsics

IL_0031: ldarg.0

IL_0032: ldfld m_ProgramJITIntrinsics

IL_0037: ldloc.1

IL_0038: callvirt instance void Intrinsic_VMStackPushString(string) IL_003d: ldc.i4 0x1

IL_0042: conv.u4

IL_0043: ldc.i4 0x1

IL_0048: conv.u4

IL_0049: callvirt instance void Intrinsic_ExecuteActionService(uint32,

uint32)

In essence, the generated code makes the following calls: String ^ s = "Hello, world (from NWScript)'; m_ProgramJITIntrinsics->VMStackPushString( s ); // PrintString is action ordinal 1, // and takes one source-level argument. m_ProgramJITIntrinsics->ExecuteActionService( 1, 1 ); If PrintString happened to return a value, we would have seen a call to VMStackPop* here (or potentially several calls, if several return values were placed on the VM stack). While the standard call path is functional, it does have its downsides. Internally, each of the intrinsics actually goes through several levels of indirection: * First the JIT code calls the .NET interface INWScriptProgram

intrinsic method.

* The INWScriptProgram intrinsic’s ultimate implementation in the JIT core module, NWNScriptJIT.dll, calls into a C++-level interface, _INWScriptStack_ or _INWScriptActions_, depending on the intrinsic. This indirection takes us cross-module from NWNScriptJIT.dll to the script host, such as NWNScriptConsole.exe or NWN2Server.exe. * Finally, the implementation of INWScriptStack or INWScriptActions performs the requested operation as normal. Most of these indirection levels are fairly thin, but they involve a managed/native transition, which involves marshalling and some additional C++/CLI interop expense (particularly when NWScript strings

are involved).

The fast action service handler interface, which we’ll discuss next time, attempts to address the repeated managed/native transitions by combining the various steps of an action service call into one transacted managed/native transition.

Tags: NWN2

Posted in Programming | 1 Comment » NWSCRIPT JIT ENGINE: JIT INTRINSICS, AND JIT’D ACTION SERVICE HANDLER CALLS, PART 1

August 18th, 2010

Previously , I demonstrated how a simple NWScript subroutine could be translated into MSIL, and then to native instructions by the CLR JIT. We still have a large piece of functionality to cover, however, which is calling action service handlers (extension points) from JIT’d code. In order to understand how action service handlers work, we need to delve into a side-topic first — JIT intrinsics. In certain circumstances, the MSIL backend for the NWScript JIT engine utilizes a series of _NWScript JIT intrinsics_ in IL that it generates when producing the IL-level representation of a script program. Simply put, these JIT intrinsics faciliate operations that must either invoke native code or that are too complex or unwieldy to desirably inline in the form of IL instructions in the generated instruction stream. The bulk of the JIT intrinsics deal with interfacing with action service handlers, which as you recall, are the main I/O extension points used by the script program to communciate with the code running in the “outside world” (or at least the script host itself). In order to understand why these intrinsics are useful, however, we need to understand more about how action service handlers are called. Using the NWScript VM that I wrote as a reference, an action service handler simply receives a pointer to a C++ object representing the current VM stack. The action service handler then pops any parameter values off of the VM stack, and pushes the return values of the action back, in accordance with the standard action calling convention defined by the NWScript ACTION opcode

.

Now, were the action handler to be called by the NWScript VM, it would be passed the actual execution stack in use by the VM as the program’s main data store, and that would be that. Recall, however, that the NWScript JIT engine is designed to be a drop-in replacement for the interpretive NWScript VM. That means that it must ultimately use the same VM-stack calling convention for action service handler calls. This is advantageous as there are a great number of action service calls exposed by the NWN2 API (over a thousand), and rewriting these to use a new calling convention would be a painful undertaking. Furthermore, reusing the same calling convention allows each action service handler call to be used by _both_ the JIT and the VM in the same program, which allows for possibilities such as background JIT with cutover, or simply a defense against the JIT having a bug or simply not being available (perhaps .NET 4.0 isn’t installed — the core server itself does not require it). Thus, in order to make an action service handler call, the MSIL JIT backend needs to call various C++ functions to place items on a VM stack object that can be passed to an action service handler’s real implementation. (In the case of the JIT system, I simply create a ‘dummy’ VM stack that only ever contains parameters and return values for the current action service handler.) However, the IL code emitted by the NWScript JIT cannot easily directly interface with the VM stack object (which is written in native C++). The solution I selected for this problem was to create the set of JIT intrinsics that I made reference to previously; these JIT intrinsics, implemented in C++/CLI code, expose the mechanisms necessary to invoke an action service handler to NWScript in the form of a safe/verifiable .NET interface. (Actually, the reality is a little bit more complex than that, but this is a close approximation.) For performance reasons (recall that action service calls are to NWScript as system calls are to a native program), the NWScript JIT backend exposes three distinct mechanisms to call into an action service handler. Most of these mechanisms heavily rely on various special-purpose JIT intrinsics, as we’ll see shortly: * A “standard” action service call mechanism, corresponding of a series of intrinsics for each VM stack operation (i.e. push a value on the VM stack, pop a value off the VM stack, call the action service handler). The standard action service call mechanism is invoked when an action service call has five or fewer combined parameters and return values, or if the action service call involves an engine

structure.

* A “fast” action service call mechanism, corresponding of a single unified intrinsic that combines pushing parameters onto the VM stack, calling the action service handler, and popping any return values off the stack. If verifiable IL is desired, the fast action service call mechanism is invoked when an action service call has six or more combined parameters and return values and does not involve any

engine structures.

* A “direct fast” action service call mechanism, which generates direct, devirtualized calls to the raw C++-level interface used by the NWScript host to expose action service handlers. The direct fast action service call mechanism is the fastest action call mechanism by a large margin, but the emitted IL is non-verifiable (and in fact specific and customized to the instance of the NWScript host process). Like the ordinary fast action service call mechanism, the direct fast action service call does not support action service calls that involve engine structures. If non-verifiable IL is acceptable the direct fast action service call mechanism is always used unless an engine structure is involved. Why the distinction at six combined parameters and return values with respect to the “fast” action service call mechanism? Well, profiling determined that the fast mechanism is actually only faster than the standard mechanism — in the current implementation — if there are seven or more intrinsics being called at once (six parameter or return value VM stack operations, plus the actual action call intrinsic). We’ll get into more details as to why this is the case next time. All three action service handler invocation mechanisms perform the same effect at the end of the day, however. For the most part, the .NET-level interface exposed by the JIT intrinsics system is relatively simple. There is an interface class (INWScriptProgram) that exposes a set of APIs along the line of these:

//

// Push an integer value onto the VM stack (for an action call).

//

void

Intrinsic_VMStackPushInt(

__in Int32 i

);

//

// Pop an integer value off of the VM stack (for an action call).

//

Int32

Intrinsic_VMStackPopInt(

);

// ...

//

// Execute a call to the script host's action service handler.

//

void

Intrinsic_ExecuteActionService( __in UInt32 ActionId, __in UInt32 NumArguments

);

// ...

//

// Execute a fast call to the script host's action service handler.

//

Object ^

Intrinsic_ExecuteActionServiceFast( __in UInt32 ActionId, __in UInt32 NumArguments, __in ... array< Object ^ > ^ Arguments

);

When a piece of generated code needs to access some extended functionality present in a JIT intrinsic, all that needs to be done is to set up a call to the appropriate JIT intrinsic interface method on the JIT intrinsics interface instance that is handed to each main script program class. This allows complex functionality to be written in C++/CLI versus directly implemented as raw, emitted IL. Aside from logic to support action service handler invocation, there are several additional pieces of functionality exposed as JIT intrinsics. Specifically, comparison and creation logic for engine structures is offloaded to JIT intrinsics, as well as a portion of the code to set up a saved state object for an I_SAVE_STATE instruction. On that note, next time we’ll dig in deeper as to what actually goes on for a JIT’d action service handler call under the hood, including how the above JIT intrinics work and how they are used.

Tags: NWN2

Posted in Programming | 1 Comment » NWSCRIPT JIT ENGINE: UNDER THE HOOD OF A GENERATED MSIL SUBROUTINE

August 17th, 2010

Yesterday , I expounded on the basics of how assemblies for scripts are structured, and how variables, subroutines, and IR instructions are managed throughout this process. Nothing beats a good concrete example, though, so let’s examine a sample subroutine, both in NWScript source text form, and then again in MSIL form, and finally in JIT’d amd64 form. _Example subroutine_ For the purposes of this example, we’ll take the following simple NWScript subroutine: int g_randseed = 0;

int rand()

{

return g_randseed = (g_randseed * 214013 + 2531101) >> 16;

}

Here, we have a global variable, _g_randseed_, that is used by our random number generator. Because this is a global variable, it will be stored as an instance variable on the main program class of the script program, as we’ll see when we crack open the underlying IL for this

subroutine:

_MSIL version_

.method private instance int32 NWScriptSubroutine_rand() cil managed

{

// Code size 110 (0x6e)

.maxstack 8

.locals init (int32 V_0,

uint32 V_1,

int32 V_2,

int32 V_3,

int32 V_4)

IL_0000: ldarg.0

IL_0001: ldarg.0

IL_0002: ldfld uint32 m_CallDepth IL_0007: ldc.i4.1

IL_0008: add

IL_0009: dup

IL_000a: stloc.1

IL_000b: stfld uint32 m_CallDepth

IL_0010: ldloc.1

IL_0011: ldc.i4 0x80

IL_0016: clt.un

IL_0018: brtrue.s IL_0025 IL_001a: ldstr "Maximum call depth exceeded." IL_001f: newobj instance void System.Exception::.ctor(string)

IL_0024: throw

IL_0025: ldarg.0

IL_0026: ldfld int32 m__NWScriptGlobal4

IL_002b: stloc.2

IL_002c: ldc.i4 0x343fd

IL_0031: stloc.3

IL_0032: ldloc.2

IL_0033: ldloc.3

IL_0034: mul

IL_0035: stloc.s V_4 IL_0037: ldc.i4 0x269f1d

IL_003c: stloc.2

IL_003d: ldloc.s V_4

IL_003f: ldloc.2

IL_0040: add

IL_0041: stloc.3

IL_0042: ldc.i4 0x10 IL_0047: stloc.s V_4

IL_0049: ldloc.3

IL_004a: ldloc.s V_4

IL_004c: shr

IL_004d: stloc.2

IL_004e: ldloc.2

IL_004f: stloc.3

IL_0050: ldarg.0

IL_0051: ldloc.3

IL_0052: stfld int32 m__NWScriptGlobal4

IL_0057: ldloc.2

IL_0058: stloc.0

IL_0059: br IL_005e

IL_005e: ldarg.0

IL_005f: ldarg.0

IL_0060: ldfld uint32 m_CallDepth IL_0065: ldc.i4.m1

IL_0066: add

IL_0067: stfld uint32 m_CallDepth

IL_006c: ldloc.0

IL_006d: ret

}

// end of method

// ScriptProgram::NWScriptSubroutine_rand That’s a lot of code! (Actually, it turns out to be not that much when the IL is JIT’d, as we’ll see.) Right away, you’ll probably notice some additional instrumentation in the generated subroutine; there is an instance variable on the main program class, m_CallDepth, that is being used. This is part of the best-effort instrumentation that the JIT backend inserts into JIT’d programs so as to catch obvious programming mistakes before they take down the script host completely. In this particular case, the JIT’d code is instrumented to keep track of the current call depth in an instance variable on the main program class, m_CallDepth. Should the current call depth exceed a maximum limit (which, incidentally, is the same limit that the interpretive VM imposes), the a System.Exception is raised to abort

the script program.

This brings up a notable point, in that the generated IL code is designed to be safely aborted at any time by raising a System.Exception. An exception handler wrapping the entry point catches the exception, and the default return code for the script is returned up to the caller if a script is aborted in this way. Looking back to the generated code, we can see that the basic operations that we would expect are all there; there is code to load the current value of g_randseed (m__NWScriptGlobal4 in this case), multiply it with a fixed constant (0x343fd, or 214013 as we see in the NWScript source text), then perform the addition and right shift, before finally storing the result back to g_randseed (m__NWScriptGlobal4 again) and returning. (Whew, that’s it!) Even though there are a lot of loads and stores here still, most of these actually disappear once the CLR JIT compiles the MSIL to native code. To see this in action, let’s look at the same code, now translated into amd64 instructions by the CLR JIT. Here, I used !sos.u from the sos.dll debugger extensions (the instructions are colored using the same coloring scheme as I used above): 0:007> !u 000007ff`001cbac0 Normal JIT generated code NWScriptSubroutine_rand() Begin 000007ff001cbac0, size 7e

push rbx

push rdi

sub rsp,28h mov rdx,rcx mov eax,dword ptr

lea ecx,

mov dword ptr ,ecx xor eax,eax cmp ecx,80h

setb al

test eax,eax

je 000007ff`001cbb07 mov eax,dword ptr imul eax,eax,343FDh

lea ecx,

sar ecx,10h mov dword ptr ,ecx mov eax,dword ptr

dec eax

mov dword ptr ,eax mov eax,ecx add rsp,28h

pop rdi

pop rbx

ret

lea rdx,

mov ecx,70000005h call clr!JIT_StrCns mov rbx,rax

lea rcx,

call clr!JIT_TrialAllocSFastMP_InlineGetThread mov rdi,rax mov rdx,rbx mov rcx,rdi call mscorlib_ni+0x376e20 (System.Exception..ctor(System.String) mov rcx,rdi call clr!IL_Throw

nop

(If you’re curious, this was generated with the .NET 4 JIT.) Essentially each and every one of the fundamental operations was turned into just a single amd64 instruction by the JIT compiler — not bad at all! (The rest of the code you see here is the recursion

guard.)

Tags: NWN2

Posted in Programming | 1 Comment » NWSCRIPT JIT ENGINE: GENERATING A .NET ASSEMBLY FOR A JIT’D SCRIPT

August 16th, 2010

Last time , I outlined the MSIL JIT backend from a high level, and described some of how its external interface functions. While knowing how the MSIL JIT backend works from the outside is all well and good, most of the interesting parts are in the internals. This time, let’s dig in deeper and see how the MSIL code generation process in the JIT backend functions (and what a generated script assembly might look like).

_Script assemblies_

As I mentioned, the backend generates a new .NET assembly for each script passed to NWScriptGenerateCode. This API creates a new _NWScriptProgram_ object, which represents an execution environment for the JIT’d script program. When a NWScriptProgram object is created, it consumes an IR representation for a script program and begins to create the MSIL version of that script, contained within a single .NET assembly tied to that NWScriptProgram instance. Each script assembly contains a single module; that module then contains a series of classes used in the MSIL representation of the script. The NWScriptProgram object internally maintains references to the script assembly and exposes a API to allow the script to then be invoked by the user. _Main program class_ Each generated NWScript program contains a main class, with a name of the form NWScript.JITCode.

More Annotations

Elaine Sutton

2021-05-19 12:21:07

Elaine Sutton

2021-05-19 12:21:09

Elaine Sutton

2021-05-19 12:21:10

Elaine Sutton

2021-05-19 12:21:11

Elaine Sutton

2021-05-19 12:21:15

Elaine Sutton

2021-05-19 12:21:17

Elaine Sutton

2021-05-19 12:21:18

Elaine Sutton

2021-05-19 12:21:22

Elaine Sutton

2021-05-19 12:21:24

Elaine Sutton

2021-05-19 12:21:32

Elaine Sutton

2021-05-19 12:21:33

Elaine Sutton

2021-05-19 12:21:37

Favourite Annotations

Elaine Sutton

2021-06-06 21:20:16

Elaine Sutton

2021-06-06 21:20:16

Elaine Sutton

2021-06-06 21:20:17

Elaine Sutton

2021-06-06 21:20:17

Elaine Sutton

2021-06-06 21:20:17

Elaine Sutton

2021-06-06 21:20:19

Elaine Sutton

2021-06-06 21:20:20

Elaine Sutton

2021-06-06 21:20:20

Elaine Sutton

2021-06-06 21:20:20

Elaine Sutton

2021-06-06 21:20:22

Elaine Sutton

2021-06-06 21:20:22

Elaine Sutton

2021-06-06 21:20:23

Text

your workspace.

(“?:”) is

actually didn’t

really do.

that I

of them, in

your workspace.

(“?:”) is

actually didn’t

really do.

that I

of them, in

actually didn’t

of them, in

your workspace.

actually didn’t

really do.

(“?:”) is

that I

your workspace.

actually didn’t

really do.

(“?:”) is

that I

actually didn’t

of them, in

event.

really do.

your workspace.