This project is dedicated to the memory of William Morris (aka Frags), who was the main contributor to the bounty but was unable to see the final result.

Sunday, March 27, 2011

Rainy Day with Copper-rendered Rainbows

The sky is cloudy, sometimes sudden showers are soaking the insanely green grass and I am forced into the shabby unit we rented half a year ago. Time to move - popping into my mind quite often these days, but that is a completely different matter.
There are only a few things distracting me from working on E-UAE, I tried to concentrate working on it with a great strength. (Apage Satanas: Facebook! I clearly miss the times when I was developing Flamingo. I was sitting in my curtained flat all day, no internet, no phone, no tv: the best environment for creating your own little pet creature and take over the world, MUHAHAH...)

I was trying to compile E-UAE since my last post, but I am clearly not really fond of the multi-platform environment. Configure script is a bad dog: never behaves. (Linux developers must have an insane amount of time and patience.)
Finally, after running into the never-ending circle of deleting all, copy back, adjusting, configuring and do the make for the hundredth time: I was able to compile E-UAE sources, and the binary is even working. YAY! \o/

I am happy that I can start on tinkering with the JIT implementation, but I have to admit that my Amiga developer skills became a bit rusty in the last three years, since I hardly had time to spend on development.
I fought big time with the path of include files, still getting flooded with complaints about missing prototypes for the AOS functions, but at least that is only a warning. (Not that I can live with compile-time warnings for my code, but I really don't have enough strength to find what minor thing needs fine tuning in the compiling to get rid of these warnings safely.)
It took me an hour today to find out that TimeDelay function is part of the (obsolete) libamiga and this is why the linking failed every time. There was an easy fix for that: just remove the check for AOS4 from the sources and let it fall back to the DOS/Delay() function. Not a proper solution, because Delay() is badly inaccurate, but at least I was able to compile and fire it up.

The compiling on the uA1 is awesomely slow and eating up all the memory, so I had to turn on the swapping. I seriously considering the cross-compiling from my windows laptop, and let the A1 do the testing functionality only. It was a very long time since I tried cross-compiling for the OS kernel, back in the days when it was impossible to compile it on AOS4. It was painful, I had to use Cygwin and a number of tricks to get it working. Then copied the compiled binary to my Amiga 4000 on a floppy disk for testing... *Sigh* I was soo patient, I am amazed now. ;)
Speaking of which: I still need net connection for the uA1, that would be essential for the cross-compiling.

Anyway, let's go back to experimenting. The first test will be implementing a very-very simple form of the poor-man's JIT, which will do nothing more than "compiling" the executed code into series of calls to the interpretive instruction implementation functions. The outcome will be an environment, where I still have all the already implemented instructions, but I can replace them one-by-one with the real JIT instructions. Something what I needed when developed Petunia, but there was no interpretive emulator to back up the missing instructions.

Sunday, March 6, 2011

Blueprints

It is time to celebrate: finally my Amiga configuration is complete. It took more than a half year after moving to the opposite side of the globe (from Hungary to New Zealand). I dragged my poor, old A1-XE with me in a suitcase; unfortunately it didn't survive the flight. :(
Stephen “Cobra” Fellner lent a uA1 to me in a compact case, I couldn't be grateful enough for it. (Not to mention the tons of helps they gave us to start our new life here from scratch...)
Piece-by-piece the machine was completed; the final item was an old Philips monitor from TradeMe.

It is also time to reorganize priorities in my (rather limited) free time. Cut back on beaches, bushwalks and especially on Facebook. :P
But enough from whining, this is a technical blog after all and the autumn is coming rapidly with lotsa rain...

Let's scheme!

A JIT compiler similar to a programming language compiler, the main difference is the source: while a programming language source is human readable text (or that would be the goal, at least), the source in this case is machine code from a different processor.
The upside is the machine code has very strict rules, easier to interpret. The downside is it must be precisely known in every minor detail and there are undocumented features and side-effects that have to be implemented correctly. (How to find out these: good question; usually with countless hours of debugging crashing applications.)

Why JIT called just-in-time compiling: the program code translated to natively executable code while it is emulated. It means every time the execution reaches a point in the program that wasn’t reached before then the compiler takes a chunk from the emulated code and translates it to native code then the execution flow goes on. (Some JIT compilers are able to translate the whole executable right after loading, but that is only possible in special cases.)
The compilation process can be either fairly simple or overly complicated, depending on the actual method. The final result must be directly executable on the host system; in other case it would be inefficient and probably useless. (It would require an interpreter to execute, rarely makes any sense.)
There are different approaches to the compilation process, which one is the best choice depends on multiple factors.

Poor man’s JIT

For example the most simplistic method produces a series of jump instructions for execution of source data fetching, operation execution, result store. Everything else is done by a library of functions in the emulation environment. This solution gets rid of the interpretation of every instruction code at every (re-)execution, but it is next to impossible to do any optimization which would involve the source- and destination handling together with the operation; not to mention a wider span of instructions.
Why anybody would try such an inefficient solution: if the interpreter is already implemented then requires significantly less work to turn it into a JIT compiler by this way. Also the translated code needs (lots of) memory, such implementation is lightweight on the memory usage.

Stamping Lil’ Roses and Rainbows

Slightly better approach is creating templates for certain addressing modes and operations that can be copied into the translated code (almost) directly. Not every instruction incarnation needs its own template; it is possible filling up some gaps regarding the specifics to the actual instruction, such as the involved registers.
This is the most common approach, flexible and efficient, if implemented properly. With some tweaking it is even possible to adjust the templates to handle special cases, such as optimizing arithmetic flag emulation away when consecutive instructions would overwrite it anyway. This is how Petunia works and as I found out recently, something similar that WinUAE implements.
However, templates are a bit rigid sometimes and still not to easy (or even impossible) to join together more, than two instructions in a specific optimization, which would make use of certain aspect of the target processor. Creating truly flexible templates could result a big mess in translation functions.

Big Planz I haz it

I had a couple years after finishing (the never-ever finished) Petunia playing around with scenarios in my mind where emulated code builds up this or that way. Finally, I came to the conclusion that there is a possible better way to implement JIT compiling, and that is similar to the microcode, that used often in CISC processors.
Microcode is for reducing the complexity of the machine code instruction by implementing it as a series of very simple “wired” instructions and an “interpreter” executes the simple instructions one-by-one at each clock cycle.
Combining this technique with the VLIW approach, when the simple instructions are executed out-of-order, or even can be eliminated completely the result might be lot more optimal on the generated code.
How do I intend to implement… Similarly to the templates, each emulated instruction will be prepared as a series of virtual instructions. The compiler in the first round collects the virtual instructions for each emulated instructions into a buffer for a defined chunk of the emulated code.
In the second round an optimizer runs trough the buffer while trying to apply modifications on the virtual instructions according to predefined rules. At this level the virtual instructions and the original (emulated) instructions have no connection at all anymore, each virtual instruction can be handled, reorganized, eliminated on its own.
The third round is the code emitter: turns the virtual instructions into natively executable code using actual code templates.

I am sure it wasn’t me who thought on this solution for the first time, but never read about similar approach before in the case of the JIT compiling. Programming language compilers do similar code translation to maintain the portability of the compiler between the different processor architectures. (Code emitter has to be adapted, the rest of the compiler needs no modification.)

Predicted Roadblocks

Once I heard: if you were not able to summarize the problem then you won't be able to find the solution either. Let's find out the possible problems with JIT compiling for the complete machine emulation then.

1. Memory access emulation
Unless the 68k emulation in OS4 the UAE is a complete machine emulation. While an application is running it reads and writes memory (no news to anybody, I guess). If the accessed memory is plain data then there is not much to do with it: the application can do whatever it was planned for.
Unfortunately there are two types of memory access that needs special care: accessing hardware registers and writing into the executed code area. For the latter see below at self-modifying code, the former is lot more easily to handle.
Solution: basically what needed is incorporate the functions from UAE that are called for each memory accesses into the translated code.

2. Self-modifying code
The 68k processors don't make difference between data and code, although the AmigaOS itself is able to recognize what part of the loaded executable contains actual code. The difference was never enforced to the developers with all of its up- and downsides, they found it out by the hard way what can of worms is hidden there, when the processors with cache (like 68020) appeared in Amigas. Several self-modifying game and demo fails running on the cached code memory, because the code cache and the data cache is separated.
In AmigaOS4 I was able to get information from DOS.library regarding the loaded and removed code segments. By using this information I could tell which memory areas are cleared, I simply dropped the translated code for those.
With UAE the situation is completely different: any byte in the memory is potential target for modifying. It means the translated code must be dropped and retranslated; otherwise it would conserve the previously translated state.
Solution: probably I must extend the memory access checking and drop the translated code when it gets modified. I have to revisit this topic later on; checking for translated code at every memory access might be too slow.

3. Translated code lookup
When the execution jumps, branches to a new memory address the emulator has to know whether there is an already translated code or the new address was never hit before, new translation is needed. The translated parts are sometimes following each other, but often the programs are wondering around in the memory with no logic to follow.
Solution: Petunia had the same problem, I created a two level look-up table for the translated code segments for each address in the address space; a bit memory-hungry, but very quick for finding the address of the translated code.

As Michael Jackson would say: This is it. These are the initial plans, more details must follow, but first thing first: let’s try to compile E-UAE… :)