Here are some helpful hints to get optimal performance:
The EMMS call takes a lot of time, so try to seperate oating point and MMX operations.
Use MMX only in low level routines because the compiler saves all used MMX registers
when calling a subroutine.
The NOT-operator isn't supported natively by MMX, so the compiler has to generate
a workaround and this operation is ine cient.
Simple assignements of oating point numbers don't access oating point registers, so
you need no call to the EMMS procedure. Only when doing arithmetic, you need to call
the EMMS procedure.