The AMD 64-bit architecture
This architecture was
invented by AMD, and was later adopted by by Intel when their own Itanium 64-bit
architecture was not received with enthusiasm. Intel use the term x86-64. It is
the basis of most modern PCs, and is targeted by FTN95 when the /64 switch is
used.
The AMD 64-bit architecture has 16 general purpose integer registers:
RAX, RCX, RDX, RBX, RSP, RBP, RSI, RDI, R8, R9, R10, R11, R12, R13, R14, R15.
The bottom eight registers correspond to the 32-bit register set, and retain some of the same functionality. Thus RSP is the stack pointer and descends as the stack expands, RCX, RSI and RDI are used for string operations just as they are in 32-bits, and RAX is used by convention to return integer function values. RBP does not correspond in function to EBP, however it is given a special function in Silverfrost code (explained later), and should not be modified in normal circumstances.
All these registers hold 64 bits (8 bytes) and can therefore hold a pointer to anywhere in the 64-bit address space.
64-bit programs can access two sets of different floating point registers - the old floating point stack of eight 80-bit registers, and a set of registers designated XMM0 - XMM15, and known as the SSE registers. These registers can hold multiple values simultaneously - four REAL*4 floating point values, or two REAL*8 values. They can also hold integer values. Thus these registers are 16 bytes in width. These registers do not 'know' what data they contain - so it is up to the programmer to keep track. In particular, if you load a REAL*8 value into an XMM register and wish to store it as a REAL*4, you must first use the appropriate conversion instruction.
Strangely, the old coprocessor stack instructions, do offer some functionality that is not present in the newer SSE instruction set - for example SIN and COS can be evaluated in one instruction.
Silverfrost CODE/EDOC
conventions
Let us start with a simple
executable example of a 64-bit CODE/EDOC sequence that simply sums a vector of
REAL*8 values. It is not meant to be optimal because it does not use the
parallel execution facilities of the SSE registers.
REAL*8 vec(3),ans DATA vec/3.0d0,4.0d0,5.0d0/ CALL sum(vec,3,ans) PRINT*,ans END SUBROUTINE sum(vec,n,ans) INTEGER n REAL*8 vec,ans CODE MOV_Q RDX,=VEC ! The '=' denotes a (non-immediate) constant or, as in this case, the address of an argument MOV_Q R14,=N ! Remember all addresses are 64-bit - hence the use of MOV_Q MOVSX_Q R14,[R14] ! Instructions and register names are case insensitive ! N is only a 32-bit integer, so it is sign extended to 64 bits XORPD XMM0,XMM0 ! This is one way to zeroise an XMM register it does a bitwise exclusive OR 1 ADDSD XMM0,[RDX] ADD_Q RDX,8 ! This uses an immediate constant DEC_Q R14 JNE $1 ! Labels are denoted by a '$' MOV_Q RCX,=ans MOVDQU [RCX],XMM0 ! Store away the accumulated answer in the argument ANS EDOC END
This illustrates a variety of points
1) The instructions that operate on the integer registers can operate on 1, 2, 4, or 8 byte operands. These are distinguished by a suffix, thus the MOV instruction takes the forms MOV_B, MOV_H, MOV, MOV_Q.
2) Unlike the 32-bit code/edoc, the register name does not change when the operation operates on a smaller number of bytes.
3) Operations that work on 4 bytes of a register (MOV, ADD, etc) also clear the upper 4 bytes of the register, whereas 2-byte and 1-byte instructions do not change the other bytes of the register. This is a feature of the hardware, not a Silverfrost convention.
4) Labels are prefixed by a '$' when used, just as is the case in 32-bit mode.
5) When accessing a Fortran argument, you need to first access its address (an 8-byte quantity). The notation =N is used to access the address of argument N. The '=' notation can also be used to address a constant in memory, for example:
MOVSD XMM3,=2.0d0
6) The MOVSX_Q instruction sign extends a 32-bit integer to 64-bits. In situations where a number is known to be non-negative. This extension can be obtained for free using point 3 above.
In general a good way to learn to write instructions inside CODE/EDOC is to compile simple code samples with the /EXPLIST option, which will display the instructions generated by the compiler line by line in essentially the same format that you will use.
Referencing COMMON, MODULE, and ALLOCATE'd
variables
Because most
COMMON blocks are allocated as the program starts up (as are large arrays in
MODULE's) the simplest way to access these objects, as well as explicitly
ALLOCATE'd arrays, is to take their address before entering the CODE/EDOC. For
example:
COMMON/FRED/alpha,beta(100),gamma INTEGER*8 alpha,beta,gamma INTEGER*8 addressof_beta addressof_beta=loc(beta) CODE MOV_Q R10,addressof_beta MOV_Q [R10+8],42 !This sets beta(2) to the value 42 EDOC
The 64-bit address space
The 32-bit address space provided a theoretical maximum
232 (4 x 109) addressable bytes. Correspondingly, the 64-bit address space
offers a theoretical maximum 2^64 (1.8 x 1019) addressable bytes. This means
that, rather like in the early days of the 32-bit architecture, when a typical
computer might have vastly less than 232 bytes (4 GB) of memory, the virtual
address space is only very sparsely populated. Indeed, the 64-bit virtual
address space is so large that it isn't possible to provide page tables to cover
the address space. This means that the amount of virtual address space available
to a program is determined in a way that depends on the version of Windows in
use, and the total amount of main memory on the computer (say 16 GB). This
number is still extremely large. However, it is relevant if you use calls to
VirtualAlloc to access high memory addresses in an absolute way.
Using the SSE
registers for parallel computation
Instructions like MOVDQA will
load a pair of REAL*8 numbers into an XMM register. Since these numbers are just
bits, the instruction can also be used to move four REAL*4 numbers into an XMM
register. However this instruction will fault if the data is not 16-bit aligned.
This is problematic because REAL*4 and REAL*8 numbers are aligned wherever
possible (EQUIVALENCE can prevent alignment) to 4 and 8 bytes respectively. In
practice it turns out that the MOVDQU (which is reputed to be slower than
MOVDQA) seems to run at the same speed for aligned data, and only somewhat
slower for non-aligned data, but generates no alignment faults. It is also worth
reading this discussion about alignment issues:
http://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/