Overview
FTN95 now includes an in-built profiling tool called TimingAnalysis. TimingAnalysis accurately measures CPU time spent in each subroutine of a Fortran program. It allows program developers to discover the most CPU intensive areas of a program so that these can be improved. Frequently, it will be found that over 90% of the program CPU time is in less than 10% of the code.
TimingAnalysis produces a report giving statistics on CPU time, number of calls to and from routines, and the number of page faults per routine. CPU time statistics are collected on both an exclusive (only the single routine) basis and an inclusive basis (the routine and everything called below it). In addition to the reports, a file suitable for importing into a spread sheet is produced.
To use the TimingAnalysis all that is required is to compile with the /TIMING option. This can be done within Visual Studio .NET by selecting the Project Properties and navigating to Compiler Options->Debugging->Plant timing information and set this option to Yes. If compiling from the command line then the /TIMING option should be used.
TimingAnalysis creates output files containing summary information for the program executed. If using TimingAnalysis from Visual Studio .NET then these files will be placed in the project output directory and will have extensions of .TMR and .TMO. The .TMR file contains summary analysis of the executed program and the .TMO file contains a (by default) tab seperated file ready for importing into a spreadsheet program such as Microsoft Excel.
From the command line only .TMR files are produced by default. To specify a .TMO file from the command line an environment variable called TIMINGOPTS should be created and set as follows:
set TIMINGOPTS=/TMO /DLM tab
where /TMO indicates that a .TMO file should be produced and DLM specifes the
delimeter character to be used in the .TMO file. The default is tab and
can be any character other than '-'
or '/'
.
How Timing Analysis Works
TimingAnalysis uses the RDTSC instruction to access the Pentium cycle counter. This counter gives the number of cycles that have passed since the computer started up. This allows direct measurement of the time short segments of code take to execute. The times labeled as CPU time in TimingAnalysis reports are actually elapsed time spent in a routine, to a very high resolution. This time will include time waiting for I/O to the disk (including paging) or waiting at a prompt to the screen. With this in mind, the reports can be very useful.
The /TIMING compile option causes the compiler to produce code to plant code to do a call to the timing routine when a routine is called. TimingAnalysis stores the CPU timer and begins the work of collecting information. Any overhead introduced by using TimingAnalysis is removed when the reports are produced. The work of collecting and saving the timing information can take significant time and can double the run time of a program which does a lot of calls. This time spent is never counted in that of the timed program's subroutines and does not need to be compensated for.
After the subroutine entry information is stored, TimingAnalysis again stores the CPU timer and returns to the called subroutine. When TimingAnalysis is called at return time, it performs the subroutine exit measurements similar to the way it did at routine entry and then returns to the caller of the subroutine just timed. When the program reaches termination TimingAnlaysis produces reports and the program exits.
Understanding TimingAnalysis Reports
The timing analysis report is produced twice, once sorted on exclusive CPU secand again sorted on inclusive CPU sec. The second part of the report continues for one page or until the inclusive Cpu% drops below 4.0%. The report header indicates the sort key used for the report.There are two groups of information on the page, the exclusive information which applies only the routine itself and the inclusive information which applies to the routine and everything it calls and below. The following decribes the columns in the .TMR file produced by TimingAnalysis.
Exclusive Statistics
Called - This is the number of times this routine is
called by other routines.
Calls - This is the number of calls made to other routines directly by
this routine.
Page Flts - This is the number of times a page had to be brought in
from disk during execution of this routine.
P/F% - The percentage of page faults encountered in this routine.
Cpu sec -This is the total CPU
time in seconds spent in this routine.
Avg CPU -This is the total CPU
time divided by the number of calls to this routine. This number is scaled to allow good significance to be displayed in a
small space. It is followed by an indicator of its units which can be
s for seconds, m for
milliseconds, or u for microseconds.
Cpu% -This is the percent of
the CPU time spent in this routine.
CumCpu% -This is a running sum
of the column to its left and provides an easy way to determine how many
routines comprise what percent of the total CPU time.
Inclusive Statistics
Called - This is the number of calls to
this routine and all calls made by routines called by this routine. For
the main program, this is the total of all calls made during the program
execution.
Calls - This is the total of the
calls made by this routine and the routines called by this routine. It
is equal to inclusive Called minus exclusive Called and was included on the
report mainly for symmetry.
Cpu sec - This is the total CPU
time spent in this routine and all routines it calls. This
is in effect viewing this routine as the start of a subsystem of the whole
program.
Cpu% - This is the percentage of
the CPU time that this routine and all called below it comprise.
Spreadsheet Output
When the /TMO option is given in the TIMINGOPTS environment variable or when the timing option is used within Visual Studio .NET, a file suitable for importing to a spread sheet or other utility is produced. This file has the same name as the timing report file, with the .TMO suffix. If the /DLM option is given a delimiter character can be given which can be: "comma", "tab", or any punctuation character. This character will then be used as a field separator. The .TMO file contains the basic data present in the timing report. The first line contains names of the quantities. The remainder of the file contains one line per called routine with that routine's data.
Using TimingAnalysis Reports
There are many ways to use TimingAnalysis reports to improve a program's performance. First of all, look at how many times each routine was called. Do these coincide with your understanding of the program? Could any be called in a loop they should not be in? When a routine is called many times but has a very short average execution time, there may be much time wasted in the call alone. It may be a good idea to find a way to perform the function in-line or compute a group of results in advance and simply access it where the call was.
The paging information can be very useful as well. In routines with high paging, look for multi-dimensional arrays which are accessed primarily with the right-most subscripts varying fastest.
The CPU time is the most useful statistic for improving program performance. The TimingAnalysis report is arranged so that you can easily determine a set of routines which comprise any particular percent of the whole run time. These are the routines which should be studied for possible improvement.
It is important to note what the time assigned to a routine includes. All routines which are compiled with the /TIMING option have their time assigned to them. Any routines not compiled with this option will have their time assigned to the routine which called them. All FORTRAN library routines such as SIN, ATAN, LOG, etc. have their CPU time included in the routines which call them. FORTRAN I/O statements can use a lot of CPU time and this is also assigned to the routines containing the I/O statements. The CPU time used to call another timed routine and transfer arguments is assigned to the calling program. The CPU time used in the return from the called program is assigned to the called program.
Restrictions
The maximum call depth within TimingAnalysis is 2048.
Recursion (a routine calling itself), a Fortran extension, will cause the inclusive statistics to be incorrect for the recursively called routines.
Alternate subroutine returns are not supported and will lead to incorrect timing results.