| Rogue Wave Standard C++ Library 1.2.1 and Tools.h++ 7.0.6 |
|---|
For both 32-bit and 64-bit libraries:
|
// create a mutex and initialize it pthread_mutex_t the_mutex; #ifdef _PTHREADS_DRAFT4 // for user threads pthread_mutex_init(&the_mutex, pthread_mutexattr_default); #else // for kernel threads pthread_mutex_init(&the_mutex, (pthread_mutexattr_t *)NULL); #endif pthread_mutex_lock(&the_mutex); cout << "something" ... ; pthread_mutex_unlock(&_the_mutex);
Note that conditional compilation may be necessary to accommodate both the user threads and the kernel threads interfaces, as in the above example. An alternative might be to compose a buffer with an ostrstream and output with one write. The following example could be used with the cfront compatible libstream.
ostrstream ostr; ostr << "something" /*...*/ ; ostr << " or another" /*...*/ << endl; cout.write(ostr.str(), ostr.pcount()); ostr.rdbuf()->freeze(0);Note that the above example works with with the new library, though with the deprecated ostrstream.
Or something similar can be done with the Rogue Wave Standard C++ Library 2.x (libstd_v2) with standard ostringstream, as in the following example:
ostringstream ostr; ostr << "something" /*...*/ ; ostr << " or another" /*...*/ << endl; cout.write(ostr.str().c_str(), ostr.str().length());Note that cout.flush() may be needed if sharing the file with stdio.
| Rogue Wave Standard C++ Library 1.2.1 and Tools.h++ 7.0.6 |
|---|
For both 32-bit and 64-bit libraries:
|
WARNING: If you do not specify these options as described in the above table, a run-time error will be generated or multi-thread behavior will be incorrect.
void f(ostream &out, int x, int y) {
out << setw(3) << x << setw(10) << y;
}
This function would not be thread safe if called from multiple
threads with the same object, since the "width" in the shared object
could be changed at any time. Therefore, such objects are not
protected from interactions between multiple threads, and the
result of sharing such an object between threads is undefined.
If the same object is shared between threads, a runtime crash, abort, or intermingled output may occur. With the Rogue Wave Standard C++ Library 2.x, output may be intermingled but no aborts will occur.
-D_THREAD_SAFE with the
cfront Compatible libstreamcout, cin, cerr, and clog,
you can specify the -D_THREAD_SAFE compile time flag for
any file that includes <iostream.h>.
In this case, a new instance of the object is transparently created for
each thread that uses it. All instances share the same file descriptor.
The f() function in the above example will now work, because it receives
one new "out" object per thread. However, the results of two
simultaneous executions of f() will be mixed in any order in the output.
Using -D_THREAD_SAFE with the global scope operator is not supported for
cout, cin, cerr, and clog.
For example, the following code would generate an error:
::cout << endl;
Note, if you use locks, you need not use the -D_THREAD_SAFE compile time flag since you are now responsible for ensuring thread safety.
libstreamVisible differences would be as follows. In the case of standard iostreams, there is intermingling of each component being inserted. With cfront compatible iostreams, there is intermingling of complete buffers (depending on when endl or flush is called).
-D__HPACC_THREAD_SAFE_RB_TREErb_tree class is involved.
In other words, if the tree header file
(which includes tree.cc) under /opt/aCC/include/ is used,
these libraries are not thread safe.
Most likely, it is indirectly referenced by including the standard C++ library
container class map or set headers,
or by including a RogueWave tools.h++ header like tvset.h,
tpmset.h, tpmset.h, tvset.h, tvmset.h, tvmset.h, tpmap.h, tpmmap.h,
tpmmap.h, tvmap.h, tvmmap.h.
Since changing the rb_tree implementation to
make it thread safe would break binary compatibility,
the preprocessing macro, __HPACC_THREAD_SAFE_RB_TREE, must be defined.
Whether or not this macro is defined
when compiling a file that includes the tree header, its use must
be consistent. For example, a new object file compiled with the macro
defined should not be linked with older ones that were compiled without
the macro defined. Library providers whose library is built with the
the macro defined may need to notify their users to also compile their source
with the macro defined when the tree header is included.
The following example illustrates that you cannot catch an object which has been thrown in a different thread. To do so will result in a runtime abort since HP aC++ finds no available catch handler and terminate is called.
#include <pthread.h>
void foo() {
int i = 10;
throw i;
}
int main() {
pthread_t tid;
try {
ret=pthread_create(&tid, 0, (void*(*)(void*))foo, 0);
}
catch(int n) {}
}
Choose from the following for introductory information. For in depth information about code parallelization, refer to the Parallel Programming Guide for HP-UX Systems.
Here are some basic tasks to help you get started with parallelizing HP aC++ programs.
The +Oparallel option causes the compiler to transform eligible loops for parallel execution on multiprocessor machines.
The following command lines compile (without linking) three source files: x.c, y.c, and z.c. The files x.c and y.c are compiled for parallel execution. The file z.c is compiled for serial execution, even though its object file will be linked with x.o and y.o.
The following command line links the three object files, producing the executable file para_prog:
As this command line implies, if you link and compile separately, you must use aCC, not ld. The command line to link must also include the +Oparallel and +O3 options in order to link in the right startup files and runtime support.
MP_NUMBER_OF_THREADS environment variable to set
the number of processors that are to execute your program
in parallel. If you do not set
this variable, it defaults to the number of
processors on the executing machine.
From the C shell, the following command sets MP_NUMBER_OF_THREADS
to indicate that programs compiled for parallel execution
can execute on two processors:
If you use the Korn shell, the command is:
Use the MP_IDLE_THREADS_WAIT environment variable to determine how threads wait. Idle threads can be suspended or can spin-wait.
This variable takes an integer value n. For n less than 0, the threads spin wait. For n equal to or greater than 0, the threads spin-wait for n milliseconds before being suspended.
By default, idle threads spin-wait briefly after creating a join. They then suspend themselves if they receive no work.
Pthreads (POSIX threads) refers to the Pthreads library of thread-management routines. For information on Pthread routines see the pthread(3t) man page. To use the Pthread routines, your program must include the <pthreads.h> header file and the Pthreads library must be explicitly linked to your program. For example:
aCC -D_POSIX_C_SOURCE+199506L prog.c -lpthread -D_REENTRANT
The -D_POSIX_C_SOURCE=199506L string specifies the appropriate
POSIX revision level. In this case, the level is 199506L.
Profiling a program that has been compiled for parallel execution is performed in much the same way as it is for non-parallel programs:
The differences are:
routine_name##pr_line_0123
where routine_name is the name of the routine containing the loop, pr (parallel region) indicates that the loop was parallelized, and 0123 is the line number of the beginning of the loop or loops that are parallelized.
To ensure the best performance from a parallel program, do not run more than one parallel program on a multiprocessor machine at the same time. Running two or more parallel programs simultaneously or running one parallel program on a heavily loaded system, will slow performance.
You should run a parallel-executing program at a higher priority than any other user program; see rtprio(1) for information about setting real-time priorities.
The following sections describe conditions that can inhibit parallelization.
The compiler will not parallelize any loop containing a call to a routine that has side effects. A routine has side effects if it does any of the following:
If the compiler cannot predict what the runtime loop iteration count is before the loop executes, it does not parallelize the loop. The reason for this precaution is that the runtime code must know the iteration count in order to know how many iterations to distribute to the different processors for execution.
The following conditions can prevent a runtime count:
When a loop is parallelized, the iterations are executed independently on different processors, and the order of execution differs from the serial order that occurs on a single processor. This effect of parallelization is not a problem. The iterations could be executed in any order with no effect on the results. Consider the following loop:
In this example, the array a would always end up with the same data regardless of whether the order of execution were 0-1-2-3-4, 4-3-2-1-0, 3-1-4-0-2, or any other order. The independence of each iteration from the others makes the loop eligible candidate for parallelization.
Such is not the case in the following:
In this loop, the order of execution does matter. The data used in iteration i is dependent upon the data that was produced in the previous iteration [i-1]. a would end up with very different data if the order of execution were any other than 1-2-3-4. The data dependence in this loop thus makes it ineligible for parallelization.
Not all data dependences must inhibit parallelization. The following paragraphs discuss some of the exceptions.
Nested Loops and Matrices
Some nested loops that operate on matrices may have a data dependence in the inner loop only, allowing the outer loop to be parallelized. Consider the following:
The data dependence in this nested loop occurs in the inner [j] loop: Each row access of a[i][j] depends upon the preceding row [j-1] having been assigned in the previous iteration. If the iterations of the [j] loop were to execute in any other order than the one in which they would execute on a single processor, the matrix would be assigned different values. The inner loop, therefore, must not be parallelized.
But no such data dependence appears in the outer loop: Each column access is independent of every other column access. Consequently, the compiler can safely distribute entire columns of the matrix to execute on different processors; the data assignments will be the same regardless of the order in which the columns are executed, so long as each executes in serial order.
When analyzing a loop, the compiler errs on the safe side and assume that what looks like a data dependence really is one and so it does not parallelize the loop. Consider the following:
The compiler assumes that a data dependence exists in this loop because it appears that data that has been defined in a previous iteration is being used in a later iteration. However, if the value of k is 100, the dependence is assumed rather than real because a[i-k] is defined outside the loop.
In order to use memory classes in C++ programs, you must
include the header file /usr/include/spp_prog_model.h. Memory
classes are described in the
Parallel Programming Guide for HP-UX Systems.
In C++, the general form for assigning memory is:
#include <spp_prog_model.h> . . . [storage_class_specfier] memory_class_name type_specifier namelist
where:
thread_private or node_private
int or float)
static
storage duration. If the object is declared within a function,
it must have the storage class extern or static. Data
objects declared at file scope and assigned a memory
class need not specify a storage class.
A hypernode is a set of processors and physical memory organized as a symmetric multiprocessor (SMP) running a single image of the operating system microkernel.
This storage class specifier causes the variables and arrays specified in namelist to be replicated in the physical memory of each hypernode on which the process is executing. While each data object has a single image in virtual memory, it maps to a different physical location on each hypernode. The threads of a process within a hypernode all share access to the copy on their hypernode and cannot access the copies on other hypernodes.
This storage class specifier causes the variables and arrays to be treated as being thread_private. These data objects map to unique node_private addresses for each thread of a process. Refer to the Parallel Programming Guide for HP-UX Systems for more information.
HP aC++ provides functions that can be used with pragmas to achieve synchronization.
Gates allow you to restrict execution of a block of code to a single thread. They can be allocated, locked, unlocked or deallocated. Or, they can be used with the ordered or critical section pragmas, which automate the locking and unlocking functions.
Barrriers block further execution until all executing threads reach the barrier.
You declare gates and barriers by using the following type definitions:
gate_t namelist
barrier_t namelist
Gates and barriers should only appear in definition and declaration statements, and as formal and actual arguments.
These functions allocate memory for a gate or barrier. When memory is first allocated, gate variables are unlocked.
int alloc_gate(gate_t *gate_p);int alloc_barrier(barrier_t *barrier_p);
gate_p and barrier_p are pointers of the indicated type, which have been previously declared as described above.
These functions free the memory assigned to the specified gate or barrier variable. These functions have the following declarations:
int free_gate(gate_t *gate_p);
int free_barrier(barrier_t, *barrier_p);
where gate_p and barrier_p are pointers of the indicated type. Always free gates and barriers when you are done using them.
These functions acquire a gate for exclusive access.
If the gate cannot be immediately acquired, the calling
thread waits for it. The conditional locking functions,
which are prefixed with COND_ or cond_, acquire a gate
if doing so does not require a wait. If the gate is acquired,
the functions return 0; if not, they return -1.
The functions have the following declarations:
int lock_gate(gate_t *gate_p);int cond_lock_gate(gate_t *gate_p);
where gate_p is a pointer of the indicated type.
This function releases a gate from exclusive access. Gates are typically released by the thread that locks them, unless a gate was locked by thread 0 in serial code. In that case it might be unlocked by single different thread in a parallel construct.
The function has the following declaration:
int unlock_gate(gate_t *gate_p);
where gate_p is a pointer of the indicated type.
This function uses a barrier to cause the calling thread to wait until the specified number of threads call the function, at which point all threads are released from the function simultaneously. The functions have the following declarations:
int wait_barrier(barrier_t *barrier_p, const int *nthr);
where barrier_p is a pointer of the indicated type and nthr is a pointer referencing the number of threads calling the routine.
You can use a barrier variable in multiple calls to the
wait() function, as long as you ensure that two
barriers are not active at the same time. Also, check
that nthr reflects the correct number of threads.