How to do the worm: 1) Install HDF5 with streaming capabilities on the site and locate the installation 2) Compile Cactus: here single proc, without mpi: %> gmake worm HDF5=yes HDF5_DIR= Compile a Cactus executable with following Thorns in ActiveThornsList: # arrangement/thorn # implements (inherits) [friend] {shares} # CactusBase/Boundary # boundary ( ) [ ] { } CactusBase/CartGrid3D # grid ( ) [ ] { } CactusBase/IOASCII # IOASCII ( ) [ ] { } CactusBase/IOBasic # IOBasic (IO) [ ] { } CactusBase/IOUtil # IO (Cactus) [ ] { } CactusBase/Time # time ( ) [ ] { } CactusConnect/HTTPD # httpd ( ) [ ] { } CactusConnect/HTTPDExtra # http_utils (httpd) [ ] { } CactusPUGH/PUGH # driver (Cactus) [ ] { } CactusPUGH/PUGHSlab # Hyperslab ( ) [ ] { } CactusWave/IDScalarWaveC # idscalarwave (wavetoy,grid) [ ] { } CactusWave/WaveBinarySource # binarysource (wavetoy,grid,idscalarwave) [ ] { } CactusPUGHIO/IOHDF5 CactusWave/WaveToyC # wavetoy ( ) [ ] { } AlphaThorns/SimpleWorm # SimpleWorm (httpd) [ ] { } #AlphaThorns/thorn_MDS BetaThorns/IOHDF5Util BetaThorns/IOStreamedHDF5 BetaThorns/Socket external/TCPXX external/jpeg6b (no need for thorn_MDS at the moment) 3) Log into a machine (origin.aei.mpg.de) under your login (lanfer) and obtain a grid proxy for the sc2000 account. %lanfer> grid-proxy-init This is needed for the server process to copy files from site to site. You need to be able to do the following to every site you want to run the worm across. %lanfer> gsissh -p 2222 paramount.uni-paderborn.de 'pwd' %lanfer> gsissh -p 2222 origin.aei.mpg.de 'pwd' etc. 4) Start the server use parameter file ./SimpeWorm/par/swormserver.par The server needs to be started before the client. Set httpd::port = 7100 lanfer%>cactus_worm swormserver.par 5) Start the client Copy ./SimpeWorm/par/EGRID_cpgsi.par to the Cactus directory lanfer%> ./cactus_worm EGRID_cpgsi.par First, carefully check the parameter file for consistency: Here is a step by step explanation of the parameters as the are used in EGRID_cpgsi.par: # This Cactus will be run as a client SimpleWorm::master = "no" # This Cacuts server (wserver.par) will be listening on this host at # this port. Make sure that the master has set httpd::port = 7100! SimpleWorm::server = "origin.aei.mpg.de:7100" # # Each client will perform 1500 iterations (the original # itlast is obsolete) SimpleWorm::nextiterations = 1500 # The client will contact the server every 50 iteration to send # timing info, it will keep a stdout/stderr log SimpleWorm::contact_every = 50 SimpleWorm::worm_log = "yes" # Total number of sites in this parfile SimpleWorm::worm_machines = 3 # Profile for site 1: the sites have to have different hostname #(that's how the data is stored (currently)). You cannot have multiple # enries for origin.aei.mpg.de, for example SimpleWorm::worm_machine1 = "origin.aei.mpg.de" # username, working directory # they have to exist, otherwise the thing will fail, check this! SimpleWorm::worm_profile1 = "sc2000 /data/sc2000/AGerd/WORM" # The location of the executable on this client SimpleWorm::worm_exefile1 = "/data/sc2000/AGerd/Cactus/exe/cactus_worm" # Same for the other machine names... SimpleWorm::worm_machine2 SimpleWorm::worm_machine3 # # How to detect the next machine, currently you can cycle over the # number of machines (worm_machines needs to be set!). This seems to # be buggy, need to check. SimpleWorm::mf_method = "cycle" # How to transfer the checkpoint files: can be copy or stream SimpleWorm::wtransfer = "copy" # How to access the client machines: can be gsi or ssh # If you want to do ssh, see the general readme on how to setup # ssh access. SimpleWorm::waccess = "gsi" # Checkpointing: checkpoint on terminate. IO::verbose = "yes" IO::out3d_mode = "onefile" IO::checkpoint_on_terminate = "yes" IOHDF5::checkpoint = "yes" General Remarks on how the worm works: The client knows everything, the server knows nothing; that's probably the key statement. The client tells the server what do. When you start the client, the program flow for a copy-worm (transferring worm by copy) is the following: 1) start client 2) connect to master (eg. origin:7100) 3) while (evolution routine) - connect to master for timing info 4) increase generation counter 5) checkpoint the data to a local file 6) find out about the next machine (SimpleMF.c) 7) prepare the new parameter file this parameter file a minimal: it needs to allow the new client on the next host to restart from the checkpoint and sets a new itlast value. the name of this new parameter file has the iteration number appended: _iteration It looks like this: ActiveThorns = "" IO::recover = "auto" IO::recover_and_remove = "yes" cactus::cctk_itlast = 7500 8) Client tells server to copy the parfile and checkpoint file from old client(c1) to new client (c2) Specifically, the copy procedure: gsiscp c1:foo.par_100 c2:foo.par_100 is broken up into: 1) gsiscp c1:foo.par_100 TMP; 2) gsiscp TMP c2:foo.par_100_tmp; 3) gsissh c1 'mv foo.par_100_tmp foo.par_100' 1->2 because direct copy didn;t work for me sometimges and the mv is needed in the case that you operate on the same filesystem. 9) Client (c1) tells server to start a new client on c2. If things don't work, log at the logfiles. The client writes out what commands it gives to the server. Try to execute these commands from the server side. Very often (for me) directory inconsistencies in the parfile caused trouble. good luck gerd