The implementation of Non-linear (Preconditioned) Conjugate Gradient,
LBFGS, online learning, and adaptive online elarning works on clusters
(both Hadoop and otherwise) now.

To build the code, run make.

At a high level, the code operates by repeatedly executing something
equivalent to the MPI AllReduce function---adding up floats from all
nodes then broadcasting them back to each individual node.  In order
to do this, a spanning tree over the nodes must be created.  This is
done using the helper daemon 'allreduce_master'.

***********************************************************************

To run the code on Hadoop clusters:

Connect to the master node for the Hadoop cluster:
./allreduce_master <n> <m>
where <n> is the number of mappers you are going to launch and <m> is 
the number of spanning trees it should setup.

Start the map-reduce job using Hadoop streaming: 

hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-Dmapred.job.map.memory.mb=2500 -input <input> -output <output> -file
vw -file runvw.sh -mapper 'runvw.sh <output> <master>' -reducer
NONE

where <output> is the directory on HDFS where you want the trained
model to be saved and <master> is the hostname of the gateway where
allreduc_master runs. The trained model is saved to the file
<output>/model on HDFS and can be retreived by hadoop -get.

To modify the arguments to VW, edit the script runvw.sh. Arguments to
hadoop can be directly added in the hadoop streaming command. 

See the 'mapscript.sh' which uses 'runvw.sh' for an advanced example
of running VW in a Hadoop enviornmnent.

***********************************************************************

To run the code on non-Hadoop clusters:

Start the master job on one of the cluster nodes:
./allreduce_master <n>
where <n> is the number of workers you are going to launch.

Launch vw on each of the worker nodes: 

./vw --cache_file temp.cache -b 24 --conjugate_gradient --passes 4 -q
up -q ua -q us -q pa -q ps -q as --loss_function=logistic
--regularization=1 --master_location <master> -d <input> -f
model

where <master> is the hostname of the master node and <input> is a
different input file to each of the slaves.  The trained model is saved
as the file model on the worker nodes.

************************************************************************

The files you need to know about:

runvw.sh: This is the mapper code. It takes as arguments:
          
The output directory. The trained model from the first mapper is
stored as the file "model" in the output directory.

The hostname of the cluster gateway, so that the mappers can connect
to the gateway
                    
All the other standard VW options are currently hardcoded in the
script, feel free to mess around with them.

#########################################################################

allreduce_master.cc: This is the master code which runs on the gateway. You
start it before the call to hadoop. The master takes as input the
number of mappers it should expect to hear from, so you need to know
this ahead of time based on the number of input files you have. The
master backgrounds itself after starting and listens for <n> incoming
connections. The role of the master is to set up the topology on the
mappers and then let them communicate amongst themselves.

#########################################################################

allreduce.h: This is the header file for the workers. It provides a
struct socks, that contains a socket to the parent and sockets for up
to two children. Also provides the following function:

void all_reduce(char* buffer, int n, string master_location): Performs
AllReduce on the entries of buffer that is n bytes, with the parent
and children specified in socks object. It is implicitly assumed that
the buffer contains floats so that addition is defined on chunks of 4
bytes.

The header also defines a constant const int buf_size which is the
buffer size (in bytes) used for communication.
                  
#########################################################################

allreduce.cc: This is the code for the workers. It implements the 
routines described above. all_reduce is implemented as a combination of 
reduce and broadcast routines. reduce reads data from children, adds it 
with local data and passes it up to the parent with a call to pass_up. 
broadcast receives data from parent, and passes it down to children with 
a call to pass_down.

#########################################################################

cg.cc, gd.cc, lbfgs.cc: learning algorithms which use all_reduce
whenever communication is needed. Uses routines accumulate and
accumulate_scalar to reduce vectors and scalars resp.