Parallella Problems – Initial Investigations

In my last two posts Building a Parallella Mini-Cluster^[1] and Bringing Up a Parallella Mini-Cluster – Trials and Tribulations^[2] I described constructing a small cluster of 4 Parallella boards^[3] into a case with power and cooling and initial steps at getting the cluster up and running. I finished at the point where I knew of the following problems:

One board has serious problems when using the Epiphany III chip^[4] tending to lock up the whole system or at least the network (as I am connecting over ssh locking up or crashing the network looked the same as locking up the whole system).
All boards (including the one mentioned above on the odd occasion it decides to function for a while when using the Epiphany chip) show less serious occasional problems getting stuck when repeatedly running matmul-16^[5] matrix multiplication Epiphany chip example application that could be recovered from by terminating the stuck execution with ctrl-C.

The only checks I had performed were a quick check of the power supply voltage to the Parallella boards was in range (it was 5.02V plus or minus 0.01V), and the temperature of the Zynq FPGA chips that turned out to run really hot requiring heat sinks and fan cooling to be used. They were all in the range of around 46°C to 55°C – a bit warm certainly but not excessively so.

My next step was to try to get the matmul-16 application to timeout rather than get stuck and report the difference between timing out and failing due to bad results from the Epiphany and to get the run4ever.sh script used to repeatedly execute the matmul-16 program to display an indication of why there was a failure.

First I created a modified version of run4ever.sh that repeatedly ran the run-once run.sh that displayed all the progress output from the matmul-16 host program. I then added additional trace output to the matmul_host.c host program C source to help me pin point where the sticking point was. Rebuilding and running the modified version of run4ever.sh revealed, as I suspected from a read of the matmul_host.c source, the program was getting stuck waiting for the Epiphany to signal task completion in the following loop in the matmul_go function:

while (Mailbox.core.go != 0)
    e_read(pDRAM, 0, 0, addr, &Mailbox.core.go, sizeof(Mailbox.core.go));
printf( "Done...\n");

So I added a countdown initialised to a million to the while-loop so the loop would terminate when it had counted down to zero followed by a check to see if an error return is required:

unsigned int countdown = 1000000;

...

while (Mailbox.core.go != 0 && --countdown)
    e_read(pDRAM, 0, 0, addr, &Mailbox.core.go, sizeof(Mailbox.core.go));
if (!countdown)
{
    printf( "### FAIL: TIMEOUT ###.\n");
    return M_ERR_TIMEOUT;
}

The returning point after the matmul_go call site was modified to check for error returns from matmul_go and to print an error message to stdout and exit – and yes it does use a goto to the requisite clean-up point in main, that’s C for you:

if ( matmul_go(pDRAM)!=0 )
{
    printf( "\n\nERROR: Time-out waiting for Epiphany to complete calculations!!\n");
    printf( "*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***\n");
    printf( "BAD: CHIP FAILED!!!\n");
    rerval = 2;
    goto Finish;
}

I then modified the original run4ever.sh script and the test.sh to:

Pass the execution iteration number in the COUNTER variable of run4ever.sh to test.sh together with the name of an all-executions-wide error log file name.
Write to the all-executions-wide log file execution failure information including:
- The execution iteration number.
- Lines containing “ERROR“ extracted from the per-execution log file (currently each error only outputs one such line).
- The full contents of the per-execution log file (i.e. all output from the failing execution).

This setup, combined with use of my clash Python script (see brief description in my previous post^[2]), watch, grep and sort, allowed me to execute repeatedly matmul-16 continuously on all 4 nodes and have a summary displayed of failing executions on each node until node 3 ‘hard’ failed (i.e. requiring a reset or power cycle to clear), whereupon I might either wait for the timeout to occur after which node 3 would always report as being unreachable, or restart watching for failure from just nodes 0, 1 and 2. A similar use of clash, watch, and sort with the ztemp.sh script^[6] (which I named zynqtemp locally) allows keeping an eye on all nodes’ temperatures (at least until node 3 ‘hard’ fails!).

The conclusions I reached were that:

The temperature of the Zynq chips does not increase significantly when running matmul-16 repeatedly.
All failures I received were timeout failures.
Such failures seem to occur at random intervals: after tens, few hundreds or many hundreds of iterations.
Occasionally node-3 would play ball for a while and not ‘hard’ fail but run with the occasional timeout for a while then ‘hard’ fail.
No obvious reason dictated whether there will be long or short intervals between timeouts.

My next line of enquiry, having found my way around the host part of the matmul-16 example, was to look at the source code of the Epiphany side in matmul_main.c. A careful reading lead me to spot what seemed like a few oddities, but nothing that would account for the observed failing behaviour:

The shared data structure seemed to originally use 2 flags for host-Epiphany signalling:
- ready to signal Epiphany ready to go or having results ready to read
- go for the host to signal the Epiphany to start work
It seemed that in the end the go flag was used as the predominant signal to signal both Epiphany readiness to the host and for the host to signal to the Epiphany that it should start work.
The shared go flag seemed not to be set by the host initially.
The Epiphany-side code data_copy function played safe having a redundant initial call to e_dma_wait.
The bigmatmul function contained an optimisation that would never be used due to what amounts to an off by one mistake caused by the use of less than or equal rather than just less than.

To check that the missing go flag initialisation was not causing a problem I re-worked the matmul_host.c matmul_go function’s synchronisation with the Epiphany to use the ready flag. This was OK as the Epiphany side code consistently sets the ready flag as well as the go flag – although the values are inverted. I also added checks to see that the ready and go flag states were consistent with each other. I next removed the redundant call to e_dma_wait in the matmul_main.c data_copy function. Finally I corrected the optimisation checks to use less than rather than less than or equal.

The results after each change was the same as before. Mostly passes with occasional timeout failures.

Having now written up my Parallella experiences so far it seems I shall have to get less ad-hoc about tracking down the problems I am having.

Parallella Problems – Initial Investigations

References