Bringing Up a Parallella Mini-Cluster – Trials and Tribulations

Saturday, 12 Jul 2014 - 16:35 +0100| Tags: Parallella, hardware, Linux

My previous post Building a Parallella Mini-Cluster[1] dealt with physically building the cluster and ended at the point it could be switched on and off. The next stage is bringing the cluster up to a stable and usable state, and I am sorry to report that it has not been a trouble free ride and tracking down the problems is ongoing as of mid-July 2014.

One thing I should mention is that before turning on the power for the first time I installed the provided heat sinks on the Zynq FPGA chips.

The next thing to do was to create SD card images for each of the 4 Parallella boards. Four micro SD cards were included in the mini-cluster reward kit, each in a standard sized SD card adapter so they were there ready to be used. Following the instructions on the Parallella web site[2] I downloaded the then current Ubuntu Linux for Parallella archive[3] as well as kernel archives for both the HDMI and headless configurations on Parallella-16s with the Zynq 7020[4][5].

I deviated from the Windows installation instructions, extracting each of the kernel file sets into their own directories which were both copied to the boot partition; the HDMI configuration files were also copied into the boot partition root directory as the initial active configuration. Having both sets of boot files to hand on each card would make switching between them convenient.

Using one micro SD-card I checked each Parallella board booted. In this initial case I connected a display via the micro HDMI socket to each board in turn. I did not have the correct cables/adapters to connect a USB keyboard to the Parallella boards so just made sure the desktop login screen came up over HDMI.

The next order of business was to sort out the network: hostnames, gateway, DNS, IP addresses. I run a local DNS server and find static IP addresses easier to deal with. Choosing a contiguous group of IP address on my local network I set up the various items for each card/board configuration – with the basic setup done with each micro SD-card mounted on another Linux box.

Now I could login via ssh to each board I installed a Python script called clash (sort of for CLuster SHell) I had written for a cluster of 5 Raspberry Pis a year or so ago that broadcasts single commands via ssh to any or all targets in the configured cluster and collects and displays the returned lines of text from each targeted node prefixed with the node’s number. I nominated the board running as node 0 (called paraclust0) as the main command node, then used ssh-keygen and ssh-copy-id to set up password less login via ssh from node 0. Using clash makes tasks such as shutting down all boards or package updates and upgrades for all nodes much more convenient.

So far so good. The only niggle being that sometimes not all boards would boot to the point the network was up and running. Next to check the temperature of the Zynq chips. I copied across the ztemp.sh script code[6] and started running it regularly and determined the Zynq chips temperature varied between about 46°C and 55°C – one board, node 2, was always 3 or 4 degrees warmer that the rest. While I would like the values to be somewhat lower the temperatures are not excessive.

Now to see if the Epiphany III chip[7] on each board worked. I found the test build and run scripts in the handily installed epiphany_examples directory sub-tree[8]. Initially I built and ran all the tests just on node 0, adding the exporting of the EPIPHANY_TESTS environment variable to the node 0 linaro user’s .bashrc file. All well and good.

Next I copied the updated .bashrc file to the other nodes and used my clash script to build and run the tests on all nodes concurrently. Oh dear – none of the nodes knew about the Epiphany environment when commands were initiated using remote commands of the form:

ssh user@host command-line

Which of course is the form used by clash.

After some poking around I determined that using ssh to run a single command logged in to the remote system as a non-interactive user and the .bashrc script run by all the linaro users at login bails out very early in the script for non-interactive logins:

# ~/.bashrc: executed by bash(1) for non-login shells.
# see /usr/share/doc/bash/examples/startup-files (in the package bash-doc)
# for examples

# If not running interactively, don't do anything
case $- in
    *i*) ;;
      *) return;;
esac

The solution was of course to set up the Epiphany development environment in all cases, moving the requisite lines from the end of the script to the beginning:

# ~/.bashrc: executed by bash(1) for non-login shells.
# see /usr/share/doc/bash/examples/startup-files (in the package bash-doc)
# for examples

export EPIPHANY_HOME=/opt/adapteva/esdk
. ${EPIPHANY_HOME}/setup.sh
export PATH=/usr/local/browndeer/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/browndeer/lib:/usr/local/lib:$LD_LIBRARY_PATH
export EPIPHANY_TESTS=/home/linaro/epiphany-examples

# If not running interactively, don't do anything more
case $- in
    *i*) ;;
      *) return;;
esac

Interestingly the comments at the top of the file claim not to run .bashrc for login shells but it always seemed to get executed when logging in!

Having ensured the Epiphany development environment and the added EPIPHANY_TESTS environment variable I added were getting set I could fire off running tests on all nodes simultaneously – hoorah!

Not so fast. While 3 of the boards were happy to run through the tests (nodes 0, 1 and 2 as it happens), node 3 was not. It generally failed to complete all the tests. If it did complete them then it would fail a repeat. Its best effort to date is to completely run through twice followed by failing the first reset test. By fails I mean locks up the whole system – or at least the network sufficiently to prevent ssh, ping etc. responding. A check of the voltage supplied to the board yielded a good value of 5.02V with a variance of about 0.01V so power seemed not to be the issue.

Up to this point I had not added heat sinks to the Epiphany chips. Thinking maybe node 3’s Epiphany was running a little hot (although checks with a finger did not seem to indicate it was excessively hot – that is I did not get a burnt finger!) I ordered some heat sinks and double sided adhesive thermally conductive foil pads and attached one to the failing board’s Epiphany chip. The heat sink certainly got warm but no improvement in stability was seen – harrumph.

I decided to try something different: continuously running the matmul-16 matrix multiplication on the Epiphany example[9]. To my dismay at random points the 3 ‘good’ boards would get stuck. However unlike bad board node 3, in these cases the fault was recoverable by issuing a ctrl-C to quit the current execution and move onto the next, which would pass and so the executions would continue until the next sticking point. I reverted to running tests on each board in a separate ssh session.

It is interesting that the failures I encountered are not the sort that the tests report as failures. The tests either pass, get stuck or make the system unresponsive (or at least the network). Recovering from the failure – either softly with a ctrl-C or more firmly with a reset or power cycle to re-boot – will return the system to a state where it will pass or get stuck or become unresponsive. If a test completes it always passes.