In my previous post Parallella Problems – Initial Investigations[1] I detailed the problems I was having with the 4 Parallella boards[2] in my mini-cluster and what I had done to investigate them. Since then I have been continuing my investigations and it looks like I might have found a reason and solution to the more serious problem suffered by the board that suffers complete system lock ups when running Epiphany chip[3] tests and examples.
Noticing in the Parallella forum[4] posts that many problems were down to power issues I started the week by looking into the 5V supply to the Parallellas. It seemed OK when checked with a multi-meter, but to make sure enough juice was always available I tried connecting only one Parallella board at a time, so the 5V rail was only powering a single Parallella and the network switch. The small PC PSU I was using had a 5V rail rated at 12A so this should be much more than enough.
The first board I tried this with was the one that locked up the whole system when it failed. To my disappointment it showed no signs of behaving in a more stable manner. I next tried with one of the other boards that fail to complete the matmul-16
Epiphany example[5] at random intervals. Again the results were the same.
In order to process and observe the matmul-16
timeout failures I ran with the modified matmul-16
host program that used a counter to detect timeouts along with the modified run4ever.sh
and test.sh
scripts I wrote about previously[1]. I further modified the host program to skip over the host side calculation and as I was not interested in the timings to use usleep[6] to sleep for ~0.17 seconds between checks for completion, reducing the timeout counter to 40. This allowed considerably faster executions and reduced ARM processor utilisation.
Having had no improvements so far I next checked the 5V supply for ripple, noise and the like. I have an old CRT based analogue 20MHz oscilloscope and used this to observe the AC component on the 5V supply. It was quite noisy. The noise was around 50mV but there were some regular dips of around 10mV for about 1mS every 5mS to 10mS. I had to take the time base off its calibrated setting to get these artefacts to appear to be moving slow enough to see easily. After a while it dawned on me that there seemed to be two separate sequences of dips. After a while longer it occurred to me that they changed their timing in synchronisation with changes to the beats of the two fans used for cooling as they changed their speeds slightly and independently.
It also seemed that the Parallella board itself adds quite a lot of noise to the supply – especially when the Epiphany is in use. For the matmul-16
example these were positive and negative spikes of 20mV to 30mV above and below the general noise of around 50mV about 50nS wide every 300nS or so. The pattern persists momentarily when all is well but persists until the timeout kicks in and shuts down the execution when things go wrong.
To check the Parallella was not picking up the fan noise I disconnected all Parallellas from the 5V supply. With just the network switch powered not a lot of fan noise was visible. So I connected a dummy resistive load consisting of 2 10W 1 Ohm resistors in series. The fan noise was still there so it seemed it was fans, cabling and PSU not the Parallellas.
To remove the fan noise from the equation I re-purposed the 5V supply of a cluster of 5 Raspberry Pi model Bs plus an 8 port network switch. This was a 5V 10A TDK Lambda[7] affair and it certainly reduced the noise. It also had a handy voltage adjustment potentiometer…
With the same board connected as before I tried a slightly higher voltage of 5.2V, with no change. Returning the voltage to 5.01V (about the same as what the PC PSU 5V line supplied when loaded with all 4 Parallellas and the network switch) I next tried to increase the airflow over the Epiphany chip. I got one or two promising runs with no failures up to 3000+ (would have left the good run for longer but accidently moved something that crashed the board!) or around 2500 before the first timeout. Thinking maybe I had inadvertently restricted the airflow I tried restricting the airflow on purpose with a similar promising run or two then reverting back to having timeouts before reaching 200 iterations.
Friday rolled round and I still had no fixes – every time I thought I had a configuration that worked better a later run would disabuse me of that conclusion! I was getting depressed so I changed tack and returned to seeing if I could make any progress with the board that locks up the whole system when it failed. With the TDK Lambda supply connected to this board I tried various different voltages – not much difference between 4.76V and 5.21V – the lower voltages looked promising at one point but then did the usual U-turn.
Clutching at straws I wondered if the mountings were stressing the board in some bad way so I removed the mounting screws from all but one mounting post which I left loose (otherwise the board might fall into the fan below!). As a precaution I placed tape over the now unused mounting posts. Again a promising few runs of all the Epiphany tests listed in the linaro user’s epiphany-examples/scripts/LIST.E16 file followed by reverting to badness.
Early on Saturday afternoon I finally acted on the vague observation that many if not all the better runs had happened in the afternoon of hot (for London UK) days. So having run out of other ideas I disconnected the bottom fan that was nearest the bad board under test with the intention of greatly reducing the airflow over and around the board thereby increasing its temperature. And indeed it did - firing the board up, logging in over ssh
and running watch with the ztemp.sh
script[8] (named zynqtemp
locally) showed the Zynq 7020 chip to be running with reported temperatures in the 60°C to 63°C range.
To my surprise the board then proceeded to run all the Epiphany III tests several times all the way through. I created a quick script to continuously iterate the test runs, showing the iteration number each time and low and behold it ran through them fifty odd times without any problems. I then ran the unmodified matmul-16
example application for a while via its run4ever.sh
script and then my quick-running modified version for over 12000 iterations without the board locking up but, like the other boards, with occasional timeout failures (the unmodified matmul-16
application of course just stopped every once in a while requiring a manual ctrl-C
to interrupt and fail that run and continue).
As I write this on Sunday I am running tests on the board in the same better-working setup as yesterday. Earlier today it ran the Epiphany III tests through 77 times with full passes and then ran over 8000 iterations of the modified quick-run matmul-16
application before the board locked up. I power cycled the board and it has currently run over 16000 iterations of the quick-run modified matmul-16
application. I am hoping the lock up was just because it is a little cooler today, especially earlier. Certainly the reported Zynq chip temperature was in the 56°C to 59°C earlier and is now in the 58.5°C to 60.5°C range.
So it seems Parallella boards might just have a bit of a Goldilocks complex: they are too hot if not cooled enough, but too cold if cooled too much – they like their temperature just right, which seems like it may be when 58°C < Zynq 7020 temperature ⇐ 70°C…at least for one specific board. Further investigation would seem to be required.