Despite the hardships I described in the previous post I’ve managed to produce something. Something rather cool! While I’d like to present it in video form, I’m just not feeling up for it. Blog texts are more my medium, anyway. Or maybe it’s that I have more than ten years of experience with this; can’t say the same about video :d
Anyway. A
brief summary before embarking on this adventure: I made an internet connected
timer display! On a microcontroller! With little previous experience. With
.NET! Started with a dirty “MVP”, and then focused on improving the stability
until I was satisfied. Next step would be to add more features on this rather
solid foundation and improve the form-factor. See a short clip
about it! There's also a clip about an early version.
This is a relatively
small project, but I have a lot to tell! Writing this all must have taken at least
a dozen hours or so. Maybe even more :o
Background
To prepare
for this adventure, let’s first take a step back and start from the beginning with
some background, as usual. Feel free to skip to the next section at your
leisure, or even the one following it.
Eons ago I
implemented a JavaScript-based countdown timer. It started as a pizza timer and
initially saw most use at small LAN parties, and due to the convenience I’ve
also used it for other foods too :p But it could have been even more convenient!
So, I added a shortcut for it to my Runner, which is a Win-R replacement with
strong customization. After this I could launch it just by pressing Win-Q and
typing cd 12, and this would open the timer page in browser and set it to count
down from 12 minutes with the query string. How easy is that!
But
sometimes there’s a need to also count to a specific point in time. So, I added
a command for it. at 12:10 would open the timer, and set the remaining time in
such a way that it would trigger at that given time. This opened up a lot more
possibilities for use, and it often was the case that I had several timers
running at once. And more than once some of those timers were set for longer
times, and I happened to restart my PC during them. Let’s just say that a dumb
browser-based countdown isn’t quite compatible with that concept. Not to
mention the times when Chrome updates prevented the timer for accessing audio
due to low user engagement. Thank god there was a group policy to fix that. It
was also a bit of a bother to either keep the tab active, or constantly take a
peek to see how much time there was remaining. I could have moved the timer to
a second monitor, but it would have required extra effort. If I even had a
second monitor, that is. A single ultrawide is more of my thing.
A plan forms
I needed
something better. Something which provided that durability and reliability, with
the “ergonomics” being a still-important secondary feature. And with the alarms tied to specific instants in time instead of arbitrary durations. I considered my
options, and saw two clear options. Either a Windows application with automatic
start and a screen overlay, or something that I could run on a smaller embedded
device like the Raspberry Pi. This second option would then need some kind of
API to interface with it from the desktop and extra hardware to enable sound
and the display.
The second
option was superior in the sense that it would function independently of the
main desktop, and the platform would likely have fewer interruptions due to
reboots etc. But that extra hardware was quite an issue. But in this I also saw
a third option. A true embedded device. Something where I wouldn’t even have to
concern myself with OS level stuff like getting my app to start automatically
and staying running. And I was already in possession of suitable hardware, and now
software, too.
This third
option is of course the Meadow F7 device from Wilderness Labs, which I owned
two (now four). Some time ago it had received an update which enabled the
built-in Wi-Fi hardware allowing it to be easily connected, and I already had the
other extra hardware that was required from the accompanying Founder’s edition “Hack
Kit” and an Adafruit order, namely the displays and a buzzer.
The
microcontroller form-factor also offers advantages with size and (perceived)
reliability, with restarts happening quicky. At least once there’s AOT… And
most importantly, I wanted to do things on a microcontroller. Maybe I could
event have the device battery powered some day?
And boy,
did I build a fine thing in the end ^^
Concept validation
Few years
back I had already played around a bit with the Meadow, and built for example a code breaking game
with almost the same exact hardware, so I had a pretty good idea how to
approach this particular problem. The functionality itself was rather simple,
both logic- and hardware-wise. Initially.
Another goal
I had was that the device itself wouldn’t have any human interface. Everything
would still be driven through the Runner for superior usability, and as such
the device would have to be connected to the control server either via IP or a
serial connection. I already had some experience with the serial connections,
but IP is always so much cooler, and also more standalone in this case.
And as a
bonus, I wanted to have the ability to have alarms displayed on multiple
devices at once. That way I could have one on my desk, another in kitchen etc.
So, I
started by proving that the device would indeed be able to communicate over IP
as advertised, and that it would also retain that ability over longer periods.
I took the provided Wi-Fi
sample code as my starting point and got to work.
I was quite
pleased that the sample worked unchanged (not counting network configs). Could
it really be this simple? Encouraged by my success, I moved on to the next
phase, and quickly implemented a relatively simple .NET 6 based backend for all
my timing needs, accompanied with a HTTP wrapper for making requests on the
backend.
But I still
had scepticism regarding the networking in Meadow. On the other hand, I’ve often
struggled with trying to build things that are too perfect too soon, so this
time I settled for the “bare” minimum and just rolled with it. I didn’t bother
to implement persistence, opting to just store things in memory. This wasn’t a
lot better than just having the things in a browser, but at least it was
magnitudes better than keeping the stuff only inside Meadow’s memory. I could
always add the persistence layer when the more uncertain things were less
uncertain.
Unfortunately,
my initial scepticism was confirmed. When I got even slightly more serious
about my use of the network, the device just hung. Sometimes this happened
within a minute of boot, and sometimes took more than an hour. That wasn’t
great. Not great at all. But hey, the device is still in beta, and the people
working on the device assured me that stability improvements were actively
being worked on. And they were later fixed!
As
explained, getting the thing to work wasn’t the end goal. Getting them to work
reliably was. On any other week I would have probably been quite devastated when
something this elementary wasn’t working as expected. But now I embraced the
challenge presented. I even
had a secret new tool at my disposal now. One I had been itching to apply
somewhere.
Implementation
A while ago
the hardware watchdog in the device was exposed for use. While not exactly
graceful, it was perfectly effective in getting the device to recover. Challenge
overcome. How unexpectedly anticlimactic. Additionally, a later firmware update greatly improved network stability.
Now I had
more time to focus on the application domain, then. I needed two things. First,
I’d obviously have to get the timers to the device. And second, a closely
related mandatory reliability feature is getting the device to recover those
timers on booting. Something that would happen quite often for the foreseeable
future.
Luckily this
was something I had anticipated with the initial architecture, and almost the
sole reason the server component even exists. While I didn’t set out to build
perfect right away, it had to be better than just persisting the timers in the
device’s RAM. Especially this early in development I assessed that the server would
have a lot less restarts than the device, and it wouldn’t make sense in trying
to persist anything important on the device.
Or at least
I didn’t know enough about embedded hardware to know how reliable it is. All I
know is that SSDs on PCs are quite reliable. And a magnitude more reliable yet
if the server is clustered over many physical computers and the writes go to
multiple independent storage devices. But that’s another adventure altogether, best
embarked some other year.
MVP
Let’s talk
about the APIs first. Respecting the pledge I made to myself earlier, I started
by building something less desirable, but what would work with minimal work.
And what’s easier than polling over HTTP?
No state-keeping,
no events, just a dumb endpoint that returned the next countdown timer. And as I
wasn’t familiar with the characteristics of the device’s clock and its
accuracy, I made the endpoint also return the current time. The device could
then compute the exact current time by diffing to a stopwatch which was reset
when the endpoint polled.
And there I
had it! Thanks to the code breaker project, it didn’t take long to have the
remaining time visible on a 7-segment display, and the end of the timer
visualized by flashing a Charlieplex led matrix. A fully functional MVP
already.
Ending up with
a viable end result is rather unheard of, as far as my things go. As
hinted, usually my projects are very long and I aim for the perfect. Sure, I’ve
learned a lot doing it, but only rarely managed to produce an artifact,
and of any real use. And that I had something again, it felt really good! I had
really missed the feeling.
Further improvements
Now that I
had a minimum viable product, I could have just ended it there. But it’s just
not in my nature. I had already put a week of work into it. What if I put in
another? I could do so many nice incremental improvements, and all the time
have a working thing. Even if I quit, I’m left with something worthwhile. Plus,
I was feeling good. I really wanted to keep working on it. Even if my fascination
got a bit unhealthy towards the end of the first week. Surprised myself by
taking a short break, and was again energized.
And there’s
a lot I ended up improving. I’m not sure how I should best present everything,
so here comes something. It doesn’t have to be perfect, right?
Networking
The first
obvious improvement was how the device interfaced with the server. If you
recall, the first implementation was simple HTTP polling. Polling has high
latency, and this was something that needed instant feedback in order to feel
reliable. If I set a timer, I want to immediately see that it got set and move
on to do more important things.
I could
have upgraded to long polling and call it a day, but publish/subscribe is a lot
cooler. Plus, it’s more efficient and scales better, not that it was an actual
concern. While I’ve tried to make NATS my go-to in this regard, I decided to go
with another of my favorites: Redis. It’s a mature codebase, and the wire
protocol is dead simple, so it’s going to perform extremely well for my scenario.
Except it
didn’t. I tried using the de-facto StackExce.Redis package, but it turned out
to have too many features. Meadow executes code in an interpreted mode with
some rather primitive JIT, and all those features with a complex handshake
meant that the initial connection took a long time, enough to blow past
about every conceivable timeout. Even five minutes wasn’t enough to complete
the whole handshake. That was just too much.
I could
have tried NATS yet, but decided to play it safe and go for the nearly polar
opposite. And have a chance at doing something I had been missing a long time.
Pure UDP. Minimal framing. Dead-simple connectionless protocol with timers.
Handcrafted packets. Oh, how I had missed that world; it has been so long since
I had worked with Tracker.
And third
time’s the charm. Performance was awesome and things just worked. And would
they happened to not have worked, timers would soon rectify the situation. I was
happy.
There are
just two packet types. The device sends discovery packets at an interval to the
server, and the server sends status packets at an interval to all devices which
have been discovered. And if there’s a new alarm, the status packet is sent
immediately, allowing the device to pick up the countdown without delay. Sure,
it was excessively chatty when there were no updates, but it was also
excessively simple and reliable.
The
discovery packet is just a simple sequence of “magic” bytes, and that’s it. The
status packet is more sophisticated. Mirroring how the HTTP polling endpoint worked,
it contains a sequence of the next few upcoming countdowns which the device
hasn’t yet finished. Additionally, the packet starts with a hash of the data it
represents. The data doesn’t change until the old alarm passes, or a new one
gets inserted before it. This means that the client can simply check those
initial bytes of the packet for the hash, and stop parsing if it equals to the
old hash. Only if the hash differs is it required to continue parsing and
possibly allocating memory. So fast! Other than that, there’s really nothing
extra. Not even a header or a real checksum to differentiate the status packets
from garbage :s There probably should be…
But even
with an implementation of this level things work really well.
Part of the
equation is that the device tells the server that it has received a countdown, or
that it has started/finished alerting it. This still happens via HTTP. While it
would be nice that there was only one communication channel, one could also ask
why? Right tool for the job, and it already worked well. And the device is plenty
powerful to contain code for both. Everything doesn’t have to be absolutely perfect.
That’s what I actively try to tell myself, and I’m slowly starting to perhaps
even believe it.
There’s
also a mechanism for keeping the device’s time closely matching the server’s. Initially,
I thought I’d implement NTP. But I don’t really understand it, and I could not find
a good implementation I could run on the device. So, I rolled my own (:
When the
device boots, it simply does a HTTP call and uses that time. And afterwards
every 15 minutes it asks for the time again. In case the call takes less than a
threshold, the device’s time is updated, after tweaking the time value by half
of the request latency. Because why not. It’s likely mostly symmetric on a LAN,
right?
It works
really well. If I wanted to improve this, I would eliminate the explicit updates
completely, or at least implement them in UDP, so that there’s always only one
round-trip required, improving latency. Not that the current 60-100ms is too
bad. It should be using keep-alives, anyway, so there’s not too many extra
packets. Elimination of these updates could be achieved if the server immediately
replied to the discovery packets with status and time. And perhaps have some
soft nudging so that the device’s time changes by only a few milliseconds at a
time if the difference isn’t too large. That way the remaining time on the
countdown display decrements as expected even under close observation.
Local persistence
Now that
the protocol was at a satisfactory level, I could continue to improve other
things related to reliability. Which still happened to be related to
networking, too. As explained, the way the device protocol works is that the server
sends the “next” events to the device. For the server to know what these next events
are, it needs to know if the device has already alerted them: the device needs
to tell the server this. But the network can be unreliable, and I don’t want to
bother the user with duplicated alarms in case where an alarm happens but the server
can’t be reached before the device boots.
So
obviously the device needs to be able to locally persist these states and then
ignore them if the server disagrees, and resend the state update. But how to
store this data? While Meadow does have onboard flash storage which is accessible
to user code, I’m concerned with write endurance. State updates can happen
relatively often, so it might wear out the device at a surprising rate.
But there’s
alternatives! I’ve been fascinated by write endurance before, and happened to
stumble upon a type of memory which is persistent without power, yet having a
superior write endurance compared to flash, while being relatively affordable and
usable in embedded devices. As a part of Adafruit order I got a couple FRAM (Ferroelectric
RAM) modules for unrelated purposes. These particular modules have write
endurance of about 10e12 per byte. While still finite, it’s practically infinite
in this application. How cool is that!
There was
no ready-made library for using the modules with Meadow, so I ended up writing
my own based on Adafruit’s Arduino code. Things went quite smoothly – after I
learned that the chip select pin can’t be released between sending a command
and reading the result. Oh, and there was also another thing. This particular
device requires sending a separate write enable command before the actual write
command. Adafruit’s library insists that the write enable command needs to be
sent once, and then multiple writes can be issued. But reading the datasheet
the write enable latch is reset each time chip select is released, and a new
write command can’t be issued without releasing the pin. Was a bit frustrating
to figure that out. Or at least I couldn’t figure out how it was supposed to
happen. This was my first time interfacing with an SPI device, after all.
Now that I
had the storage device, I could get to writing things to it. I ended up with something
relatively straightforward. The persisted data consists of “packets” of
constant size, and each having a static header, countdown GUID, the latest
status enum value, a serial and then a hash. Each time a new update is written,
it’s written after the one before it. This way the memory gets worn relatively
evenly, not that the write endurance really was a problem. But why not.
For the
hash I wanted to use just a simple CRC32, but as it happens, .NET standard 2.1
doesn’t have an implementation for it, and I didn’t want an extra library just
for it. But what I have is MD5. And as a bonus, it is hardware accelerated,
too! As the full hash is rather excessive, I simply XOR it to shorten it.
Span<byte>
hash = stackalloc byte[16];
// MD5.Create().TryComputeHash(…, hash, …)
var ints = MemoryMarshal.Cast<byte, int>(hash);
var smallHash = ints[0] ^ ints[1] ^ ints[2] ^ ints[3];
Beautiful,
isn’t it.
Now with
the states persisted, the device can read through the memory and try parsing a
packet from each packet-sized offset. If the header and the hash match, it is
assumed as valid persisted state. If multiple updates are found for a single
event, the “newest” one is selected first based on the serial (a version
number) and the state. Afterwards, all those states are bulk sent to the server
during the startup sequence, which now once again sends relevant status updates
which the device doesn’t have to ignore. This also saves a bit of network bandwidth.
Watchdog
Continuing
with the reliability improvements, my next focus was improving the watchdog.
The initial implementation guarded well against complete device hangs, but wasn’t
not much more sophisticated than that. As the application now consisted of the
time updates, the discovery stuff, the actual timing code and lastly an asynchronous
bonus layer for the state updates (more about it later), it made sense to start
monitoring all of them. But there’s only a single watchdog. How to watch for so
many different things?
What I
ended up was a collection of timestamps which record when each of those components
was last healthy (=reached a checkpoint), and then a task to periodically compare
those timestamps against specific timeout values. If any of the timers is deemed
unhealthy, and error is printed and the watchdog is not reset. This leads to the
device restarting, and usually things start to work again. As a bonus, as the
timeouts are computed in “user-space”, they can be a lot longer than the default
short.MaxValue milliseconds the Meadow’s watchdog makes possible. Mostly useful
for the time updater.
I’ve
spotted the device to restart a couple of times due to above, but I don’t have
the specifics on why. There’s some kind of traceback visible on the small debug
display I have attached to it, but it’s too small to display it in full. I’m
considering on trying to write the tracebacks to the flash memory as they
happen, and then sending them to the server after reboot, or on background. Or
maybe order a lot larger display just to display the longer tracebacks :D
Usability
As I built
the timer on top of the hardware I had in the code breaker, I already had a
7-segment display and a bright “lamp”. The 7-segment display was obviously for showing
the time remaining, and blinking the lamp for alarming. In this case the lamp
is an addressable Charlieplex led matrix display (had to make a driver for it
myself, again). It’s a total overkill, as I’m just filling all the pixels with
a single brightness value. But it’s easy. And really bright.
But what’s
an alarm without auditory output, too. In the Hack Kit there also was a piezo
speaker which was perfect for alarms when attached to a PWM port. I immediately
hated how it sounded. It was perfect.
But I could
do better.
I added a
small beep when a new alarm was detected so that there was more feedback for
entering one. A tiny thing, but a really nice one.
I figured
out I could also improve the 7-segment display. This is probably a bit controversial,
but this is my thing, and I can do stuff just the way I want :) The purpose the
display it is to show the remaining time, but only when I’m interested in it. I
found it a bit obnoxious that the seconds kept updating every second even when
they didn’t really have any relevance. So, I made it so that the display only
shows the seconds if there’s less than 10 minutes left. If there’s more, only
the minutes are shown, with the digits reserved for seconds remaining
completely dark.
The seconds
remaining dark actually serves a dual purpose. The way 7-segment displays are
typically driven is via a matrix, with only a single led receiving power at a
time. This leads to flickering, which is especially apparent at lower brightness
settings. The less illuminated areas there are on the display, the less there’s
surface area for the flickering to manifest at. During use I found that I’m highly sensitive
for the flickering. The display is a small object, and if I moved my head around
it felt as if the display was moving. That was not a nice feeling. The less
flickering, the better.
Also, as
the display is for indication and not illumination, I had used it at the lowest
possible brightness setting. This helps to reduce visual fatigue when the
display remains in my field of view. But as I mentioned, this was at odds with
the flickering. So, as a workaround I bumped up to the highest brightness and covered
the display with a dimming film. Not very elegant or flexible, but it felt like
it helped, and my camera seemed to agree. It’s still nowhere near perfect, but
it’s usable at least.
As there
doesn’t really seem to be any 7-segment display which don’t flicker, my other
options seem to be making my own (unfeasible), or using another display type.
OLEDs would be great, but they might end up with burn-in. Not sure if that’s
really a problem. There’s also TFTs, but I’m not sure how readable they are
with their reduced brightness. I do have one TFT display (the debug one), but
haven’t yet tested to render the timer on it.
Accuracy
And lastly,
I focused on accuracy. As I hinted, the system supports alarms on
multiple devices at once. I wanted to make sure that different devices would
display the same time, and start alerting at the exact same time.
I already
had the time updates with latency compensation, so most of the work was already
done. What was left was to make sure that the time calculation logic was
accurate, and that the code which was executed when an alarm started executes
in roughly the same time on different devices. The biggest hurdle was the
status updates. On a desktop it happened practically instantly, but the Meadow
on Wi-Fi took considerably longer to execute the update.
I solved
this by making the status updates asynchronous. The update is written to FRAM
instantly, but afterwards it goes to a background queue and takes however long
it takes. With automatic retries.
After these
the alarms trigger about as closely as possible, even on drastically different
hardware :) See the video in the intro.
About the development experience
Before (finally)
concluding, I’d like to talk briefly about the developer experience. I’ve grown
to be a big .NET fan, and I was ecstatic that I had the ability to stay on the
platform even when targeting a microcontroller. Likely wouldn’t have targeted one
if that wasn’t the case. At least not without a prototype in .NET.
And what’s even
better is that Meadow has support for the full .netstandard2.1 profile,
so not just some exotic device-specific framework. While I’d love to have the
full .NET 5+ support I’ve heard fables of, that profile has mostly all
the features I need. What this enables, is the ability to write .NET library
code as usual, and have that work on the device without modifications.
Including networking and async/await. The only thing I needed to add extra
support for was the application-specific hardware, like the displays and the
FRAM chip, but that was handled via a “device interface” with just a few
methods.
All this meant
that I could write the application logic in a reusable library, and then host
that application on different targets with minimal code. In this case one
target was obviously the Meadow, and another was LinqPad for running the code
on PC. This also meant that for testing most changes I didn’t even have to
deploy the code to the device (a task which takes a few minutes when also
counting the startup time), and could instead locally test them on a desktop PC,
taking only a few seconds to get results (including the time it took to compile
the application). After testing I could finally deploy the app on the device,
and it just worked. It was glorious.
Of course,
testing more device-centric things wasn’t this easy, but there were only few of
those things.
What’s next?
After all those
improvements the device is at very good place already! The core is now very
stable and I feel really confident that I will be able to rely on the device
side of things.
What’s
still missing is the server-side improvements. Things still are not persisted to
a database on that end, and I’m running the server of the desktop, so as a
whole the stability isn’t that much better than it used to be. But after I improve
that aspect, things are really well all around!
The next
real improvement is probably either the form-factor, or features. The device is
built on a solderless breadboard with a lot of jumper wires. It takes quite a
bit more space than it could. I have plans in moving the FRAM chip to a backpack,
and replace the led matrix with just a led or two. The piezo should be good
enough to get my attention. It’s a bit sad if I have to keep the dimming film
on the display. I could have used the lower brightness in normal use, and then
blink it at full power when the alarm ends, subverting the need to have separate
LEDs.
Anyway.
After this I won’t need the breadboard, and the whole thing then fits on a three-layer
feather form factor, taking considerably less space on my desk, enabling better
positioning. I’ll make a post about it when/if I get around to implementing it.
I’m also
flirting with the idea of introducing strong cryptography, especially on the
UDP layer as it’s stateless. While it won’t help with confidentiality
considering the timing aspect of the system, it will greatly help with
integrity and authenticity. It’s like an “easy” solution for ignoring garbage
packets. If a packet doesn’t pass crypto, it can be ignored. And if it does, it’s
probably valid application data! HTTP on the other hand is stateful and has a “strict”
structure, so there’s not a realistic chance for garbage.
And maybe some new features, too. Like customizing the alerts, by allowing some countdowns to be silent, or with a different (=less annoying) tone.
And as the things happen over a rather simple API, I can futher customize the functionalities by writing orchestration code on another system. For example the at command in Runner is implemented by having it perform calculations, and then adding a countdown to a specific time. I'm also going to write a new command which cancels an existing alarm of the same type before starting a new one with the given time. Couldn't do that with the old browser-based approach, but now I can :)
I really like how this new system turned out. And while this might perhaps not sound like a lot, it really made a difference. After using this new thing for just a few days, I felt really handicapped when I had to use the old alarms for a while. The difference in usability was really astonishing!