Retrospective: How We Could Have Failed Faster

For the last several months, I’ve been working on a project that involves reading data from a device via Bluetooth and then sending that data to another machine via RS232 serial port. At least, that’s my piece of the puzzle. It’s proved to be a challenging, but not insurmountable, task for  someone who is traditionally a CRUD programmer.

Yesterday, we discovered that we built the wrong software. I built what I was asked to build, but it isn’t what the client needs. The sensor device that was chosen doesn’t provide appropriate data for the use case. I spent the day salvaging whatever code I could for reuse in the end product, but the application itself (and several months of development) will be thrown away. This has prompted me to have a personal retrospective on this project to see if there were things we could have done to fail faster.

Project Background

The Goal

Read data from a sensor, and send it to an RS232 serial port on a Matlab xPC Target computer so some researchers can do researchy things with it.

The Team and Process

Theoretically, there’s a team of three. One engineer, one developer, and a product owner (also an engineer) on the team. In reality, I’ve spoken to the engineer once. However, the PO and I meet regularly and often to discuss the software. Also, we’ve pulled in temporary team members as needed. Need help setting up a LAN? Grab an someone from infrastructure for a day. Need help understanding serial ports? Grab the embedded development expert for an afternoon.

The team doesn’t have a regular standup, but the PO and I speak every day and have a demo/planning session every other week. No story points and no estimation of any kind beyond, “historically, I can complete 3 stories every 2 weeks”. We just a prioritized and well groomed backlog, pulling the next most important piece of work as others get completed. We don’t have retrospectives, hence my need to put it on paper.

The Stack

The device itself is a commercial device with an SDK and accompanying .Net libraries. Being they’re .Net libraries, we chose to write the application in C#. It’s a language I’m fluent in, so cool. Easy enough, just need to learn a new API, the library itself will take care of all the BlueTooth stuff for me. The device and SDK require a minimum of Windows 8 to leverage the Windows BlueTooth Low Energy (BTLE) libraries that were introduced. Windows 8 was awful, so we went with Win10. There’s a wish to deploy the application to a mobile device, so we went with the Universal Windows Platform (UWP). Like I mentioned earlier, we’re sending the data over a serial port to an xPC Target machine, where we read it out again via a Matlab Simulink model.

Of all these technologies, I have experience in exactly one (and a half). I know C# and can get around the XAML in a WPF application, but the markup for UWP is just different enough for it to be a small, but painful learning curve.  When we started this project, I knew absolutely nothing about serial ports, network protocols, Matlab, Simulink, or xPC Target.

In Retrospect

Now that we have the background, let’s ask the classic three questions.

What Went Well?

A number of things have gone surprisingly well actually. Off the top of my head, these are the things we’ve being doing very right.

  • Spikes
  • Good Test Coverage & Architecture
  • Demos
  • Lean planning meetings
  • Learned a lot about many new (to us) technologies.
Spikes

This project has been a learning experience every step of the way. What do we do when we need to learn something? We drive a spike. I’ve used three approaches to my spikes over the life of the project: characterization testing, the (more traditional) “throw it away” spike, and a “Spike and Stabilize”(1) approach.

Early in the project, I had no idea how feasible any of this was going to be, so I did my best to build the simplest thing that could possible work. Very early on, this took the form of writing characterization tests for the libraries I needed to call. This code was quickly trash binned, but it did allow me to quickly learn the APIs I needed to learn. From there, I started building a simple console application. Again, the purpose was not to build production code, but to prove that the approach could work and learn the things I would eventually need to know in order to write production code as quickly as I could.

It took a bit of time to discover what architecture was going to work, but the process of trying the simplest thing that might work, and throwing it away when it didn’t worked very well. There was so little sunk cost that it wasn’t a big deal to throw anything away and try something different.

Once we had a design that we knew would work, I moved on to production code and my approach to spiking changed a bit. I wasn’t learning whether or not something would work, but rather how to make some particular piece of functionality work in UWP. Here I took the “Spike & Stabilize” approach. I would relax my TDD practice a bit and do whatever ugly thing I needed to do in order to figure out how to do some particular thing. Once I had a basic knowledge of it, I would go back, write any missing tests for the spike code, and resume the red-green-refactor cycle. It took a lot of discipline, but worked very well for me. I was able to quickly learn, stabilize it with some tests, and then transition back into a more rigorous testing approach without getting bogged down worrying about TDD while I was trying to learn.

Good Test Coverage & Architecture

Tests are essential if you need to move quickly. Being a green field project, I was able to ensure that the hardware and 3rd party libraries were decoupled from the software. Having the tests as a safety net made it easy and safe to change the software as we learned what it should be. This has all been said before and in many places though.

What is of important note is the level of abstraction the tests were written at. I paid very close attention to make sure that I wasn’t testing implementation details, but only through the public API of the system. We made many changes to the software very rapidly, so if I had been testing implementation details, I would have gotten bogged down in updating tests instead of making the needed changes.

(I’m well aware that I’m patting myself on the back here, but it took me a long time to get this right. Learning how to write automated tests is easy. Learning how to write good tests that aren’t brittle in the face of change is really frakkin’ hard.  I earned that back patting.)

Demos

Demos are a great practice in my opinion. Not only did they provide me great feedback about what was good and what wasn’t, but they also provided everyone else with a very real status of the software. It also had the benefit of looping in some other developers who aren’t officially on the project. Being the only developer on this one, it gives me some peace of mind to know that there are other people at least somewhat familiar with it. We need to do more of this.

Lean planning meetings

Directly after the demo, the Product Owner and I would take a few minutes to discuss what needed to be done next. We’d enter any new items into the backlog and then prioritize them. As I mentioned earlier, we’re not using Story Points or hourly estimations so these planning sessions go very fast, about 15 minutes at most.

I do keep an Excel workbook that to keep track of past performance though.  It queries TFS for Story start and end dates, then performs some calculations to output historic Cycle Times and Throughput. Using this data, I can very quickly let the Product Owner know how many (and which) stories he can expect to see implemented in the next demo. I love this method and plan to blog more about it in the future. There’s little more boring and wasteful than putting a bunch of developers in a room for hours on end to guess at how long things will take.

Learned a lot about many new (to us) technologies

I’m not sure that I can say that this is something we did well, but it is certainly a very Good Thing™. As software engineers, we trade in knowledge. What we do is knowledge work, so learning new things makes us more valuable. It wasn’t just myself who learned new things though. By proxy, so did my company.

What Didn’t?

So, it’s all sunshine and roses, right?

Obviously not, or we wouldn’t have thrown away several months of development. There have been a number of issues along the way, as there is with any project.  So what hasn’t gone well?

  • Had to learn many new (to us) technologies.
  • The UWP Test Runner is buggy at best, hindering at worst.
  • There’s not been much of a team.
    • Having a single developer as a single point of failure is a huge risk.
  • Discovered very late in the project that the hardware wasn’t viable.
Had to learn many new (to us) technologies

While it’s great from a long term “invest in the company’s collective knowledge” perspective, it hurts during the short life of a project to have this many unknowns. It’s riskier and slower than using tried and proven technology and techniques.

How to Improve:

Keep the number of unknown technologies to a minimum. Maybe one new technology per project, two at most.

The UWP Test Runner is Buggy at Best, Hindering at Worst

Unit testing seems to be a second class citizen for UWP applications. The test runner doesn’t fail tests if an exception is thrown, it just simply blows up and doesn’t run your tests. I can not imagine anything more painful than making a change and, instead of getting a red test or two, having an entire suite of tests simply refuse to run; leaving you with no idea which test actually failed.

The test runner is also very slow. For some reason, UWP Tests have a UI that must be launched in order to run the tests. It slows things down. It’s not an unacceptable slow down, but it’s enough to hurt and break your train of thought.

Oh, and did I mention that the Assert.Throws<TException>() method doesn’t work right with async/await code?

I don’t want to rant, but if Microsoft wants developers to migrate to UWP, they need to put some effort into giving us the tools we need to properly test our code. As it is today, its too broken to be seriously productive.

How to Improve:

Keep as much code as possible in Portable Class Libraries. PCLs use the standard unit test project, so it doesn’t suffer from the flaws of the UWP test runner.

There’s Not Much of a Team

While, the PO and I are very much of a team and he has the necessary engineering skills we need, there is a team member that has just been totally absentee. That’s not his fault, he’s just not been directed to be involved, whether or not he has actions to take at the moment. Still, can you really say he’s part of the team if he comes to none of the meetings and hasn’t spoken to the software developer but once during the lifetime of the project?

Speaking of software developers, having a single dev on the team is incredibly risky. Having any single point of failure on a team is risky. There needs to be some level of redundancy in case someone gets sick, or simply would like to go on vacation. It also means that no one is reviewing the developer’s work. It’s far too easy for a lone developer to do something (unintentionally) stupid to not have someone performing code reviews.

How to Improve:

In the future, there can be no less than two developers responsible for a project. Period. If there isn’t enough manpower to dedicate one person full time, and another at least half time, then we do not have the bandwidth to take on the project and it must wait until there is. (Yes. I know this is easier said than done, but it’s a good rule.)

Also, demos/planning meetings are not optional for team members. Managers and other stakeholders can be optional, but all the team members need to show up.

Discovered very late in the project that the hardware wasn’t viable

This one… this one is a project killer and I’m trying very hard to be objective about it. After several months of development, we discovered that the sensor isn’t viable for the client’s needs. I was able to salvage about half the code, and a lot of knowledge, but it doesn’t change the fact that we built the wrong software. We built good software, software that works and worked a little better every day than it did the day before. We built software that does one thing and does it well. The problem is that one thing was to talk to a device that we cannot use.

As a software engineer, I had no indication that the device wasn’t viable. I assumed that the device had been thoroughly evaluated and tested before we started building software. That was a disastrous assumption on my part. Instead of using the device’s SDK to test the sensors, the team waited until I had our custom software nearly finished. The Product Owner wasn’t happy with the data that was coming out of my application. I proved that it was indeed the correct information to him and only then did he do the necessary testing to discover that it was correct information, but it was not the right information.

How to Improve:

Make no assumptions. When a team member from another part of the business hands you something, ask them about it (politely, of course). If they’re choosing hardware, ask them what kind of evaluation was done. A simple question could bring to light that a team member did not do their due diligence. Just maybe I could have prevented the wasted time and effort, if I had only asked about the hardware team’s device evaluation process.

Likewise, be sure to have done your own. Is that framework or library really the right choice, or did you choose it because it’s what you’re most comfortable with? Did you do any research on alternatives? Did you write down why you chose particular technologies? Can you justify your own choices to others? If you can’t, then you’re putting the project at risk.

I’m not saying that we need to make all of these decisions before a project begins, but there’s no reason the hardware portion of this project couldn’t have been treated with the same Spike & Stabilize approach we took to building the software. A few hours of testing the sensors revealed that it wasn’t going to work. This should have been done much earlier in the project. Then, we would have realized it early enough to try something else. And something else. And something else, until we found hardware that was going to work for our client.

Failing fast isn’t just for software developers, it’s for the entire team.

Ultimately though, I blame myself for not asking more questions. I blame myself for not doing more to expound the virtues of failing fast, of quickly learning what won’t work. I’m going to improve by talking about those things to anyone willing to listen, as early and often as possible.

Footnotes:

(1): I was using the technique prior to hearing the term. Credit for coining it goes to Dan North and his GoTo 2016 presentation “Software, Faster”.

Advertisements

, , , , , ,

  1. Leave a comment

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: