Sometimes rather than aim for the grand plan, make something simple and fun to show it works first.

What Do I Mean?

Anyone starting a data science project is often excited about the potential and will reach for the stars before you know it you have a horrendously ambitious and complicated project and you don’t know where to start. The result is you never start it because you never get sufficient of the “hooks” done.

Note: If you want to get straight to the machine learning project then feel free to skip ahead.


Just as you need a fish to bite the hook so you can be a successful fisher, these are things you need to get a bite on before you think you can make a success of something. So, for example, if you were to make a security camera that only turns on only when a person walks past (and not a vehicle, pet, leaf or wildlife) then you will have the following list after a while:

  • A night vision camera (because it can’t see with the security light off)
  • An IR light to illuminate the area so the camera can see it
  • A security light
  • A portable device or computer that can process the frames sufficiently fast
  • A special mains voltage rated switch that can be turned on and off using a GPIO pin from the portable device
  • Fixings for it outside, including a waterproof box and power access for the processing device
  • Lots of data recorded on your camera that you can train your model on
  • Tagging all the data so it knows what a person is for training your model

Phew! That’s a long list and that’s just what I’ve got from a few minutes and we have not even discussed what machine learning solution we would use or the output or control system to set it up.

You can see now why someone may never start this and often this similar thinking is what stops companies from investing in data science. Give them a huge estimate for the amount of work to deliver their grand vision and they may look sick as a dog and ask you to leave.

Have this sort of thinking carry on too long and you may fall into the Toolbox Fallacy (see below) and never ever complete any data science projects that are your own (maybe copy or follow what someone has done). The cause of this is:

You aim too ambitiously at once. Too many things depend on other things being ready it never happens.

Break your project into the key concept you want to test and start there. Everything else is secondary.

Toolbox Fallacy

This is a bit of side diversion but very relevant here (and often a cause of never starting anything). This is something I came across recently and I must say it does ring true of the reasons I gave up many hobbies or never completed the things I wanted to.

In essence it states that to do something you require other things to have happened. For example, you need X before you can do Y. There’s a really good video (posted here) that explains it well.

Therefore, if you want to be a data scientist you need to do data science and keep at it or soon you “were” a data scientist.

Believe it or not I used to bind books for a hobby for many years, but I’ve not done one in years. Can I consider myself one still? Maybe not.

Go Back to the Core and Keep it Simple

To go back to our example about the security light break it into the core concept “Can I control something by what I see on a camera?”, to explain this core concept you only need some video data (or still pictures) from an open dataset, apply a vision based system to classify objects and based on where they are on screen if you turn an LED on or not. You could even make it simple and just have your algorithm flag in a “.csv” file that for picture X or at time X on a video it would have turned something on.

This only requires:

  • A free set of data
  • A computer (your main data science one will do)
  • The core algorithm (this you could build upon others works or research and build you own)
  • Maybe an LED to turn on or off

This is a lot more workable and easier to plan and segment into work. Try from this point and soon…

Voila! You’ve done some data science!

Sure, it won’t work at night or survive weather outside, but you’ve made the core part and the other parts are bits you can add later. Without this core piece of functionality the other parts are pointless.

Congratulations you’ve made your first “Proof of Concept” (PoC).

Now let’s talk about my mini project looking at fan whispering (as in being good at listening to fans to determine what ails them). It’s a nice little project that covers the core area of “is it possible?”, keeps things simple and I was able to do it for little cost. I’m not excited to expand and fill it out to cover the areas I know it is weak in.

Fan Whispering?

Ever complained about your domestic appliance being broken and the customer service department asks you to hold the device against the phone so they can listen? Or have you ever had the occasion where your car is playing up and the mechanic can diagnose what is going wrong without lifting the bonnet? In both these cases they are looking for characteristic noises or sounds that relate back to a certain fault.

The largest source of failures from motors and fans is an issue with the bearings (e.g. lack of lubrication). This often causes the sound profile of the motor/fan to change (often a characteristic whine or grind). This is actually an area many places are interested in detecting by listening (here and here).

But why is listening useful? Well in the case of the first example, the manufacturer can quickly diagnose certain faults without the device needing to come in or a service engineer needing to go out. This all saves cost as call outs can be expensive in time and money. The other advantage is that if you know a motor has or is going to fail soon on a domestic appliance, you can make sure the service engineer is carrying the right part (very important for manufacturers with a large and diverse product portfolios) again saving costs.

The Core Project Question

For a person it can take many years to be able to recognise what noises from machinery mean, but how easily can you train machine learning to do it?

This is the guiding question that I want to solve with my PoC and so I will focus on diagnosing faults in an electrical fan as it is the easier to acquire and induce faults in safely.

I purchased a fan online and then subjected it to several (reversible, I’m not made of money) faults, which were:

  • Normal: The fan running in its normal state with no fault (when I refer to faults or classes I do include this in there)
  • Plastic Film Obstruction: A piece of plastic was introduced into the path of the rotating fins
  • (think the card you put through the spokes of your bicycle as a child)
  • Badly Weighted: A weight was attached to one of the fins to unbalance it
  • Catching Sides: A small extension was added to one of the fins, so it starts to rub on the inside of the fan housing

Now all these faults sound trivial, but often before a hard failure a small annoyance occurs first, before worsening to failure point, and this should illustrate that if a minor inconvenience can be detected a much more disruptive one is possibly able to be avoided completely by early intervention.

I then set about recording ten second clips of these faults at the three available fan speeds (slow, medium and fast) to add variability to the noise levels and profiles. I used my phone to approximate the quality of a recording device similar to my first example (customer holding the device to the phone so customer services can listen to it). Repeats for each combination were done five times yielding a sample size of 60. For each repeat of a failure mode I would remove the fault and re-apply it so there would be a little variability introduced to the recordings.

I chose five repeats so I could train on three of them and the remaining two would be for the test set (36 train and 24 test files).

It was then a case of:

  • Performing a Fast Fourier Transform to convert the sound into a frequency spectrum a model can be trained on
  • Smoothing the resulting spectra to reduce noise by using a rolling average about 0.1 kHz wide
  • I then only took the data points from 1 to 18 kHz in 0.01 kHz steps
  • This was done to standardise and reduce the dataset in size
  • I was then able to train a Random Forest Classifier to see if it could distinguish between the four classes
  • Remember I was also varying the fan speed, so hopefully the classifier won’t be affected by that
  • I then used the test set on the model so I would have some results to compare how well it is doing on unseen data
  • The predictions and actual values were then displayed in a confusion matrix
  • These plot actual (true) values against the predicted values
  • If the model is perfect, then all the values lie in a diagonal line
  • Any deviation shows you what classes it is mislabelling and from what group they should be in
  • This is often a very powerful debugging tool during classification problems

Ta Da!

Looking at the results below they are not too bad! Considering how basic the model is and we aren’t doing anything sophisticated with the noise (such as background noise subtraction).

But wait, there is more!

One of the interesting things about a random forest is the ability to query it about what features it is using to make its decisions. For us these are the frequencies that the model is using to assign classes and is further interesting because you can identify what frequencies are more important for identifying some classes compared to others. (Note: This feature importance is a bit roughly done but conveys the main point).

To display these importance’s, I plotted a graph containing an example of each class and put a vertical dotted line for each frequency which was classed as important. This gave:

You can now see why the model could easily tell the classes apart as they each are quite different shapes and powers. You can also see what frequencies are important for each class (in the figure above):

  • “Catching Sides” (red): Lower Frequencies
  • “Normal” (blue): Middle Frequencies
  • “Badly Weighted” (green): Middle and High Frequencies
  • “Plastic Film Obstruction” (yellow): Frequencies spanning the whole available range

Take Away

Using simple techniques, it is possible to train a machine learning algorithm to distinguish each fault class almost perfectly (95.8% accuracy) but also determine what frequencies are important for classifying each fault.

The weaknesses of this example are that only one fan was used, the background noise was minimised and kept consistent, and the analysis/data preparation was basic. A more rigorous model would cover these areas and could be developed into a fun little demonstrator (maybe when I have more spare time).

I hope you enjoyed a short foray into using machine learning to solve practical problems!