Machine Learning For Small Data

Machine learning has risen in part because of the rise of big data.  With more data available, techniques like unsupervised deep learning have been able to make tremendous progress.  But what about small data sets?  There are many more small data sets in the world than there are large data sets.  And humans seem particularly good at learning from small data sets.  In fact, large data sets overwhelm us.  So the question I want to briefly address in this post is:  Will there be a movement of machine learning for small data, and if so, what can we expect from it?

I had this conversation recently with the former head of machine learning for Youtube, who pointed out that part of the lack of progress in this area is that the companies who do most of the practical machine learning research don't have any incentive to investigate small data machine learning.  Their large data sets are an advantage, and of course good business is always to play to your strengths.  The only thing Google, Facebook, and Amazon gain from solving some small data problems is a bunch of competitors.

I recently got a new Android phone with a great fingerprint unlock function.  To activate it, I had to "train" the program by repeatedly applying my finger to the sensor.  The program actively encouraged me to move it slightly left, right, up, and down on each press.  I don't know the algorithm behind this process but I imagine it was some variation of a machine learning algorithm.  The interesting thing is, it was learning from a single data set, unique to me, that I was quickly generating.  The whole process took less than one minute.

While not much has been written about machine learning for small data, I can imagine cases where it is useful - most of them related to the activity of a single user.  For example, if the goal of a program is to anticipate my morning routine, it should be able to learn after just a few days of observing my routine.  It doesn't need 1000 days.  It doesn't need the morning routines of others.  But, such a program would also not have a very good way to predict those rare times when I deviate quite a lot from that routine, and the reasons for my deviations.

So what can we expect for small data machine learning?  I expect it to arise from the areas of A.I. that interact more with humans.  Such products will most likely be targeted to a single human user, and therefore, will need to learn more about that human owner.  The data set of owner actions, which will be small by the standards of machine learning data sets, will actually be the most useful data set for the job.  The impact this will have on machine learning algorithms is a move away from probabilistic methods and a move towards more situational specificity.  We will see algorithms that are basically collections of specific rules, that use decision tree like structures, but are very adaptive.

With small data, opportunities for automatic feature engineering from tools like deep learning won't be available, so what will?  I believe we will see a resurgence in the old idea of analogy making as key to learning.  Symbolic ontologies that can deal with multiple data types will be used to apply some guesses to what kind of structure might work for a given small data set.

I haven't seen much work done in the space so, if you have ideas, or have seen projects or papers based on small data set machine learning, please send them my way.