When marginal returns on data collapse, and when they do not

In my previous blogpost on the Rumpelstiltskin AI fallacy, I explained that "it's stats 101 that how much data you need depends on the amount of variation in the population you study", and that "The principle utility of large amounts of data on humans is surveillance." This could be interpreted as implying that mergers of corporations holding data of disparate sizes does not lead to increased market power for the larger company from an AI capability perspective. But it doesn't.

Data might display diminishing returns from an "improving your AI algorithm" perspective, while displaying stable or increasing marginal returns from a surveillance perspective.

This is true whether surveillance is political or economic (as in surveillance capitalism); and whether it is benevolent (for the benefit of the individual surveilled) or exploitative / autocratic. Just to be confusing, some applications or organisations could easily be in the business of solving all or any of these goals simultaneously. For example, a company or government might mostly use a set of data to provide a service that helps people keep healthy, but then also use it recognise when people illegally cross a border.

One-time uses of new data sources

If you get even a small pile of new data, it might give you entirely new insights into your old data. It may allow you to refine or even just test your models from another perspective. The process of using multiple diverse sources of data about the same thing "triangulation", as in navigation. The analogy is that if you want to know exactly how far away something is without getting closer to it, the further you move horizontally the more information you get to triangulate on the distant object's true location.

So say you have a lot of information about what people "like" on Facebook. You could get a relatively small amount of data on personality studies on just some of those people, and then use machine learning (or some other standard statistical approach) to figure out how to detect the exact same information (personality profiles) from the data you already had.

Once you have a new model for interpreting your existing big data, the value of the small data you built your model with collapses, you can throw that away. But the value of your bigger data is permanently enhanced. You now know something new you can derive from all your data.

Let's take another example. Say you already have search, mapping, and email data on a large number of people. Of course you may not have that much data from all of these sources on all of the people you know about – maybe you only have even heard about some people through secondary sources like being addressees in email. What if you now got information on some of their fitness and activity?

Probably that new data is in some sense redundant with the data you already have, but you wouldn't have been sure before you got this new data. Now that you have your new fitness-specific data, you can test your theories about how your old data predicted the well being, fitness, compliance to lockdowns, all kinds of things, of your old customers.

There may be only a few customers who are in both your new and your old data sets, but if they turn out to be a reasonably representative sample, then you now have massively improved your understanding of all of your data, for all the people you already knew about, whether or not they were in the new data set.

So the new data you have has tremendous value when you first acquire it, not only because you may be able to use it somehow directly, but more importantly, because it enormously increases the value of your original, larger dataset by allowing you to increase the understanding of what that data represents. The value of this newly-acquired dataset may then collapse, because you can just use your original dataset now for mining the same information. Or the new data stream if it keeps updating may continue to hold some marginal value, to the extent that the correlations you discovered between it and your more established models drifts over time. In this case, being able to update your understanding has ongoing value.

So is there really Rumpelstiltskin fallacy, or is more data always better?

The Rumpelstiltskin AI fallacy describes a pernicious belief I've heard falsely promulgated by some leading machine learning experts: that you need to keep all data forever, or someone else with more data will build better AI than you can. This assumes that machine learning is a Rumplestiltskin-like process: the bigger a stack of straw you hand him, the bigger the pile of gold he gives you back the next day.

If we're talking about a metaphor here where the gold is meant to represent something like AGI, then this is a fallacy. You won't get something more human-like by training it on 1 Billion people than 1 Million or 10,000 humans, if what you are looking for is predicting things like where the eyes are on a face or what colours hair are. For any of those numbers of people, you can get a good or a bad model of those kinds of things just based on the diversity of people who you drew for your sample population. Quite a lot of useful science is done with just a handful of people each pushing buttons in quite a lot of contexts for quite some time.

But if we are talking about literal gold, well, you can make a lot of money through surveillance capitalism, and the more people you subject to it, or the more you know about them, the better you can sell them things. Except of course if you sell them things that they don't really need or that are ecologically unsustainable – in that case you might create an unequal or otherwise unstable society that might one day collapse, taking all your money with it. You should think about that if you think what you want is a healthy economy.

Or you could use data to conduct political surveillance, and either jail people who disagree with you, or limit earning, political, or other life potential of those who might prove politically problematic to you in the future. You might even be able to convince a lot of people they want you to do these things, and be able to realise total warfare – a fully mobilised society. I doubt it – when people are stressed they tend to polarise, so probably you would have society fractionate out of under you at some point, but you might be able to maintain totalitarian control of large amounts of gold for the rest of your (natural) lifespan – a few people have managed this. Very few.

Summary

For controlling or exploiting populations, more data is always better.
For scientific discovory, innovating new processes, double-checking understanding, creating new forms of machine intelligence etc. very small amounts of data are often all you need. That data can generally be disposed of after the innovation or discovery. In many cases, even if you did (for example) want to track changes in society, you could just grab that small (or even very large) amount of data again, do the recalibration, then throw the data away again, protecting privacy.
Small amounts of data being used as per 2 can substantially increase the value of big data being used as per 1. Which includes potentially making that data much more dangerous to hold (see figure).

photo from the article: Fitness tracking app Strava gives away location of secret US army bases,
by Alex Hern, Guardian, 28 January 2018

Adventures in NI

Search This Blog