"Big data" is not a win: the Rumpelstiltskin AI (Rumpelstilzchen KI) fallacy and manifesto

  1. We as scientists want to communicate that privacy is essential to security, innovation, dignity, and flourishing. As such, we must stop celebrating how big our data is. 
  2. The Rumpelstiltskin (originally, Rumpelstilzchen) theory of AI is just wrong. You do not automatically get more or better intelligence in proportion to the amount of data you use. Even where you use machine learning to build your AI (which is certainly not always the case), it  is basic statistics 101 that how much data you need depends on the variation in the population you are studying.
  3. The principle utility of large amounts of data on humans is surveillance. If you want to manipulate a population politically or economically (police or upsell) then the marginal returns for data are stable.
  4. Even for applications where a lot of data might be useful, storing it is still a hazard.  It may be abused by present or future owners, or hackers.
  5. Data should not be routinely retained and stored without good reason. Where there is good reason, it must be stored with the highest standards of cybersecurity.
  6. We need both proactive and responsive systems for detecting and prosecuting the use of inappropriately retained data.
  7. We need to stop calling for projects like "Big data and [policy problem X]" and start calling for projects like "Data-led validation of [policy solution X]", so that we stop communicating to politicians that indiscriminately gathering and retaining data is ever a good thing.
AI is neither a fairy tale nor alchemy.
In this fallacious metaphor, the data is the straw, the AI is the gold,
and machine learning is Rumpelstilzchen. Even though ML
is also actually a kind of AI. Sorry if that's confusing.
Also, no princess or king here.


Update May 2020: If you want a less brief version of the above, Will Lowe on twitter directed us to Xiao-Li Meng, Statistical paradises and paradoxes in big data. Basically, if you can't be careful about how you subsample data, you have to get ALL the data to be right. But that's seldom possible, so it's better to be careful about how you subsample.

Update March 2021: fixed points 3 & 4 to clarify why you may still need to worry about data mergers. See also this new explainer: when marginal returns on data collapse, and when they do not.

Comments