Grokking data

  • When you have been exploring a dataset for a while, studying its distribution, its composition, its quirks, its innards and its “essence” in the end.
  • When you know the answer to a query even before performing it.
  • When you have a strict hierarchical organization of the folders for plots.
  • When you know by heart the number of unique URLs and usernames in the dataset.
  • When your methodology to name files according to their schema has become more complex that the schemas themselves.
  • When you have five different versions of the dataset, but you forgot the reason behind four of them.
  • When the size of the scripts to analyze the dataset begins to rival the dataset itself.
  • Ultimately, when the only thought of putting again your hands on that data gives you urticaria.
That’s when you grokked the data.
Now imagine doing that on hundreds of gigabytes…

