tl;dr you only get one new row per day for daily aggregated data, how many days did you need to learn something cool?
One obvious challenge of quantified self analysis is deriving insights from relatively small quantities of data. In my day job, I would do find patterns in data (using for example https://en.wikipedia.org/wiki/Apriori_algorithm) regularly consisting of millions of rows. Needless to say, I do not have millions of rows of personal data. In particular a lot of data sources aggregate at the daily level, so for these I effectively get one new row per day. I’ve been interested in quantified self for a bit over a year, so I have roughly 500 rows of daily aggregated data. One simple question: is this enough?
Let’s say that our goal is to find “patterns” - “individual behaviors”, “sequences of behaviors” or “combinations of behaviors” that correlate with a “quantity of interest”. Let’s define an “insight” as the discovery of a pattern (like showering at night vs. morning) that robustly predicts a quantity of interest (like hours of sleep). A well-known rule of thumb in logistic regression analysis claims that you need at least 10 events per pattern instance https://en.wikipedia.org/wiki/One_in_ten_rule. At the same time, more recent research has suggested that the number of events required depends in a detailed way on the data being studied - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6519266/. Either way, I am curious about how many data points people needed in practice before they found something that was truly useful.
I imagine a few routes forward given this small data problem.
- Patience. Just wait until you have more daily rows, or see if there are any ways of dredging up more of your own historical data. This seems reasonable in many ways, and is really the only solution for objectives like increasing nightly sleep, for which there is naturally only one variable per day.
- More Granular Time Scales. Instead of collecting data once per day on the connection between your diet and productivity, you could instead gather that data multiple times per day. This has the appeal that the user take potentially apply more experiments and interventions during the day to address challenges as they arise.
- Anonymized comparisons to similar people. It might be possible to cluster people together into cohorts based on their behavioral patterns and effectively add their rows to your own for analysis purposes. The obvious downside of this approach is that it requires large scale coordination and engineering effort. Another downside is that you might actually be distinct in some unrecorded ways.
This topic is a bit funny, because we usually focus on the smallness of quantified self data in the sense of n=1, but that smallness in turn means that we often only have relatively few rows. Another way of thinking about this is that companies often do analysis that applies only to that one company, so its n=1 in another sense. But at the same, those companies often have terabytes of data, so they can make statements that, while unique to that company, are still statistically powerful. One of the ironies of quantified self seems to be that the small scale analysis that we do as individuals on our small personal data is actually harder than what happens at larger companies with more data and more resources.
If you are using quantified self data more for dashboarding and overall tracking, then you don’t really face these challenges. I wonder offhand what fraction of the community is focused on dashboarding vs. correlation vs. prediction.
Thanks for reading.