Statistics: How many rows did you need to get a robust pattern?

Tags: #<Tag:0x00007f53cd219b20>

tl;dr you only get one new row per day for daily aggregated data, how many days did you need to learn something cool?

One obvious challenge of quantified self analysis is deriving insights from relatively small quantities of data. In my day job, I would do find patterns in data (using for example regularly consisting of millions of rows. Needless to say, I do not have millions of rows of personal data. In particular a lot of data sources aggregate at the daily level, so for these I effectively get one new row per day. I’ve been interested in quantified self for a bit over a year, so I have roughly 500 rows of daily aggregated data. One simple question: is this enough?

Let’s say that our goal is to find “patterns” - “individual behaviors”, “sequences of behaviors” or “combinations of behaviors” that correlate with a “quantity of interest”. Let’s define an “insight” as the discovery of a pattern (like showering at night vs. morning) that robustly predicts a quantity of interest (like hours of sleep). A well-known rule of thumb in logistic regression analysis claims that you need at least 10 events per pattern instance At the same time, more recent research has suggested that the number of events required depends in a detailed way on the data being studied - Either way, I am curious about how many data points people needed in practice before they found something that was truly useful.

I imagine a few routes forward given this small data problem.

  1. Patience. Just wait until you have more daily rows, or see if there are any ways of dredging up more of your own historical data. This seems reasonable in many ways, and is really the only solution for objectives like increasing nightly sleep, for which there is naturally only one variable per day.
  2. More Granular Time Scales. Instead of collecting data once per day on the connection between your diet and productivity, you could instead gather that data multiple times per day. This has the appeal that the user take potentially apply more experiments and interventions during the day to address challenges as they arise.
  3. Anonymized comparisons to similar people. It might be possible to cluster people together into cohorts based on their behavioral patterns and effectively add their rows to your own for analysis purposes. The obvious downside of this approach is that it requires large scale coordination and engineering effort. Another downside is that you might actually be distinct in some unrecorded ways.

This topic is a bit funny, because we usually focus on the smallness of quantified self data in the sense of n=1, but that smallness in turn means that we often only have relatively few rows. Another way of thinking about this is that companies often do analysis that applies only to that one company, so its n=1 in another sense. But at the same, those companies often have terabytes of data, so they can make statements that, while unique to that company, are still statistically powerful. One of the ironies of quantified self seems to be that the small scale analysis that we do as individuals on our small personal data is actually harder than what happens at larger companies with more data and more resources.

If you are using quantified self data more for dashboarding and overall tracking, then you don’t really face these challenges. I wonder offhand what fraction of the community is focused on dashboarding vs. correlation vs. prediction.

Thanks for reading.

This is an interesting set of questions - thanks for posting. I also have experience of analysing large datasets as part of my day job as an academic sociologist.
As you imply in these type of QS analyses we don’t really have an N of 1 if we consider N to be the number of time points (usually days) over which we collect our data. We can increase N if it makes sense to collect data on some topics more than once a day.
Another thought is that you mention logistic regression as an approach to analysis - I agree this is the most obvious multi-variate way of looking for patterns and predictions within the data collected.
However people may also be interested in the work of Charles Ragin (e.g. see his book - Redesigning social Inquiry - fuzzy sets and beyond) In this he explains an alternative way of understanding relations between variables that is based on Boolean algebra and is more appropriate to understanding the links between things in life - I can write more on this if you are interested, but hope it is a useful resource for other members of the QS community.

1 Like

Thanks Jane, I appreciate the reference to Charles Ragin.

I am particularly interested in applying the techniques of causal inference to my personal data.
In some sense I am a domain expert for my personal data, and can therefore clearly set up a framework ruling out some correlations as meaningless, while focusing on others. One new tool for this causal inference analysis is Microsoft’s new dowhy python package - I’ll be experimenting with it this month, and I’m really excited to see what I learn.

I’m enjoying this discussion and hope it continues. One thing worth considering when planning an analysis is the extra information that is available to you as SELF tracker. Much of the analytic firepower available from academic and biomedical research experience is needed to get a signal from data whose context is relatively sparse. But when we’re doing self-research, we can adjust observational practice intentionally so that the data is more meaningful. So, for instance, understanding emotion in the context of sleep, for group/biomedical research, probably requires a lot of data to securely establish the sleep condition, as well as a lot of data to secure the emotion rating. (If this is even possible, I’ve become skeptical that emotion ratings are reliable at the group level for evaluating the influence of experiences and activities.) However, in SELF research we can fiddle with the definition of the phenomenon; for instance, we can decide we don’t care as much about biological sleep as we do about “time of getting in bed.” With this change, many fewer rows may be necessary.

1 Like

If it predicts its enough; cross validation solves your problem. However if you measure the same state 100 times a second that is not actually 100 measurements, its only one. EDIT: for mall causes of variance except instrumental. Similarly if you try to compare two time series with only one bump each that is almost only one piece of data.