I have been tracking various aspects of my health daily in scores of 1-2-3. What kind of statistical tests or functions should I use to identify whether there are relationships between different columns? I have all the data in Google docs.
I’d just start with whatever question you find most important right now mate. You could make a simple graph comparing that one against other markers one by one for example. If you check out @michaelforrest he’s uploaded some cool videos to YouTube comparing two metrics together and discussing implications and meaning.
I’m not a statistician and and I don’t play one on the Internet, but you could try uploading your spreadsheet to a free online correlation matrix resource such as http://www.sthda.com/english/rsthda/correlation-matrix.php
It might take a bit of tweaking your spreadsheet format to get it to work, and there are other similar online solutions out there. If you run into troubles, let me know and I’ll try to help.
Looking forward to reading about your progress,
I’m confused by your comment and the comic??? The poster asked for resources to see if there are “relationships between different columns”. I don’t think they are looking for causation? Though maybe I’m misunderstanding (either the poster’s question or the point of the comic you posted)? I think a Pearson correlation is a measure of the linear relationship between 2 variables, no? Wouldn’t that be appropriate? If not, what would be?
For example, I find some strong correlations in my Fitbit data… as an example, not surprisingly my sleep score and time asleep rise and fall in step. Slightly, more interestingly, my historic weight and resting heart rate have a moderate correlation. Adding in RescueTime data there are varying degrees of correlation between my daily Productivity Pulse at the end of the day and my Fitbit Sleep Score from the morning, among others. I’m not saying that changes in one variable causes the other variable to change - only that there seems to be a relationship.
Ejain, from your others posts here, I know your math/statistics skills are way above mine, so if you can explain your comment a bit more to help me understand, I’d really appreciate it. And if you do have any suggestions to help the poster (and me) with their question, I’d be grateful for that too.
Nothing wrong with checking for simple correlations; I do so frequently myself. But correlations are often made out to be something they aren’t, hence the disclaimer
Ok, got it. The comic was a warning not to read too much into info from correlations, and not a suggestion that it’s a bad idea.
Thanks, yeah I think correlation is the best I am aware of as well. It worked out quite well.
What did you learn?
Save data to file then load via https://cran.r-project.org/web/packages/readr/readr.pdf. Then Viz via
https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html. Then get hardcore and study https://cran.r-project.org/web/views/TimeSeries.html or https://feasts.tidyverts.org/. EDIT: i expect time series decomposition as illustrated in last link is the most important.
One statistical test that I have used is a distributed lag model (DLM):
R package: https://rdrr.io/cran/dLagM/man/dlm.html
There is often a delay between cause and effect, e.g. a certain medication/food/exercise doesn’t have any effect until the next day, or a few days later. Because of the time delay, these effects often go unnoticed. A DLM will look back in time and evaluate different lag intervals (0 days, 1 day, 2 days, etc) try to find these lagged relationships.
Just want to add that another reason it’s good to look at lagged relationships is that there is less chance of confusing cause and effect. For example, let’s say your records show that on a June 1 you did a lot of stretching and also had increased pain. That could mean different things:
- The pain started first, so you did stretches to relieve the pain.
- You did the stretches in the morning, and that caused your pain to flare up for the rest of the day.
So when you look for correlations, you can end up thinking A causes B when in fact it is the opposite. If you add a lag to your model (whether 1 day or 1 hour, etc) at least you can be sure that A happened before B, so therefore B did not cause A.
You say that either A causes B or B causes A. But it may well be that A and B are unrelated or that both A and B are caused by C.
In the example case, a one time correlation between pain and stretching doesn’t tell us anything. It could be pure chance. For example, a person could bump their knee against something while stretching. The pain is then not caused by the stretching, but by something that accidentally happened during the stretch. The probability that A and B are somehow related increases when they correlate more often, i.e. when the pain appears every time that this person stretches.
But even then, A doesn’t necessarily cause B or vice versa. For example, the pain could be caused by the running that precedes both the stretching and the appearance of the pain. Then both A and B are causes by C. So even when a correlation exists, you still have to interpret your data carefully. Cases are very rare where there are no confounding variables, and often people aren’t aware of all the factors that play into the outcome they observe. For example, the person stretching may suffer from arthritis or some inflammation but not know it yet. So they stop stretching, because they think the stretching is hurting them, when in fact they should see a doctor – and continue stretching!
@242 I agree with your post, although I didn’t really say it must be one or the other I just mentioned those 2 possibilities because they directly relate to my point about time lag.