This is a topic for general discussion of experiments in self-research. I created it to provide a place to move more “meta” issues from project logs when they get going. Use at your convenience!
I notice a double meaning in the word experiment as it’s used in the forum: Sometimes it means formal experiment as @Max_Eastwood is using it, and sometimes it just means trying something, including trying to observe something, as I think @bretbernhoft may be using it. Projects described as experiments in the latter sense of the word can be presented more formally; for instance, it’s possible for hypothesize that I would be capable of collecting data of certain type using a certain instrument, and then execute the experimental activities of deploying the instrumentation, etc. But this formality tends to be more or less empty, so maybe just better to accept the double meaning.
I’ve noticed that formal experiments, with blinding, an analysis framework described in advance, etc., arrive very late in QS projects, if they arrive at all. Often, there’s a lot to be learned from just trying to make observations. I actually believe this is typical even of academic science and clinical research: there is an exploratory phase involving observing and tinkering that prepares scientists for planning formal experiments. It’s just that this phase is not often described in any detail in published accounts. Here is a wonderful essay that talks about the hidden dimension of scientific research: Night Science.
That’s a good point, but with small caveats
In my opinion, prospective observational studies are good for formulating thesis or theory and essentially not enough to proof them and use in decision making for small / medium effect sizes. Using this appoach in n=1 may increase internal validity, because between subject noise isnt present, but to make conclusions we should find large effect size and make sure that’s not because we did multiple comparisons and found random noise. Also it’s hard to confirm that correlation is causation just by looking at graphs.
One of main reasons of exploratory phase is to find out how big should sample size be to get enough statistical power / significance for main experiment by checking for data variability/distribution and rough effect sizes estimations on a small sample. It also cant prove or disprove a thesis formulated before, but can tell that we need a very big sample size to find effect and we dont have enough resources right now.
This seems like an especially important point. There are a lot of considerations influencing how much certainty you want and need: How consequential the decision is; how much risk is involved; how easy it is to stop/recover/change directions (that is, “reversibility”).
I find the question of effect size is itself a bit complicated, since it can be looked at both from a personal and from a statistical point of view; and these do not always align. For instance, although there are various ways of talking about effect size size from a statistical point of view, it’s common to think of it as how much of the variance is explained by the treatment/intervention/model. But from a personal perspective, effect size is often taken to mean something different: How big an impact does the treatment/intervention/experiment have on my life? These are clearly not the same concepts. At the same time, they can be related, and how to do this properly without imposing excessive formal requirements/demands on self-researchers is something that I hope somebody, or more than one person, with the right set of skills will consider tackling.
In medicine/biological studies clinical significance complements statistical significance and effect size. Clinical significance means importance of effect we have found. Even small effects can be important , but it seems like the last step in analysis: 1) check statistical significance and adjust for multiple comparisons 2) if significant - calculate effect size 3) check out clinical significance/importance. Anyway most of people will ignore most of these steps and just make decisions by looking at graphs.
You are right, we can stop somewhere here, thanks for an interesting converstaion
Thank you again for sparking this topic. While specialized, its important for sorting out the relationship between personal science and biomedicine. I think again you highlight something very important, which is the order of operations. You give it from a biomedical perspective, where there is a causal discovery that may have scientific value, and then its clinical significance is evaluated in the context of translation into practice. Whereas most personal science is more like “empirically enhanced trial-and-error.” That is, an increase in observational capacity allows more sensitive real world tinkering, in fast cycles. If a new, better life situation results the project may stop without reaching much certainty about causes. So in this case the personal significance comes earlier.
I think this conversation is interesting in part because there are important “joints” between the two modes of empirical practice. Techniques from biomedical practice may be needed to solve difficult problems where rough evaluation is inadequate, for instance, just as more sensitive calibration methods may be needed in a workshop that has done well using a rough approach until a more complex problem arises. The # of people competent to work at this border is very small!
You can do a trial. But how can you know if it worked or not? When the doctor gives patient a medication and patient recovers, most of doctors thinks that’s because of medication. For thousands of years, medicine worked like that curing people with phlebotomy and laxatives. But modern evidence-based medicine disproved this method, because it leads to a huge amount of errors and wrong decisions. In reality most of interventions not worked and model “what i see is what happens in reality” doesnt lead to good decisions.
If you can’t interpret results correctly “fast cycling” may make things worse. It will help with getting more false positives which most people will interpret like a true connection. Some of them will be true positive and true negative, but what approach will be used to distinct them?
Simplicity of “trial and error” attracts people, but it have some additional weak points.
Since in most QS “trial and error” participant, experimenter and observer is 1 person which leads to
- Observer/Experimenter bias
- Subject expentancy bias
- Confirmation bias
- A lot of other cognitive biases. Good explanation is in “Think fast and slow” by Daniel Kahneman
Some of studies using double blind method to avoid these problems and hide hypothethis from participant. But most QS experiments cant hide hypothethis, because participant formulates it.
So in a summary, repeating “trial and error” will result in
- Getting some false negative / false positive which will be misinterpreted
- Results will be distorted by a big pack of cognitive biases which most of people cant avoid or account for.
- Some of results will be true positive and may provide useful insights
In a total will value from these decisions be positive or negative or zero? I’m not sure, but i hope its between zero and positive…
So, in my opinion a rough “trial and error” should be used with a lot of salt applied to results and experimenter should take into account these limitations. Also we should remember that people love to share their experience and bad external validity of QS experiment most of time not taken into account when generalizing results.
Imagine, somebody who read in a blog about correlation coefficient which can be done in excel. Did experiment measured A & B, calculated correlation and found it 0. What conclusion would be? There is no association, lets go try something new. What’s problems here? If data association isnt linear - correlation coefficient may fail. Sample size may be not enough. Data may be inaccurate. We didnt found association, but it doesn’t mean that there is no association.
@Max_Eastwood I want to turn your attention to an example from my own project if you have time; I think I will learn something! Here is a test I did using a vary basic approach — in fact, I use the correlation coefficient. What do you think of this?
I’ve looked at your data. It looks like good linear association and Persons correlation r=0.74 (large).
But i see a problem with a small sample size, there is only 13 observations of zio/1-button.
I’ve checked 95% confidence interval for r and its a really wide [0.31,091]. Even it far from 0, its too wide ~0.6 and wide 95% CI means a lot of uncertainty in data. Since zio trend was rising and same for 1-button, its good to catch changing or steady trend and verify association. I would suggest to confirm that analysis by doing more observations.
Also, did you look at you zio night results every morning or be blind and looked only after 2 weeks? Since trend was rising, you might be stressed of it and did more 1-button presses. This may happen subconsciously, but thats not the case if you were blinded and didnt look at zio results.
Here is R Code if you want a try:
events <- read.csv("arrithmia.csv") c <- cor.test(events$zio, events$button); c
arrithmia.csv (243 Bytes)
Pearson's product-moment correlation data: events$zio and events$button t = 3.6068, df = 11, p-value = 0.004121 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.3114100 0.9156948 sample estimates: cor 0.7360945
Since we have only 13 observations, there is not too much data analysis we can do… For my data i’m always trying to have >30 observations. But often, even 30 is not enough to get narrow confidence interval, and my goal is 50-100 and more. I’m choosing comfortable tracking methonds, which can be run in long term, for months, to have big data set and be confident in my results.
Also take into account - i’m not experience statistician, i’m learning right now and might have mistakes
Thank you! This is pretty coherent with my conclusions. The reason for such a small # of observations is that I’m aggregating by day. I actually have a very large number of OBT observations (about 10,000 for the year). The Zio also produces voluminous data, but, as is typical of these biomedical devices, it doesn’t provide any patient access. Instead, I have to get a PDF from my cardiologist, and the only convenient metric on the PDF for hand-conversion into tabular data is the 24 hour aggregate.
I think there are two things to consider here. Yes, there are a small # of observations, so if we think statistically there is high uncertainty. But this illustrates, I think, a key advantage of personal science and a difference between personal science and clinical observational studies.
First: I know very exactly how the aggregate 24 score was created. I have good reason to trust the OBT data, because I have many observations each day, so even if there is a false positive or two, or some time when I failed to make observations, the error is going to be rather small when all the observations are added together for a daily score. I also trust the Zio daily data, which is constructed from continuous measurement using a well tested instrument. So, my knowledge of what’s happening “under the hood” helps me trust the correlation; whereas if this were an observational study where the measurement methods were less reliable or less transparent I’d have more doubts.
Second: As you point out, I’m not trying to calibrate the relation between Zio and OTB scores well enough to translate between them. I’m just examining whether my OTB measures are reasonably trustworthy at the resolution of one day. Therefore the fact that they both go up and down in the same way (with the exceptions noted in the post) is reassuring.
I think if I were too worried about the statistical factors, I could stuck there, proving a lot more doubt and uncertainty than is justified. It’s generally good to think about how much certainty is needed before adding statistical firepower. At the same time, I don’t have complete confidence, and if you had seen a big problem I would have wanted to know about it!
You can analyse, for example, 90 days of aggregated data (n=90). That will increase certainty in results.
If you have small amount of observations you cant be certain in your results. That’s not abstract “statistically high uncertantly” - it’s what happens in real world. If you measure your blood pressure only 5 times in a year you can’t be confident that you dont have hypertension. Naming it “personal science” will not improve significance of you results - there is not too much data to make conclusion.
If we assume that device is innacurate - it doesnt matter how many innaccurate measurements done per day. Accuracy is about how accurate device measurements compared to TRUE values or estabilished gold standard, and not how many measurements device doing by itself. Personally, i prefer 5 highly accurate measurements vs 100 highly inaccurate. For example, if i want to measure my Deep sleep, i prefer 5 nights with EEG device versus 100 nights with wristband wearable.
If there is no validation studies comparing this device with gold-standard - assumption that device have only 1-2 false positives per day might be wrong.
There might be a person who have another knowledge and will not trust in that correlation. But our trust doesnt make correlation stronger and doesnt confirms causality.
If correlation between A & B driven by another C (C → A and C → B) than there is no causality( A !not! cause B), and it doesnt matter what helps us to trust in correlation and how much.
What you did was observational study of n=1. As i understand you, it’s not a comparison of personal science vs observational studies. You compare observational study of n=1 vs n>1. It’s known that n=1 have great internal validity and bad external validity (generalize) and n>1 is mostly opposite: more generalizable but less internal validity.
But you did that
In other words - its a calibration between devices on 1-day resolution
If we correlate a lighter in a pocket and lung cancer we will find a big trend. Lighter in a pocket will have strong correlation with increasing odds of lung cancer. But does it cause cancer? No. The cause of cancer is a smoking. One of devices might be inaccurate and same trend may be caused by another factor, which wasnt taken into account.
Maybe, statistics isn’t popular and easy to learn But replacing it with personal experience and perception may lead into way of trust into wrong things. Also it may be better to take some time and read 1-2 introductional books about statistics. That may result in better data-driven decisions in long term and that what i’m trying to learn now. Anyway its not a recommendation, i may be wrong and it will not make my decisions better,. But i want to try
I think n=13 is very far from complete confidence If you have more data, for example for 90 days, you may attach it here in csv and i’ll check it
Very interesting to note and think about these somewhat conflicting approaches. I’ll spend some time making sure I understand correctly before replying. One process note: Real world conditions are challenging sometimes! The Zio is only available by doctor’s prescription, and is required to be returned after 14 days. (Mine came off my chest after 12 days.) SO, I’m dealing with the condition: Should I test my OBT against Zio using the data I have, or make no test at all? That’s to say: I agree that 90 days would be better. I have some additional questions involving PVCs at night that are really not addressed at all by my tests. But… step by step.
When i was intrested in getting raw eeg signal i’ve found few solutions
- https://shop.bittium.com/ - seems to be very solid, but expensive
- https://www.aidlab.com/ - not so much verified in studies, but technology is same as polar h10. Support told there is raw ecg data collection and export available
- there should be some holter monitors with data export which are free to sell
The other question is how to process raw ecg data, but since i’ve waived measuring my ecg 24/7 i didnt investigate this question.
Yes, this is an interesting aspect: I could get better EEG but what’s important to me is the phenomenon I’ve chosen to observe because it promises insight into my question. This phenomenon is not “EEG data” but rather “arrhythmia events.” Of course I already know I have these arrhythmias in general, and I know their type, thanks to a different measurement device (the Kardia by AliveCor), plus consultation with my cardiologist. And I already know there is quite a bit of variation over days, weeks, and months. But now I want to watch it more closely, both to see if I get any ideas about things to try, and to evaluate how well my medication is working. I’ve made nearly 10,000 observations using active tracking, and this indeed has given me quite a few ideas and insights. However, I was curious about what a biomedical measure would tell me, for several reasons. First, I needed a biomedical measure for discussion with my cardiologist, who was looking to see if the severity crossed certain thresholds, which would trigger treatment options. But second, I wanted to know if my self-measurement was basically coherent with the biomedical measurements. Most importantly, are we measuring the same thing; that is, arrhythmia incidents? As you suggest, I could be measuring a subjective feeling of arrhythmia that wasn’t really associated with the specific heart rhythm problems named by the cardiologist. Or, more likely, my ability to detect them was so influenced by the measurement conditions that there was little correlation between my daily score and the Zio daily score, which would make me look for some refinements in my approach. Getting twelve days of measurement that are so closely correlated is reassuring.
OK, you say, but how reassuring? I think I can answer this but I’m going to think a bit more before I do.