Interventions to Improve Sleep

I’m getting back into self-experimentation after being distracted for a while from starting a new job.

This will be my research log for studies I’m doing on sleep. I post incremental updates at www.quantifieddiabetes.com and will collate/organize them here for people who want to follow along.

Completed Studies:

  • Effect of melatonin on sleep duration (blinded, pre-registered)
    • Pre-registration, extension
    • Report
    • Summary of Conclusions:
      • Sleep measurements from Apple Watch don’t correlate with manually recorded times asleep.
      • Contemporaneous recording of waking disrupted my sleep, leading to more recalled wake-ups and possibly increased fatigue
      • Melatonin had no observable effect on sleep duration or any other metric examined. It may have an effect that was too small to be observed. However, if true, that’s too small to be of interest/use to me.

Questions/Requests

  • Does anyone have any suggestions for other supplements or interventions for me to try?
  • I’m also always looking for collaborators for future experiments. If you’re interested in collaborating on scientifically rigorous self-experiments with foods, nootropics, sleep aids, or anything else, let me know.
2 Likes

Hi Steve,
It may be that you body is producing enough melatonin that the added amount doesn’t change your sleep.
Have you looked at the effects of blue light on reducing sleep quality with a possible extension that if you do detect changes then supplemental melatonin can reverse some of the negative effects of blue light before bedtime.

I haven’t quantitatively studied the effect of blue light, but I set my electronics to minimize blue light 90 minutes before I go to sleep and turn off screens entirely 30 minutes before bed time.

Hello, thanks for sharing. You may try find correlations / build linear regression models with objective/subjective sleep metrics versus:

  1. Bedtime. I’ve already did some analysis of EEG derived hypnogram and found my optimal bedtime.
  2. Vitamin D supplementation and timing. Seth Roberts, Gwern Barren did some self experiments
  3. Daily steps count / train load
  4. Food - kcal, weight etc
  5. mood
  6. if you have any EEG derived hypnogram you may try to check how certain things affect your REM or DEEP sleep. Like pink noise or nbacking before bed.
  7. caffeine intake. you may find maximal dose and timing, which not affect your sleep.

Right now i’m doing some nbacking and spaced repetition before sleep to check if it increase my REM sleep.

1 Like

Just read through your studies. Really nice work!

I like the suggestion of doing a more detailed regression analysis from my sleep data. I’ve been manually recording sleep to go along with my watch data for just shy of two months. I think I’ll let that go a few more weeks and then do an analysis.

Just found how caffeine ruins my sleep. It might be important to check, experiment shouldn’t be too hard.
In that case its important not only look for things which can improve sleep but also find out what make it worse.

I don’t eat or drink anything with caffeine, but definitely worth testing if you do.

There are other adenosine receptors antagonists which acts in same way. Mostly in cocoa and tea

Hi Max,
When you did the bedtime optimization, did you wake up each morning with or without an alarm? Did the time period overlap with the nights you were tracking caffeine? Does decreasing caffeine change your optimal bedtime?

No alarm. There were a few nights when i’ve tried Dreem 2 smart alarm feature, but almost all days i’ve woke up naturally.
Also i’ve used sleep mask to make sure morning sun do not disturb my sleep and foam earplugs which was cut with scissors to fit my ears without pressure.

Yes, that was the same timeframe. I’m planning to model TST ~ caffeine + bedtime to distinct effects. But since i’m trying to have strict schedule around ~22:00, i dont think that caffeine influenced that time.
I’ve checked for correlation between caffeine and bedtime using bootstrap and didnt found it, CI is [-0.08,0.23] which crossing 0 with n=146


(caffeine is y, bedtime is x, in seconds from 00:00)

As you can see above, i havent found correlation between caffeine and bedtime start. There were enough low caffeine days but scatterplot didnt reveals a connection…
Right now i’m lowering my dose and will continue to monitor my sleep, but Dreem 2 after about 1 year of use is dying on me and i cant find casual alternative to precisely track my sleep (eeg)

Also it’s worth noting, that i start my preparations ~1 hour before sleep: - dim the light, wear blue glasses, read ebook and doing something calm & relax for most of days.

Cool. I just checked dreem 2 is sold out in the US.

there are some on ebay and in AU resmed store. I’m trying to buy second, but havent yet. Hypnodyne ZMax also looks solid.

I’ve did extended analysis of which factors influence my sleep, if somebody interested details is here
In short: caffeine, bedtime start, sickness, vitamin D3 and negative emotions statistically significantly and independently influence my sleep. I’ll lower my caffeine intake, will go sleep earlier, will take D3 every other morning, and looking for ways to reduce my stress.

2 Likes

red light with f.lux

Fascinating work in this thread so far! Looking forward to diving deeply into it all.

I recently did an massive observational study on my sleep behaviour over 472 nights, relating it to data about my eating windows, mood, exercise, location, activity, heart rate, habits, weather, etc. If you’re interested in the details, you can read the summary here or the full paper here.

The idea was to use observational data (and some clever techniques) to narrow down the key factors that interfere with sleep quality. Those can then be explored in controlled experiments (like @sskaye’s pre-registered melatonin study).

These were the most predictive features in my final model. The 16 final features are a reduced subset of the 308 initial features that were explored throughout the observational study, yet they explain the majority of the variation in sleep quality. There are many caveats about interpreting these that I discuss extensively in Sections 9.5 and 10.5 of the paper.

Some of the features are expected — travelling, eating windows, previous sleep, and melatonin
are all known to affect sleep quality. But the direction of some of these effects was
unexpected. E.g. melatonin consumption is associated with a decrease in sleep quality,
when research suggests it should be an increase. The previous night’s sleep quantity also
has an unexpected decrease on the present night’s sleep quality. Perhaps there is some
kind of trade-off between sleep quality and sleep quantity on successive nights? These unexpected
directions might just be artefacts of the study design, or could be idiosyncrasies in my
sleep patterns.

On the other hand, some of the final 16 features were very much unexpected. Specifically, the lag
in the features. It is not intuitive that pleasure reading and barometric pressure from many days
prior could affect the current night’s sleep. It is also strange that location changes from an entire
week prior are still predictive of the current night’s sleep. One explanation is that these features
merely correlate with other (unmeasured) variables that influence sleep quality, and that this is
being captured by the model. Another (complementary) explanation is that the sparsity effect
of the Lasso model caused it to discard all but one feature from each set of highly-correlated
features.

I’m hoping to write up a more accessible discussion of the study and my findings. I’ll post it on the forum when it’s ready.

3 Likes

Very interesting! This is a small detail but I think of interest to other QS Forum readers: Say something about your experience using the Nomie app. Did it perform as expected? Was it a good solution for some or all of the active observations?

1 Like

I’ve read full paper and it looks like a very well guide for serious QS’ers.

  1. When i read about previous day sleep impact, i’ve immediately did lasso on my data, sleep time from previous night survived and resulted in increase in adjusted-rsquared from 0.24 to 0.28. p-value were significant in final lm. significant improvement :+1:

  2. the idea of markov unfolding and overcoming missing data is pretty interesting, i’m going to try it when my noob skill in statistics will be improved :slight_smile: Actually i did that in some degree, but not in a solid way like you

  3. The bedtime feature had a strong negative correlation with the target (r ≈−0.6), which is likely
    because it is considered as part of the ideal sleep window calculation that comprises 10% of the
    sleep score. It was important to remove features like this from the dataset prior to training the
    model, in order to prevent data leaks

That looks weird for me. Bedtime isnt a measure of sleep quality - it’s something that affect sleep quality. Maybe it’s better to remove bedtime from sleep score (since you know the oura formula & weights, that’s shouldnt be a problem) and keep bedtime in dataset as a featuire which may influence sleep quality.
Keeping bedtime in a final feature (sleep score) have another problem - since the bedtime sometimes defined by your decision and not by features in data set, sleep quality will be distorted by including bedtime.

  1. according to my data oura sleep score is very noisy variable.
    Sleep stages are pretty poorly detected (in my case)
    Total sleep time also poorly detected - i did bland altman for 140 nights of data and found poor limits of agreement and huge bias
    oura-test-bland-altman
    There is -24 minutes bias and -144 minutes to +95 minutes limit of agreement.
    I’ve also did correlation analysis for restlesness score but even this score didnt correlate (<0.2) with awake time, awake count, position changes count from eeg device. Same for sleep latency :frowning:
    So in a total - sleep stages, total sleep time and restlessness seems to have bad signal to noise ratio and i’m not sure in what degree oura sleep score represents sleep quality. There should be some useful signal, but data above raises a lot of questions. Only total time in bed looks acceptable

Thanks for work you have done :slight_smile: i hope there will be more n=1 papers with deep dive like that

1 Like

I’ve been using Nomie for probably 5 or 6 years now and really like it. At points, I was using it for almost all of my active quantitative logging. I really like the speed and flexibility it offers.

But, over the years, developer interest seems to have waned. That scared me off. All seasoned QS enthusiasts know that feeling of a tool they’ve been using suddenly going offline and ruining their experiments*.

For the past 2 years, I’ve only been using Nomie for caffeine and alcohol tracking. It’s been particularly great for that! I put the app on the bottom right of my homescreen and became very fast and discrete at logging whenever I had a coffee or an alcoholic drink with a quick tap. For those trackers in particular, you want something that’s super fast and minimal so that it’s not distracting in social situations. Nomie works really well for that — only my immediate family and SO even notice it.

So Nomie is great, but I feel hesitant to rely on it because its support status seems to constantly be changing.

*This actually happened to me with AWARE during the Quantified Sleep experiment, which required me to do some non-trivial workarounds to salvage my data.

Thanks for checking out the paper, Max! I’m glad to hear that it prompted some ideas for your own research :slight_smile:. It’s very exciting to get feedback from other QS practitioners about how well techniques generalise across datasets, so thank you for all the details!

When i read about previous day sleep impact, i’ve immediately did lasso on my data, sleep time from previous night survived and resulted in increase in adjusted-rsquared from 0.24 to 0.28.

Lagging features are surprisingly powerful, but (as I mention in the paper) there’s a lot of optimisation to be done in how to incorporate information from lags >1. But features with a lag of 1 are already just an easy way to boost predictive performance (and better understand the time series). I’m glad that worked for your model, but bear in mind that you’ve also added more parameters (lowering bias and increasing variance). Are those adjusted R^2 values on a validation set?

the idea of markov unfolding and overcoming missing data is pretty interesting, i’m going to try it when my noob skill in statistics will be improved…

Much of the complexity in my paper was because I was comparing these techniques to baseline, which requires some fiddly tricks to prevent data leaking. In terms of actually implementing them, it shouldn’t be too complex. Markov unfolding is a bit conceptually complex, but the actual implementation is pretty simple (in Python with Pandas), so long as your dataframe is already in the right format:

def markov_unfolding(df: pd.DataFrame, length=7):
    """ Creates `length` new features for each column, corresponding to time lag.
    """
    _df = df.copy()  # Make a copy of original data
    # Loop over numeric columns (except the target feature named 'target')
    numeric_cols = _df.select_dtypes(np.number).columns
    for c in [col for col in numeric_cols if 'target' not in col]:
        # Apply shifts of up to `length` and stack them onto the new dataframe
        for i in range(1, length+1):
            _df[f'{c}_-{i}day'] = _df[c].shift(i)
    return _df

(Based on the original implementation here.)

As for the missing data imputation, there are existing implementations in both R and Python for this. They handle most of the complexity for us :slight_smile:.

That looks weird for me. Bedtime isnt a measure of sleep quality - it’s something that affect sleep quality. Maybe it’s better to remove bedtime from sleep score (since you know the oura formula & weights, that’s shouldnt be a problem) and keep bedtime in dataset as a featuire which may influence sleep quality.

You’re definitely onto something here. I did consider extracting parts of the Oura score to construct a more tailored target feature. I decided against it because I wanted to keep the approach general (so that other Oura users could replicate my methods more directly). It would be interesting to look at how much noise vs. signal each component of the Oura sleep score contributes — perhaps I’ll tackle this one day in the follow-up work!

Keeping bedtime in a final feature (sleep score) have another problem - since the bedtime sometimes defined by your decision and not by features in data set, sleep quality will be distorted by including bedtime.

Exactly! This speaks to one of the major challenges of observational QS studies: the feedback loops make it really hard to figure out candidates for causal relationships. There is also quite a bit of variation between individuals here, I think. For instance, I’ve always struggled with insomnia and having a consistent sleep schedule. For me, my bedtime seems out of my control, so it made more sense to treat it as (part of) the dependent variable. But for someone who can fall asleep easily at whatever time they intend do, it’s much more of an independent variable (and therefore a feature). These nuanced decisions are part of why designing QS studies is so very challenging and bespoke to each individual. But maybe that’s also part of the fun!

according to my data oura sleep score is very noisy variable.

Yes, I’d say this is expected. Sleep quality (even in the abstract sense) is quite noisy as there are so many factors that contribute to it and sleep is still so minimally understood. Then we add in the noise introduced by sensors and using model predictions as proxy measures, etc.

Sleep stages are pretty poorly detected

This seems to be consistent with the literature. Though I’m not sure that we even have good ways of measuring sleep stages in a lab — all our sensors are “guessing” in some sense. This obviously contributes to the high noise.

Total sleep time also poorly detected

This surprises me :thinking: My own experiences, comparisons by other QSers, and lab studies all find sleep duration (onset, interruptions, offset) to be quite accurate.

Am I correct that you compared the Oura with a Dreem 2 headband? Both devices will have some noise, surely, but we can’t know which just from the Bland Altman.

So in a total - sleep stages, total sleep time and restlessness seems to have bad signal to noise ratio and i’m not sure in what degree oura sleep score represents sleep quality. There should be some useful signal, but data above raises a lot of questions.

Unfortunately, these things are all hard to measure (even with lab-grade polysomnography equipment), so we have to just deal with noisy variables here.

One workaround that might apply to you would be to build 2 identical explanatory models: one model using the Oura sleep score as a target, and one using the Dreem score as a target. That way, we know both are noisy, but any agreement between the feature importance/direction in the models is probably robust. I’d be very interested in seeing how consistent the models are, despite the targets not being in agreement.

Thanks for work you have done :slight_smile: i hope there will be more n=1 papers with deep dive like that

Thanks for all the great feedback and for replicating some of the analysis! :slight_smile: I’m looking forward to seeing your future experiments!

Are those adjusted R^2 values on a validation set?

I did LOOCV, 10 fold CV, validation set approach (1/2) and finally on a whole sample. In each method adjusted R^2 increased by 3-4%

thanks, my code in R looks pretty similar to yours

But the circadian rhytm have influence on sleep, so when you shift your bedtime (even if you fall asleep fast) it may affect you sleep. Going to sleep at 21:00 in one day may result different sleep quality compared to sleep which started at 23:00 on another day. And decision to go to bed at 21:00 / 23:00 may be driven by something not included in model.

That’s correct, but sleep stages defined by eeg patterns (spindles, k-complexes etc). In case of ring it tries to predict stages without eeg which results in poor agreement with eeg devices. EEG devices also not perfect, but agree well between themselves which is a good sign of measuring same thing.

This comparison shows poor agreement. Total sleep time is Time in bed minus awake time.


As you can see oura marked 0.6%+4.6%+1.8%+0.7% = 7.7% as awake time from sleep. Only 0.7% or 0.7/7.7 = 9% of awake time was detected. 91% of awake time gone to light/rem/deep. If we assume time in bed is perfect and awake time is a total mess, how can we get total sleep time? We deduct a wrong time from time in bed, which result in wrong total sleep time. Also you can see here that 50% of deep sleep was detected incorrectly. Same for rem.

This study also shows poor accuracy. They seems to use 30 minute wrapping because not want to report poor accuracy. There is no 87% agreement with PSG on total sleep time. Diff between psg and oura is less than 30 minutes ~87% for total sleep time and awake time. Imagine 25 minute difference for awake time during night - this paper treats this difference as agreement. Imagine total sleep time of 7 hours in PSG and 7h 25 minutes for oura - this doesnt looks like good agreement but paper treats this difference as agreement.

Yes. Right now i’m using fitbit charge 4 and 5, oura, withings sleep, dreem 2 and manually asess my sleep every morning (before looking at any devices). I’m planning to post my results soon. Using dreem as a reference device, manual asessment of total sleep time outperfroms other non-eeg devices (~60 nights of data), so i’m pretty sceptical on total sleep time from non-eeg wearables (we can think of brain as an eeg device :slight_smile:).

I trust dreem more because they validated eeg signal quality (dry eeg seems not perfect, but good enough) and showed 85% agreement with PSG on sleep staging. Quantified Scientist got same results. I didnt see same level of sleep quality detection in any of oura papers / reviews. I hope i’m not biased here :slight_smile:

That’s right, but more signal = better quality of outcome response and bigger chance to detect associations with predictors.

Dreem doesnt have sleep score and we cant use oura sleep score formula because there is no restlessness parameter. Dreem measure “restlessness” directly and report awakening count and position changes during night. We may try to use same formula as oura, but replace restlessness by standartized awakenings count, scaled to 0-100. But why we should use that formula? How did they get it? Does it validated in some degree and represents sleep quality? I cant find answer to that question.
I’ve found interesting this kind of factor analysis did by Gwern, where him look for latent variables and categorize them as “sleep quantity” and “insomnia”.

Thanks. When are you planning to show your next paper? I’m ready to read now :slight_smile:

2 Likes