Interventions to Improve Sleep

Some interesting points to dig into, Max!

This comparison shows poor agreement. Total sleep time is Time in bed minus awake time. […] This study also shows poor accuracy. […] Imagine total sleep time of 7 hours in PSG and 7h 25 minutes for oura - this doesnt looks like good agreement but paper treats this difference as agreement.

I think we might be talking past each other here. I would interpret both the lab results and Quantified Scientist’s as showing good sleep detection accuracy. It seems like you would expect it to be a lot closer? I guess it also depends what your goals are in QS.

Looking at the time-series from Rob’s analysis, it seems that the wake/sleep binary classification aligns pretty well. As I point out in the paper, this is pretty good for a device that’s so uninstrusive. I mean, are humans even this accurate? I’m not sure that I’d get better accuracy by having a sleep scientist label timestamps of a video of me sleeping. But maybe my intuitions are wrong here? I totally see how EEG can detect REM sleep (and why the Oura struggles), but light/wake classification is much less obvious. And using HRV/motion data can probably detect a lot more restless behaviour than EEG, no?

What I really like about the Oura ring is that I have virtually no missing data and no disruption to my natural behaviour. I literally wear it constantly. This means that I have data from aeroplanes and couches and trains and all sorts of occasions where I wouldn’t have worn a headband. As a result, I have >1000 nights of Oura data under a wide variety of situations. For my goals, having 1000 nights of noisy-but-varied data is much more useful than 500 nights of slightly-more-accurate data that may be biased to occasions where I could/bothered to wear the EEG device. That’s definitely a trade-off and there are advantages and drawbacks to both.

Much further up this thread you said:

Right now i’m doing some nbacking and spaced repetition before sleep to check if it increase my REM sleep.

Given those experiments, I totally agree that the Oura ring is not accurate enough — especially with REM classification.

I’m planning to post my results soon. Using dreem as a reference device, manual asessment of total sleep time outperfroms other non-eeg devices (~60 nights of data), so i’m pretty sceptical on total sleep time from non-eeg wearables

I’m very interested to see these results! Especially with your manual assessment. Sure, it’s not that hard to estimate to within 10-15 minutes when you went to sleep and within 5 minutes when you woke up. But what about disturbed sleep during the night? I find that very difficult to estimate (which is why I rely on devices). Perhaps we have quite different sleep profiles/habits? Or maybe I misunderstood what you meant. But I’m looking forward to your results!

Dreem doesnt have sleep score and we cant use oura sleep score formula because there is no restlessness parameter.

I think it would be good to create your own scoring function anyway. In my study, I had a couple of reasons to stick with the built-in Oura score. For a comparison across devices (and for the types of experiments you seem to be interested in), it makes a lot of sense to devise your own composite score — it could even include some subjective measures of sleep.

This is something I intend to do in my future analyses, particularly focussing on things I personally care more about, like feeling rested and energised when I wake up.

Right now i’m using fitbit charge 4 and 5, oura, withings sleep, dreem 2 and manually asess my sleep every morning (before looking at any devices).

It would be really cool to make some sort of consensus score out of all of these devices! And perhaps it would be valuable to do some analysis where the agreement across all the devices (or between Dreem and the non-EEG devices) is used as a target feature. I’d be really interested to know what other factors cause the devices to disagree, as that would help get an understanding of how to “subtract away” some of the noise in sleep measurement.

When are you planning to show your next paper? I’m ready to read now

It may be quite a wait, I’m afraid :sweat_smile: I’m fairly burned out from the Quantified Sleep paper and am focussing on finishing my MSc thesis at the moment. In the near future, I’m going to try do some smaller-scale experiments that I can post here. I would like to do controlled (maybe self-blinded) experiments on the factors that were highlighted in the observational analyses.

I’ve also been wanting to do some analysis of my diet/nutrition. I’d design a two-week plan involving various different food types, quantities, timings, etc. Then track everything (calories, macros, timing) and relate it to data from a continuous glucose monitor and ketone strips.

I’m looking forward to your device agreement results (and the ones about spaced repetition’s effect on REM)! Keep it coming, Max!

1 Like

This is a single night. I have nights where oura and dreem have good agreement, almost perfect. The problem is that most of all other nights agree less. Confusion matrix takes into account summary of all nights.


Lets look at deep sleep from confusion matrix from my previous post. There were a 12.3% of deep sleep according to PSG. Oura recognized 9.9% and misrecognized 10.2%. So PSG tells us there is 12.3% of deep sleep, but oura tell us there is 20.1%. If we take average night of 8 hours, PSG tells us there is 8*60*0.123 = 59 minutes of deep sleep, oura tells us there is 8*60*0.201 = 96.5 minutes of deep sleep. I dont see how that ~37 minute difference in deep sleep can be a “good” agreement, thats about ~40% of error. Same for rem.

Awake time even worse, oura tells there is 8*60*0.077 = 37 minutes awake, psg tells us there is 8*60*0.007 = 3 minutes. Since total sleep time is a time in bed minus awake time, we will get ~34 minute error in total sleep time (37-3). I dont see how overestimation of awake time by 34 minute is a good agreement?

There is metric called kappa, which is measure of agreement. In my case is was 0.42 for oura which is a low value, Rob didnt show kappa for his confusion matrix, but i’m confident it will be about same.

Yes. If i want to check what affects my awake time the error of 30 minutes looks pretty big. Same for deep or rem sleep. Total sleep time also have an error of ~30 minutes because of awake time error. If you have read Dreem 2 psg study, you may seen kappa of 0.74.

There is accuracy / comfort trade off. Like bias / variance one :slight_smile: I see oura in a side of being pretty comfortable with a big price in accuracy. For example fitbit charge is less comfortable than oura, but agree better with eeg. Dreem is even less comfortable than fitbit, but outperforms non-eeg wearables.

5 sleep scientists had a 0.8 Cohen’s kappa according to dreem study. Each sleep specialist analyzed EEG signal data independently. That’s not perfect, but a solid agreement. There are some other studies with kappa like 0.76+ between sleep specialists. I’m not requiring oura to have 0.9 kappa, but 0.42 (in my case) is too small.

Why light/wake classification is less obvious? As i see all of sleep stages is a complex processes, which a rougly categorized. Some specialists says that sometimes its hard to classify some transition periods, which may cause loss of information. But anyway there is AASM standards for sleep stage classification which defines what we call deep sleep etc.

PSG utilize hr, hrv, respiratory rate (direct), temperature, muscle electrodes at legs, ecg, spo2, motion in addition to eeg, so there is a ton of sensors here)). Dreem 2 also have motion data, sonometer and ppg for hr. These devices arent just EEG. Zmax have even more sensors than dreem, adding spo2, hrv, temperature etc.

Do you have a gaps in your data? Sometimes oura shows a gaps in my data (1-2 days per week for 20-30 minutes), which can be treated as missing. I dont understand how they can tell that i was in rem sleep if there were no hr/hrv data?

Same here. It’s pretty comfortable.

Yeah, that’s comfort / accuracy trade off. I take my wearables / headband everywhere with me and it’s not a problem, so i have my flights etc. But that’s for me, most people will not tolerate this)
Also i’m not sure if these extreme days should be included into the model, because my current goal is to optimize my usual sleep, with less interest of extreme conditions (flights is less than 1% of my sleep).

That’s good point. I’m using oura for ~330 nights and i have about ~160 nights of dreem data. I’m a bit sad that i havent started this few years ago :slight_smile:

Maybe, but i assume you dont have 500 nights of less noisy data to check that. I hope in a few years i can better understand if using headband will add some value over just using oura alone. But there is possibility that headband may not add anything at all.

Same throughts :+1:

Maybe. Since i have both devices, i’ll come with results when i have enought data. I’m not sure if even dreem can detect something if effect size is small. This may take a years to get enough data.

I just look at my fitbit when going to sleep, for example 23:15. When i wake up in the morning i look at fitbit at first. I’m dont remember a morning when i forgot yesterday bedtime :slight_smile: So i’m pretty sure it’s even more accurate (but maybe i’m overconfident?)

After dealing with some psy questionnaires i’ve found that to get some useful data there should be a few questions about metric of interest. Single question things doesnt work well. I’m answering a set of a questions related to sleep issues which comes from pittsburgh sleep dairy. In my future post i’ll show full questionnaire results and how they correlate with headband / oura / fitbit (each of these devices have some degree of sleep disturbance measures).

I usually go sleep at 23:00 and wake up at 7:00. I dont spent any time in bed during day and immideately go out from bed when i wake up. My sleep quality and quantity described in details here.
I’ve had problems in the past with very long sleep onset (40-60 mins my brain can stop) which i was able to lower to 10-15 mins now which is a healthy range (most impact was caused by using dreem with their programs for improve sleep). My current sleep is mostly ok, but i want to lower my nightly awake time and awakenings count to be in a healthy range.
I was able to lower them in some degree by dealing with some external stressors (noise, overheating etc) but still waking up 2-3 times per night.

I’m not sure what this formula will describe. There is oura formula, also there is ZQ formula from zeo sleep. But what’s their clinical value? Single measure will lose some information. I understand practical value of this when i’m trying to build a model, but in general i prefer to look at all data i have

I tried to do that, but it’s hard to build 1 single number. I can for example build a sum of standartized awake time, awakenings count, position changes and call this a “disturbance” score. Maybe it’s worth adding here a sleep onset latency and call it “insomnia”. Another metric might be a sleep quantity which is something like deep/rem/light. The problem is these metrics cant be compared to anything except myself in the past. And i may be wrong and misinterpret them. That’s why for now i prefer looking at all data and use well estabilished metrics.

I think i can do that in a future. But results might not be generazable as most of n=1

It mightbe quality of sensors, quality of firmware, sleep detection algorhitms etc.

That will work only if oura have a lot of positive signal inside which is hidden by the noise. If it’s not - you can’t recover something that wasnt measured accurately. It measures hr, hrv, movement and temperature with an acceptable quality. I dont think that’s enough to describe sleep in all details as psg with all their sensors.

i have all of my nutrition for 13 months, but dont have enough stats skill to analyze this :slight_smile: but i hope i can in a near future.

Thanks :slight_smile: I think we can print our posts from this thread and apss a peer review :slight_smile:

1 Like

Yes. Right now i’m using fitbit charge 4 and 5, oura, withings sleep, dreem 2 and manually asess my sleep every morning (before looking at any devices). I’m planning to post my results soon. Using dreem as a reference device, manual asessment of total sleep time outperfroms other non-eeg devices (~60 nights of data), so i’m pretty sceptical on total sleep time from non-eeg wearables (we can think of brain as an eeg device :slight_smile:).

Also looking forward to this.

Thanks. Right now i have ~60 pairs of data. My total sleep time standard deviation is 46.2 minutes (dreem 2) and i want to be able to detect at least 10 minutes effect which is 10 / 46.2 = 0.22 of my standard deviation.
Simple power analysis suggests i need 164 nights (with 95% significance and 80% power), so it might take some time…

also i’m not sure what to do with multiple comparisons issue here, since there is a few devices, my p-values must be adjusted which may lead to even ~160 nights isnt enough :crazy_face:

1 Like

Preliminary data analysis of 63 nights is available now.

Withings Sleep analyzer and Manual assessment seems to be a good estimates of total sleep time (sum of DEEP, REM, LIGHT) and a proxy to EEG device. Fitbit Charge 4 is less accurate compared to manual assesment. Oura shows worst results.

Fitbit Charge 5 was not included since i’m using it only for 2 weeks, but FC5 agree with Dreem 2 pretty similarly as FC4 and i assume FC5 will have same results as FC4.

Time in bed (which is total sleep time plus time awake) seems to be fine between all devices.

Google spreadsheets with manual sleep asessment seems to be acceptable alternative to wearables for measuring total sleep time.

Analysis will be updated in future, when i gather more data.

Works great for me. No need to use mental capacity or willpower.

I just had a look at these initial results. Three things came to mind:

  1. In 6 months, I expect you’ll have some really robust results from this experiment. You should consider writing it up formally as a preprint and getting some feedback from the science teams at each device manufacturer.

  2. I’m surprised that the Oura ring performs that badly compared to the other non-EEG devices. My priors are that a company successfully selling sleep-focussed wearables should be performing at least as well as companies selling general-purpose wearables (e.g. Fitbit). Perhaps the Oura ring is just overhyped, but perhaps there’s something else going on. In the latter case, if I had to guess, I’d imagine that the Oura ring’s performance might vary somewhat across individuals. How’s the fit of your ring? What size did you get and which finger do you wear it on? It seems there are quite some variables there that might have differed and may account for some of the inaccuracy. How often do you have a night where some data is missing? Quantified Scientist has also done some comparisons with two Oura rings on different fingers. IMO, going with the finger-worn device adds a lot more variables and Oura may need to gather more data to refine their usage recommendations.

  3. In your Limitations section you say “manual assessment accuracy may be affected by calibration with wearables and learning.” I think this is an excellent point! I’d estimate that at least half of the value I get from manual tracking in QS experiments is from the mindfulness and intentionality that are involved in actively logging data. As someone who is clearly very focussed on their sleep and manually logs it daily (as well as checking various sensors), you’re probably way above average at estimating sleep duration. This speaks to the feedback loops that make (N-of-1) QS studies so difficult to interpret. I wonder if there would be any principled way to account for that in your experiments. Any ideas?

As always, keep up the great experiments and communication of results! :slight_smile:

2 Likes

I’ll continue using Fitbit Charge 5, Withings Sleep, Dreem 2, Oura ring and manual asessment. Since i’m not familiar with writing scientific (only reading :slight_smile: ) articles i’m not sure if i can do that alone. I’ll come back with that after some time, when data is gathered.

Same thing, right now i just updated my priors with new data… I dont understand what makes Fitbits be good at sleep tracking. I assume oura ring HR, HRV, accelerometer and temperature data is of good accuracy and it shouldnt be so far from Fitbit, which utilize same things to predict sleep…

Fit is perfect. There is no air between finger and ring, it sits pretty fine, without loosing. Before sleep i check orientation of ring and rotate it if needed.
o1o2

ring finger with US8 size

1-2 times per week, here is last week

Yeah, i’ve seen all of his videos. I’ve viewed multiple times ones related to oura / fitbit charge / dreem. Oura have some announcements soon, if it’s a new ring - i’ll preorder it and will wear both for some time.

Ring have some limitations due to small battery. As i hear there is scientific version with raw data access, but it doesnt seems available to consumers.

Also it may worth to note that my sleep looks healthy, the only thing i want to improve is awake time, but its not too big to be a problem. SPO2 ring didnt reveal any drops so i shoulnt have sleep apnea. It looks like i dont have serious sleep disorders. Since consumer wearables state that they were validated on people free of sleep disorders i assume they should work fine on me.

Install all apps on wife’s phone and let her sync all devices and do not allow me to look at results until experiment is finished. But i’m not sure if want to be blind for a few months :slight_smile:
Also i can see that fitbit / withings not far from manual, so it might not worth overquantifying everything manually in long term.

:+1:

2 Likes

For fans of the quantified scientist and the guy too: YOU CAN CHECK SPO2 SENSOR QUALITY JUST BY HOLDING YOUR BREATH!!!

This is fantastic work. Really rigorous analysis on a large number of measurement methods.

I’ve also been concerned about the accuracy of my automatically collected sleep data (I use an Apple watch) and have been manually tracking for the last 108 nights. I don’t have the statistics skill to do the kind of detailed analysis you and gianlucatruda are doing, but from a simple scatterplot I see that my watch is reasonably at picking out sleep/wake times most days (within 20 min), but sometimes is off by hours (e.g. claiming I woke up at 9a, when I get up at 5a and am at the office by 6a). As a result, I find that the average sleep time over the course of a week is ok, but the Watch data can’t be used for any kind of controlled experiment or really anything that relies on day-to-day accuracy.

Is there any way for you to look at frequency of significant error in your data? I’d be very interested if any of the devices were better/worse on that metric.

1 Like

It seems like between the three of us, we have a pretty substantial dataset as well as a significant interest in self-experiments on sleep :slight_smile:. Any interest in pooling data for analysis/write-up and/or getting together (by video) to discuss/plan out self-experiments? Sort of a QS working group for sleep studies. For me, at least, it would be really helpful to get feedback on experiment design, data analysis, and general motivational support.

Let me know if you’re interested and I’d be happy to schedule/set-up.

2 Likes

Yeah, I’m confused here too! The HR and HRV for the Oura ring is supposedly very accurate, so sensor-wise it should be at least on a level playing field. As for the software component, I wouldn’t imagine that Fitbit is at much of an advantage by having more users, especially given that Oura is focussing on sleep tracking. Very strange.

Yeah, that all checks out. I wear a US12 on my middle finger and also do the orientation checks regularly. From the photo, the fit looks identical to mine. Now I wish I had more sleep devices with a data history that I could use to see if your results replicate on my hardware and anthropometrics.

I don’t have missing segments quite so regularly. For me it’s probably more like once every 2 weeks on average (2-3 consecutive nights per month, interestingly).

They’re really great! I first met Rob in pre-pandemic times right when he was just starting his channel. It was at the Utrecht QS meetup and he literally rocked up covered in sensors and wires :laughing: That’s how I knew he was serious and worth getting to know! It seems like he’s found an interesting niche with wearable reviews and comparisons.

Yeah, I saw the announcement too! It is indeed a new ring. Seems like they’ve addressed two of the main issues: (1) they now have 24/7 HR logging and (2) blood oxygen levels. They’ve also added more temperature sensors and seem to be interested in the menstrual cycle tracking and illness prediction markets. Hopefully yours will arrive soon and you can get some comparative data!

That’s really nice baseline analysis to have! This is even more curious, given that you should be right in the middle of the distribution and demographically-similar to their calibration cohort (going off your anthropometrics and skin-tone in your photos).

“In sickness and in health, in conventional recreation and in obsessive QS experiments…” :laughing:

1 Like

Just don’t hold it for too long :wink:

Nice idea!
I’d definitely be up to share (some of) my data*.

As for planning experiments, I’d be up for that too but would prefer to delay that until the new year (while I finish off my MSc). But if @Max_Eastwood is also up for it, I think it could be interesting to explore some n-of-few experiments. I’m particularly interested in benchmarking the gen. 2 Oura ring the way Max has done to see if the low accuracy issue shows up in other subjects.

*My current research area is on differentially-private dataset sharing so I might have some suggestions for ways we could preserve some aspects of our privacy too!

How large a change do you see? I tried holding my breath with mine and didn’t see any change.

Any recommendation on a sensor?

Down to 92% on command with the contec medical systems cms50f.

Yeah, also Fitbit is on the wrist - that place seems to be worse, compared to finger. But in my case reality surprise me.

Yeah, right now i’m thinking about always having multiple devices for each metric to make sure devices agree enough. Most of my checks reveal agreement not as good as i thought… Also it seems manual asessment is a good sleep device :slight_smile:

Yeah, for sure. After watching my motiovation for QS is raised and my wearables collection increased.

I bought 4 oura 2.0 rings for me and family 1 year ago and yesterday upgraded all 4 to gen 3. When we get all rings i collect old 2.0 and plan to use 2-3 rings 2.0 and new ring 3.0 simultaneously for few some time to check how 3-4 oura rings (including 3.0) will agree between themself and vs other devices.

1 Like

Actually i dont want to “experiment” with my sleep too much. Since my sleep is generally healthy i only want to optimize awakenings and awake time during night. Right now i dont know too much about that and not sure which interventions to try. Most of my data analysis didnt reveal connection between different events and nightly awakenings. My current plan is to talk to some doctors to get some directions to investigate.

Another point of interest is to keep enough deep/rem during aging.

Can you explain what’s you interests in sleep, what are you tring to achieve? If we have similar goals - why not to try to design pooled experiment :+1: Also i’m on the way of learning statistical data analysis because right now my skills arent enough to design / analyse complex things.

I think we’ve got similar interests in terms of “experiments.” I also have issues with waking up during the night, plus I get tired & unproductive in the afternoon.

In terms of interventions/experiments/analyses, I’m interested in

  1. Reducing daytime fatigue (I get tired & unproductive in the afternoons, even when getting a full 7.5h sleep)

  2. Determining accuracy of the new Oura 3 vs. other sleep trackers

  3. Assessing the impact of and impact to sleep of blood pressure, cognitive function, blood glucose (probably not relevant for either of you if you don’t have diabetes), other health metrics

  4. Pooling our existing data to get a better assessment of manual sleep tracking vs. tracking via wearables.

1 Like

I pinged my friend @Wozniak about this topic, since he has a lot of data on his own sleep. I found him to be a skeptic about picking up causal cues from the chaos of daily life. Interestingly, he had an intuition about the relationship between learning and sleep which, when he checked his data, turned out to be wrong, at least at the most obvious level:

I studied similar experiments, and there is one huge problem. The crowd does
not realize that in free running sleep, the sleep model is relatively
simple. It is easy to adjust to biology and sleep like a baby. But once the
interference from the outside world sneaks in, things easily unravel into
chaos and no brain can make order in that. Chaos is hard to study. Add
melatonin at different times, coffee, light, exercise, alcohol, naps,
stress, etc. and this can be unmanageable. Teenagers cannot get normal
healthy sleep! Let alone adults with a great deal of injury to their sleep
control systems.

My primary advice: free run your sleep, keep a simple schedule, respect the
cues, and avoid the impact of technology/chemistry on healthy sleep. Only
then it is simple to maintain an analyze.

Ironically, written joyfully at 4 am :slight_smile: … My rhythm is free, but
computers keep me a prisoner to very unusual hours.

As for the impact of learning on sleep, the default should be lots of
learning which facilitates sleep. Afair, I did not study it too much because
what I found was very insignificant. I can try to look back into what I
found in sleepchart, if you care (if it was big, I would remember :slight_smile:

apology for super late answer

Then, after checking his Supermemo and sleep tracking data:

oh, I see there is even an option in SuperMemo for that! I recall now:

the more you learn, the less you sleep!

sounds strange? in my case, the interpretation is very simple. On days of
high creativity, you learn a lot and usually go to sleep later. That
shortens the sleep. The quality of sleep may actually increase due to
improved structure. [if this makes sense, you can quote me … :slight_smile: … at
worse I will have to make it more precise :slight_smile:

2 Likes