Login

jedimark · 06-30-2014, 07:17 AM

An old problem to do with summary only data has raised it's ugly head, and it's time I find a better way to deal with it in SleepyHead.

This doesn't just affect ResMed machines, but for simplicity sake, I'm just referencing ResMed data here.

STR.edf provides a table of summary data for each day, with up to 1 years worth of information.

For each day, it provides 50th Percentile, 95th Percentile and Maximum statistics for Leak, Pressure and various other data channels.

For days where no PRD/BRP .edf files are available, this is all the data there is available for these channels.

Each STR.edf record also provides a list of mask on/off times which can be used to rebuild a days session times. (close sessions are merged in an annoying way, but it's enough)

SleepyHead uses a per session model to store data indexes, and not per day. It's rather difficult to make data only available once-per-day data fit SleepyHead's per session storage model.. Where data isn't available to recalculate indexes correctly, I have to use a hack to divide the percentile figures between sessions, so the daily calculation gives back the correct answer.

In SleepyHead statistics page, in each calculation where it looks over a time period, it calculates percentiles without having to read the entire data using a daily sum of an index containing a frequency count of each possible value per session (and time weights).

For summary only days, it is impossible to generate this frequency data, because only the median, 95th and max is known.. If I had access to minimum values for all channels (some are obviously zero), I do know curve fitting algorithms exist that could help generate a rough estimate, but that's a messy solution I'd rather avoid.

So statistics page summaries are kind of inaccurate where summary only data is present until something is done about this issue.

This stuff is the reason I had to lock ResMed data's day splitting down to Noon split without close session sorting. Otherwise the way I divide the daily values between sessions would fail and give incorrect results. I know no other way to do this. :/

I'm half wondering what cheat ResScan uses, and I say cheat, because I highly doubt the same minds that thought up ResMed's summary file format mess could achieve a classy solution to this.

If any maths gurus can grok what I mean, and have any suggestions on this, I'd be very grateful.

I don't expect to find a perfect solution to this problem.. what I'm looking for is a compromise that the majority of users will be happy with. :-/

**Sleepster** · 06-30-2014, 09:45 AM

I'm not sure I understand the problem, but I'll take a stab at what I think might be a solution. If I've missed the mark completely just let me know. It won't be the first time I've found a wrong solution to the wrong problem. Wink

Let's say you have three sessions:

Session 1 lasts 3 hours and has a 95th percentile of 8.
Session 2 lasts 4 hours and has a 95th percentile of 5.
Session 3 lasts 2 hours and has a 95th percentile of 9.

We take a weighted average:

3*8 + 4*5 + 2*9
3 + 4 + 2

24 + 20 + 18
3 + 4 + 2

62
9

For an overall 95th percentile of about 6.9.

**Sleepster** · 06-30-2014, 09:46 AM

Also, let me know if you want me to move this thread to the main forum where more nerds might see it. Smile

diamaunt · 06-30-2014, 11:34 AM

(06-30-2014, 09:46 AM)Sleepster Wrote: Also, let me know if you want me to move this thread to the main forum where more nerds might see it.

not just nerds, MATH nerds.. (hello robysue Wink

)

jedimark · 06-30-2014, 11:45 AM

I probably should have posted this in main.. :-}

jedimark · (This post was last modified: 06-30-2014, 12:21 PM by jedimark.)

(06-30-2014, 09:45 AM)Sleepster Wrote: I'm not sure I understand the problem, but I'll take a stab at what I think might be a solution. If I've missed the mark completely just let me know. It won't be the first time I've found a wrong solution to the wrong problem.

Let's say you have three sessions:

Session 1 lasts 3 hours and has a 95th percentile of 8.
Session 2 lasts 4 hours and has a 95th percentile of 5.
Session 3 lasts 2 hours and has a 95th percentile of 9.

We take a weighted average:

3*8 + 4*5 + 2*9
3 + 4 + 2

24 + 20 + 18
3 + 4 + 2

62
9

For an overall 95th percentile of about 6.9.

I use this method already in the Overview graphs legend calculations, which need to be calculated quickly within a single frame. It sort of feels like cheating.

This method does not give the same result as when you combine all the samples overall for those 3 sessions, for a particular data channel, rank them in order (according to duration weights where necessary) and take the sample closest to the 95.0th percentile.

I guess the question arises, does the average person really give a crap whether it's a true percentile given, or a weighted average for a multi-day time period?

Does it still give a statistically as valid answer to someone who wants to know what their 95th pressure was for example, a 3 month period?

Perhaps the weighted average is what the user really wants to see in the statistics page?

Perhaps I'm overthinking it... but I just don't want it to intentionally give a "wrong" answer. :/

***SuperSleeper*** · 06-30-2014, 01:42 PM

(06-30-2014, 11:45 AM)jedimark Wrote: I probably should have posted this in main.. :-}

Thread has been moved to the Main Forum. Thanks Mark.

robysue · (This post was last modified: 06-30-2014, 04:50 PM by robysue.)

(06-30-2014, 11:34 AM)diamaunt Wrote:
(06-30-2014, 09:46 AM)Sleepster Wrote: Also, let me know if you want me to move this thread to the main forum where more nerds might see it.
not just nerds, MATH nerds.. (hello robysue )

The problem JediMark is facing is that ResMed itself is doing something really kludgy with the statistics AND its kudge is not documented.

From a math point of view, there's just no sound mathematical way of approximating the 95% percentile and median (50%) of a large data set composed of several disjoint subsets of data when all you know is the 95%, median, and the size of each of the subsets.

Yes, you can find a weighted average of the 95% percentiles (or the medians) and hope that it's "close enough". (And for many data sets, that estimate may indeed be close enough.) But the distribution of the data in each of the subsets is pretty important: If the data sets all look pretty much the same, then averaging the 50% and 95% will probably give you a decent enough estimate for the 50% and 95% of the whole data set. But if the data varies substantially from subset to subset, then some pretty wild things can happen.

As an example: Here's a collection of three data subsests where averaging the percentiles may not be a good idea for estimating the percentiles for the whole set:

Let's suppose we have one data point for each minute, and we have the following data sets

Data Set 1:

3 hours of data (180 data points) with this distribution:
- 160 points have value 6 (i.e. 2:40 minutes of data have value 6)
  10 points have value 7
  5 points have value 8
  3 points have value 9
  2 points have value 10
The median for Data Set 1 is 6, and since 0.95*180 = 171 and the 171st number on our list is a 8, the 95% for Data Set 1 equals 8.

Data Set 2:

2 hours of data (120 data points) with this distribution:
- 50 points have value 6 (i.e. 50 minutes of data have value 6)
  60 points have value 7
  3 points have value 8
  4 points have value 9
  3 points have value 10
The median for Data 2 is 7, and since 0.95*120 = 114 and the 114th number on our list is a 9, the 95% for Data Set 2 is equals 9.

Data Set 3:

3:20 or 3.3333 hours of data (200 data points) with this distribution:
- 65 points have value 6 (i.e. 50 minutes of data have value 0)
  30 points have value 7
  25 points have value 8
  20 points have value 9
  12 points have value 10
  12 points have value 11
  10 points have value 12
  10 points have value 13
  10 points have value 14
  6 points have value 15
The median for Data 3 is 8, and and since 0.95*200 = 190 and the 190th number on our list is an 14, the 95% for Data Set 3 equals 14.

Weighted average of the medians vs. true median of the large data set
There are 180+120+200 = 500 minutes = 8.333 hours in the data.
So the weighted average of the medians is:

Weighted average of the 50% numbers
= (6*3 + 7*2 + 8*3.333)/(3 + 2 + 3.333)
= (18 + 14 + 26.664)/8.333
= 58.664/8.333
= 7.04

But of the 500 numbers in the large data set, 160+50+65 = 275 of them are 6's. Hence the median of the large data set is 6.0. Given the fact that the largest number in the data set is a 15, that means averaging the medians overestimates the size of the true median, perhaps significantly.

Weighted average of the 95% vs. true 95% of the large data set
We compute the weighted average of the 95% as folllows:

Weighted average of the 95% numbers
= (8*3 + 9*2 + 14*3.333)/8.333
= (24 + 18 + 46.662)/8.333
= 88.662/8.333
= 10.64

Now lets find the true 95% for the large data set. Since the large data set contains 500 numbes, the 95% of the large data set is the 475th number on the whole list, and when we aggregate the data for Data Sets 1,2, and 3 together into the one large list, here are the tallies:

Large data set looks like this

8:20 or 8.3333 hours of data (500 data points) with this distribution:
- 275 points have value 6 (i.e. 50 minutes of data have value 6)
  100 points have value 7
  33 points have value 8
  27 points have value 9
  17 points have value 10
  12 points have value 11
  10 points have value 12
  10 points have value 13
  10 points have value 14
  6 points have value 15

And the 475th number on this list is a 13, so the 95% for the whole, large data set is 13.

And when we compare the difference between:

Weighted average of the 95% numbers = 10.64

and

True 95% for large data set = 13

we've clearly got a problem using the weighted average of the 95% numbers as an estimate for the 95% number since the weighted average of the 95% numbers seriously underestimates the true 95% for the large set.

A final comment: Is this kind of an example even relevant to CPAP data? Well the answer to that question is really to consider why averaging the 95% numbers for these three data sets fails so miserably at finding the 95% for the whole data set: The problem is that the upper tail of data (the highest data numbers) all belong to ONE of the data subsets.

One situation where this can occur is in the pressure curves for an APAP: If some sessions have little or no supine (or REM) sleep, the pressure may never get very high. But if one session contains a significant amount of REM or supine sleep, the pressure numbers for that session maybe high enough where the 95% for the entire night is only reached during that session. And averaging the 95% for the individual sessions may wind up underestimating the 95% for the whole night, perhaps significantly.

Other strategies?
You can try to fix the "weighted average of the percentiles" by taking into account that you do have both the 50% and 90% numbers for each session (subset of data). And the max. (but not the min for some data?) So you have maybe 4 things to try to work with in coming up with a "fit" model for the missing data. That's not really much to work with given the size of the data sets.

And that's why JediMark is stuck.

Unfortunately, I'm not a statistician. So I really don't know what they do when faced with this kind of a situation.

The real question is just how does a Resmed machine calculate these numbers when the SD card is not in place and it's only recording the summary data?

My own guess is they fake it with something that's "good enough" when the data sets are similar (such as a weighted average of averages), and simply don't worry about the fact that this can lead to garbage statistical data when there's some wide variation in a few of the sessions ..

**Sleepster** · 06-30-2014, 03:04 PM

(06-30-2014, 12:20 PM)jedimark Wrote: Perhaps I'm overthinking it... but I just don't want it to intentionally give a "wrong" answer. :/

Well, I guess you could call it the weighted average of the percentiles so your conscience wouldn't bother you. Smile

Let's see if I've at least stated the problem correctly:

Quote:Let's say you have three sessions:

Session 1 lasts 3 hours and has a 95th percentile of 8.
Session 2 lasts 4 hours and has a 95th percentile of 5.
Session 3 lasts 2 hours and has a 95th percentile of 9.

What is the 95th percentile for the entire 9-hour period that spans those three sessions?

Solution
During the 1st session 95% of the time the readings were below 8. 95% of 3 hours is 2.85 hours.

During the 2nd session 95% of the time the readings were below 5. 95% of 4 hours is 3.8 hours.

During the 3rd session 95% of the time the readings were below 9. 95% of 2 hours is 1.9 hours.

So for 2.85 hours the readings were below 8.
For 3.8 hours the readings were below 5.
For 1.9 hours the readings were below 9.

2.85 + 3.8 + 1.9 = 8.55 hours, which is of course simply 95% of the total time of 9 hours.

I don't believe there's any way to solve this. I would ask for partial credit but I believe my professor would prefer to see a proof of the fact that the solution doesn't exist.

Let's see, what are the professor's office hours, and just where can she be found during those hours? She's probably in the lounge taking a nap. I saw a CPAP machine in there earlier.

robysue · 06-30-2014, 05:09 PM

(06-30-2014, 03:04 PM)Sleepster Wrote:
(06-30-2014, 12:20 PM)jedimark Wrote: Perhaps I'm overthinking it... but I just don't want it to intentionally give a "wrong" answer. :/

Well, I guess you could call it the weighted average of the percentiles so your conscience wouldn't bother you.

Let's see if I've at least stated the problem correctly:

Quote:Let's say you have three sessions:

Session 1 lasts 3 hours and has a 95th percentile of 8.
Session 2 lasts 4 hours and has a 95th percentile of 5.
Session 3 lasts 2 hours and has a 95th percentile of 9.

What is the 95th percentile for the entire 9-hour period that spans those three sessions?

Solution
During the 1st session 95% of the time the readings were below 8. 95% of 3 hours is 2.85 hours.

During the 2nd session 95% of the time the readings were below 5. 95% of 4 hours is 3.8 hours.

During the 3rd session 95% of the time the readings were below 9. 95% of 2 hours is 1.9 hours.

NO. You can't say these things.

What you can say is this:

During session 1 the readings were AT or below 8 for 2.85 hours. You don't know if they were BELOW 8 for 2 of those hours and AT 8 for .85 more hours. Or perhaps they were AT 8 for the whole 2.85 hours.

Quiz time:

Let's say Session 1 lasts 3 hours and the readings increase by 0.5 units. Suppose we have the following (known) data:

Pressure is at 6.0 for 0.8 hours.
Pressure is at 6.5 for 0.2 hours.
Pressure is at 7.0 for 0.4 hours.
Pressure is at 7.5 for 0.6 hours.
Pressure is at 8.0 for 1.0 hours.

What's the median, 90%, and 95% levels? And how do you find them?

What if Session 1 lasted 3 hours and the readings increase by 0.5 units, but we have this distribution of the (known) data:

Pressure is at 6.0 for 0.2 hours.
Pressure is at 6.5 for 0.4 hours.
Pressure is at 7.0 for 0.4 hours.
Pressure is at 7.5 for 0.7 hours.
Pressure is at 8.0 for 1.3 hours.

What are the median, 90%, and 95% levels now? And how do you find them?

Finally, what if Session 1 lasted 3 hours and the readings increase by 0.5 units, but we have this distribution of the (known) data:

Pressure is at 6.0 for 0.0 hours.
Pressure is at 6.5 for 0.0 hours.
Pressure is at 7.0 for 0.2 hours.
Pressure is at 7.5 for 1.4 hours.
Pressure is at 8.0 for 1.4 hours.

What are the median, 90%, and 95% levels now? And how do you find them?

I have to go pick up my hubby. Enjoy the quiz.

Login
Username:
Password:	Lost Password?
	Remember me

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	Statistics page, changes to device settings area, FL always shows 0.00?	DancesWithCats	1	332	01-19-2024, 11:14 AM Last Post: Sleeprider
	OSCAR Statistics Page	SingingWolf	4	634	11-04-2023, 02:26 PM Last Post: SingingWolf
	Change Statistics view default to Monthly instead of Standard?	retro	4	603	10-06-2023, 07:45 AM Last Post: Crimson Nape
	OXY DATA ON STATISTICS PAGE	srlevine1	7	687	05-01-2023, 07:47 PM Last Post: srlevine1
	Oscar CPAP Statistics - Therapy Efficacy	Dataporter	2	770	02-26-2023, 08:04 PM Last Post: Dataporter
	(Feature Request) Add notes to bottom of Statistics Page	B1Sailor	0	588	10-14-2022, 10:57 AM Last Post: B1Sailor
	Missing statistics after updating OSCAR?	breathestopbreathe	13	1,225	06-28-2022, 04:42 PM Last Post: angiessa

About Apnea Board

Quick Links

User Links

Useful Links