(06-30-2014 11:34 AM)diamaunt Wrote: (06-30-2014 09:46 AM)Sleepster Wrote: Also, let me know if you want me to move this thread to the main forum where more nerds might see it.

not just nerds, MATH nerds.. (hello robysue )

The problem JediMark is facing is that ResMed itself is doing something really kludgy with the statistics AND its kudge is not documented.

From a math point of view, there's just no sound mathematical way of approximating the 95% percentile and median (50%) of a large data set composed of several disjoint subsets of data when all you know is the 95%, median, and the size of each of the subsets.

Yes, you can find a weighted average of the 95% percentiles (or the medians) and hope that it's "close enough". (And for many data sets, that estimate may indeed be close enough.) But the distribution of the data in each of the subsets is pretty important: If the data sets all look pretty much the same, then averaging the 50% and 95% will probably give you a decent enough estimate for the 50% and 95% of the whole data set. But if the data varies substantially from subset to subset, then some pretty wild things can happen.

As an example: Here's a collection of three data subsests where averaging the percentiles may not be a good idea for estimating the percentiles for the whole set:

Let's suppose we have one data point for each minute, and we have the following data sets

Data Set 1:

- 3 hours of data (180 data points) with this distribution:
160 points have value 6 (i.e. 2:40 minutes of data have value 6)

10 points have value 7

5 points have value 8

3 points have value 9

2 points have value 10

The median for Data Set 1 is 6, and since 0.95*180 = 171 and the 171st number on our list is a 8, the 95% for Data Set 1 equals 8.

Data Set 2:

- 2 hours of data (120 data points) with this distribution:
50 points have value 6 (i.e. 50 minutes of data have value 6)

60 points have value 7

3 points have value 8

4 points have value 9

3 points have value 10

The median for Data 2 is 7, and since 0.95*120 = 114 and the 114th number on our list is a 9, the 95% for Data Set 2 is equals 9.

Data Set 3:

- 3:20 or 3.3333 hours of data (200 data points) with this distribution:
65 points have value 6 (i.e. 50 minutes of data have value 0)

30 points have value 7

25 points have value 8

20 points have value 9

12 points have value 10

12 points have value 11

10 points have value 12

10 points have value 13

10 points have value 14

6 points have value 15

The median for Data 3 is 8, and and since 0.95*200 = 190 and the 190th number on our list is an 14, the 95% for Data Set 3 equals 14.

Weighted average of the medians vs. true median of the large data set
There are 180+120+200 = 500 minutes = 8.333 hours in the data.

So the weighted average of the medians is:

Weighted average of the 50% numbers

= (6*3 + 7*2 + 8*3.333)/(3 + 2 + 3.333)

= (18 + 14 + 26.664)/8.333

= 58.664/8.333

= 7.04

But of the 500 numbers in the large data set, 160+50+65 = 275 of them are 6's. Hence the median of the large data set is 6.0. Given the fact that the largest number in the data set is a 15, that means averaging the medians overestimates the size of the true median, perhaps significantly.

Weighted average of the 95% vs. true 95% of the large data set
We compute the weighted average of the 95% as folllows:

Weighted average of the 95% numbers

= (8*3 + 9*2 + 14*3.333)/8.333

= (24 + 18 + 46.662)/8.333

= 88.662/8.333

= 10.64

Now lets find the true 95% for the large data set. Since the large data set contains 500 numbes, the 95% of the large data set is the 475th number on the whole list, and when we aggregate the data for Data Sets 1,2, and 3 together into the one large list, here are the tallies:

Large data set looks like this

- 8:20 or 8.3333 hours of data (500 data points) with this distribution:
275 points have value 6 (i.e. 50 minutes of data have value 6)

100 points have value 7

33 points have value 8

27 points have value 9

17 points have value 10

12 points have value 11

10 points have value 12

10 points have value 13

10 points have value 14

6 points have value 15

And the 475th number on this list is a 13, so the 95% for the whole, large data set is

13.

And when we compare the difference between:

Weighted average of the 95% numbers = 10.64

and

True 95% for large data set = 13

we've clearly got a problem using the weighted average of the 95% numbers as an estimate for the 95% number since the weighted average of the 95% numbers seriously underestimates the true 95% for the large set.

A final comment: Is this kind of an example even relevant to CPAP data? Well the answer to that question is really to consider why averaging the 95% numbers for these three data sets fails so miserably at finding the 95% for the whole data set: The problem is that the upper tail of data (the highest data numbers) all belong to ONE of the data subsets.

One situation where this can occur is in the pressure curves for an APAP: If some sessions have little or no supine (or REM) sleep, the pressure may never get very high. But if one session contains a significant amount of REM or supine sleep, the pressure numbers for that session maybe high enough where the 95% for the entire night is only reached during that session. And averaging the 95% for the individual sessions may wind up underestimating the 95% for the whole night, perhaps significantly.

Other strategies?
You can try to fix the "weighted average of the percentiles" by taking into account that you do have both the 50% and 90% numbers for each session (subset of data). And the max. (but not the min for some data?) So you have maybe 4 things to try to work with in coming up with a "fit" model for the missing data. That's not really much to work with given the size of the data sets.

And that's why JediMark is stuck.

Unfortunately, I'm not a statistician. So I really don't know what they do when faced with this kind of a situation.

The real question is just how does a Resmed machine calculate these numbers when the SD card is not in place and it's only recording the summary data?

My own guess is they fake it with something that's "good enough" when the data sets are similar (such as a weighted average of averages), and simply don't worry about the fact that this can lead to garbage statistical data when there's some wide variation in a few of the sessions ..