Maintaining massive amounts of data in one spreadsheet for regression analysis/correlation-hunting

Zeta · June 24, 2016, 12:52pm

Hi everyone.

Background: I’ve been gathering data – tons! – on my health and well-being and various variables possibly connected to my health (supplement-intake, diet, exercise, etc.) for years in Word files, and want to make use of it, and also continue to be able to add data, even more than I have been doing, in an easier way. Specifically, my goals – under the obvious overall rubric of maximizing health – are:

to transfer the data I already have to a spreadsheet or some other form that will make it easier to analyze and see connections
to be able, from now on, to enter data directly into the spreadsheet (or other systematically organized format)
to be able to do regression analyses or the like to find connections among variables. Being able to visualize the data would be nice, but that’s minor.

My vision, thus, is that I would have a massive spreadsheet, and could find, use, or create macros (I’m new to Excel – sorry if “macro” isn’t quite the right word!) that find correlations.

Simple, I think:

Set up a spreadsheet with time increasing to the right (with the columns [question: how finely divide units of time?]), and variables listed in the rows.
Write Visual Basic macros to convert the old Word data into Word tables and then move it over to the spreadsheet. (This part is easy.)

2a) Enter things that don’t change much using copy&paste, far into the future. (“Zn 10 mg”: copy, paste, paste, paste, paste… Can delete or alter in the future if I change the amount of Zn I take.) (This part is also easy.)

2b) Enter by hand new data that doesn’t come from a separate program (“2016-08-02; 21:00-23:00 / Social interaction: warmth level 7”).

2c) Enter new data that does comes from a separate program automatically where possibly. I don’t have many such programs right now, and this might have to be done partly manually, like an end-of-the-day or -week dump from the program (the only one I have now is CRON-O-Meter, which records dietary intake), write a macro to reformat if needed, then import into the spreadsheet.

Find, or write, macros (or whatever they’re called in the spreadsheet world) to find correlations.

I see two problems, but I’m wondering whether there are others that aren’t coming to mind.

A) The time-scale needs to be alterable, no doubt. At first I was thinking hour-long time columns would be fine, but that obviously won’t work for tracking correlations among diet, exercise, and blood glucose, for example. Ten-minute increments is probably what I’ll start with. But whatever fineness of scale I decide upon to begin with, it’s likely I’ll want to change it later. I think that’s manageable, though.

B) The amount of data might be so large that I will need to get something more powerful than a typical personal computer to manage it! Twenty years of data split into 10-minute increments is 1025280 columns, and I’ll probably want to track at least 400 variables, so that’s over 400 million cells of data (most blank, of course – many will be values from blood tests taken only occasionally). That’s a lot. But I underestimate the power of my HP Pavilion. (Of course, if I want to enter my 23andMe data, the number of cells of data increases even more. But genetic data consists of fixed contextual variables, not sure that would belong in the spreadsheet. It’s sort of like height and birth date.)

Anyone see other problems with my approach, or with my reasoning about my goals as stated here?

The post by Irene_Gabashvili here –

– mentions a couple of websites/programs that I could use my data with, I think. But I’d have to learn how to connect my data to the programs. Surely doable, but maybe complicated.

Mainly, for now, I’d like to be entering all my data into one spreadsheet (or spreadsheet-like program), and I’d like to be able to find important correlations.

Thanks!

patbuen · June 24, 2016, 9:35pm

Hi Zeta,

I am following the progress of two open source data integration software systems that work with biomedical data from patients, i2b2 and Transmart (which relies on i2b2). I have worked with i2b2 more extensibly and while the system is overkill in your case, I like how patient “observations” are stored and I would suggest you store your own personal obervations and measurements using the same EAV model (entity-attribute-value model) but in Excel. I like the i2b2 model because it focuses on well-annotated data.

I am thinking that this model applies to your data because your goal should be to make it easy to find and filter for certain variables or values to test a hypothesis but also, if at some point you want to compare your data with other people’s data or with published studies or if your data may be of value to some future clinical study and they may pay you to use it, then you want to keep your data clean and tidy.

Unfortunately there is more than one standard in clinical and biomedical data annotation (e.g. for naming drugs there is Snomed, RXnorm, ICD-10, HL7, NDC) but it is worse for self-measurement data where it seems there are no standards.

Perhaps we can work with QS to start planning for those self-measurement data standards?

Let me prepare an Excel file to show you what I mean and I hope I can upload it here or you can message me directly.

OP_Engr · June 25, 2016, 1:39am

Hi Zeta,
I think you will want the time as rows not columns. You will hit the column limit in Excel before the row limit.
The specs I found online are 1,048,576 rows by 16,384 columns for each worksheet.

Zeta · June 25, 2016, 6:40am

OP_Engr, thanks! I just found out about that myself.

Patbuen, wow, thanks for the tip. I’m obviously new to all this – well (and this is the key point): new to off-the-shelf products, not to tracking using my own hacked solutions.

But, hacked solution or not: you raise an important criterion I hadn’t thought of: being (easily) able to compare my data to others’, or even being able to contribute to science.

The absence of unified standards in data annotation has stopped me in the past from entering data in a more useable form, but this is why I decided it was time to simply switch from Word (which would require not-yet-available AI to analyze!) to a spreadsheet, but just be entirely systematic and consistent with my naming conventions. If I always write in (say) Zoloft as sertraline, always put a space between “25” and “mg”, never use a period after “mg”, and so on, I can do a global search & replace operation to conform to whatever standard is needed by the solution I ultimately choose. My immediate goal is to save time! Start entering data in some way other than in Word, since it’s so slow.

In any event, I would love to see the Excel file illustrating what you mean.

Thanks,
Zeta.

theCyberMike · June 26, 2016, 4:03am

I think you’ll find one spreadsheet using just one tab storing absolutely everything will quickly become unmanageable, especially if you try to use columns or rows as a discrete continuous time interval.

Normally to record all these data for analysis (and presuming sticking with a spreadsheet instead of a database), one would use separate tabs for recording different types of data. One tab may be used to record all your food/drink consumption, another tab for exercise, another tab for sleep, etc. In effect, these tabs are “data tables” allowing you to keep a “record” of information that logically fits together. Normally, a row is a record (such as all the nutrition information for one component of a meal). You’ll find Excel’s analytics work far better with a record’s fields as columns, and new records over time as descending rows.

Also, it is far easier to record a timestamp for each record than attempt to make each row an interval of time, and thus have excessively more blank cells than cells with data in them. Again, Excel’s analytics are not going to deal well with your time concept.

As patbuen mentioned, many times its far easier to the EAV model to record data in a tab rather than many many columns. For example, say you want to record the daily amount of 5 selected supplements you take. The non-EAV way one would likely setup Excel’s columns would be something like:

date, time, supp1mg, supp2mg, supp3mg, supp4mg, supp5mg

However, you therefore cannot record if you take each supplement at different times of day one day. And say one supplement changes it unit-of-measure because you had to change supplement supplier. ANd say you decide to stop taking suppliment2 and add a new supplement 6. The non-EAV method quickly gets complicated and messy.

The EAV way of doing the same thing is:

date, time, supplementName, quantityConsumed, unitOfMeasure

Thus each day you’ll have five rows that will get created. The time-of-day when taken can vary per supplement; the unit-of-measure for the supplement can change over time; if you don’t take one of the supplements you don’t have an empty column (and do not create an empty row), thus saving space in the spreadsheet; you can easily increase or decrease the quantity of supplements taken per-day without having to alter the spreadsheet’s columns or introduce many blank columns.

It is very easy to combine data from multiple tabs in graphs and regressions and pivot tables.

Normally, one would attempt to create some common columns across all tabs: startDate, startTime, endDate, endTime, item.

I think if you google around, you’ll find several examples of various people’s QS spreadsheets, and examples of analysis that they have performed. I would suggest learning from these examples before you begin to compose you own spreadsheet.

Just as a FYI, I personally use Excel extensively for various aspects of my own life, including sleep tracking, diet/exercise tracking, weather tracking, personal financial planning, etc. I also develop Apps for iOS and Android, and have created custom Apps for collecting data the way I want it rather than paying for some goofy App that does not do what I want. One can successfully use Excel for some pretty extensive datasets, pulling data from many sources,

However, Excel does have problems managing very LARGE datasets. Once the Excel file gets up in the 10MB or greater range, Excel will start slowing down. Get into the 100MB range, and you’ll find more problems, aborts, etc. You may find that long term, you may need to create annual Excel files rather than one that is 20 years long.

justintimmer · June 26, 2016, 12:11pm

Hey Zeta,

I use SPSS for the storage and analysis of my data. It is a statistical program and in my opinion, this program provides you a better overview of your data than Excel does. Furthermore, I believe that the data-analysis (like regression analysis, correlations) is easier in SPSS. However, SPSS is not a free program, so you should find your own way in getting it

patbuen · June 26, 2016, 1:30pm

Self-Measurements.xlsx (11.3 KB)

I uploaded an Excel file that is structured like the i2b2 “observation” fact table with 2 extra tabs for data dictionary and external terminology. I am not saying this is how it should be done, but that this is how it is done in several academic medical research institutes who have adopted i2b2 and that is how I would structure my own data. It does not support genetic data and I will think of how to cover that also.

justintimmer is right in that you can use SPSS or SAS to store and analyze your data, but it is a steep learning curve and you can easily move your data from Excel to SPPS or other statistical tool by filtering for the concept you want to study.

It is also true as OP_Engr says that Excel has a limit on rows, but if it does not fit in 1,048,576 rows you can start a new Excel file with the same 3 tabs. Ultimately you will want to import the data into a database or analytical software and that always works from Excel.

Mark_Leavitt · June 26, 2016, 3:59pm

Hello Zeta,

I’ve been through this process, initially storing my QS data in an Excel spreadsheet, but after a few years the limitations of Excel became frustrating: error-prone entry, easily damaged data (with a wrong keystroke), limited reporting capability, limited data types (I wanted free text, image, etc).

I ended up writing my own application, but you don’t have to go that far. I recommend you consider a “desktop database” program. These are full-fledged databases with features that let you create your own entry screens, reports, etc. Since you’re looking at Excel, you’re probably in the Windows environment, so Microsoft Access is the program I’d recommend you look at. There are lots of books to help beginners create their own database in Access.

Good luck!

patbuen · June 28, 2016, 5:39pm

Access is easier to learn and use than other database systems but it is not widely supported for import into other systems or for transfer/exchange of data. Also in the biomedical research community it is not as highly regarded as mySQL, SQL Server, postgresql.

One more comment. Again I see it from the point of view of integrating our own measurments with our health data. Perhaps the CCDA model is better than the i2b2 model. A friend mentioned today the following:

“To me all EMRs will be dead in 20 years. All we need is a device that can decode your DNA in a few mins and perform analysis of all your vitals among other things in real-time and just save that info in a chip we all carry. Then, all that info will be transferred using a unified formant known as CCDA from one place to another. The end of clunky EMRs.”

Maybe read about CCDAs here: https://www.healthit.gov/sites/default/files/c-cda_and_meaningfulusecertification.pdf

NanaL · June 30, 2016, 6:30am

Have a Look at PowerPivot Extension for excel (Free by Microsoft). It basically transforms the Spreadsheets to a full in memory BI Solution that Is highly flexible and can handle multiple million lines of data. Basically perfect for your requierement. When also using the power query plugin you can directly connect to APIs.

Steven_Jonas · June 30, 2016, 10:23pm

I will second Nana’s PowerPivot suggestion. It brings a lot of power to excel. Another benefit is that you can create multiple tables and relate them to another, a la Access or SQL. The downside is that it uses it’s own language called DAX that you would have to learn.

Microsoft also has Power BI, which seems to be the same as PowerPivot, but is it’s own application, rather than an add-in, and has more of a focus on reporting and visualization. I haven’t used it, but one advantage is that it seems to be free for an individual.

For inspiration for using multiple regression to find connections among many variables, check out Chris Bartley’s Show&Tell talk from the 2013 Global QS Conference on figuring out what affected his chronic fatigue.

ejain · July 2, 2016, 12:42am

1M rows with 400 columns would be pushing the limits of Excel, but shouldn’t pose much of a problem for a tool like R, SPSS, SAS, JMP, or even Python with the proper packages. If you want to write your own data import scripts, Python may be the least Rube-Goldberg-esque option.

IanC · July 3, 2016, 7:08am

The other replies are generally for more erudite than I can offer with regards sophisticated handling. What they have presented and the links offer me some interesting things to follow-up (so thanks for that).

What I can offer is that I’ve been self-tracking my lifestyle for 40+ years, using Excel these last 20 or so.

Nowadays I track about 500+ variables daily (400 independent variables, 100 or so dependent ones). At present I use both simple correlations and multiple regression analysis. I have a row per day, and group my variables by such things as physical (exercise, etc), supplements, fluid intake, food, nutrition categories, prescription drugs, body composition, biomarkers, etc.

I did look at things like R, but despite my engineering background I found it too cumbersome.

I do find Excel can be slow to update at times, presumably due to the quantity of data, especially in the links to the various graphs and collation analyses that flood through each time.

I’ve attached a somewhat truncated copy of my spreadsheets to show what I do in case it is of use https://1drv.ms/u/s!ApkzBi2PNvSVlWXz4IifJ2GCqmwT

Zeta · July 4, 2016, 8:38am

Greetings, everyone!

First, pardon my absence. I was admitted to the hospital, and was just released. Very long story, but some chronic health problems suddenly became acute. The chronic health problems, by the way, are a big part of why I started rethinking how I handle my data.

Second, I must offer my deep and sincere gratitude to everyone for responding. Even without being sent to the hospital, I would have needed time to formulate a thought-through response because the wealth of ideas here demands much research on my part. I’m realizing how little I know about all the progress that’s been made in tracking and the utilization of data.

While in the hospital I also realized that the criteria for the ideal system (for me, at least), need to be expanded. Some of you mentioned some criteria I hadn’t thought of (such as being able to participate in research projects as a subject). In the hospital, I realize another one is important: having readable, useful data for healthcare practitioners. I’ve struggled with doctors who toss their hands up for so long I’m used to being my own doctor. But, with the right presentation of my data, there might be some doctors out there who could help me in a way I haven’t been able to help myself.

I have a lot more to research before saying more, I also need to taper the opioids… Bear with me. Back soon!

Z

Amit_Kohli · July 4, 2016, 10:18am

Sorry to hear Zeta, hope your recovery is swift.

Just wanted to add my 2 cents… I use databases as well as Excel PowerPivot. I’d really just like to drive a point. If we are talking about data STORAGE, then I’m afraid a database is not optional. Access has been mentioned frequently… even tho it could be annoying to get the data in. How’s SQLite? It’s apparently super easy to learn. Once your data is securely stored therein, you could analyze the data using www.metabase.com

If you chose to stay w/ Excel, please have a good backup strategy!

patbuen · July 12, 2016, 9:47pm

Sorry to hear about your recent hospital visit. Hope you are feeling better.
BTW, this statement is very important: “… another one is important: having readable, useful data for healthcare practitioners.”
A colleague at http://omicshelpdesk.com/ is developing a free open source health records viewer. He has not yet compiled it for the App store and is working for one for personal use but we can let you know when he has it ready. The you can use it to store your data and any practitioner who uses HL7 can read it. This is his github site: https://github.com/alexandern/ccdaviewer

You can also contact me at info@omicshelpdesk.com if you need help with your data.