Are there any de facto standard formats for common auto-analytics data?

I’m starting a project to write my own auto-analytics tools, starting with exercise data and how I spend my time. What are the most common database formats for recording those activities? Do any official standards exist?

If so it would be awesome for a number of reasons. For one thing it would make it easier on me if I ever decide to switch to a commercial product to keep my old information. More importantly, if I can make my database compatible with commercial programs I could use them for analysis while still doing data entry through my own scripts.

Thanks for your thoughts!

EDIT: grammar

1 Like

I’m not aware of anything like a unified standard. It might be because there’s no unified focus for auto-analytics. Some folks are interested in fitness, others diet, others geolocation, others daily email volume etc.

But we are developers, yeah? Why not start the conversation? I think it’d be terribly interesting to have your private database able to take advantage of my quantitative analytics, visualization, etc. and vice versa by sharing interchangeable formats.

What might it look like? Perhaps the first thing we’d need to define is the quanta of personal data: one entry in the log, one heartbeat from the GPS, one exercise session (or one rep…?). Here’s how I’m handling multiple incoming simultaneous data streams using this approach. I have one big database with a field for record ID, datum type (“GPS”, “meal”, “commute snapshot”, etc.), 10 datum value fields (e.g., when the datum is a GPS heartbeat, the first value is latitude, second value is latitude accuracy, third is longitude, longitude accuracy, elevation, speed, direction, and so forth) and then a bunch of fields for attributes and metadata of that datum (date, time, API version of the incoming data, etc.).

For summary stats, I just select out the one data type I need. For correlation, regression, and other more involved stats, I select out all the data types I need and leave the rest. From a data architecture standpoint, this is not the most efficient, but for a simplicity standpoint it works very well, and I think that simplicity is key for maintaining interoperability between users and toolsets. As it is, this whole thing can get dumped into a simple CSV and remains perfectly usable with a data dictionary to note which values correspond with each field for each data type.

I hope this makes some sense. :slight_smile:

1 Like

That definitely makes sense. I was thinking about making separate databases for each of my auto-analytics apps, but maybe that’s not the best way to do it. Especially for such small amounts of data where efficiency isn’t essential.

I think it’s stuff like excercise sessions v. reps that could really mess this project up. One thing that we could do is go ahead and make a directory of the most common formats people are using at the moment. Maybe a wiki would be best for collecting that information. Then we could do auto-analytics on our auto-analytics! *

  • Forgive me if this joke has been made a thousand times already. I’m new to this scene.

very humble: at DidThis.com we use this micro-format: eg. #run:10miles on Twitter and if you are logged in DidThis it will pick it up from Twitter so #whateveryouwanttotrack:1234unit Cheers!

I like the Keep It Simple Approach that microformats encourage! Very approachable, very easy to pick up for a new entrant to lifelogging.

For my own approach, I’m also attaching metadata to boost the kinds of analytics I can run on the records.

Perhaps a good way to go about looking for standards is to look at the mass-market devices: FitBit, Jawbone Up, NikeFuelBand, Motorola MotoActive, Nike+, etc. All these devices have lovely proprietary websites that do the basic number crunching and visualization for you, but do any of them allow you to download your own data in tabular format? The column headings would be very helpful to us here. :)[hr]
EDIT: DidThis.com is doing a good job of at-a-glance visualization. Lovely!

I work with data a lot, and I don’t think a standard would really help here. Data needs to conform to the requirements of the task, both in terms of what the data is representing and how the data is analyzed. QS is not a consistent field. People are measuring different things and looking to get different information out of the data; so the data should be different from person to person.

I would worry more about the consistency and clarity of the data. For each piece of data you are collecting, write down exactly what it measures, and be sure you are recording it as consistently as possible.

Hey ichabod801, thanks for your perspective! These boards are otherwise pretty quiet. It is exciting to have other folks here who work a lot with data-- fun to be able to talk shop without, well, the pressures of shop work. I think you’re definitely right about individuals having individual needs from their data.

I also think you’re absolutely right about data needing to conform to the requirements of the task. It sure does make interoperability and data portability between platforms nontrivial. When the key requirement of seagreen’s task is specifically interoperability as above, though, we’re presented with a problem requiring an array of thoughtful solutions.

Perhaps one way to go about conceiving a solution to data portability/interoperability/database standards is to look for useful common denominator data and metadata. As an example, look at digital photography. A million photographers photograph for ten million reasons or more. Yet, it is useful and costs little overhead to embed EXIF data in each photo:
[list]
[]Date
[
]Time
[]Camera manufacturer
[
]Camera model
[]Aperture
[
]Shutter speed
[]Geodata (increasingly common)
[
]etc.
[/list]

These data are extremely useful when considered in aggregate, even though the data needs (the photograph content) are different across photographers and even across a single photographer’s images.

Circling back to lifelogging, regardless of the lifelogging content I’m capturing, I find myself repeatedly attaching to each datum a few key metadata:
[list]
[]Date
[
]Time
[]Location (if available)
[
]Data pipeline – did this get logged through my task website, browser extension, manual entry, laptop background process, something else? Helps to analyze similar kinds of data across these pipelines to look for selection biases).
[/list]

When those metadata get repeated so much, it makes me think long and hard about including them in any kind of data architecture, certainly any kind I would want to share for interoperability/switching platforms/hardware/allowing my users to export their own lifelogging data for analyses beyond what my services currently provide.

But in my case I only use one of the four you mention regularly (date), and one more (time) I only need for specific data. I used to track location, but I dropped it because of poor data quality, and haven’t missed it in the slightest. In fact, dropping it has improved data quality in other areas. Admittedly, I only have one data pipeline for QS work, but in other contexts it always seems to be inherent in the data file. So it still seems to me to be a project specific choice. And while aggregation might be helpful across photographers, I don’t see how it would be useful across QS users. At least, not to the users themselves.

All fair points you make. The usefulness of any kind of standard across QS users–to the users themselves–is making the data comparable, social, repeatable, etc. One of the biggest criticisms of QS by outsider quants to date is the idea that we’re all just running n=1 studies.

I don’t think that’s a valid criticism, and even if it was I don’t think that would be a valid response to the criticism.

I am not running an n = 1 study. I am not the sample, I am the population. I’m doing a P = 1 study. I am taking samples from myself, such as my mood and what I am doing, and comparing that with population data such as the food I’m taking in. Even if those “quants” who are criticizing us still think I’m doing a n = 1 study, what’s the problem with that? The problem is that n = 1 studies are not transferable to other n’s. I’m not trying to transfer it to other n’s, I’m just trying to understand myself. Problem solved.

Let’s say we did develop a standard data format, and used it to combine QS data across users. There’s a name for that kind of study, it’s called a meta-analysis. They’re very poopular, and they are highly suspect. They’re popular because they’re cheap. You don’t have to run your own study and deal with all of the human testing issues. They’re suspect because they’re combining data from different studies that used different methodologies. Not having a standard methodology for your data collection can introduce bias into your analysis.

Often these meta-analyses combine data from studies with different treatments. They have to, because you don’t get published repeating someone else’s study (which is an odd state of things if you consider the scientific method). That’s an even bigger problem in QS because everyone is doing a different treatment, often a completely different treatment. It wouldn’t be like combining five studies on aspirin and heart attacks, it would be like combining 500 studies on weight loss, weight gain, sleep quality, cholesterol, blood pressure, mood, …

The first lesson to take away from this is just because the data is in the same format, that doesn’t mean it’s comparable. The second lesson is that we should be defining what QS is, not the outside quants. They want papers they can publish in peer reviewed journals. We don’t have to give them that. If we are getting what we want out of the process, it is not our problem that they are not getting what they want.

I think that there is a lack of rigor in QS. (For that matter, I think there is a lack of rigor in peer reviewed journals, and maybe the outside quants should clean up their own back yard before looking over the fence.) I’ve seen QSers be happy with an increase from 95% to 97% without addressing whether that is statistically significant, much less practically significant. I’ve seen QSers be happy with a decrease in cholesterol, but they were only looking at LDL and not HDL, and they’d actually made their situation worse. But neither of these are data format issues.

I think that if we make sure that people are understanding what they are studying, and help them to understand best practices for analyzing the data they collect, it will do far more for our community than worrying about data format issues and trying to appease people outside our community (who perhaps don’t understand what we are doing).

IIRC, meta-analysis isn’t something that you just willy nilly choose and cherry pick data. You have to follow a standard or procedure that you establish to exclude or include.

It’s not about data. People who do meta-analysis don’t have access to the data, they just have access to the studies reporting the data. And, yes, you are supposed to use only studies that are comparable, but that’s not what people do. They just take whatever studies they can get. If you take a serious look at the statistical methods in peer reviewed journals, and you begin to realize how little of the “science” done these days is worthwhile. Just the other day I saw a journal article claiming a link between caffeine and vision problems, despite the fact they couldn’t even show a statistically significant correlation. Earlier this year it was reported that a major cancer researcher decided to replicate what he felt were 57 landmark cancer studies. Only 16 of them could be replicated.

Well, I don’t have anything to argue. So I concede my point. In any case, I am interested in rigor, but I have no idea how to do it.

Being interested in rigor is good. It’s not really that complicated. Be consistent in your methodology. Watch out for confounding factors (other things that could be affecting your results). Correlation is not causation. Statistical significance is not the same as practical significance.

My problem is more about not knowing how to do statistical analysis rather than remembering correlation is not causation. Specifically, I still don’t know how to find the statistical signfiance.

However, I haven’t really thought about the difference between statistical significance and practical significance.