File format for centralized storage of Quantified Self data

I have released a file format I have been using for the past year to store my QS data. It is based on HDF5 (standard format for scientific data) through PyTables, and I have included a python interface for reading and writing. I have also released python apps to import data from AndroSensor (background QS data collection) and KeepTrack (manual QS data collection). I also just ordered a FitBit Charge HR, so an importer for that should be coming soon.

More information, plus downloads and source code, are at the link below. If anyone is interested, please try it out. Questions and comments are welcome.

http://projreality.com/blog/?p=39

Sam

HDF5 seems like a good choice for storing raw sensor data, but overkill for the small data sets that you can export from KeepTrack or Fitbit?

There are HDF5 libraries for most programming languages, and no doubt Real Programmers have no trouble figuring it out (including any problems due to the native code dependencies all these libraries have). But seeing how many people struggle to even read or write valid CSV or JSON files, this might have limited appeal…

Thanks for your comments. HDFQS is primarily meant for people using multiple devices and apps. One of my primary struggles when doing QS early on was dealing with a large pile of exported CSV, XML, and binary files. Visualization and analysis became a pain, and ended up being “done later”. When that actually finally happened, I would sometimes see errors in how I was using a device, or taking data, and overall not having the benefit of getting feedback from the data.

While one app like KeepTrack might not generate much data by itself, being able to centralize data from many different sources, and easily query it for visualization, could make QS data more useful.

Sam

Makes sense; I was thinking of it more in terms of a data exchange format.

Presumably, HDF5 tools make importing and exporting CSV files easy, so this might be a good alternative to a directory with random CSV files…

Great work. I like the choice of HDF5, I use it in all my research.

I would like to store my QS data in a database, Im currently thinking of using DynamoDB. I would like to write a script that takes your thoroughly considered HDFQS format and saves it to a database. Also to create a script that does the reverse.

Have you considered this already?

EDIT: I found an example file here to test on

I’m currently finalizing a similar data structure using Cassandra, which will allow easier viewing of data in real time (as opposed to having to use git annex or other means to sync individual files for an update). The tradeoff is that it is more complex to get up and running.

I’ll be looking at writing a script to transfer data between HDFQS and the Cassandra-based system soon. Perhaps it will be useful for your work. Unfortunately, I haven’t used DynamoDB before.

Side note - it turned out that updating QSLife to display data in the Cassandra-based system was fairly straightforward. If you get to the point where you have a stable API and some example data using DynamoDB, I can add the DynamoDB interface into QSLife.

Hope this helps. Thanks for looking at HDFQS!

Sam

Cassandra is a good choice. In my experience, it’s very easy to set up for simple environments. Some people also tried running it on low power hardware like Raspberry Pi. Of course, it will not break the light speed limit on there, but 200 writes per second or RPi 1 or 2500 iops on a cluster of RPis 2 is more than enough for any home usage. On RPi 3 I believe it should work fine as a home server / sink for data (which is 10 times faster than RPi one). So, in the end it can even be run on cheap hardware 24/7 if needed.