I’ve been working with data a lot for the past few years. And have had a number of distinct opportunities to decide where certain data caches should be stores/saved. Sometimes I’ve chosen to save data through a JSON API, connected to an external server. But other times I’ve elected to save my data locally, on a hard drive I physically control.
With all of that said, when it comes to self-quantification, where you store your data matters; IMHO it’s a “mission critical” decision. So, I’m curious about how others have chosen to store their personal analytics information. Where/how do you store your data? Do you make regular backups of your data?
Given how many self-tracking projects I’ve done, it’s a bit frustrating to answer this question and admit that I’ve always been haphazard in my data storage. I tend to pay very close attention for the duration of the project, and then just let the data senesce in various storage locations until I don’t trust it anymore because some crucial information (such as what device I used, or why certain data has gaps) is not retrievable any more. I have data in Google Sheets, in CSV files in various personal cloud services, in JSON files downloaded via export workflows, also in cloud storage, and under the custody of various services where I can (sometimes) access it if I still have the username and password. Pretty horrible answer, but figured it’s useful to be honest. The upside is that I’ve already processed a lot of this data and gotten something interesting out of it, and my lack of concern for long term storage somewhat reflects my intuition that it’s no longer very useful to me. BUT, I do lose the chance to think about long term changes, for instance in my heart rate, which now that I have an arrythmia I’d been interested in looking at.
The LLIF.org nonprofit is trying to address data storage over the long term by making itself a data lake for individuals, just like how enterprise companies have data lakes from which they draw historical data.
Currently, though, there is no way to import historical data from manual entry formats like Google Sheets.
Reading this discussion leads me to the question, which format is most important to allow a person to import to save long term?
Tabular data is the most common for analysis, but JSON is a common format for export. Conversion isn’t hard but requires some decisions to be made about how to structure the output. But have I understood correctly the question you’re asking?
Yes, as far as I can understand it as the asker! I’m not a data scientist or developer but I know we accept specific formats of data exports in CSV, JSON, and HTML. I’ll have to look into what tabular data would look like for import, storage, and trend analysis.
I have a cloud VPS (costs a few $ per month) with Nextcloud instance
My notebook and phone have folders synced with nextcloud.
Web service checks for new files every minute in these folders and process them by different logic based on folder names
Processing is done in a few steps: save original file raw data into mysql table. Then zip and put file into special archive folder.
Process saved raw data from mysql table into key/values and insert them into another mysql table.
For a fast queries and processing mysql tables are being mirrored into clickhouse instance
mysql / clickhouse db being backed up every few hours and zipped backup put into folder. Folder is synced across notebook / pc / phone
R scripts ran by cron do most job of data processing
This allows to avoid work with file system in data processing scripts, all data is in mysql table which are single place where all raw and processed data can be accessed. This also easily allow to re-process data if needed.
Backups are also stored from different places / types - file system archive folder, mysql backup, clickhouse backup.
Also i have a single table for all api’s i’m using (withings, polar, oura etc) where the api call request & request params, api url and response are stored. This table is also used as single place where all raw data from api calls being stored and can be re-processed if needed, without making api calls again.
That is a topic that is very close to my hearth. Getting all the different data sources together is super complex. A the moment I am very much relying on Google Sheets and some IFTTT scripts - but would be very interested if someone has found a nice solution that integrates data in a way that one can draw conclusions across different data sources - e.g. my workout and sleep data.
Last I checked (it’s been a while) Exist.io only stored day-level data – so if you are using their service for analyzing your data, I’d still try to figure out how to keep a copy of the data from the underlying sources somewhere. Same goes for other services of that type
Any text-based format seems fine for long-term storage, as are widely used binary formats (e.g. for images). The only issue is that you might accumulate a lot of different formats over time, until you need to sacrifice an entire weekend to answer a simple question like “how have my daily step counts changed over the past 20 years”. But at least it’s still feasible.
Since I originally wrote this, I have used several other data storage technologies. Based on those experiences, I would say that the most important lesson I’ve learned so far, is the importance of redundancy and reputation.
Pretty much all my QS data is stored in Apple’s Health app for iOS/iPadOS. It is inconvenient if not impossible to do any form of longitudinal analysis on this data within the app therefore I dump it all out — well minus the medication data, which Apple refuse to export — and keep the XML files produced.
Sadly the Health app export function exports everything everytime causing the XML files to grow over time. It would be more useful to have incremental exports instead. To that end I am looking at various (open source) database systems that can accept and process XML. MariaDB is the likely candidate for this with its LOAD XML and CONNECT (to XML) features; not considering MySQL which MariaDB is a fork from because of the issues with Oracle trying a copyright grab of open source licensed code. postgresSQL is a lesser candidate as its XML features appear to be more convulted to use.
Once in a database will need to create a sensible display of the Health data to find correlations.