Data prep before CI and ML

Steps of data prep of self tracking data. Some of these are completely automatic. ML and causal analysis steps (nearly last ones) are very complicated and need their own threads. Comments and corrections are appreciated. I will be making an actual analysis with real data and charts in a year or so.

For each data source:

  1. Get files via api or just download. Put into folder specific to source of data. Program will automatically check here for new files in all folders. Lots of graphs get dumped into this folder as analysis continues. Code for parsing this exact source and tags are in this folder too.

  2. Parse. Aggregate if data is too dense like raw accelerometry. Auto check for overlapping new and old data not agreeing.

  3. Derive new metrics such as length of time user put off writing down their meal data.

For each metric / time series, or if the metrics are very closely correlate a bundle of them:

  1. Tag appropriately. Include provenance and just what the data is. Include goal and degree of controlability tags. These define dependent and independent variables. Tags will be used to automate several upcoming steps.

  2. Basic single data series analysis and manual cleaning. Such as distribution to find outliers and distribution of jumps to check for stranger outliers. Are 0s NAs or just 0? Autocorrelation.

  3. Normalization. Test scores adjusted for difficulty of test.

  4. Decomposition. For example a cognitive test score time series can be composed of the learning curve, monthly interest in the test, weekly lack of sleep and noise. Each part is its own graph. Check against common models such as learning curve. Possibly derivatives replace raw metric.

  5. Period detection. Find times when variance differed a lot from usual or when what seems like a tare (torn?) happened.

  6. User then annotates with help of searching available diary notes. Resulting graph is worth sharing.

Single series analysis ends here and combined starts here.

  1. Impossible overlaps. Like eating while sleeping. Usually caused by timezone misalignment or user input error.

  2. Compare multiple data sources for same metric. Formal standards such as data taken at doctors office can be used to improve and evaluate cheaper easier continuous data sources. Sensors breaking can be detected too.

  3. Composite derived metrics such as posture and activity from multiple accelerometers.

  4. Correlation, clustering and grander period. This is just exploratory.

  5. Day quality graph made of some combinations of smaller goal metrics and maybe including user annotation. One type of result. Other custom output charts. Results to post on forum.

  6. Causal impact. Cross-correlation. Simulations if formal tests are not enough. Very complicated.

  7. Machine Learning. Statistical learning but still. Very complicated.

  8. RESULTS! Possible causes for every goal metric ranked by several measures.

  9. Suggestions for maximum daily quality based on decisions made in 5. Or suggest experiments.

1 Like