A reusable architectural pattern for continuously syncing data from external APIs to local storage. Extracted from implementations across Strava, Withings, RetroAchievements, Last.fm, and other quantified-self integrations.

Overview

The API Sync Pattern enables automated, incremental collection of personal data from third-party services while maintaining:

  • Credential security (OAuth refresh tokens in Vaultwarden)
  • Incremental efficiency (fetch only new data since last sync)
  • Local sovereignty (structured JSON under git control)
  • Historical continuity (append-only individual files per entity)

This is infrastructure for continuous quantified self — every sync adds to the historical record without human intervention.

Core Components

1. OAuth 2.0 Refresh Flow

Most modern APIs require OAuth for user data access. The pattern:1

def refresh_access_token(self):
    """Exchange refresh token for new access token."""
    data = {
        'client_id': self.client_id,
        'client_secret': self.client_secret,
        'grant_type': 'refresh_token',
        'refresh_token': self.refresh_token
    }
    response = requests.post(TOKEN_URL, data=data)
    return response.json()['access_token']

Credentials stored in Vaultwarden:

  • CLIENT_ID — application identifier
  • CLIENT_SECRET — application password
  • REFRESH_TOKEN — long-lived token for obtaining access tokens

Never store access tokens — they expire (1-6 hours). Always regenerate from refresh token at sync time.

2. Incremental Fetching

Track the last sync timestamp to avoid re-fetching historical data:

def get_last_sync_timestamp(self):
    """Read timestamp from last sync metadata file."""
    try:
        with open('data/last_sync.json', 'r') as f:
            return json.load(f)['timestamp']
    except FileNotFoundError:
        return 0  # First sync, fetch everything
 
def fetch_new_activities(self, after_timestamp):
    """Fetch only activities created after given timestamp."""
    return self.api_request('/activities', params={'after': after_timestamp})

Benefits:

  • Fast syncs (seconds instead of minutes)
  • Respects API rate limits
  • Reduces bandwidth and storage growth
  • Enables frequent cron jobs (every 6 hours)

Edge cases:

  • First sync: after=0 fetches complete history
  • Missed syncs: gaps are filled automatically
  • Retroactive edits: some APIs provide updated_after for changes

3. Individual Files Per Entity

Store each activity/measurement/achievement as a separate JSON file:

data/
├── activities/
│   ├── 2026-02-05-morning-run-123456.json
│   ├── 2026-02-05-evening-ride-123457.json
│   └── 2026-02-06-lunch-walk-123458.json
├── measurements/
│   ├── 2026-02-05-weight-78945.json
│   └── 2026-02-06-weight-78946.json
└── summary.json

Why individual files?

  • Multiple entries per day handled naturally
  • Git-friendly diffs — one activity = one file changed
  • Easy querying — glob patterns, direct file access
  • Scales to thousands without monolithic files
  • Append-only — syncs never modify existing files

Naming convention:

{YYYY-MM-DD}-{entity-type}-{unique-id}.json

The date comes from the activity/measurement timestamp, not sync time. This keeps files organized chronologically even if synced late.

4. Structured JSON with Metadata

Each file contains:

{
  "id": 123456,
  "timestamp": "2026-02-05T08:30:00Z",
  "type": "run",
  "raw": {
    // Complete API response preserved
  },
  "calculated": {
    // Derived metrics (pace, power, trends)
  },
  "synced_at": "2026-02-05T14:22:15Z"
}

Fields:

  • raw — complete API response (never modify)
  • calculated — derived metrics, trends, analysis
  • synced_at — when this file was created
  • Entity-specific fields extracted for convenience

Why preserve raw responses?

  • Enables retroactive analysis without re-fetching
  • API adds new fields → you already have them
  • Schema changes don’t lose data
  • Debugging and verification

5. Summary and Index Files

Maintain aggregate statistics for quick queries:

{
  "total_activities": 847,
  "total_distance_km": 3241.5,
  "date_range": {
    "first": "2020-03-15",
    "last": "2026-02-06"
  },
  "by_type": {
    "run": 523,
    "ride": 298,
    "walk": 26
  },
  "last_sync": "2026-02-06T14:22:15Z"
}

Updated on every sync to reflect current state. Avoid scanning thousands of files for totals.

6. Calculated Metrics

Beyond raw data collection, compute domain-specific insights:

Strava example (training load):

  • CTL (Chronic Training Load) — 42-day moving average
  • ATL (Acute Training Load) — 7-day moving average
  • TSB (Training Stress Balance) — fatigue vs fitness
  • ACWR (Acute:Chronic Workload Ratio) — injury risk

Withings example (body composition):

  • Weight trends (7-day, 30-day moving average)
  • Body fat % change
  • Muscle mass gain/loss
  • Hydration patterns

RetroAchievements example (game progress):

  • Completion percentage
  • Mastery tracking
  • Points per day
  • Difficulty distribution

These calculations run during sync and are stored in the calculated field of each file or in separate derived data files.

Implementation Template

Directory Structure

personal/service-name/
├── scripts/
│   ├── fetch_service.py          # Main sync script
│   └── calculate_metrics.py      # Derived metrics
├── data/
│   ├── activities/               # Or measurements/achievements/etc
│   ├── summary.json
│   └── last_sync.json
├── .forgejo/workflows/
│   └── sync.yml                  # Automated sync via CI
├── requirements.txt
└── README.md

Sync Script Pattern

#!/usr/bin/env python3
import json
import os
from datetime import datetime
import requests
 
class ServiceSync:
    def __init__(self):
        # Load credentials from environment (Forgejo secrets)
        self.client_id = os.getenv('SERVICE_CLIENT_ID')
        self.client_secret = os.getenv('SERVICE_CLIENT_SECRET')
        self.refresh_token = os.getenv('SERVICE_REFRESH_TOKEN')
        self.access_token = None
    
    def refresh_access_token(self):
        """Get fresh access token from refresh token."""
        # OAuth refresh flow
        pass
    
    def get_last_sync_timestamp(self):
        """Read last sync timestamp."""
        try:
            with open('data/last_sync.json') as f:
                return json.load(f)['timestamp']
        except FileNotFoundError:
            return 0
    
    def fetch_new_entities(self, after_timestamp):
        """Fetch entities created since last sync."""
        headers = {'Authorization': f'Bearer {self.access_token}'}
        response = requests.get(
            f'{API_BASE}/entities',
            headers=headers,
            params={'after': after_timestamp}
        )
        return response.json()
    
    def save_entity(self, entity):
        """Save entity to individual JSON file."""
        entity_id = entity['id']
        timestamp = entity['timestamp']
        date_str = timestamp.split('T')[0]
        
        filename = f"data/entities/{date_str}-{entity_id}.json"
        os.makedirs(os.path.dirname(filename), exist_ok=True)
        
        with open(filename, 'w') as f:
            json.dump({
                'id': entity_id,
                'timestamp': timestamp,
                'raw': entity,
                'synced_at': datetime.utcnow().isoformat()
            }, f, indent=2)
    
    def update_summary(self, entities):
        """Update summary statistics."""
        # Calculate totals, ranges, breakdowns
        pass
    
    def run(self):
        """Main sync logic."""
        print("Starting sync...")
        
        # 1. Refresh access token
        self.access_token = self.refresh_access_token()
        
        # 2. Get last sync timestamp
        after = self.get_last_sync_timestamp()
        
        # 3. Fetch new entities
        entities = self.fetch_new_entities(after)
        print(f"Found {len(entities)} new entities")
        
        # 4. Save each entity
        for entity in entities:
            self.save_entity(entity)
        
        # 5. Update summary
        self.update_summary(entities)
        
        # 6. Record sync timestamp
        with open('data/last_sync.json', 'w') as f:
            json.dump({
                'timestamp': int(datetime.utcnow().timestamp()),
                'synced_at': datetime.utcnow().isoformat()
            }, f, indent=2)
        
        print("Sync complete")
 
if __name__ == '__main__':
    sync = ServiceSync()
    sync.run()

Forgejo Actions Workflow

name: Sync Service Data
on:
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours
  workflow_dispatch:        # Manual trigger
 
jobs:
  sync:
    runs-on: docker
    container:
      image: node:20-bookworm
    steps:
      - uses: actions/checkout@v4
      
      - name: Install Python
        run: |
          apt-get update
          apt-get install -y python3 python3-pip python3-venv git
      
      - name: Install dependencies
        run: |
          python3 -m venv .venv
          .venv/bin/pip install -r requirements.txt
      
      - name: Run sync
        env:
          SERVICE_CLIENT_ID: ${{ secrets.SERVICE_CLIENT_ID }}
          SERVICE_CLIENT_SECRET: ${{ secrets.SERVICE_CLIENT_SECRET }}
          SERVICE_REFRESH_TOKEN: ${{ secrets.SERVICE_REFRESH_TOKEN }}
        run: .venv/bin/python scripts/fetch_service.py
      
      - name: Commit and push
        run: |
          git config user.name "agent"
          git config user.email "agent@dungeon.church"
          git add data/
          git diff --staged --quiet || git commit -m "data: sync $(date -u +%Y-%m-%d)"
          git push

Rate Limiting Strategies

Respect API Limits

Most APIs have rate limits:

  • Per-second: 5-10 requests/sec
  • Per-hour: 200-1000 requests/hour
  • Per-day: 2000-10000 requests/day

Pattern: Add delays between requests

import time
 
for entity in entities:
    fetch_detail(entity)
    time.sleep(0.2)  # 200ms = max 5 req/sec

Use Webhooks When Available

Some APIs offer webhooks for real-time notifications:

  • Strava: activity created/updated
  • GitHub: push, PR, issues
  • Calendar: event changed

Pattern: Webhook triggers on-demand sync instead of polling

on:
  repository_dispatch:
    types: [strava_activity_created]

Webhooks don’t count against rate limits and provide immediate updates.

Exponential Backoff on Errors

import time
 
def api_request_with_retry(url, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = requests.get(url)
            if response.status_code == 429:  # Rate limited
                wait = 2 ** attempt  # 1, 2, 4, 8, 16 seconds
                print(f"Rate limited, waiting {wait}s...")
                time.sleep(wait)
                continue
            return response.json()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)

Real-World Examples

Strava (Activities → Training Metrics)

Repo: personal/strava

  • Syncs runs, rides, walks every 6 hours
  • Individual file per activity: data/activities/2026-02-05-morning-run-123456.json
  • Calculates training load (CTL, ATL, TSB, ACWR)
  • Powers automated coaching insights

Repo: personal/withings

  • Syncs weight, body fat %, muscle mass, hydration
  • Individual file per measurement: data/measurements/2026-02-05-weight-78945.json
  • Calculates 7-day and 30-day moving averages
  • Tracks body composition changes over time

RetroAchievements (Game Progress → Completion Tracking)

Repo: personal/retroachievements

  • Syncs achievements, game progress, mastery badges
  • Individual file per achievement unlock
  • Calculates completion percentage, points per day
  • Tracks gaming activity and skill development

Last.fm (Scrobbles → Music Taste Analysis)

Repo: personal/scrobbles

  • Syncs listening history (scrobbles)
  • Enriches with genre tags via artist.getTopTags
  • Individual file per scrobble
  • Analyzes genre evolution, discovery patterns

Anti-Patterns

Don’t:

  • Store access tokens (they expire)
  • Re-fetch entire history every sync (use incremental)
  • Use monolithic JSON files (individual files scale better)
  • Discard raw API responses (preserve for retroactive analysis)
  • Sync more frequently than necessary (respect rate limits)
  • Store credentials in code or config files (use Vaultwarden)

Do:

  • Refresh access tokens from refresh tokens at sync time
  • Track last sync timestamp for incremental fetching
  • One file per entity for git-friendly diffs
  • Preserve complete API responses in raw field
  • Sync on a schedule that respects API limits (every 6 hours is common)
  • Store credentials in Vaultwarden, load via CI secrets

Testing Strategy

Local Testing First

Before deploying to CI:

# 1. Set up credentials
export SERVICE_CLIENT_ID=$(rbw get "Service API" -f CLIENT_ID)
export SERVICE_CLIENT_SECRET=$(rbw get "Service API" -f CLIENT_SECRET)
export SERVICE_REFRESH_TOKEN=$(rbw get "Service API" -f REFRESH_TOKEN)
 
# 2. Test sync script
python3 scripts/fetch_service.py
 
# 3. Verify output
ls -lh data/entities/
cat data/summary.json | jq .

See Pre-CI Testing Discipline for full testing checklist.

Test OAuth Flow

Verify token refresh works:

def test_oauth_refresh():
    sync = ServiceSync()
    token = sync.refresh_access_token()
    assert token is not None
    print(f"Access token: {token[:20]}...")

Test Incremental Sync

Simulate multiple syncs:

# First sync (full history)
python3 scripts/fetch_service.py
FIRST_COUNT=$(ls data/entities/ | wc -l)
 
# Second sync (only new)
python3 scripts/fetch_service.py
SECOND_COUNT=$(ls data/entities/ | wc -l)
 
# Should be same unless new entities exist
echo "First: $FIRST_COUNT, Second: $SECOND_COUNT"

Future Enhancements

  • Conflict resolution for retroactive edits (track updated_at)
  • Delta compression for large historical datasets
  • Multi-service aggregation (combine Strava + Withings for health dashboard)
  • Alerting on sync failures (webhook to Discord/Matrix)
  • Data export to standard formats (GPX, TCX, CSV)

Footnotes

See Also

Footnotes

  1. This OAuth 2.0 refresh token flow follows RFC 6749 Section 6, the standard specification for refreshing access tokens. The grant_type parameter must be exactly "refresh_token" per the specification. The server responds with a new access_token, token_type, expires_in, and optionally a new refresh_token. This implementation pattern is confirmed against the OAuth 2.0 specification and verified as of 2026-02-15.