API Sync Pattern

A reusable architectural pattern for continuously syncing data from external APIs to local storage. Extracted from implementations across Strava, Withings, RetroAchievements, Last.fm, and other quantified-self integrations.

Overview

The API Sync Pattern enables automated, incremental collection of personal data from third-party services while maintaining:

Credential security (OAuth refresh tokens in Vaultwarden)
Incremental efficiency (fetch only new data since last sync)
Local sovereignty (structured JSON under git control)
Historical continuity (append-only individual files per entity)

This is infrastructure for continuous quantified self — every sync adds to the historical record without human intervention.

Core Components

1. OAuth 2.0 Refresh Flow

Most modern APIs require OAuth for user data access. The pattern:¹

def refresh_access_token(self):
    """Exchange refresh token for new access token."""
    data = {
        'client_id': self.client_id,
        'client_secret': self.client_secret,
        'grant_type': 'refresh_token',
        'refresh_token': self.refresh_token
    }
    response = requests.post(TOKEN_URL, data=data)
    return response.json()['access_token']

Credentials stored in Vaultwarden:

CLIENT_ID — application identifier
CLIENT_SECRET — application password
REFRESH_TOKEN — long-lived token for obtaining access tokens

Never store access tokens — they expire (1-6 hours). Always regenerate from refresh token at sync time.

2. Incremental Fetching

Track the last sync timestamp to avoid re-fetching historical data:

def get_last_sync_timestamp(self):
    """Read timestamp from last sync metadata file."""
    try:
        with open('data/last_sync.json', 'r') as f:
            return json.load(f)['timestamp']
    except FileNotFoundError:
        return 0  # First sync, fetch everything
 
def fetch_new_activities(self, after_timestamp):
    """Fetch only activities created after given timestamp."""
    return self.api_request('/activities', params={'after': after_timestamp})

Benefits:

Fast syncs (seconds instead of minutes)
Respects API rate limits
Reduces bandwidth and storage growth
Enables frequent cron jobs (every 6 hours)

Edge cases:

First sync: after=0 fetches complete history
Missed syncs: gaps are filled automatically
Retroactive edits: some APIs provide updated_after for changes

3. Individual Files Per Entity

Store each activity/measurement/achievement as a separate JSON file:

data/
├── activities/
│   ├── 2026-02-05-morning-run-123456.json
│   ├── 2026-02-05-evening-ride-123457.json
│   └── 2026-02-06-lunch-walk-123458.json
├── measurements/
│   ├── 2026-02-05-weight-78945.json
│   └── 2026-02-06-weight-78946.json
└── summary.json

Why individual files?

Multiple entries per day handled naturally
Git-friendly diffs — one activity = one file changed
Easy querying — glob patterns, direct file access
Scales to thousands without monolithic files
Append-only — syncs never modify existing files

Naming convention:

{YYYY-MM-DD}-{entity-type}-{unique-id}.json

The date comes from the activity/measurement timestamp, not sync time. This keeps files organized chronologically even if synced late.

4. Structured JSON with Metadata

Each file contains:

{
  "id": 123456,
  "timestamp": "2026-02-05T08:30:00Z",
  "type": "run",
  "raw": {
    // Complete API response preserved
  },
  "calculated": {
    // Derived metrics (pace, power, trends)
  },
  "synced_at": "2026-02-05T14:22:15Z"
}

Fields:

raw — complete API response (never modify)
calculated — derived metrics, trends, analysis
synced_at — when this file was created
Entity-specific fields extracted for convenience

Why preserve raw responses?

Enables retroactive analysis without re-fetching
API adds new fields → you already have them
Schema changes don’t lose data
Debugging and verification

5. Summary and Index Files

Maintain aggregate statistics for quick queries:

{
  "total_activities": 847,
  "total_distance_km": 3241.5,
  "date_range": {
    "first": "2020-03-15",
    "last": "2026-02-06"
  },
  "by_type": {
    "run": 523,
    "ride": 298,
    "walk": 26
  },
  "last_sync": "2026-02-06T14:22:15Z"
}

Updated on every sync to reflect current state. Avoid scanning thousands of files for totals.

6. Calculated Metrics

Beyond raw data collection, compute domain-specific insights:

Strava example (training load):

CTL (Chronic Training Load) — 42-day moving average
ATL (Acute Training Load) — 7-day moving average
TSB (Training Stress Balance) — fatigue vs fitness
ACWR (Acute:Chronic Workload Ratio) — injury risk

Withings example (body composition):

Weight trends (7-day, 30-day moving average)
Body fat % change
Muscle mass gain/loss
Hydration patterns

RetroAchievements example (game progress):

Completion percentage
Mastery tracking
Points per day
Difficulty distribution

These calculations run during sync and are stored in the calculated field of each file or in separate derived data files.

Implementation Template

Directory Structure

personal/service-name/
├── scripts/
│   ├── fetch_service.py          # Main sync script
│   └── calculate_metrics.py      # Derived metrics
├── data/
│   ├── activities/               # Or measurements/achievements/etc
│   ├── summary.json
│   └── last_sync.json
├── .forgejo/workflows/
│   └── sync.yml                  # Automated sync via CI
├── requirements.txt
└── README.md

Sync Script Pattern

#!/usr/bin/env python3
import json
import os
from datetime import datetime
import requests
 
class ServiceSync:
    def __init__(self):
        # Load credentials from environment (Forgejo secrets)
        self.client_id = os.getenv('SERVICE_CLIENT_ID')
        self.client_secret = os.getenv('SERVICE_CLIENT_SECRET')
        self.refresh_token = os.getenv('SERVICE_REFRESH_TOKEN')
        self.access_token = None
    
    def refresh_access_token(self):
        """Get fresh access token from refresh token."""
        # OAuth refresh flow
        pass
    
    def get_last_sync_timestamp(self):
        """Read last sync timestamp."""
        try:
            with open('data/last_sync.json') as f:
                return json.load(f)['timestamp']
        except FileNotFoundError:
            return 0
    
    def fetch_new_entities(self, after_timestamp):
        """Fetch entities created since last sync."""
        headers = {'Authorization': f'Bearer {self.access_token}'}
        response = requests.get(
            f'{API_BASE}/entities',
            headers=headers,
            params={'after': after_timestamp}
        )
        return response.json()
    
    def save_entity(self, entity):
        """Save entity to individual JSON file."""
        entity_id = entity['id']
        timestamp = entity['timestamp']
        date_str = timestamp.split('T')[0]
        
        filename = f"data/entities/{date_str}-{entity_id}.json"
        os.makedirs(os.path.dirname(filename), exist_ok=True)
        
        with open(filename, 'w') as f:
            json.dump({
                'id': entity_id,
                'timestamp': timestamp,
                'raw': entity,
                'synced_at': datetime.utcnow().isoformat()
            }, f, indent=2)
    
    def update_summary(self, entities):
        """Update summary statistics."""
        # Calculate totals, ranges, breakdowns
        pass
    
    def run(self):
        """Main sync logic."""
        print("Starting sync...")
        
        # 1. Refresh access token
        self.access_token = self.refresh_access_token()
        
        # 2. Get last sync timestamp
        after = self.get_last_sync_timestamp()
        
        # 3. Fetch new entities
        entities = self.fetch_new_entities(after)
        print(f"Found {len(entities)} new entities")
        
        # 4. Save each entity
        for entity in entities:
            self.save_entity(entity)
        
        # 5. Update summary
        self.update_summary(entities)
        
        # 6. Record sync timestamp
        with open('data/last_sync.json', 'w') as f:
            json.dump({
                'timestamp': int(datetime.utcnow().timestamp()),
                'synced_at': datetime.utcnow().isoformat()
            }, f, indent=2)
        
        print("Sync complete")
 
if __name__ == '__main__':
    sync = ServiceSync()
    sync.run()

Forgejo Actions Workflow

name: Sync Service Data
on:
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours
  workflow_dispatch:        # Manual trigger
 
jobs:
  sync:
    runs-on: docker
    container:
      image: node:20-bookworm
    steps:
      - uses: actions/checkout@v4
      
      - name: Install Python
        run: |
          apt-get update
          apt-get install -y python3 python3-pip python3-venv git
      
      - name: Install dependencies
        run: |
          python3 -m venv .venv
          .venv/bin/pip install -r requirements.txt
      
      - name: Run sync
        env:
          SERVICE_CLIENT_ID: ${{ secrets.SERVICE_CLIENT_ID }}
          SERVICE_CLIENT_SECRET: ${{ secrets.SERVICE_CLIENT_SECRET }}
          SERVICE_REFRESH_TOKEN: ${{ secrets.SERVICE_REFRESH_TOKEN }}
        run: .venv/bin/python scripts/fetch_service.py
      
      - name: Commit and push
        run: |
          git config user.name "agent"
          git config user.email "agent@dungeon.church"
          git add data/
          git diff --staged --quiet || git commit -m "data: sync $(date -u +%Y-%m-%d)"
          git push

Rate Limiting Strategies

Respect API Limits

Most APIs have rate limits:

Per-second: 5-10 requests/sec
Per-hour: 200-1000 requests/hour
Per-day: 2000-10000 requests/day

Pattern: Add delays between requests

import time
 
for entity in entities:
    fetch_detail(entity)
    time.sleep(0.2)  # 200ms = max 5 req/sec

Use Webhooks When Available

Some APIs offer webhooks for real-time notifications:

Strava: activity created/updated
GitHub: push, PR, issues
Calendar: event changed

Pattern: Webhook triggers on-demand sync instead of polling

on:
  repository_dispatch:
    types: [strava_activity_created]

Webhooks don’t count against rate limits and provide immediate updates.

Exponential Backoff on Errors

import time
 
def api_request_with_retry(url, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = requests.get(url)
            if response.status_code == 429:  # Rate limited
                wait = 2 ** attempt  # 1, 2, 4, 8, 16 seconds
                print(f"Rate limited, waiting {wait}s...")
                time.sleep(wait)
                continue
            return response.json()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)

Real-World Examples

Strava (Activities → Training Metrics)

Repo: personal/strava

Syncs runs, rides, walks every 6 hours
Individual file per activity: data/activities/2026-02-05-morning-run-123456.json
Calculates training load (CTL, ATL, TSB, ACWR)
Powers automated coaching insights

Withings (Measurements → Body Composition Trends)

Repo: personal/withings

Syncs weight, body fat %, muscle mass, hydration
Individual file per measurement: data/measurements/2026-02-05-weight-78945.json
Calculates 7-day and 30-day moving averages
Tracks body composition changes over time

RetroAchievements (Game Progress → Completion Tracking)

Repo: personal/retroachievements

Syncs achievements, game progress, mastery badges
Individual file per achievement unlock
Calculates completion percentage, points per day
Tracks gaming activity and skill development

Last.fm (Scrobbles → Music Taste Analysis)

Repo: personal/scrobbles

Syncs listening history (scrobbles)
Enriches with genre tags via artist.getTopTags
Individual file per scrobble
Analyzes genre evolution, discovery patterns

Anti-Patterns

❌ Don’t:

Store access tokens (they expire)
Re-fetch entire history every sync (use incremental)
Use monolithic JSON files (individual files scale better)
Discard raw API responses (preserve for retroactive analysis)
Sync more frequently than necessary (respect rate limits)
Store credentials in code or config files (use Vaultwarden)

✅ Do:

Refresh access tokens from refresh tokens at sync time
Track last sync timestamp for incremental fetching
One file per entity for git-friendly diffs
Preserve complete API responses in raw field
Sync on a schedule that respects API limits (every 6 hours is common)
Store credentials in Vaultwarden, load via CI secrets

Testing Strategy

Local Testing First

Before deploying to CI:

# 1. Set up credentials
export SERVICE_CLIENT_ID=$(rbw get "Service API" -f CLIENT_ID)
export SERVICE_CLIENT_SECRET=$(rbw get "Service API" -f CLIENT_SECRET)
export SERVICE_REFRESH_TOKEN=$(rbw get "Service API" -f REFRESH_TOKEN)
 
# 2. Test sync script
python3 scripts/fetch_service.py
 
# 3. Verify output
ls -lh data/entities/
cat data/summary.json | jq .

See Pre-CI Testing Discipline for full testing checklist.

Test OAuth Flow

Verify token refresh works:

def test_oauth_refresh():
    sync = ServiceSync()
    token = sync.refresh_access_token()
    assert token is not None
    print(f"Access token: {token[:20]}...")

Test Incremental Sync

Simulate multiple syncs:

# First sync (full history)
python3 scripts/fetch_service.py
FIRST_COUNT=$(ls data/entities/ | wc -l)
 
# Second sync (only new)
python3 scripts/fetch_service.py
SECOND_COUNT=$(ls data/entities/ | wc -l)
 
# Should be same unless new entities exist
echo "First: $FIRST_COUNT, Second: $SECOND_COUNT"

Future Enhancements

Conflict resolution for retroactive edits (track updated_at)
Delta compression for large historical datasets
Multi-service aggregation (combine Strava + Withings for health dashboard)
Alerting on sync failures (webhook to Discord/Matrix)
Data export to standard formats (GPX, TCX, CSV)

Commune

Explorer