A reusable architectural pattern for continuously syncing data from external APIs to local storage. Extracted from implementations across Strava, Withings, RetroAchievements, Last.fm, and other quantified-self integrations.
Overview
The API Sync Pattern enables automated, incremental collection of personal data from third-party services while maintaining:
- Credential security (OAuth refresh tokens in Vaultwarden)
- Incremental efficiency (fetch only new data since last sync)
- Local sovereignty (structured JSON under git control)
- Historical continuity (append-only individual files per entity)
This is infrastructure for continuous quantified self — every sync adds to the historical record without human intervention.
Core Components
1. OAuth 2.0 Refresh Flow
Most modern APIs require OAuth for user data access. The pattern:1
def refresh_access_token(self):
"""Exchange refresh token for new access token."""
data = {
'client_id': self.client_id,
'client_secret': self.client_secret,
'grant_type': 'refresh_token',
'refresh_token': self.refresh_token
}
response = requests.post(TOKEN_URL, data=data)
return response.json()['access_token']Credentials stored in Vaultwarden:
CLIENT_ID— application identifierCLIENT_SECRET— application passwordREFRESH_TOKEN— long-lived token for obtaining access tokens
Never store access tokens — they expire (1-6 hours). Always regenerate from refresh token at sync time.
2. Incremental Fetching
Track the last sync timestamp to avoid re-fetching historical data:
def get_last_sync_timestamp(self):
"""Read timestamp from last sync metadata file."""
try:
with open('data/last_sync.json', 'r') as f:
return json.load(f)['timestamp']
except FileNotFoundError:
return 0 # First sync, fetch everything
def fetch_new_activities(self, after_timestamp):
"""Fetch only activities created after given timestamp."""
return self.api_request('/activities', params={'after': after_timestamp})Benefits:
- Fast syncs (seconds instead of minutes)
- Respects API rate limits
- Reduces bandwidth and storage growth
- Enables frequent cron jobs (every 6 hours)
Edge cases:
- First sync:
after=0fetches complete history - Missed syncs: gaps are filled automatically
- Retroactive edits: some APIs provide
updated_afterfor changes
3. Individual Files Per Entity
Store each activity/measurement/achievement as a separate JSON file:
data/
├── activities/
│ ├── 2026-02-05-morning-run-123456.json
│ ├── 2026-02-05-evening-ride-123457.json
│ └── 2026-02-06-lunch-walk-123458.json
├── measurements/
│ ├── 2026-02-05-weight-78945.json
│ └── 2026-02-06-weight-78946.json
└── summary.json
Why individual files?
- Multiple entries per day handled naturally
- Git-friendly diffs — one activity = one file changed
- Easy querying — glob patterns, direct file access
- Scales to thousands without monolithic files
- Append-only — syncs never modify existing files
Naming convention:
{YYYY-MM-DD}-{entity-type}-{unique-id}.json
The date comes from the activity/measurement timestamp, not sync time. This keeps files organized chronologically even if synced late.
4. Structured JSON with Metadata
Each file contains:
{
"id": 123456,
"timestamp": "2026-02-05T08:30:00Z",
"type": "run",
"raw": {
// Complete API response preserved
},
"calculated": {
// Derived metrics (pace, power, trends)
},
"synced_at": "2026-02-05T14:22:15Z"
}Fields:
raw— complete API response (never modify)calculated— derived metrics, trends, analysissynced_at— when this file was created- Entity-specific fields extracted for convenience
Why preserve raw responses?
- Enables retroactive analysis without re-fetching
- API adds new fields → you already have them
- Schema changes don’t lose data
- Debugging and verification
5. Summary and Index Files
Maintain aggregate statistics for quick queries:
{
"total_activities": 847,
"total_distance_km": 3241.5,
"date_range": {
"first": "2020-03-15",
"last": "2026-02-06"
},
"by_type": {
"run": 523,
"ride": 298,
"walk": 26
},
"last_sync": "2026-02-06T14:22:15Z"
}Updated on every sync to reflect current state. Avoid scanning thousands of files for totals.
6. Calculated Metrics
Beyond raw data collection, compute domain-specific insights:
Strava example (training load):
- CTL (Chronic Training Load) — 42-day moving average
- ATL (Acute Training Load) — 7-day moving average
- TSB (Training Stress Balance) — fatigue vs fitness
- ACWR (Acute:Chronic Workload Ratio) — injury risk
Withings example (body composition):
- Weight trends (7-day, 30-day moving average)
- Body fat % change
- Muscle mass gain/loss
- Hydration patterns
RetroAchievements example (game progress):
- Completion percentage
- Mastery tracking
- Points per day
- Difficulty distribution
These calculations run during sync and are stored in the calculated field of each file or in separate derived data files.
Implementation Template
Directory Structure
personal/service-name/
├── scripts/
│ ├── fetch_service.py # Main sync script
│ └── calculate_metrics.py # Derived metrics
├── data/
│ ├── activities/ # Or measurements/achievements/etc
│ ├── summary.json
│ └── last_sync.json
├── .forgejo/workflows/
│ └── sync.yml # Automated sync via CI
├── requirements.txt
└── README.md
Sync Script Pattern
#!/usr/bin/env python3
import json
import os
from datetime import datetime
import requests
class ServiceSync:
def __init__(self):
# Load credentials from environment (Forgejo secrets)
self.client_id = os.getenv('SERVICE_CLIENT_ID')
self.client_secret = os.getenv('SERVICE_CLIENT_SECRET')
self.refresh_token = os.getenv('SERVICE_REFRESH_TOKEN')
self.access_token = None
def refresh_access_token(self):
"""Get fresh access token from refresh token."""
# OAuth refresh flow
pass
def get_last_sync_timestamp(self):
"""Read last sync timestamp."""
try:
with open('data/last_sync.json') as f:
return json.load(f)['timestamp']
except FileNotFoundError:
return 0
def fetch_new_entities(self, after_timestamp):
"""Fetch entities created since last sync."""
headers = {'Authorization': f'Bearer {self.access_token}'}
response = requests.get(
f'{API_BASE}/entities',
headers=headers,
params={'after': after_timestamp}
)
return response.json()
def save_entity(self, entity):
"""Save entity to individual JSON file."""
entity_id = entity['id']
timestamp = entity['timestamp']
date_str = timestamp.split('T')[0]
filename = f"data/entities/{date_str}-{entity_id}.json"
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, 'w') as f:
json.dump({
'id': entity_id,
'timestamp': timestamp,
'raw': entity,
'synced_at': datetime.utcnow().isoformat()
}, f, indent=2)
def update_summary(self, entities):
"""Update summary statistics."""
# Calculate totals, ranges, breakdowns
pass
def run(self):
"""Main sync logic."""
print("Starting sync...")
# 1. Refresh access token
self.access_token = self.refresh_access_token()
# 2. Get last sync timestamp
after = self.get_last_sync_timestamp()
# 3. Fetch new entities
entities = self.fetch_new_entities(after)
print(f"Found {len(entities)} new entities")
# 4. Save each entity
for entity in entities:
self.save_entity(entity)
# 5. Update summary
self.update_summary(entities)
# 6. Record sync timestamp
with open('data/last_sync.json', 'w') as f:
json.dump({
'timestamp': int(datetime.utcnow().timestamp()),
'synced_at': datetime.utcnow().isoformat()
}, f, indent=2)
print("Sync complete")
if __name__ == '__main__':
sync = ServiceSync()
sync.run()Forgejo Actions Workflow
name: Sync Service Data
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
workflow_dispatch: # Manual trigger
jobs:
sync:
runs-on: docker
container:
image: node:20-bookworm
steps:
- uses: actions/checkout@v4
- name: Install Python
run: |
apt-get update
apt-get install -y python3 python3-pip python3-venv git
- name: Install dependencies
run: |
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
- name: Run sync
env:
SERVICE_CLIENT_ID: ${{ secrets.SERVICE_CLIENT_ID }}
SERVICE_CLIENT_SECRET: ${{ secrets.SERVICE_CLIENT_SECRET }}
SERVICE_REFRESH_TOKEN: ${{ secrets.SERVICE_REFRESH_TOKEN }}
run: .venv/bin/python scripts/fetch_service.py
- name: Commit and push
run: |
git config user.name "agent"
git config user.email "agent@dungeon.church"
git add data/
git diff --staged --quiet || git commit -m "data: sync $(date -u +%Y-%m-%d)"
git pushRate Limiting Strategies
Respect API Limits
Most APIs have rate limits:
- Per-second: 5-10 requests/sec
- Per-hour: 200-1000 requests/hour
- Per-day: 2000-10000 requests/day
Pattern: Add delays between requests
import time
for entity in entities:
fetch_detail(entity)
time.sleep(0.2) # 200ms = max 5 req/secUse Webhooks When Available
Some APIs offer webhooks for real-time notifications:
- Strava: activity created/updated
- GitHub: push, PR, issues
- Calendar: event changed
Pattern: Webhook triggers on-demand sync instead of polling
on:
repository_dispatch:
types: [strava_activity_created]Webhooks don’t count against rate limits and provide immediate updates.
Exponential Backoff on Errors
import time
def api_request_with_retry(url, max_retries=5):
for attempt in range(max_retries):
try:
response = requests.get(url)
if response.status_code == 429: # Rate limited
wait = 2 ** attempt # 1, 2, 4, 8, 16 seconds
print(f"Rate limited, waiting {wait}s...")
time.sleep(wait)
continue
return response.json()
except Exception as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt)Real-World Examples
Strava (Activities → Training Metrics)
Repo: personal/strava
- Syncs runs, rides, walks every 6 hours
- Individual file per activity:
data/activities/2026-02-05-morning-run-123456.json - Calculates training load (CTL, ATL, TSB, ACWR)
- Powers automated coaching insights
Withings (Measurements → Body Composition Trends)
Repo: personal/withings
- Syncs weight, body fat %, muscle mass, hydration
- Individual file per measurement:
data/measurements/2026-02-05-weight-78945.json - Calculates 7-day and 30-day moving averages
- Tracks body composition changes over time
RetroAchievements (Game Progress → Completion Tracking)
Repo: personal/retroachievements
- Syncs achievements, game progress, mastery badges
- Individual file per achievement unlock
- Calculates completion percentage, points per day
- Tracks gaming activity and skill development
Last.fm (Scrobbles → Music Taste Analysis)
Repo: personal/scrobbles
- Syncs listening history (scrobbles)
- Enriches with genre tags via
artist.getTopTags - Individual file per scrobble
- Analyzes genre evolution, discovery patterns
Anti-Patterns
❌ Don’t:
- Store access tokens (they expire)
- Re-fetch entire history every sync (use incremental)
- Use monolithic JSON files (individual files scale better)
- Discard raw API responses (preserve for retroactive analysis)
- Sync more frequently than necessary (respect rate limits)
- Store credentials in code or config files (use Vaultwarden)
✅ Do:
- Refresh access tokens from refresh tokens at sync time
- Track last sync timestamp for incremental fetching
- One file per entity for git-friendly diffs
- Preserve complete API responses in
rawfield - Sync on a schedule that respects API limits (every 6 hours is common)
- Store credentials in Vaultwarden, load via CI secrets
Testing Strategy
Local Testing First
Before deploying to CI:
# 1. Set up credentials
export SERVICE_CLIENT_ID=$(rbw get "Service API" -f CLIENT_ID)
export SERVICE_CLIENT_SECRET=$(rbw get "Service API" -f CLIENT_SECRET)
export SERVICE_REFRESH_TOKEN=$(rbw get "Service API" -f REFRESH_TOKEN)
# 2. Test sync script
python3 scripts/fetch_service.py
# 3. Verify output
ls -lh data/entities/
cat data/summary.json | jq .See Pre-CI Testing Discipline for full testing checklist.
Test OAuth Flow
Verify token refresh works:
def test_oauth_refresh():
sync = ServiceSync()
token = sync.refresh_access_token()
assert token is not None
print(f"Access token: {token[:20]}...")Test Incremental Sync
Simulate multiple syncs:
# First sync (full history)
python3 scripts/fetch_service.py
FIRST_COUNT=$(ls data/entities/ | wc -l)
# Second sync (only new)
python3 scripts/fetch_service.py
SECOND_COUNT=$(ls data/entities/ | wc -l)
# Should be same unless new entities exist
echo "First: $FIRST_COUNT, Second: $SECOND_COUNT"Future Enhancements
- Conflict resolution for retroactive edits (track
updated_at) - Delta compression for large historical datasets
- Multi-service aggregation (combine Strava + Withings for health dashboard)
- Alerting on sync failures (webhook to Discord/Matrix)
- Data export to standard formats (GPX, TCX, CSV)
Footnotes
See Also
- Forgejo — CI/CD automation for scheduled syncs
- Credential Management — secure OAuth token storage
- Last.fm API — specific implementation example
- Training Metrics and Automated Coaching — Strava sync pattern
- Quantified-Self Health Analytics — health domain analysis powered by this sync architecture
- Agent Skills — includes API sync skills for various services
Footnotes
-
This OAuth 2.0 refresh token flow follows RFC 6749 Section 6, the standard specification for refreshing access tokens. The
grant_typeparameter must be exactly"refresh_token"per the specification. The server responds with a newaccess_token,token_type,expires_in, and optionally a newrefresh_token. This implementation pattern is confirmed against the OAuth 2.0 specification and verified as of 2026-02-15. ↩