Data Sampling Issues | Blue Frog Docs

Data Sampling Issues

Diagnose and fix data sampling that affects accuracy and reliability in high-traffic analytics reports

Data Sampling Issues

What This Means

Data sampling occurs when analytics platforms process only a subset of your data instead of all available data, then extrapolate results. While this speeds up report generation, it can lead to inaccurate insights, especially for detailed segments or custom reports in high-traffic properties.

Impact on Your Business

Inaccurate Insights:

  • Reports based on estimates, not actual data
  • Segment analysis unreliable
  • Small segments may show incorrect trends
  • Confidence in data diminished

Decision-Making Problems:

  • Optimization based on sampled data
  • A/B test results potentially invalid
  • Budget allocation on incomplete information
  • Cannot trust detailed analysis

Analysis Limitations:

  • Cannot drill into specific segments
  • Custom reports show different results each time
  • Funnel analysis inaccurate
  • Time-series data inconsistent

How to Diagnose

Method 1: Check for Sampling Indicator in GA4

  1. Look for sampling badge:

    • Open any GA4 report
    • Top right corner shows sampling status
    • Green checkmark = unsampled
    • Yellow badge = sampled data
  2. Check sampling percentage:

    • Click sampling badge
    • Shows "Based on X% of sessions"
    • Lower percentage = less accurate
  3. Review data collection:

    • Reports → Realtime
    • Check daily session volume
    • Note GA4 limits

What to Look For:

  • Sampling badge appears in reports
  • Percentage of sessions sampled
  • Which reports trigger sampling
  • Time periods affected

Method 2: Compare Standard vs Custom Reports

  1. Check standard reports:

    • GA4 → Reports → Acquisition
    • Usually unsampled (up to limits)
    • Note the numbers
  2. Create custom exploration:

    • GA4 → Explore → Free form
    • Use same dimensions/metrics
    • Compare numbers to standard report
  3. Look for discrepancies:

    Standard report: 10,000 sessions
    Custom report: 9,500 sessions (95% sampled)
    Difference: 500 sessions (5% variance)
    

What to Look For:

  • Different numbers between reports
  • Sampling indicator on explorations
  • Variance percentage
  • Inconsistent results on refresh

Method 3: Check Property Session Volume

  1. Review daily session count:

    • GA4 → Reports → Realtime
    • Note daily session volume
    • Compare to GA4 limits
  2. GA4 sampling thresholds:

    Standard property (free):
    - 10M events per month
    - Sampling may occur in explorations
    - Standard reports generally unsampled
    
    360 property (paid):
    - Higher limits
    - Less sampling
    - Unsampled reporting available
    
  3. Calculate if you're over limits:

    Daily sessions: 50,000
    Monthly sessions: 1,500,000
    Average events per session: 10
    Monthly events: 15,000,000 (exceeds 10M limit)
    Result: Sampling likely
    

What to Look For:

  • Sessions approaching or exceeding limits
  • Event count near monthly threshold
  • Frequent sampling in explorations
  • Date range affecting sampling

Method 4: Test Date Range Impact

  1. Create test exploration:

    • GA4 → Explore → Free form
    • Add dimension: Session source
    • Add metric: Sessions
  2. Try different date ranges:

    Last 7 days: 100% data (unsampled)
    Last 30 days: 50% data (sampled)
    Last 90 days: 25% data (sampled)
    
  3. Note sampling threshold:

    • Identify date range where sampling starts
    • Document session threshold
    • Plan analyses accordingly

What to Look For:

  • Date range where sampling kicks in
  • Session count triggering sampling
  • Variance with shorter ranges
  • Consistent sampling patterns

General Fixes

Fix 1: Reduce Date Range in Reports

Analyze shorter time periods:

  1. Use shorter date ranges:

    Instead of: Last 90 days (sampled)
    Use: Last 30 days (unsampled)
    Or: Weekly reports combined manually
    
  2. Break analysis into chunks:

    // Pseudo-code approach
    Week 1: Jan 1-7 (unsampled)
    Week 2: Jan 8-14 (unsampled)
    Week 3: Jan 15-21 (unsampled)
    Week 4: Jan 22-28 (unsampled)
    
    Combine results manually or via API
    
  3. Schedule regular exports:

    • Export weekly unsampled data
    • Combine in spreadsheet/database
    • Analyze complete dataset offline

Fix 2: Use Standard Reports Instead of Explorations

Leverage pre-calculated reports:

  1. Standard reports are less sampled:

    • GA4 → Reports → Life cycle
    • Pre-aggregated data
    • Usually unsampled up to higher limits
  2. Customize standard reports:

    • Add secondary dimensions
    • Apply filters
    • Use comparison mode
    • Still less sampling than explorations
  3. When you must use explorations:

    • Simplify dimensions (fewer breakdown)
    • Reduce segments
    • Limit filters
    • Decrease date range

Fix 3: Upgrade to GA4 360

Consider paid version for high traffic:

  1. GA4 360 benefits:

    Standard GA4: 10M events/month
    GA4 360: 1B events/month (100x more)
    
    Standard: Sampling in explorations
    GA4 360: Unsampled reports, unsampled explorations
    
    Standard: Best effort support
    GA4 360: Dedicated support, SLA
    
  2. When to upgrade:

    - Exceeding 10M events/month consistently
    - Seeing frequent sampling
    - Need accurate segment analysis
    - Critical business decisions depend on data
    - Multi-property roll-ups needed
    
  3. Cost vs benefit:

    GA4 360 pricing: $50,000 - $150,000/year
    Consider if:
    - Annual revenue > $10M
    - Data accuracy critical
    - Large marketing budget
    - Enterprise analytics needs
    

Fix 4: Use BigQuery Export

Export raw, unsampled data:

  1. Enable BigQuery export:

    • GA4 → Admin → BigQuery Links
    • Link to BigQuery project
    • Choose daily or streaming export
    • All events exported (unsampled)
  2. Query unsampled data in BigQuery:

    -- Example: Get unsampled session data
    SELECT
      user_pseudo_id,
      event_name,
      event_timestamp,
      (SELECT value.string_value FROM UNNEST(event_params)
       WHERE key = 'session_id') AS session_id
    FROM
      `project.analytics_PROPERTY_ID.events_*`
    WHERE
      _TABLE_SUFFIX BETWEEN '20240101' AND '20240131'
      AND event_name = 'page_view'
    
  3. Benefits of BigQuery:

    • 100% unsampled data
    • Raw event-level data
    • Custom analysis without limits
    • Join with other data sources
    • Machine learning possible
  4. BigQuery costs:

    Storage: ~$0.02/GB per month
    Queries: $5 per TB processed
    
    Example site (1M sessions/month):
    - Storage: ~$2-5/month
    - Queries: ~$10-50/month
    Total: $12-55/month (much less than 360)
    
  5. Setup instructions:

    • Create Google Cloud project
    • Enable BigQuery API
    • Link from GA4
    • Wait 24 hours for first export
    • Start querying

Fix 5: Reduce Event Volume

Optimize tracking to stay under limits:

  1. Audit unnecessary events:

    // Remove excessive scroll tracking
    // Bad - tracks every 10%
    window.addEventListener('scroll', function() {
      const scrollPercent = Math.round(window.scrollY / document.body.scrollHeight * 100);
      if (scrollPercent % 10 === 0) {
        gtag('event', 'scroll', { percent: scrollPercent });
      }
    });
    
    // Good - tracks only meaningful milestones
    let tracked25 = false, tracked50 = false, tracked75 = false;
    window.addEventListener('scroll', function() {
      const scrollPercent = window.scrollY / document.body.scrollHeight * 100;
      if (scrollPercent > 75 && !tracked75) {
        gtag('event', 'scroll', { percent: 75 });
        tracked75 = true;
      } else if (scrollPercent > 50 && !tracked50) {
        gtag('event', 'scroll', { percent: 50 });
        tracked50 = true;
      } else if (scrollPercent > 25 && !tracked25) {
        gtag('event', 'scroll', { percent: 25 });
        tracked25 = true;
      }
    });
    
  2. Consolidate similar events:

    // Bad - separate event for each product view
    gtag('event', 'view_product_1', {...});
    gtag('event', 'view_product_2', {...});
    
    // Good - single event with parameter
    gtag('event', 'view_item', {
      item_id: 'product_1'
    });
    
  3. Remove debug events in production:

    // Only fire debug events in development
    if (window.location.hostname === 'localhost') {
      gtag('event', 'debug_checkpoint', {...});
    }
    
  4. Sample client-side events:

    // Sample low-value events (track only 10%)
    if (Math.random() < 0.1) {
      gtag('event', 'low_value_interaction', {...});
    }
    

Fix 6: Use Data API for Unsampled Data

Extract data programmatically:

  1. GA4 Data API:

    // Node.js example using GA4 Data API
    const {BetaAnalyticsDataClient} = require('@google-analytics/data');
    
    const analyticsDataClient = new BetaAnalyticsDataClient();
    
    async function runReport() {
      const [response] = await analyticsDataClient.runReport({
        property: `properties/${propertyId}`,
        dateRanges: [
          {
            startDate: '30daysAgo',
            endDate: 'today',
          },
        ],
        dimensions: [
          { name: 'sessionSource' },
        ],
        metrics: [
          { name: 'sessions' },
        ],
      });
    
      // Process unsampled data
      console.log('Report result:');
      response.rows.forEach(row => {
        console.log(row.dimensionValues[0].value, row.metricValues[0].value);
      });
    }
    
    runReport();
    
  2. Benefits:

    • Unsampled for most reports
    • Automated data extraction
    • Custom dashboards with fresh data
    • Integration with other systems
  3. API limitations:

    - 10,000 rows per request (pagination needed)
    - Complex explorations may still sample
    - Rate limits apply
    - Requires coding knowledge
    

Fix 7: Create Filtered Views with Subproperties

Divide traffic across properties:

  1. Create multiple GA4 properties:

    Main property: All traffic
    Property A: North America traffic only
    Property B: Europe traffic only
    Property C: Mobile app only
    
  2. Conditional tracking:

    // Send to different properties based on criteria
    let propertyId;
    
    if (userLocation === 'north_america') {
      propertyId = 'G-AAAAAAAAAA';
    } else if (userLocation === 'europe') {
      propertyId = 'G-BBBBBBBBBB';
    } else {
      propertyId = 'G-CCCCCCCCCC';
    }
    
    gtag('config', propertyId);
    
  3. Benefits:

    • Each property has lower volume
    • Reduced sampling
    • Focused analysis per region/segment
  4. Drawbacks:

    • More complex setup
    • Cannot easily compare across properties
    • More properties to manage
    • Consider carefully before implementing

Platform-Specific Guides

Detailed implementation instructions for your specific platform:

Platform Troubleshooting Guide
Shopify Shopify Sampling Issues Guide
WordPress WordPress Sampling Issues Guide
Wix Wix Sampling Issues Guide
Squarespace Squarespace Sampling Issues Guide
Webflow Webflow Sampling Issues Guide

Verification

After implementing fixes:

  1. Check sampling status:

    • Create test exploration
    • Check for sampling badge
    • Note percentage improvement
    • Document which fixes worked
  2. Verify BigQuery export:

    • Check BigQuery console
    • Verify daily tables created
    • Run test query
    • Confirm event counts match
  3. Monitor event volume:

    • GA4 → Configure → DebugView
    • Check events per session
    • Calculate monthly projection
    • Verify under 10M limit
  4. Compare report accuracy:

    • Run same report multiple times
    • Results should be consistent
    • No sampling indicator
    • Confidence in data restored

Common Mistakes

  1. Not checking for sampling indicator - Unaware data is sampled
  2. Using long date ranges unnecessarily - Triggers sampling
  3. Over-tracking low-value events - Exceeds event limits
  4. Not considering BigQuery - Free/low-cost unsampled access
  5. Complex explorations on large datasets - Always sampled
  6. Not using API for large exports - Missing unsampled option
  7. Ignoring event volume optimization - Wasteful tracking
  8. Not understanding GA4 limits - Surprised by sampling
  9. Assuming all reports unsampled - Explorations more likely sampled
  10. Not documenting sampling patterns - Cannot optimize

Troubleshooting Checklist

  • Sampling indicator checked in reports
  • Monthly event volume calculated
  • Under 10M events/month (or 360 considered)
  • Unnecessary events removed
  • Date ranges optimized
  • Standard reports used when possible
  • BigQuery export enabled
  • Data API utilized for automation
  • Event sampling implemented for low-value events
  • Regular data exports scheduled
  • Sampling patterns documented
  • Team trained on sampling awareness

Sampling Severity Levels

No Sampling: 100% of data

  • Ideal state
  • Full accuracy
  • Complete confidence in analysis

Light Sampling: 90-99% of data

  • Minimal impact
  • Generally acceptable
  • Small variance in results

Moderate Sampling: 50-89% of data

  • Noticeable impact
  • Use caution with decisions
  • Consider alternative approaches

Heavy Sampling: < 50% of data

  • Significant accuracy concerns
  • Do not trust for critical decisions
  • Implement fixes immediately
  • Consider BigQuery or 360

Additional Resources

// SYS.FOOTER