# Firecrawl Change Tracking Issues This document outlines known issues when using Firecrawl's Change Tracking feature, particularly during development and testing. ## Issue 1: Backend Comparison Failures ### Summary When attempting to use the Firecrawl `scrape_url` method with the Change Tracking feature enabled, the service may occasionally fail to perform the comparison on the backend, resulting in a missing `changeTracking` attribute in the response object, even when the scrape itself is successful. ### Evidence The issue manifests as a warning message within the `ScrapeResponse` object: ``` warning='Comparing failed, please try again later.' ``` This occurs despite correctly configuring the request parameters as per the Firecrawl documentation: - `formats=["markdown", "changeTracking"]` - `changeTrackingOptions={"modes": ["git-diff"]}` (optional, but also tested) ### Tested URLs This behavior was observed intermittently when scraping various URLs, including: 1. `https://api.github.com/zen` (simple text endpoint) 2. `http://www.whattimeisit.com` (HTML clock website) ### Conclusion (Issue 1) The presence of the specific warning message `Comparing failed, please try again later.` strongly indicates a potential backend processing issue within the Firecrawl Change Tracking beta feature, rather than an error in the client-side code implementation. ## Issue 2: Extremely Long Cache TTL on Free/Hobby Tiers ### Summary Firecrawl's API (observed on Free and Hobby tiers) employs a very long internal cache Time-To-Live (TTL) for scraped content, often exceeding 17 hours for a given origin URL. This prevents the Change Tracking feature from detecting real-time changes when polling the same URL frequently, as cached results are served instead of performing a fresh scrape. ### Evidence Testing conducted using a script (`test_ttl.py`) that polled `https://api.github.com/zen` every 5 minutes via `fetch(..., mode="GitDiff")` revealed the following: - Firecrawl consistently returned `changeStatus: "same"` and did not update the `previousScrapeAt` timestamp for approximately **17 hours**. - Standard cache-busting techniques (appending unique timestamp-based paths or query strings like `?ts=`) were **ineffective** in forcing a fresh scrape during this window; Firecrawl appeared to cache based on the origin URL. - Only after the ~17-hour TTL expired did Firecrawl perform a new scrape and report `changeStatus: "changed"`. ### Impact on Testing This long, opaque cache TTL makes it impractical to test application logic that relies on *detecting frequent changes* via Firecrawl during development cycles. Simulating periodic checks (e.g., every 10 minutes) will consistently hit the cache and report \"no change,\" even if the underlying source content *has* changed. ### Conclusion (Issue 2) The extended cache TTL on lower tiers is a significant limitation for testing real-time or near-real-time change detection workflows built on Firecrawl. ## Recommendations If encountering issues with Firecrawl Change Tracking: 1. **Check for Backend Errors:** Inspect the response for the `warning='Comparing failed...'` message. If present, report it to Firecrawl support with URL and parameter details. 2. **Verify Long Cache TTL:** If change tracking consistently reports `\"same\"` status despite known content changes on a dynamic endpoint: * Run a test similar to `test_ttl.py` to measure the effective cache window for your target URL. * Be aware that the TTL might be many hours (17+ observed) on Free/Hobby tiers. 3. **Inform Firecrawl Support:** If the long TTL is problematic, provide feedback to Firecrawl detailing the observed TTL and its impact on your use case (e.g., testing periodic checks). 4. **Use Local Testing Workarounds:** For rapid development and testing of your application's *diff handling and notification logic* (independent of Firecrawl's API behavior), bypass Firecrawl and perform fetches/diffs locally: * Use libraries like `aiohttp` or `requests` to fetch content directly. * Use Python's `difflib` to generate diffs between snapshots. * Feed these locally generated diffs into your notification system. 5. **Consider Paid Tiers:** If immediate re-scrapes are essential, investigate if Firecrawl's higher-tier plans offer cache-bypassing options or significantly shorter TTLs. ## Code Context (`test_ttl.py`) Example script structure used to measure the cache TTL: ```python # test_ttl.py import asyncio import time from datetime import datetime # Assuming ai_lab_tracker.firecrawl_adapter.fetch exists and works # from ai_lab_tracker.firecrawl_adapter import fetch # from ai_lab_tracker.models import ChangeTracking # Mock fetch for demonstration if needed - replace with actual imports/fetch class MockDiff: text = "diff" class MockCT: change_status = "same"; previous_scrape_at = datetime.now(); diff = None class MockResult: change_tracking = MockCT() async def fetch(url, mode): print(f"Mock fetch: {url}, {mode}") # Simulate change after some time for example purposes if time.time() > start_time + 600: # Simulate change after 10 mins MockCT.change_status = "changed" MockCT.previous_scrape_at = datetime.now() MockCT.diff = MockDiff() return MockResult() start_time = time.time() # Define start_time for mock logic async def check_ttl(url: str, interval: int = 60, max_checks: int = 60): """ Hit Firecrawl every `interval` seconds up to `max_checks` times, and report when it actually re-scrapes (i.e. changeStatus != 'same'). """ # First scrape → baseline r1 = await fetch(url, mode="GitDiff") # ct1: ChangeTracking = r1.change_tracking # Use actual type hint ct1 = r1.change_tracking t0 = ct1.previous_scrape_at or datetime.utcnow() print(f"[Baseline] previousScrapeAt = {t0.isoformat()}, changeStatus = {ct1.change_status!r}") # Subsequent scrapes for i in range(1, max_checks + 1): print(f"Sleeping {interval}s (check #{i}) …") # Use asyncio.sleep in real async code await asyncio.sleep(interval) # Correct way to sleep in async r2 = await fetch(url, mode="GitDiff") # ct2: ChangeTracking = r2.change_tracking # Use actual type hint ct2 = r2.change_tracking t1 = ct2.previous_scrape_at or t0 status = ct2.change_status print(f"[Check #{i}] previousScrapeAt = {t1.isoformat()}, changeStatus = {status!r}") if status != "same": delta = (t1 - t0).total_seconds() print(f"\\n🚀 Firecrawl finally re-scraped after ~{delta:.0f}s!") return print(f"\\n⚠️ No re-scrape detected after {interval * max_checks}s (~{(interval*max_checks)/3600:.1f}h).") if __name__ == "__main__": # Replace with your public test URL (e.g. ngrok tunnel to dynamic_server.py) TEST_URL = "https://api.github.com/zen" # Example # Example run: asyncio.run(check_ttl(TEST_URL, interval=300, max_checks=12)) print("Example structure - requires actual fetch and asyncio.run call") ```