# Firecrawl Change Tracking Issues

This document outlines known issues when using Firecrawl's Change Tracking feature, particularly during development and testing.

## Issue 1: Backend Comparison Failures

### Summary

When attempting to use the Firecrawl `scrape_url` method with the Change Tracking feature enabled, the service may occasionally fail to perform the comparison on the backend, resulting in a missing `changeTracking` attribute in the response object, even when the scrape itself is successful.

### Evidence

The issue manifests as a warning message within the `ScrapeResponse` object:

```
warning='Comparing failed, please try again later.'
```

This occurs despite correctly configuring the request parameters as per the Firecrawl documentation:

-   `formats=["markdown", "changeTracking"]`
-   `changeTrackingOptions={"modes": ["git-diff"]}` (optional, but also tested)

### Tested URLs

This behavior was observed intermittently when scraping various URLs, including:

1.  `https://api.github.com/zen` (simple text endpoint)
2.  `http://www.whattimeisit.com` (HTML clock website)

### Conclusion (Issue 1)

The presence of the specific warning message `Comparing failed, please try again later.` strongly indicates a potential backend processing issue within the Firecrawl Change Tracking beta feature, rather than an error in the client-side code implementation.

## Issue 2: Extremely Long Cache TTL on Free/Hobby Tiers

### Summary

Firecrawl's API (observed on Free and Hobby tiers) employs a very long internal cache Time-To-Live (TTL) for scraped content, often exceeding 17 hours for a given origin URL. This prevents the Change Tracking feature from detecting real-time changes when polling the same URL frequently, as cached results are served instead of performing a fresh scrape.

### Evidence

Testing conducted using a script (`test_ttl.py`) that polled `https://api.github.com/zen` every 5 minutes via `fetch(..., mode="GitDiff")` revealed the following:

-   Firecrawl consistently returned `changeStatus: "same"` and did not update the `previousScrapeAt` timestamp for approximately **17 hours**.
-   Standard cache-busting techniques (appending unique timestamp-based paths or query strings like `?ts=<timestamp>`) were **ineffective** in forcing a fresh scrape during this window; Firecrawl appeared to cache based on the origin URL.
-   Only after the ~17-hour TTL expired did Firecrawl perform a new scrape and report `changeStatus: "changed"`.

### Impact on Testing

This long, opaque cache TTL makes it impractical to test application logic that relies on *detecting frequent changes* via Firecrawl during development cycles. Simulating periodic checks (e.g., every 10 minutes) will consistently hit the cache and report \"no change,\" even if the underlying source content *has* changed.

### Conclusion (Issue 2)

The extended cache TTL on lower tiers is a significant limitation for testing real-time or near-real-time change detection workflows built on Firecrawl.

## Recommendations

If encountering issues with Firecrawl Change Tracking:

1.  **Check for Backend Errors:** Inspect the response for the `warning='Comparing failed...'` message. If present, report it to Firecrawl support with URL and parameter details.
2.  **Verify Long Cache TTL:** If change tracking consistently reports `\"same\"` status despite known content changes on a dynamic endpoint:
    *   Run a test similar to `test_ttl.py` to measure the effective cache window for your target URL.
    *   Be aware that the TTL might be many hours (17+ observed) on Free/Hobby tiers.
3.  **Inform Firecrawl Support:** If the long TTL is problematic, provide feedback to Firecrawl detailing the observed TTL and its impact on your use case (e.g., testing periodic checks).
4.  **Use Local Testing Workarounds:** For rapid development and testing of your application's *diff handling and notification logic* (independent of Firecrawl's API behavior), bypass Firecrawl and perform fetches/diffs locally:
    *   Use libraries like `aiohttp` or `requests` to fetch content directly.
    *   Use Python's `difflib` to generate diffs between snapshots.
    *   Feed these locally generated diffs into your notification system.
5.  **Consider Paid Tiers:** If immediate re-scrapes are essential, investigate if Firecrawl's higher-tier plans offer cache-bypassing options or significantly shorter TTLs.

## Code Context (`test_ttl.py`)

Example script structure used to measure the cache TTL:

```python
# test_ttl.py
import asyncio
import time
from datetime import datetime
# Assuming ai_lab_tracker.firecrawl_adapter.fetch exists and works
# from ai_lab_tracker.firecrawl_adapter import fetch
# from ai_lab_tracker.models import ChangeTracking

# Mock fetch for demonstration if needed - replace with actual imports/fetch
class MockDiff: text = "diff"
class MockCT: change_status = "same"; previous_scrape_at = datetime.now(); diff = None
class MockResult: change_tracking = MockCT()
async def fetch(url, mode):
    print(f"Mock fetch: {url}, {mode}")
    # Simulate change after some time for example purposes
    if time.time() > start_time + 600: # Simulate change after 10 mins
         MockCT.change_status = "changed"
         MockCT.previous_scrape_at = datetime.now()
         MockCT.diff = MockDiff()
    return MockResult()

start_time = time.time() # Define start_time for mock logic

async def check_ttl(url: str, interval: int = 60, max_checks: int = 60):
    """
    Hit Firecrawl every `interval` seconds up to `max_checks` times,
    and report when it actually re-scrapes (i.e. changeStatus != 'same').
    """
    # First scrape → baseline
    r1 = await fetch(url, mode="GitDiff")
    # ct1: ChangeTracking = r1.change_tracking # Use actual type hint
    ct1 = r1.change_tracking
    t0 = ct1.previous_scrape_at or datetime.utcnow()
    print(f"[Baseline] previousScrapeAt = {t0.isoformat()}, changeStatus = {ct1.change_status!r}")

    # Subsequent scrapes
    for i in range(1, max_checks + 1):
        print(f"Sleeping {interval}s (check #{i}) …")
        # Use asyncio.sleep in real async code
        await asyncio.sleep(interval) # Correct way to sleep in async

        r2 = await fetch(url, mode="GitDiff")
        # ct2: ChangeTracking = r2.change_tracking # Use actual type hint
        ct2 = r2.change_tracking
        t1 = ct2.previous_scrape_at or t0
        status = ct2.change_status

        print(f"[Check #{i}] previousScrapeAt = {t1.isoformat()}, changeStatus = {status!r}")
        if status != "same":
            delta = (t1 - t0).total_seconds()
            print(f"\\n🚀 Firecrawl finally re-scraped after ~{delta:.0f}s!")
            return

    print(f"\\n⚠️ No re-scrape detected after {interval * max_checks}s (~{(interval*max_checks)/3600:.1f}h).")


if __name__ == "__main__":
    # Replace with your public test URL (e.g. ngrok tunnel to dynamic_server.py)
    TEST_URL = "https://api.github.com/zen" # Example
    # Example run: asyncio.run(check_ttl(TEST_URL, interval=300, max_checks=12))
    print("Example structure - requires actual fetch and asyncio.run call")
```