Open Methodology

How we detect signals before they become trends.

Most intelligence tools summarize what happened. We built a pipeline that detects what's about to happen — by reading the data nobody has time to read.

25-85K
Documents / week
16
Signal detectors
12
Live sources
5 yrs
Historical depth
The Pipeline

9 steps. From noise to signal.

Every week, we ingest up to 85,000 documents from 12 public sources across 6 layers of the tech ecosystem. Most of it is noise. Our pipeline compresses it down to 5 actionable opportunities with evidence, scores, and playbooks.

01
Ingest
Pull from 12 sources across 6 layers — code, academic, community, news, product reviews, and patents. Normalize into a unified document format. Deduplicate by content hash and unique ID.
25-85K docs/week12 APIs6 layers
02
Cluster
Embed documents into 384-dimension vectors, then cluster with HDBSCAN. Topics emerge organically — no predefined categories. Stratified sampling handles the volume: 10K documents per weekly run.
BERTopicMiniLM-L6HDBSCAN
03
Detect bursts
Run the Kleinberg burst detection algorithm on each topic's time series. Identifies statistically abnormal activity spikes — not just growth, but acceleration.
KleinbergZ-score fallback
04
Run 16 signal detectors
Each document passes through 16 specialized detectors across 4 categories: supply-shock signals, gap signals, tracking signals, and meta-validation signals.
5 supply-shock2 gap5 tracking2 meta
05
Extract capabilities & constraints
LLM reads the top 50 bursting topics (3 docs each, 150 documents total) and extracts: what does this technology make possible? What are the current blockers? Every claim is verified against source quotes.
Gemini FlashEvidence spansQuote verification
06
Match opportunities
Cross-multiply capabilities x constraints. LLM evaluates each pair for pain level, existing workarounds, timing, and competitive landscape.
Cartesian matchingPain scoring
07
Score
30+ raw signals compress into 8 composite features. A log-space sigmoid formula produces a 0-100 score. High momentum + high readiness + high pain = high score. Competition and mainstream coverage penalize.
8 featuresLog-space sigmoid0-100
08
Track lifecycle
Each topic is classified into a lifecycle stage: Embryonic, Emerging, Accelerating, Peaking, Mainstream, Declining. Cross-source confirmation checks how many independent layers validate the signal.
6-state FSM5 confirmation layers
09
Generate report
Top 5 opportunities (score >= 40) are enriched with evidence packs and synthesized into actionable playbooks. You receive a weekly intelligence report with concrete next moves.
Top 5Evidence packsPlaybooks
Step 1 — Ingestion

12 sources. 6 layers. One pipeline.

Each source captures a different layer of the tech ecosystem. Signals are strongest when they appear across multiple, independent layers.

Code Layer
GitHub ArchiveLive
Every public event on GitHub — repo creations, pushes, forks, stars. Filtered by 56 tech keywords across 4 event types.
5K-20K
Documents / week
"This is where builders ship before they announce. A sudden fork spike on an obscure repo is a signal nobody else sees."
Academic Layer
ArXivLive
Pre-prints across 4 CS categories: AI, ML, Computation & Language, Information Retrieval. Where breakthroughs appear 6-12 months before products.
700-3.5K / week
Papers
"The Attention Is All You Need paper was on arXiv 2 years before GPT-2. The signals are there — buried in math."
OpenAlexLive
Academic graph covering 250M+ works. We track 5 CS concepts with citation velocity, institutional signals, and cross-reference patterns.
7-35K / week
Works
"When 3 unrelated labs start citing the same paper in the same month — something is converging."
Community Layer
Hacker NewsLive
Dual strategy: walking the 500 most recent items + keyword search via Algolia. Not the articles — the comments. That's where early adopters debate.
500-1K / week
Items (deduplicated)
"When a Show HN gets 3 comments saying 'I'm building around this' — that's a signal the headline writer missed."
Stack ExchangeLive
Questions tagged across 14 tech domains. When developers start asking "how to" questions about something new, adoption is beginning.
2K-5K
Questions / week
"A spike in unanswered questions about a new framework means people are trying it. Answers come later. We catch the questions."
RedditLive
~15 tech subreddits. Higher volume and more diverse opinions than HN. Captures sentiment from practitioners, not just thought leaders.
3.5-14K / week
Posts
"Reddit catches the 'I switched from X to Y and here's why' posts before any analyst writes a market report."
IndieHackersLive
Bootstrapper community. Low volume but high signal-to-noise on what solo founders are building and where they see demand.
70-350 / week
Posts
"When 5 indie hackers independently start building the same tool — that's product-market pull, not hype."
News Layer
GDELTLive
Global news event database. 12 keyword queries, 3-month rolling window. Captures when signals cross into mainstream media.
~3K / week
Articles
"We use GDELT as a penalty signal. The more mainstream coverage, the less alpha. If TechCrunch wrote about it, you're late."
WikipediaLive
Pageview tracking on curated tech articles. A lagging indicator — useful to measure when a concept enters public consciousness.
Pageviews
Metrics (not docs)
"When a Wikipedia page goes from 200 views/day to 2,000 — the mainstream moment is here. By then, you should be positioned."
Product Layer
G2 ReviewsLive
SaaS and tech product reviews. Captures real user pain points, feature requests, and switching patterns at scale.
700-3.5K / week
Reviews
"When G2 reviews start mentioning a competitor that doesn't exist yet — someone is about to build it."
Patent Layer
PatentsView (US)Live
US patents across 3 CPC codes: computing, machine learning, and data transmission. Shows where large companies invest R&D before announcements.
350-1.4K / week
Patents
"A cluster of patents from the same company in an unexpected domain signals a pivot — months before the press release."
EPO OPS (EU)Live
European patent filings across the same IPC codes. Cross-referencing US and EU filings reveals global R&D coordination patterns.
350-1.4K / week
Patents
"Same invention filed in both USPTO and EPO = serious commercial intent, not just IP defense."
The Funnel

85,000 documents in. 5 opportunities out.

Each layer filters more aggressively. By the time something reaches your inbox, it has survived statistical, semantic, and LLM-based scrutiny.

25-85K
Raw documents ingested per week from 12 sources
↓ dedup + clustering
50-200
Topics identified via embedding + HDBSCAN
↓ burst detection
10-30
Topics in active burst — statistically abnormal activity
↓ LLM extraction + matching
20-100
Candidate opportunities from capability x constraint matching
↓ composite scoring >= 40
5
Actionable opportunities with evidence + playbooks
Step 4 — Detection

16 detectors. 4 categories.

Every document is scanned by 16 specialized detectors. Each looks for a specific type of weak signal that precedes a technology shift.

Supply-Shock Detectors
Something just got dramatically cheaper, better, or more accessible.
💰 Cost Collapse
Detects when the cost of a technology drops past a critical threshold. Looks for ratio-based patterns: 10x cheaper, 90% cost reduction, etc.
Trigger: "GPU inference cost dropped 94% in 8 months" → ratio >= 5x detected
📊 Benchmark Jump
Tracks 30+ known benchmarks. Fires when a new model or system achieves >= 20% improvement. Detects SOTA-breaking events.
Trigger: "New architecture beats transformers on 3 benchmarks with 40% less compute"
🔓 Open Source Event
Release of a major open-source project, license change, or sudden contribution spike. Measures community traction via stars, forks, contributor growth.
Trigger: Repo goes from 50 to 5,000 stars in 2 weeks → community signal
⚠️ Deprecation
Patterns like "sunset", "end-of-life", "migration required". Deprecations create forced adoption windows — everyone has to move somewhere.
Trigger: "Service X will be sunset December 2026" → migration timeline extracted
🔧 Hardware Unlock
New hardware availability, sudden cost drops, or performance jumps that enable previously impossible use cases.
Trigger: "M4 chip enables local 70B model inference" → capability unlock
🔎
Gap Detectors
Something that should exist doesn't yet — the space between research and products.
📑 Implementation Gap
Compares papers (arXiv) to code (GitHub). A paper with high citation velocity but no open-source implementation is an opportunity gap waiting to be filled.
Trigger: Paper with 140K+ citations, zero open repos → gap detected
🧩 Ecosystem Gap
Maps 5 reference ecosystems and tracks fill ratio. When >= 40% of expected tools exist, the missing ones are the opportunity.
Trigger: "AI agent ecosystem: auth ✓, payments ✓, monitoring ✗, testing ✗" → 40% gap
📡
Tracking Detectors
Language shifts, concept convergence, and abstraction patterns that reveal where things are heading.
🆕 Neologisms
Extracts new terms and bigrams that didn't exist 4 weeks ago. Tracks growth rate and source spread. When a new word appears across 3+ sources, it names something real.
Trigger: "vibe coding" — first seen week 3, 47 mentions by week 7, 4 sources
🔀 Cross-Domain Analogy
Detects "X for Y" patterns where concepts from one domain are being applied to another. "Kubernetes for ML" or "Stripe for AI agents".
Trigger: "Stripe for AI agents" seen in 4 repos, 2 HN threads, 1 paper
📖 Citation Velocity
Tracks citations per 30-day window per document, aggregated by topic. Acceleration in citations means the research community is converging.
Trigger: Paper receiving 50+ citations/month, up from 5 → 10x acceleration
🔗 Convergence
Detects when previously separate technologies start appearing together. Rising co-occurrence score signals a fusion event.
Trigger: "RAG" + "agents" co-occurrence up 300% in 4 weeks
📐 Abstraction Layers
Patterns like "wrapper", "simplifies", "high-level API". When abstraction layers appear, a technology is maturing from expert-only to mainstream-ready.
Trigger: 12 new "LangChain alternative" repos in 3 weeks → abstraction wave
Meta Detectors
Signals that validate other signals — confidence layers and lifecycle tracking.
🔄 Lifecycle Stage
Classifies each topic into one of 6 stages: Embryonic, Emerging, Accelerating, Peaking, Mainstream, Declining. Rule-based state machine updated weekly.
Example: MCP Protocol — Embryonic (week 38) → Emerging (week 42) → Accelerating (week 48)
🛡️ Cross-Source Confirmation
Checks 5 independent layers: papers, code, community, news, products. A signal confirmed by 4/5 layers is far more reliable than one from a single source.
Example: Signal at 5/5 layers = high confidence. Signal at 1/5 = noise.
Step 7 — Scoring

From 30 signals to one score.

Every opportunity gets a composite score from 0 to 100. The formula rewards momentum, readiness, and pain — and penalizes competition and mainstream coverage.

Tech Momentum
0.17
Tech Readiness
0.15
Pain Scale
0.14
Confirmation
0.13
Workaround
0.10
Infra Barrier ⊖
-0.07
Competition ⊖
-0.05
Mainstream ⊖
-0.05
The score is computed in log-space to handle signals across different magnitudes.
Positive features (momentum, readiness, pain, confirmation) push the score up.
Penalty features (competition, mainstream coverage, infrastructure barriers) pull it down.
Final normalization via sigmoid(3.0 x (log_score + 0.30)) maps to 0-100.
95
Strong signal
High burst + multiple sources + real pain + no mainstream coverage yet. This is where the alpha lives.
52
Watch
Emerging activity but low cross-source confirmation. Could be noise, could be early. Needs another week of data.
18
Mainstream / noise
Already covered by TechCrunch. High competition. The wave happened — you're reading about it, not riding it.
Step 9 — Output

What you receive every week.

Up to 85,000 documents compressed into one intelligence report. Every claim is backed by verifiable evidence from public sources.

Active — February 2026
5
Top opportunities
Ranked by composite score. Each one includes the capability it enables, the constraint it addresses, and why now.
Evidence
Verified quotes
Every claim is backed by exact quotes from source documents — with links to the original GitHub repos, HN threads, papers.
Score
0-100 composite
8 features, transparent weighting. You see the breakdown: what's driving the score up, what's pulling it down.
Playbooks
Actionable next steps
Day 0-1: what to research. Day 1-2: what to prototype. Day 2-3: what to ship. Concrete, not theoretical.
Historical Depth

5 years of history. Pattern-matched.

We don't just look at this week. We've backfilled 5 years of data across all 12 sources to calibrate our detectors against known technology waves — so we know what early signals actually looked like.

12-44M
Documents processed
5 years x 12 sources, deduplicated and signal-extracted
260
Weeks analyzed
Each week replayed through the full 9-step pipeline
6
Layers cross-referenced
Code, academic, community, news, product, patents

The method works.
The data is live.

The next signal is already in this week's data. The question is whether you'll see it before everyone else.

Get Early Access →