Prewave — The Method

The Pipeline

9 steps. From noise to signal.

Every week, we ingest up to 85,000 documents from 12 public sources across 6 layers of the tech ecosystem. Most of it is noise. Our pipeline compresses it down to 5 actionable opportunities with evidence, scores, and playbooks.

Ingest

Pull from 12 sources across 6 layers — code, academic, community, news, product reviews, and patents. Normalize into a unified document format. Deduplicate by content hash and unique ID.

25-85K docs/week12 APIs6 layers

Cluster

Embed documents into 384-dimension vectors, then cluster with HDBSCAN. Topics emerge organically — no predefined categories. Stratified sampling handles the volume: 10K documents per weekly run.

BERTopicMiniLM-L6HDBSCAN

Detect bursts

Run the Kleinberg burst detection algorithm on each topic's time series. Identifies statistically abnormal activity spikes — not just growth, but acceleration.

KleinbergZ-score fallback

Run 16 signal detectors

Each document passes through 16 specialized detectors across 4 categories: supply-shock signals, gap signals, tracking signals, and meta-validation signals.

5 supply-shock2 gap5 tracking2 meta

Extract capabilities & constraints

LLM reads the top 50 bursting topics (3 docs each, 150 documents total) and extracts: what does this technology make possible? What are the current blockers? Every claim is verified against source quotes.

Gemini FlashEvidence spansQuote verification

Match opportunities

Cross-multiply capabilities x constraints. LLM evaluates each pair for pain level, existing workarounds, timing, and competitive landscape.

Cartesian matchingPain scoring

Score

30+ raw signals compress into 8 composite features. A log-space sigmoid formula produces a 0-100 score. High momentum + high readiness + high pain = high score. Competition and mainstream coverage penalize.

8 featuresLog-space sigmoid0-100

Track lifecycle

Each topic is classified into a lifecycle stage: Embryonic, Emerging, Accelerating, Peaking, Mainstream, Declining. Cross-source confirmation checks how many independent layers validate the signal.

6-state FSM5 confirmation layers

Generate report

Top 5 opportunities (score >= 40) are enriched with evidence packs and synthesized into actionable playbooks. You receive a weekly intelligence report with concrete next moves.

Top 5Evidence packsPlaybooks

Step 1 — Ingestion

12 sources. 6 layers. One pipeline.

Each source captures a different layer of the tech ecosystem. Signals are strongest when they appear across multiple, independent layers.

Code Layer

GitHub ArchiveLive

Every public event on GitHub — repo creations, pushes, forks, stars. Filtered by 56 tech keywords across 4 event types.

5K-20K

Documents / week

"This is where builders ship before they announce. A sudden fork spike on an obscure repo is a signal nobody else sees."

Academic Layer

ArXivLive

Pre-prints across 4 CS categories: AI, ML, Computation & Language, Information Retrieval. Where breakthroughs appear 6-12 months before products.

700-3.5K / week

Papers

"The Attention Is All You Need paper was on arXiv 2 years before GPT-2. The signals are there — buried in math."

OpenAlexLive

Academic graph covering 250M+ works. We track 5 CS concepts with citation velocity, institutional signals, and cross-reference patterns.

7-35K / week

Works

"When 3 unrelated labs start citing the same paper in the same month — something is converging."

Community Layer

Hacker NewsLive

Dual strategy: walking the 500 most recent items + keyword search via Algolia. Not the articles — the comments. That's where early adopters debate.

500-1K / week

Items (deduplicated)

"When a Show HN gets 3 comments saying 'I'm building around this' — that's a signal the headline writer missed."

Stack ExchangeLive

Questions tagged across 14 tech domains. When developers start asking "how to" questions about something new, adoption is beginning.

2K-5K

Questions / week

"A spike in unanswered questions about a new framework means people are trying it. Answers come later. We catch the questions."

RedditLive

~15 tech subreddits. Higher volume and more diverse opinions than HN. Captures sentiment from practitioners, not just thought leaders.

3.5-14K / week

Posts

"Reddit catches the 'I switched from X to Y and here's why' posts before any analyst writes a market report."

IndieHackersLive

Bootstrapper community. Low volume but high signal-to-noise on what solo founders are building and where they see demand.

70-350 / week

Posts

"When 5 indie hackers independently start building the same tool — that's product-market pull, not hype."

News Layer

GDELTLive

Global news event database. 12 keyword queries, 3-month rolling window. Captures when signals cross into mainstream media.

~3K / week

Articles

"We use GDELT as a penalty signal. The more mainstream coverage, the less alpha. If TechCrunch wrote about it, you're late."

WikipediaLive

Pageview tracking on curated tech articles. A lagging indicator — useful to measure when a concept enters public consciousness.

Pageviews

Metrics (not docs)

"When a Wikipedia page goes from 200 views/day to 2,000 — the mainstream moment is here. By then, you should be positioned."

Product Layer

G2 ReviewsLive

SaaS and tech product reviews. Captures real user pain points, feature requests, and switching patterns at scale.

700-3.5K / week

Reviews

"When G2 reviews start mentioning a competitor that doesn't exist yet — someone is about to build it."

Patent Layer

PatentsView (US)Live

US patents across 3 CPC codes: computing, machine learning, and data transmission. Shows where large companies invest R&D before announcements.

350-1.4K / week

Patents

"A cluster of patents from the same company in an unexpected domain signals a pivot — months before the press release."

EPO OPS (EU)Live

European patent filings across the same IPC codes. Cross-referencing US and EU filings reveals global R&D coordination patterns.

350-1.4K / week

Patents

"Same invention filed in both USPTO and EPO = serious commercial intent, not just IP defense."

Step 4 — Detection

16 detectors. 4 categories.

Every document is scanned by 16 specialized detectors. Each looks for a specific type of weak signal that precedes a technology shift.

⚡

Supply-Shock Detectors

Something just got dramatically cheaper, better, or more accessible.

💰 Cost Collapse

Detects when the cost of a technology drops past a critical threshold. Looks for ratio-based patterns: 10x cheaper, 90% cost reduction, etc.

Trigger: "GPU inference cost dropped 94% in 8 months" → ratio >= 5x detected

📊 Benchmark Jump

Tracks 30+ known benchmarks. Fires when a new model or system achieves >= 20% improvement. Detects SOTA-breaking events.

Trigger: "New architecture beats transformers on 3 benchmarks with 40% less compute"

🔓 Open Source Event

Release of a major open-source project, license change, or sudden contribution spike. Measures community traction via stars, forks, contributor growth.

Trigger: Repo goes from 50 to 5,000 stars in 2 weeks → community signal

⚠️ Deprecation

Patterns like "sunset", "end-of-life", "migration required". Deprecations create forced adoption windows — everyone has to move somewhere.

Trigger: "Service X will be sunset December 2026" → migration timeline extracted

🔧 Hardware Unlock

New hardware availability, sudden cost drops, or performance jumps that enable previously impossible use cases.

Trigger: "M4 chip enables local 70B model inference" → capability unlock

🔎

Gap Detectors

Something that should exist doesn't yet — the space between research and products.

📑 Implementation Gap

Compares papers (arXiv) to code (GitHub). A paper with high citation velocity but no open-source implementation is an opportunity gap waiting to be filled.

Trigger: Paper with 140K+ citations, zero open repos → gap detected

🧩 Ecosystem Gap

Maps 5 reference ecosystems and tracks fill ratio. When >= 40% of expected tools exist, the missing ones are the opportunity.

Trigger: "AI agent ecosystem: auth ✓, payments ✓, monitoring ✗, testing ✗" → 40% gap

📡

Tracking Detectors

Language shifts, concept convergence, and abstraction patterns that reveal where things are heading.

🆕 Neologisms

Extracts new terms and bigrams that didn't exist 4 weeks ago. Tracks growth rate and source spread. When a new word appears across 3+ sources, it names something real.

Trigger: "vibe coding" — first seen week 3, 47 mentions by week 7, 4 sources

🔀 Cross-Domain Analogy

Detects "X for Y" patterns where concepts from one domain are being applied to another. "Kubernetes for ML" or "Stripe for AI agents".

Trigger: "Stripe for AI agents" seen in 4 repos, 2 HN threads, 1 paper

📖 Citation Velocity

Tracks citations per 30-day window per document, aggregated by topic. Acceleration in citations means the research community is converging.

Trigger: Paper receiving 50+ citations/month, up from 5 → 10x acceleration

🔗 Convergence

Detects when previously separate technologies start appearing together. Rising co-occurrence score signals a fusion event.

Trigger: "RAG" + "agents" co-occurrence up 300% in 4 weeks

📐 Abstraction Layers

Patterns like "wrapper", "simplifies", "high-level API". When abstraction layers appear, a technology is maturing from expert-only to mainstream-ready.

Trigger: 12 new "LangChain alternative" repos in 3 weeks → abstraction wave

✓

Meta Detectors

Signals that validate other signals — confidence layers and lifecycle tracking.

🔄 Lifecycle Stage

Classifies each topic into one of 6 stages: Embryonic, Emerging, Accelerating, Peaking, Mainstream, Declining. Rule-based state machine updated weekly.

Example: MCP Protocol — Embryonic (week 38) → Emerging (week 42) → Accelerating (week 48)

🛡️ Cross-Source Confirmation

Checks 5 independent layers: papers, code, community, news, products. A signal confirmed by 4/5 layers is far more reliable than one from a single source.

Example: Signal at 5/5 layers = high confidence. Signal at 1/5 = noise.

Step 7 — Scoring

From 30 signals to one score.

Every opportunity gets a composite score from 0 to 100. The formula rewards momentum, readiness, and pain — and penalizes competition and mainstream coverage.

Tech Momentum

0.17

Tech Readiness

0.15

Pain Scale

0.14

Confirmation

0.13

Workaround

0.10

Infra Barrier ⊖

-0.07

Competition ⊖

-0.05

Mainstream ⊖

-0.05

The score is computed in log-space to handle signals across different magnitudes.
Positive features (momentum, readiness, pain, confirmation) push the score up.
Penalty features (competition, mainstream coverage, infrastructure barriers) pull it down.
Final normalization via sigmoid(3.0 x (log_score + 0.30)) maps to 0-100.

Strong signal

High burst + multiple sources + real pain + no mainstream coverage yet. This is where the alpha lives.

Watch

Emerging activity but low cross-source confirmation. Could be noise, could be early. Needs another week of data.

Mainstream / noise

Already covered by TechCrunch. High competition. The wave happened — you're reading about it, not riding it.

How we detect signals before they become trends.

9 steps. From noise to signal.

12 sources. 6 layers. One pipeline.

85,000 documents in. 5 opportunities out.

16 detectors. 4 categories.

From 30 signals to one score.

What you receive every week.

5 years of history. Pattern-matched.

The method works.
The data is live.

How we detect signals before they become trends.

9 steps. From noise to signal.

12 sources. 6 layers. One pipeline.

85,000 documents in. 5 opportunities out.

16 detectors. 4 categories.

From 30 signals to one score.

What you receive every week.

5 years of history. Pattern-matched.

The method works.The data is live.

The method works.
The data is live.