Skip to the content.

Feature: AI Extraction Skill — Unstructured Documents → NeqSim JSON Process Model

Summary

Create a Copilot skill (and companion agent) that reads unstructured information from various document types (text descriptions, PFDs, data sheets, operational reports, Excel tables, images/sketches) and converts it into the canonical NeqSim JSON format that ProcessSystem.fromJson() already accepts. This closes the gap between “I have a pile of engineering documents” and “I have a running NeqSim simulation.”


Motivation

NeqSim already has a powerful JSON process builder (ProcessSystem.fromJson() / ProcessSystem.fromJsonAndRun()) that can declaratively build and run complete process simulations from structured JSON. An evaluation notebook (examples/notebooks/json_process_builder_evaluation.ipynb, planned) confirms this works well for:

The missing piece is the “first mile” — converting messy, real-world engineering information into that clean JSON. Today, engineers must manually translate PFDs, design basis documents, and operating data into either Python code or JSON by hand. This is tedious, error-prone, and the #1 barrier to adoption.

What Engineers Actually Have

Source Type Example Information Content
Text descriptions “The well stream at 65 bara and 80°C enters a 3-phase separator, gas goes to a compressor at 120 bara, liquids to a letdown valve at 15 bara” Topology, conditions, equipment types
Process Flow Diagrams (PFDs) PNG/PDF images of process sketches Equipment layout, stream connectivity, tag numbers
Data sheets Equipment data sheets (PDF/Excel) Design pressures, temperatures, materials, sizes
Operating reports Daily/monthly production reports Flow rates, compositions, operating points
Excel spreadsheets Heat & mass balance tables, well test data Compositions, conditions, multi-stream data
Design basis documents FEED/concept study reports Fluid compositions, design envelopes, constraints
P&IDs Piping & Instrumentation Diagrams Valve types, instrument tags, control loops

The Gap

Today:   Documents → (manual reading & coding) → NeqSim API calls → Results
Target:  Documents → (AI extraction skill) → JSON → ProcessSystem.fromJson() → Results

Proposed Architecture

┌─────────────────────────────────────────────┐
│              Input Sources                   │
│  Text / PDF / Images / Excel / Tables / OCR  │
└──────────────────┬──────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────┐
│     Skill: neqsim-process-extraction        │
│  (Copilot skill with LLM-guided parsing)    │
│                                             │
│  1. Source classification & chunking         │
│  2. Equipment identification & typing        │
│  3. Stream topology extraction               │
│  4. Operating condition extraction           │
│  5. Fluid composition extraction             │
│  6. Missing data detection & flagging        │
│  7. Assumption tracking                      │
└──────────────────┬──────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────┐
│     Canonical Intermediate Schema            │
│  (streams, units, connections, fluids)       │
│                                             │
│  - Equipment type mapping (60+ synonyms)     │
│  - Unit normalization (barg→bara, °C→K)      │
│  - Composition validation (sum to 1.0)       │
│  - Orphan stream detection                   │
│  - Template matching for known topologies    │
└──────────────────┬──────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────┐
│     NeqSim JSON Builder Format               │
│  ProcessSystem.fromJson() / fromJsonAndRun() │
│                                             │
│  ✓ Already exists and works                  │
│  ✓ 40+ equipment types                      │
│  ✓ Dot-notation stream wiring               │
│  ✓ Structured error responses               │
│  ✓ Session management for iterations        │
└──────────────────┬──────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────┐
│     Validation + Simulation + Report         │
│  SimulationResult with error codes & fixes   │
└─────────────────────────────────────────────┘

Key Design Principle

The LLM extracts structured data into a constrained JSON schema. It does NOT write NeqSim code.

This is fundamentally more reliable than code-generation because:

  1. The JSON schema is finite and well-defined
  2. Validation is deterministic (rule-based, not LLM-based)
  3. ProcessSystem.fromJson() handles all API calls correctly
  4. Errors are structured and actionable

Deliverables

1. Copilot Skill: neqsim-process-extraction

Location: .github/skills/neqsim-process-extraction/SKILL.md

The skill should contain:

2. Copilot Agent: extract process to neqsim json

Location: .github/agents/extract.process.agent.md

An agent that:

  1. Accepts unstructured input (pasted text, file references, image descriptions)
  2. Loads the neqsim-process-extraction skill
  3. Extracts equipment, topology, conditions, and compositions
  4. Produces the NeqSim JSON builder format
  5. Runs the simulation via ProcessSystem.fromJsonAndRun()
  6. Reports results with confidence score, assumptions used, and missing information flagged

3. Canonical Intermediate Schema

A documented JSON schema for the intermediate representation between raw extraction and NeqSim JSON:

{
  "source": {
    "type": "text|pfd|datasheet|excel|image",
    "description": "Brief description of input source",
    "raw_text": "Original text if applicable"
  },
  "extraction": {
    "confidence": 0.75,
    "assumptions": [
      "Default SRK EOS (not specified in source)",
      "Flow rate assumed 50000 kg/hr (not specified)"
    ],
    "missing_information": [
      "Feed composition not provided",
      "Compressor efficiency not specified"
    ],
    "warnings": [
      "Composition sums to 0.98 — normalized to 1.0"
    ]
  },
  "fluids": [
    {
      "id": "feed_gas",
      "model": "SRK",
      "temperature_C": 50.0,
      "pressure_bara": 65.0,
      "components": {"methane": 0.85, "ethane": 0.10, "propane": 0.05}
    }
  ],
  "equipment": [
    {
      "id": "V-101",
      "type": "ThreePhaseSeparator",
      "name": "Inlet Separator",
      "tag": "20VA001",
      "design_pressure_barg": 70,
      "design_temperature_C": 100
    }
  ],
  "streams": [
    {
      "id": "S-001",
      "type": "material",
      "from": null,
      "to": "V-101",
      "port": "inlet",
      "fluid_ref": "feed_gas",
      "flow_rate_kg_hr": 75000.0
    }
  ],
  "connections": [
    {"from": "V-101.gasOut", "to": "K-101.inlet"},
    {"from": "V-101.oilOut", "to": "VLV-101.inlet"}
  ]
}

4. Equipment Type Mapping Table

A comprehensive mapping file (JSON or CSV) with 60+ entries:

Natural Language NeqSim Type Category
separator, 2-phase separator, flash drum, KO drum, scrubber, slug catcher Separator Separation
3-phase separator, three-phase separator, production separator ThreePhaseSeparator Separation
compressor, export compressor, recompressor, booster compressor Compressor Compression
cooler, aftercooler, air cooler, fin fan cooler Cooler Heat Transfer
heater, pre-heater, line heater Heater Heat Transfer
heat exchanger, shell & tube, plate HX HeatExchanger Heat Transfer
valve, choke valve, JT valve, letdown valve, control valve ThrottlingValve Valves
pump, export pump, booster pump Pump Pumps
mixer, junction, manifold Mixer Mixing
splitter, tee Splitter Splitting
expander, turbo-expander Expander Expansion

5. Template Library

Pre-defined JSON templates for common process configurations:

Each template has fixed topology with parametric placeholders (pressures, temperatures, compositions, flow rates) that can be filled from extracted data.


Phased Implementation

Phase Scope Difficulty Dependencies
Phase 1 Text descriptions → JSON for linear/branching processes Easy None — can start immediately
Phase 2 Template matching + parametric fill from data sheets/Excel Medium Phase 1 + template library
Phase 3 Image/sketch interpretation (PFD, process sketches) Hard Phase 1 + vision model
Phase 4 Complex topology (recycles, multi-feed, distillation) Hard Phase 1 + recycle solver in JSON builder
Phase 5 DEXPI / ISO 15926 P&ID import Industry standard Phase 4 + DEXPI schema mapping

Phase 1 — Text Extraction (Start Here)

Minimum viable skill that can:

Phase 2 — Template + Data Fill

Phase 3 — Vision / Image


Acceptance Criteria

Must Have (Phase 1)

Should Have (Phase 2)

Nice to Have (Phase 3+)



Example: End-to-End Workflow

User provides:

“The well stream arrives at 65 bara and 80°C. It enters a 3-phase separator. Gas from the separator goes to a compressor that boosts pressure to 120 bara. Oil from the separator goes through a letdown valve to 15 bara. The gas composition is approximately 80% methane, 8% ethane, 5% propane, 3% CO2, 2% n-butane, 1% nitrogen, 0.5% n-pentane, 0.5% n-hexane. Flow rate is about 75000 kg/hr.”

Skill extracts:

{
  "fluid": {
    "model": "SRK",
    "temperature": 353.15,
    "pressure": 65.0,
    "mixingRule": "classic",
    "components": {
      "methane": 0.80, "ethane": 0.08, "propane": 0.05,
      "CO2": 0.03, "n-butane": 0.02, "nitrogen": 0.01,
      "n-pentane": 0.005, "n-hexane": 0.005
    }
  },
  "process": [
    {"type": "Stream", "name": "well stream", "properties": {"flowRate": [75000.0, "kg/hr"]}},
    {"type": "ThreePhaseSeparator", "name": "inlet separator", "inlet": "well stream"},
    {"type": "Compressor", "name": "gas compressor", "inlet": "inlet separator.gasOut", "properties": {"outletPressure": 120.0}},
    {"type": "ThrottlingValve", "name": "letdown valve", "inlet": "inlet separator.oilOut", "properties": {"outletPressure": 15.0}}
  ],
  "autoRun": true
}

Agent output:


Implementation Status

Already Delivered (Phase 1)

Remaining Work (Phase 2+)


Labels

enhancement, ai-skills, process-simulation, json-builder