Customer Intelligence Hub
1. Overview
Section titled “1. Overview”The Customer Intelligence Hub is a modular Streamlit application designed to ingest raw customer data (CSV/Excel) and apply four distinct analytical lenses: Churn Prediction, Customer Segmentation, Sentiment Analysis, and Geospatial Mapping. It features a “Self-Healing” state management system and an “Auto-Detection” engine that identifies data types based on column headers.
Tech Stack
Section titled “Tech Stack”- Frontend: Streamlit, Plotly, Matplotlib.
- ML Core: XGBoost (Churn/CLV), Scikit-Learn (Segmentation/LDA), VADER (Sentiment).
- Data Processing: Pandas, NumPy.
- Geospatial: Geopy, Plotly Mapbox.
- Explainability: SHAP (Shapley Additive Explanations).
2. System Architecture
Section titled “2. System Architecture”The project follows a standard Frontend-Backend separation pattern within a Python environment. The src/ folder acts as the backend API.
Directory Structure
Section titled “Directory Structure”Root/├── app.py # Entry Point & Capability Detection├── requirements.txt # Requirements File for Deployment└── pages/ # Pages Utility for Streamlit ├── 1_Churn.py # UI: Risk Dashboard & Simulator ├── 2_Segmentation.py # UI: Clustering & Logic Trees ├── 3_Sentiment.py # UI: NLP & Topic Modeling ├── 4_Geospatial.py # UI: Maps & Route Analytics ├── 5_Customer_View.py # UI: Single Customer Inspector├── data/ # Raw CSV/Excel inputs└── src/ # "The Brain" (Business Logic & ML) ├── churn_engine.py # XGBoost Classifier Pipeline ├── clv_engine.py # XGBoost Regressor for LTV ├── segment_engine.py # KMeans + Decision Tree Rules ├── sentiment_engine.py # VADER + LDA Topic Extraction ├── geo_engine.py # Hybrid Geocoding (Dict + API) ├── nlg_engine.py # Natural Language Generation ├── recommendation_engine.py # Rule-based Strategy Mapper ├── data_loader.py # Universal File Reader ├── config.py # Paths & Constants └── validator.py # Column Regex Validation3. Core Workflow Diagrams
Section titled “3. Core Workflow Diagrams”A. Data Ingestion & Auto-Detection
Section titled “A. Data Ingestion & Auto-Detection”The system automatically determines which modules to enable based on fuzzy string matching of column headers (e.g., if it sees “ReviewBody”, it enables Sentiment).
graph TD
A[User Uploads CSV] --> B[data_loader.py]
B --> C{app.py: Auto-Register}
C -->|Has 'churn', 'exited'| D[Enable Churn Module]
C -->|Has 'lat', 'city'| E[Enable Geo Module]
C -->|Has 'review', 'text'| F[Enable Sentiment Module]
C -->|Has 'spend', 'amount'| G[Enable Segmentation Module]
D & E & F & G --> H[Session State 'Flags']
H --> I[Update UI Badges]
B. Churn Prediction Pipeline
Section titled “B. Churn Prediction Pipeline”The Churn engine uses a production-ready pipeline that handles encoding, training, and explanation.
graph LR
RawData --> Clean[_clean_data]
Clean --> Split[Train/Test Split]
Split --> Encode[Label Encoding]
Encode --> XGB[XGBoost Classifier]
XGB --> SHAP[SHAP Explainer]
XGB --> Preds[Probabilities]
Preds --> Risk[Risk Groups: Low/Med/High]
4. Analytical Modules (Deep Dive)
Section titled “4. Analytical Modules (Deep Dive)”Churn Engine
Section titled “Churn Engine”(src/churn_engine.py)
Predicts the likelihood of a customer leaving using XGBoost.
- Algorithm:
XGBClassifierwithscale_pos_weightto handle class imbalance. - Key Features:
- Simulate Churn: Allows users to modify a single row (e.g., lower monthly bill) and re-predict risk in real-time.
- Directional Importance: Calculates feature importance and correlates it with risk to determine if a feature “Increases” or “Decreases” risk.
- Retention Plan: Aggregates churn rates by category to suggest the “Safest Segment” (e.g., “2-Year Contract has lowest churn”).
Segmentation Engine
Section titled “Segmentation Engine”(src/segment_engine.py)
Groups customers using Unsupervised Learning, then explains the groups using Supervised Learning.
- Algorithm: K-Means Clustering (grouping) followed by a Decision Tree Classifier (rule extraction).
- Mode Detection: Automatically detects if data is “Demographic” (Age/Spend) or “RFM” (Recency/Freq/Monetary) to adjust preprocessing.
- Smart Labeling: Dynamically names clusters (e.g., “Young Saver”, “Lost Whale”) based on centroid statistics.
Sentiment Engine
Section titled “Sentiment Engine”(src/sentiment_engine.py)
Analyzes unstructured text data.
- Sentiment Scoring: Uses VADER (Valence Aware Dictionary and sEntiment Reasoner) to generate compound scores (-1 to +1).
- Topic Modeling: Uses LDA (Latent Dirichlet Allocation) via
sklearnto extract top keywords and themes from the text corpus. - Data Cleaning: Removes punctuation and normalizes text before vectorization.
Geospatial Engine
Section titled “Geospatial Engine”(src/geo_engine.py)
Maps location strings to coordinates.
- Hybrid Approach:
- Tier 1: Checks a massive internal dictionary of IATA codes (LHR, JFK) and major cities.
- Tier 2: Performs fuzzy string matching (substrings/typos).
- Tier 3: (Optional) Calls the Nominatim API for deep scanning.
- Route Parser: Capable of extracting origin/destination from strings like “London to Paris”.
5. Business Logic Layer
Section titled “5. Business Logic Layer”The src/recommendation_engine.py file acts as the bridge between raw math and business advice. It applies conditional logic to generate “Next Best Actions”.
| Context | Persona Logic Example | Action Example |
|---|---|---|
| Demographic | Age < 30 + Spend > Avg → “Gen Z Spender" | "Strategy: Trend & FOMO. Push ‘Trending Now’ items.” |
| RFM | Recency < 30 + Monetary > High → “Champion" | "Strategy: Reward. Action: VIP Tier Upgrade.” |
| Churn | Risk > 80% → “Flight Risk" | "Action: Call with 20% Retention Discount.” |
| Geo | Traffic > High + Rating < Low → “High Vol/Low Sat" | "Action: Audit Ground Staff & Delays.” |
6. Key UI Features
Section titled “6. Key UI Features”The Single Customer View
Section titled “The Single Customer View”(5_Customer_View.py)
This page aggregates data for one specific ID.
- Radar Chart: Plots a customer’s percentile rank against the population average for top metrics.
- Risk Gauge: Visualizes churn probability using a Plotly gauge chart.
- Cross-Module Access: Can pull data from Churn, Segmentation, or Sentiment caches depending on what is loaded.
The Simulator
Section titled “The Simulator”(1_Churn.py)
- Users can manipulate sliders (for numeric data) and dropdowns (for categorical data) to create a hypothetical customer profile.
- The system runs
churn_bot.predict_single()to return the new probability and generates a SHAP waterfall plot to explain the specific prediction.
7. Installation & Usage
Section titled “7. Installation & Usage”- Environment Setup: Ensure requirements are installed (pandas, streamlit, xgboost, shap, plotly, geopy, scikit-learn).
- Run Application:
Terminal window streamlit run app.py - Loading Data:
- Option A: Click “Load Demo Data” on the home page.
- Option B: Upload a CSV. The
auto_register_datafunction inapp.pywill scan columns and activate the relevant tabs in the sidebar.
8. Common Troubleshooting
Section titled “8. Common Troubleshooting”- “No Geospatial Data Active”: The CSV must contain columns like ‘City’, ‘Country’, ‘Location’, or ‘Route’ for the auto-detector to flag it.
- “Unmatched Locations”: If the offline dictionary misses a city, enable “Deep Scan” in
4_Geospatial.pyto hit the live API. - Churn Model Errors: If columns are missing (specifically ‘Churn’), the model will refuse to train. Use
validator.pylogic to check requirements.