stochastic-sisyphus · stochastic-sisyphus · Dec 9, 2024 · sourcery-ai · Dec 9, 2024
diff --git a/README.md b/README.md
@@ -32,7 +32,12 @@ python -m src.main --config path/to/custom_config.yaml
 python -m src.main --input "data/documents/" --output "results/"
 ```
 
-3. **Expected Output Structure**
+3. **Run the pipeline with the new configuration file**
+```bash
+python -m src.main --config config/config.yaml
+```
+
+4. **Expected Output Structure**
 ```
 outputs/
 ├── checkpoints/          # Processing checkpoints
@@ -438,3 +443,113 @@ print(f"ROUGE-L F1: {scores['rougeL']['fmeasure']:.3f}")
 bleu_scores = metrics.calculate_bleu_scores(generated_summaries, reference_summaries)
 print(f"BLEU Score: {bleu_scores['bleu']:.3f}")
 ```
+
+## Configuration
+
+### Configuration Options
+
+The configuration file (`config.yaml`) allows you to customize various aspects of the pipeline. Below are the key configuration options:
+
+#### Data Configuration
+```yaml
+data:
+  input_path: "data/input"
+  output_path: "data/output"
+  processed_path: "data/processed"
+  batch_size: 32
+  scisummnet_path: "data/scisummnet_release1.1__20190413"
+  datasets:
+    - name: "xlsum"
+      source: "huggingface"
+      enabled: true
+      language: "english"
+      dataset_name: "GEM/xlsum"
+    - name: "scisummnet"
+      source: "local"
+      enabled: true
+      file_patterns:
+        xml: "{paper_id}.xml"
+        summary: "{paper_id}.gold.txt"
+      subdirs:
+        documents: "Documents_xml"
+        summaries: "summary"
+```
+
+#### Preprocessing Configuration
+```yaml
+preprocessing:
+  min_length: 100
+  max_length: 1000
+  validation:
+    missing_threshold: 5.0
+    min_dataset_size: 10000
+    min_text_length: 100
+    max_text_length: 1000
+```
+
+#### Embedding Configuration
+```yaml
+embedding:
+  model_name: "all-mpnet-base-v2"
+  dimension: 768
+  batch_size: 32
+  max_seq_length: 512
+  device: "cuda"
+```
+
+#### Clustering Configuration
+```yaml
+clustering:
+  algorithm: "hdbscan"
+  min_cluster_size: 5
+  min_samples: 5
+  metric: "euclidean"
+  params:
+    min_cluster_size: 5
+    min_samples: 5
+    metric: "euclidean"
+  output_dir: "outputs/clusters"
+```
+
+#### Visualization Configuration
+```yaml
+visualization:
+  enabled: true
+  output_dir: "outputs/figures"
+```
+
+#### Summarization Configuration
+```yaml
+summarization:
+  model_name: "facebook/bart-large-cnn"
+  max_length: 150
+  min_length: 50
+  device: "cuda"
+  batch_size: 8
+  style_params:
+    concise:
+      max_length: 100
+      min_length: 30
+    detailed:
+      max_length: 300
+      min_length: 100
+    technical:
+      max_length: 200
+      min_length: 50
+  num_beams: 4
+  length_penalty: 2.0
+  early_stopping: true
+```
+
+#### Logging Configuration
+```yaml
+logging:
+  level: "INFO"
+  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+```
+
+#### Checkpoints Configuration
+```yaml
+checkpoints:
+  dir: "outputs/checkpoints"
+```
diff --git a/plan.md b/plan.md
@@ -54,7 +54,14 @@
   - Datasets: N/A
   - Paper Sections: Full paper
   - Concepts to Research: Formatting, NLP publication venues
-  - 
+
+- **Configuration Options**
+  - Lines of Code (Approx.): ~100-200
+  - Files: 1-2
+  - Datasets: Configuration files
+  - Paper Sections: Methodology
+  - Concepts to Research: YAML, configuration management
+
 ---
 
 ### **2. Detailed Breakdown**
@@ -461,10 +468,61 @@ Dynamic Summarization and Adaptive Clustering: A Framework for Real-Time Researc
 2. **Entity Extraction**: Enrich summaries with named entities using `spacy` NER.
 3. **Cloud Deployment**: Enable processing at scale using AWS/GCP.
 
----
----
 ---
 
+## **Configuration Details**
+
+### **Data Configuration**
+- **input_path**: Path to the input data directory.
+- **output_path**: Path to the output data directory.
+- **processed_path**: Path to the processed data directory.
+- **batch_size**: Batch size for data processing.
+- **scisummnet_path**: Path to the ScisummNet dataset.
+- **datasets**: List of datasets to be used, with their respective configurations.
+
+### **Preprocessing Configuration**
+- **min_length**: Minimum length of the text to be considered.
+- **max_length**: Maximum length of the text to be considered.
+- **validation**: Validation parameters for the preprocessing step.
+
+### **Embedding Configuration**
+- **model_name**: Name of the embedding model to be used.
+- **dimension**: Dimension of the generated embeddings.
+- **batch_size**: Batch size for embedding generation.
+- **max_seq_length**: Maximum sequence length for the embedding model.
+- **device**: Device to be used for embedding generation (e.g., "cuda" for GPU).
+
+### **Clustering Configuration**
+- **algorithm**: Clustering algorithm to be used (e.g., "hdbscan").
+- **min_cluster_size**: Minimum size of clusters.
+- **min_samples**: Minimum number of samples for a cluster.
+- **metric**: Distance metric to be used for clustering.
+- **params**: Additional parameters for the clustering algorithm.
+- **output_dir**: Directory to save the clustering results.
+
+### **Visualization Configuration**
+- **enabled**: Whether visualization is enabled.
+- **output_dir**: Directory to save the visualization results.
+
+### **Summarization Configuration**
+- **model_name**: Name of the summarization model to be used.
+- **max_length**: Maximum length of the generated summaries.
+- **min_length**: Minimum length of the generated summaries.
+- **device**: Device to be used for summarization (e.g., "cuda" for GPU).
+- **batch_size**: Batch size for summarization.
+- **style_params**: Parameters for different summarization styles (e.g., concise, detailed, technical).
+- **num_beams**: Number of beams for beam search.
+- **length_penalty**: Length penalty for beam search.
+- **early_stopping**: Whether to stop early during beam search.
+
+### **Logging Configuration**
+- **level**: Logging level (e.g., "INFO").
+- **format**: Format of the log messages.
+
+### **Checkpoints Configuration**
+- **dir**: Directory to save the checkpoints.
+
+---
 # INFO:
 
 # **data: XL-Sum and ScisummNet**
@@ -517,4 +575,4 @@ tree -L 3
     └── test_embedding_visualizer.py
 
 13 directories, 18 files
-(.venv) (base) iMac:synsearch vanessa$ 
+(.venv) (base) iMac:synsearch vanessa$