Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update read me and documentation #3

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 116 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,12 @@ python -m src.main --config path/to/custom_config.yaml
python -m src.main --input "data/documents/" --output "results/"
```

3. **Expected Output Structure**
3. **Run the pipeline with the new configuration file**
```bash
python -m src.main --config config/config.yaml
```

4. **Expected Output Structure**
```
outputs/
├── checkpoints/ # Processing checkpoints
Expand Down Expand Up @@ -438,3 +443,113 @@ print(f"ROUGE-L F1: {scores['rougeL']['fmeasure']:.3f}")
bleu_scores = metrics.calculate_bleu_scores(generated_summaries, reference_summaries)
print(f"BLEU Score: {bleu_scores['bleu']:.3f}")
```

## Configuration

### Configuration Options

The configuration file (`config.yaml`) allows you to customize various aspects of the pipeline. Below are the key configuration options:

#### Data Configuration
```yaml
data:
input_path: "data/input"
output_path: "data/output"
processed_path: "data/processed"
batch_size: 32
scisummnet_path: "data/scisummnet_release1.1__20190413"
datasets:
- name: "xlsum"
source: "huggingface"
enabled: true
language: "english"
dataset_name: "GEM/xlsum"
- name: "scisummnet"
source: "local"
enabled: true
file_patterns:
xml: "{paper_id}.xml"
summary: "{paper_id}.gold.txt"
subdirs:
documents: "Documents_xml"
summaries: "summary"
```

#### Preprocessing Configuration
```yaml
preprocessing:
min_length: 100
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Please specify the units for min_length and max_length (characters, words, or tokens).

Suggested implementation:

  min_length: 100  # characters
  max_length: 1000  # characters

    min_text_length: 100  # characters
    max_text_length: 1000  # characters

max_length: 1000
validation:
missing_threshold: 5.0
min_dataset_size: 10000
min_text_length: 100
max_text_length: 1000
```

#### Embedding Configuration
```yaml
embedding:
model_name: "all-mpnet-base-v2"
dimension: 768
batch_size: 32
max_seq_length: 512
device: "cuda"
```

#### Clustering Configuration
```yaml
clustering:
algorithm: "hdbscan"
min_cluster_size: 5
min_samples: 5
metric: "euclidean"
params:
min_cluster_size: 5
min_samples: 5
metric: "euclidean"
output_dir: "outputs/clusters"
```

#### Visualization Configuration
```yaml
visualization:
enabled: true
output_dir: "outputs/figures"
```

#### Summarization Configuration
```yaml
summarization:
model_name: "facebook/bart-large-cnn"
max_length: 150
min_length: 50
device: "cuda"
batch_size: 8
style_params:
concise:
max_length: 100
min_length: 30
detailed:
max_length: 300
min_length: 100
technical:
max_length: 200
min_length: 50
num_beams: 4
length_penalty: 2.0
early_stopping: true
```

#### Logging Configuration
```yaml
logging:
level: "INFO"
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
```

#### Checkpoints Configuration
```yaml
checkpoints:
dir: "outputs/checkpoints"
```
66 changes: 62 additions & 4 deletions plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,14 @@
- Datasets: N/A
- Paper Sections: Full paper
- Concepts to Research: Formatting, NLP publication venues
-

- **Configuration Options**
- Lines of Code (Approx.): ~100-200
- Files: 1-2
- Datasets: Configuration files
- Paper Sections: Methodology
- Concepts to Research: YAML, configuration management

---

### **2. Detailed Breakdown**
Expand Down Expand Up @@ -461,10 +468,61 @@ Dynamic Summarization and Adaptive Clustering: A Framework for Real-Time Researc
2. **Entity Extraction**: Enrich summaries with named entities using `spacy` NER.
3. **Cloud Deployment**: Enable processing at scale using AWS/GCP.

---
---
---

## **Configuration Details**

### **Data Configuration**
- **input_path**: Path to the input data directory.
- **output_path**: Path to the output data directory.
- **processed_path**: Path to the processed data directory.
- **batch_size**: Batch size for data processing.
- **scisummnet_path**: Path to the ScisummNet dataset.
- **datasets**: List of datasets to be used, with their respective configurations.

### **Preprocessing Configuration**
- **min_length**: Minimum length of the text to be considered.
- **max_length**: Maximum length of the text to be considered.
- **validation**: Validation parameters for the preprocessing step.

### **Embedding Configuration**
- **model_name**: Name of the embedding model to be used.
- **dimension**: Dimension of the generated embeddings.
- **batch_size**: Batch size for embedding generation.
- **max_seq_length**: Maximum sequence length for the embedding model.
- **device**: Device to be used for embedding generation (e.g., "cuda" for GPU).

### **Clustering Configuration**
- **algorithm**: Clustering algorithm to be used (e.g., "hdbscan").
- **min_cluster_size**: Minimum size of clusters.
- **min_samples**: Minimum number of samples for a cluster.
- **metric**: Distance metric to be used for clustering.
- **params**: Additional parameters for the clustering algorithm.
- **output_dir**: Directory to save the clustering results.

### **Visualization Configuration**
- **enabled**: Whether visualization is enabled.
- **output_dir**: Directory to save the visualization results.

### **Summarization Configuration**
- **model_name**: Name of the summarization model to be used.
- **max_length**: Maximum length of the generated summaries.
- **min_length**: Minimum length of the generated summaries.
- **device**: Device to be used for summarization (e.g., "cuda" for GPU).
- **batch_size**: Batch size for summarization.
- **style_params**: Parameters for different summarization styles (e.g., concise, detailed, technical).
- **num_beams**: Number of beams for beam search.
- **length_penalty**: Length penalty for beam search.
- **early_stopping**: Whether to stop early during beam search.

### **Logging Configuration**
- **level**: Logging level (e.g., "INFO").
- **format**: Format of the log messages.

### **Checkpoints Configuration**
- **dir**: Directory to save the checkpoints.

---
# INFO:

# **data: XL-Sum and ScisummNet**
Expand Down Expand Up @@ -517,4 +575,4 @@ tree -L 3
└── test_embedding_visualizer.py

13 directories, 18 files
(.venv) (base) iMac:synsearch vanessa$
(.venv) (base) iMac:synsearch vanessa$