Contrastive Analysis of Constituent Order Preferences Within Adverbial Roles in English and Chinese News: A Large-Language-Model-Driven Approach

大模型驱动的英汉新闻状语功能成分序偏好对比

This repository contains source code and research data for a term paper submitted to the Corpus Linguistics (语料库语言学) and Contrastive Linguistics (英汉语言对比) courses at Beijing University of Posts and Telecommunications, 2024 Fall.

Data Sources

The research utilizes the following corpora:

ToRCH2019 Modern Chinese Balanced Corpus (李佳蕾、孙铭辰、许家金，2022，ToRCH2019现代汉语平衡语料库。北京外国语大学中国外语与教育研究中心)
The CROWN2021 Corpus (Mingchen Sun, Jiajin Xu et al. 2022. National Research Centre for Foreign Language Education, Beijing Foreign Studies University)

Repository Overview

The research paper is available in the paper directory, presented in both Chinese (original) and English (translated version). The English translation aims to provide accessibility to non-Chinese speakers while maintaining the core academic content.

Components

The analysis was implemented through three main modules:

1. Annotation Module

Utilizes GPT-4o for automated functional block annotation
Implements robust sentence splitting with handling for abbreviations and special cases
Processes both English and Chinese corpora in batches for efficiency
Includes quality control measures through post-processing scripts

2. Statistical Analysis Module

Analyzes three main research questions:
- Q1: Distribution preferences of functional blocks
- Q2: Patterns in SVO-functional block combinations
- Q3: Multiple functional block ordering patterns
Employs chi-square tests and t-tests for significance testing
Utilizes conditional probability matrices for transition analysis

3. Semantic Analysis Module

Uses MiniCPM-Embedding model for semantic feature extraction
Implements dimensionality reduction through t-SNE
Analyzes semantic similarities between functional blocks
Explores semantic influence on word order preferences

Setup

Create and activate a conda environment:

conda create -n constituent_order python=3.11
conda activate constituent_order

Install dependencies:

pip install -r requirements.txt

Usage

Data Annotation & Processing

Configure OpenAI API key in .env
Run annotation scripts:

python annotation/anno_en.py  # English corpus
python annotation/anno_ch.py  # Chinese corpus

Post-process and extract patterns:

python annotation/post_ch.py  # Process Chinese annotations
python annotation/post_en.py  # Process English annotations
python annotation/extract.py  # Extract abstract patterns
python analysis/relative_position.py  # Convert to relative positions

# Then mannually copy-paste the terminal output to /analysis/ch.txt or en.txt (hereafter referred to as relative position files)

Refer to demo directory for more detailed examples.

Analysis Pipeline

Q1: Distribution preferences

python analysis/histogram.py  # Distribution visualization
python analysis/chi.py  # Statistical tests for intralingual comparison
python analysis/t_test.py  # Statistical tests for interlingual comparison

Input: Relative position files
Output: Relative position distributions, statistical test results

Q2: SVO-functional block patterns

python analysis/Q2/count_and_patterns/

Input: Relative position files
Output: Pattern frequencies, conditional probability

Q3: Multiple block ordering

python analysis/markov.py  # Transition probability calculation
python analysis/heatmap.py  # Visualization

Input: Relative position files
Output: Transition matrices, combination patterns

Semantic Analysis

# Q1: Compare overall corpus semantics and time-SVO vs. SVO-time patterns
python analysis/semantics/1/embed.py  # Generate embeddings
python analysis/semantics/1/1.py  # Calculate and visualize

# Q2: Compare functional block semantics between languages
python analysis/semantics/2/embed.py  # Generate embeddings
python analysis/semantics/2/2.py  # Calculate and visualize

Input: Relative position files
Output: Semantic similarity matrices, visualization plots

Contact

For inquiries regarding code implementation or research reproduction, please contact the author via email. Contact information is available on my personal website.

Citation

If you find this research or codeuseful for your work, feel free to star this repository. If ultimately necessary, contact me for proper citation information for this toy project for a course paper. :)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
analysis		analysis
annotation		annotation
demo		demo
paper		paper
.DS_Store		.DS_Store
Documentation.md		Documentation.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contrastive Analysis of Constituent Order Preferences Within Adverbial Roles in English and Chinese News: A Large-Language-Model-Driven Approach

Data Sources

Repository Overview

Components

1. Annotation Module

2. Statistical Analysis Module

3. Semantic Analysis Module

Setup

Usage

Data Annotation & Processing

Analysis Pipeline

Contact

Citation

About

Releases

Packages

Languages

rexera/constituent_order

Folders and files

Latest commit

History

Repository files navigation

Contrastive Analysis of Constituent Order Preferences Within Adverbial Roles in English and Chinese News: A Large-Language-Model-Driven Approach

Data Sources

Repository Overview

Components

1. Annotation Module

2. Statistical Analysis Module

3. Semantic Analysis Module

Setup

Usage

Data Annotation & Processing

Analysis Pipeline

Contact

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages