Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analytical Questions Based on Debsources Paper #30

Open
5 tasks
SahithiKasim opened this issue Aug 7, 2023 · 1 comment
Open
5 tasks

Analytical Questions Based on Debsources Paper #30

SahithiKasim opened this issue Aug 7, 2023 · 1 comment
Assignees

Comments

@SahithiKasim
Copy link
Collaborator

SahithiKasim commented Aug 7, 2023

Upstream code analysis:

  • 1 a. What trends can be observed in code complexity over time-related to the?

  • Lines of Code (LOC): Count the total lines of code in a codebase. Higher LOC generally indicates greater complexity.

  • Dependency Counts: The number of dependencies of packages. More dependencies implies greater complexity.

  • Code Churn: The number of lines of code added/deleted over time. High churn suggests complex code that requires more frequent changes.

  • 1b. How has the cyclomatic complexity of core Debian packages changed across different time periods?

  • Get the source code history for each package from its git repository

  • Calculating the cyclomatic complexity for each package version over time using static analysis tools

  • Apply statistical analysis (like regression) to test for a significant trend in mean cyclomatic complexity

  • How has the diversity of licenses used in the codebase evolved?

  • Extract license information from Upstream copyright files.

  • Map the license identifiers to general categories like BSD, GNU, Apache, OpenSSL, etc. (can be found in this https://www.debian.org/legal/licenses/)

  • For each year, tally the number of packages under each license category. Calculate category proportions.

  • Analyze the trends in license category proportions over time using descriptive statistics.

  • What changes have occurred in the usage of programming languages?

  • Examine source code files and identify programming languages using heuristics, file extensions, and tools like cloc.

  • Categorize and tally usage for languages like Python, Java, C/C++, Rust, etc.

  • Track the proportions of each language over time as packages evolve.

  • Apply statistical analysis to determine if language proportions have significantly changed.

  • What is the correlation between authors and licenses, and how have contributions varied over time?

  • For each package, analyze git commit history.

  • Extract the author's name and emails for each commit and match the author's name with the maintainers table. Resolve aliases to unique contributors.

  • Aggregate commits by contributor to determine top contributors and their activity levels.

  • Analyze email domains to categorize contributors by organization (e.g. @debian.org, @redhat.com). Calculate organization proportions over time.

  • Categorize licenses.

  • Use correlation analysis to identify relationships between authors/organizations and license categories.

@SahithiKasim SahithiKasim self-assigned this Aug 7, 2023
@SahithiKasim
Copy link
Collaborator Author

SahithiKasim commented Aug 7, 2023

Preliminary findings from 500 cloned GitHub repositories include metrics - lines of code (LOC), code churns, dependency counts, and the languages used.

250 repo results can be found here - results.md
I am updating this sheet regularly so you can find latest results in this - https://docs.google.com/spreadsheets/d/1UktUekTQd__dGa-vd--p27k6zoUDcPo6cft-jRk3dJ8/edit?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant