There’s an interesting pattern that can be found in most codebases. The number of changes per file follows a power-law1. It’s not safe to say that the Pareto principle2 is at play (80% of the changes occur in 20% of the files) in every codebase. But you’ll be pretty close to these values.
Here is a bash script that will create a frequency distribution of changes per file.
#!/usr/bin/env bash
set -euo pipefail
function fail () {
echo "FAIL: $*" >&2
exit 1
}
if ! command -v git > /dev/null 2>&1; then
fail "Git is not installed or not available in PATH."
fi
if ! git rev-parse --is-inside-work-tree > /dev/null 2>&1; then
fail "This is not a git repository!"
fi
function get_changed_files() {
git log --name-only --pretty=format:
}
function clean_filenames() {
grep --invert-match '^$' | sed 's/[[:space:]]*$//' | tr -d '\r'
}
function create_frequency_distribution() {
sort | uniq --count | sort --numeric-sort --reverse
}
get_changed_files \
| clean_filenames \
| create_frequency_distribution
For a better visual experience, get the top 50 files and pipe them into a pager
like less
.
./changes.sh | head -n 50 | less
or save all the data into a CSV
file for further analysis (e.g. plotting, report
generating, spreadsheet analysis)
./changes.sh | head -n 50 | awk '{print $2 "," $1 }' >> changes.csv
I ran the script across four open source project repositories and took their top 20 files.
# Rails top 20 files
2501 activerecord/CHANGELOG.md
1562 actionpack/CHANGELOG
1348 activerecord/CHANGELOG
1215 activerecord/lib/active_record/base.rb
1098 actionpack/CHANGELOG.md
1056 activesupport/CHANGELOG.md
909 activerecord/lib/active_record/associations.rb
892 Gemfile
869 Gemfile.lock
867 railties/CHANGELOG.md
846 activerecord/lib/active_record/connection_adapters/postgresql_adapter.rb
815 guides/source/configuring.md
787 railties/CHANGELOG
786 actionpack/lib/action_dispatch/routing/mapper.rb
738 activerecord/lib/active_record/relation/query_methods.rb
681 railties/test/generators/app_generator_test.rb
677 activesupport/CHANGELOG
613 activerecord/lib/active_record/relation.rb
598 activerecord/lib/active_record/connection_adapters/abstract_adapter.rb
593 activerecord/lib/active_record/connection_adapters/abstract_mysql_adapter.rb
# Laravel top 20 files
1572 src/Illuminate/Foundation/Application.php
903 src/Illuminate/Database/Eloquent/Model.php
772 src/Illuminate/Database/Query/Builder.php
666 composer.json
603 src/Illuminate/Database/Eloquent/Builder.php
565 src/Illuminate/Validation/Validator.php
545 tests/Validation/ValidationValidatorTest.php
517 src/Illuminate/Support/Collection.php
510 tests/Support/SupportCollectionTest.php
489 src/Illuminate/Routing/Router.php
466 tests/Database/DatabaseQueryBuilderTest.php
405 CHANGELOG.md
394 tests/Database/DatabaseEloquentModelTest.php
365 src/Illuminate/Support/Str.php
324 src/Illuminate/Http/Request.php
319 src/Illuminate/Support/helpers.php
316 src/Illuminate/Routing/Route.php
315 src/Illuminate/Database/Eloquent/Relations/BelongsToMany.php
314 src/Illuminate/Foundation/helpers.php
288 src/Illuminate/Container/Container.php
# Django top 20 files
1104 AUTHORS
603 django/db/models/query.py
577 django/db/models/sql/query.py
538 docs/ref/settings.txt
525 django/db/models/fields/__init__.py
511 django/db/models/base.py
480 django/contrib/admin/options.py
468 docs/internals/deprecation.txt
463 django/db/models/fields/related.py
456 docs/ref/models/querysets.txt
447 docs/ref/contrib/admin/index.txt
432 tests/admin_views/tests.py
423 docs/ref/django-admin.txt
359 django/test/testcases.py
358 django/db/models/sql/compiler.py
358 django/conf/global_settings.py
354 docs/ref/models/fields.txt
349 docs/releases/1.8.txt
340 docs/ref/templates/builtins.txt
339 docs/releases/1.7.txt
# Nginx top 20 files
608 src/http/ngx_http_core_module.c
548 src/http/ngx_http_request.c
523 src/core/nginx.h
520 docs/xml/nginx/changes.xml
494 src/http/ngx_http_upstream.c
483 .hgtags
378 src/event/ngx_event_openssl.c
324 src/event/ngx_event_quic.c
275 src/http/modules/ngx_http_proxy_module.c
259 src/http/ngx_http_request.h
243 src/http/modules/perl/nginx.pm
239 src/http/ngx_http_core_module.h
228 src/core/nginx.c
224 src/http/ngx_http.c
198 src/http/modules/ngx_http_fastcgi_module.c
196 src/event/ngx_event.c
179 auto/modules
170 auto/sources
167 src/event/ngx_event.h
165 src/http/ngx_http.h
If we ignore files like changelogs and authors files, since they are not a functional part of the system. The files with the most changes are either part of the system’s core functionality or some kind of external dependency files.
This is a very good strategy to explore an unknown system. Focus on the top
files, which are the core functionality of the system. Check out the content of
those files; read the commit messages with git log --follow
. You can get a
lot of valuable information pretty quickly.
Another use case is to find the most critical parts of the system. In order to test them more and refactor them if it’s needed. Testing the right parts of the system will save time and resources while making it more reliable. Refactoring is a good idea as well, since you’ll probably spend more time there and want the system to be clean and nice in the most impactful places.