Maintaining a clean Git history is essential for collaboration and reproducibility in data science projects. This blog explores how git rebase can help you create a professional, linear project history that enhances collaboration and reproducibility in your NHS-R projects.
Introduction
Have you ever found yourself in “commit hell” with a tangled Git history? It happens to the best of us! When you’re working on a project, it’s easy to create many small, redundant commits. This can make it difficult for you and your collaborators to understand the progression of your work.
What’s the Difference Between merge and rebase?
Think of a project’s commit history as a timeline.
git mergecombines branches by creating a new merge commit. It’s like adding a new chapter to a book without editing the old ones. This can lead to a “tangled” history with multiple branching paths.git rebasemoves or combines a sequence of commits to a new starting point. It’s like rewriting and polishing a draft to make the story flow better. It creates a cleaner, linear history that’s much easier to read.
A Practical Use Case: Cleaning Up Commits
Let’s imagine you’ve been working on a new feature for your analysis, and you’ve made several small commits:
- “feat: add patient ID column”
- “fix: corrected patient ID”
- “temp: working on EDA”
- “feat: EDA for patient demographics”
These commits are all related to a single logical feature. We can use git rebase to squash them into a single, meaningful commit.
Step 1: Start an Interactive Rebase in RStudio
First, make sure you’re on the correct branch. Check the Git pane in RStudio (usually top-right panel) to confirm your current branch.
- Open RStudio’s Terminal: Click on the Terminal tab (next to the Console tab at the bottom of RStudio)
- Run the rebase command: Type the following command to start an interactive rebase for the last 4 commits:
git rebase -i HEAD~4The HEAD~4 tells Git to look at the last four commits on your current branch.
Alternative: If you want to squash all commits since your last pull from main, use:
git rebase -i main- Text editor opens: RStudio will open your default text editor (often nano, vim, or VS Code) with a list of commits that looks like this:
pick a46f23b feat: add patient ID column
pick b71a5c1 fix: corrected patient ID
pick 8c9e0d1 temp: working on EDA
pick 3d4b6c8 feat: EDA for patient demographics
Step 2: Edit the Rebase Instructions in Your Editor
The editor shows your commits in chronological order (oldest first). Here’s how to squash them:
- Keep the first commit as
pick- This will be your main commit - Change the others to
squash- These will be combined into the first commit - Edit the file to look like this:
pick a46f23b feat: add patient ID column
squash b71a5c1 fix: corrected patient ID
squash 8c9e0d1 temp: working on EDA
squash 3d4b6c8 feat: EDA for patient demographics
- Save and close the editor:
- If using nano: Press
Ctrl+X, thenY, thenEnter - If using vim: Press
Esc, type:wq, thenEnter - If using VS Code: Save with
Ctrl+S(orCmd+Son Mac) and close the tab
- If using nano: Press
| Command | Shorthand | Description | When to Use It |
|---|---|---|---|
pick |
p |
Keep the commit as is | For commits you want to preserve |
squash |
s |
Combine with the commit above, keep both messages | When you want to merge commits but review the messages |
fixup |
f |
Combine with the commit above, discard this message | For quick fixes where the message isn’t important |
drop |
d |
Remove the commit entirely | For commits you want to delete |
Step 3: Write Your New Commit Message
After closing the first editor, Git will automatically open another editor window showing all the commit messages from your squashed commits. Here’s what to do:
- Review the existing messages - You’ll see all four commit messages listed
- Delete or edit the messages to create one clear, comprehensive message
- Lines starting with
#are comments - Git will ignore these - Write your new message:
feat: Add patient demographics EDA
- Added patient ID column with validation
- Implemented exploratory data analysis for patient demographics
- Includes data cleaning and initial visualizations
- Save and close the editor using the same method as Step 2
Step 4: Verify Your Work in RStudio
Once you’ve completed the rebase:
- Check the Terminal output - You should see “Successfully rebased and updated refs/heads/your-branch-name”
- Look at the Git pane - RStudio will automatically refresh and show your new, clean commit history
- Verify with git log: In the Terminal, run
git log --onelineto see your single, clean commit instead of the four messy ones - Test your code - Run your R scripts to make sure everything still works correctly
Step 5: Push Your Rebased Changes
After successfully rebasing your local commits, you need to push them to the remote repository. This step applies when you’ve rebased commits that were already pushed to your feature branch.
If you only rebased local (unpushed) commits, a regular git push will work fine. But when you’ve rewritten commits that are already on origin/<your-branch>, you’ll see a scary error message like:
On branch example
Your branch and 'origin/example' have diverged,
and have 6 and 2 different commits each, respectively.
(use "git pull" to merge the remote branch into yours)
⚠️ DO NOT run git pull at this point! Doing so will merge the old remote commits back into your branch, undoing all your lovely rebase work.
Instead, you need to use:
git push --force-with-leaseWhy --force-with-lease instead of --force?
--forcewill overwrite the remote branch regardless of what’s there--force-with-leaseis safer - it checks that nobody else has pushed changes to the remote branch since you last fetched it- If someone else has pushed changes,
--force-with-leasewill reject your push and warn you, preventing accidental overwrites
In RStudio:
- Open the Terminal tab
- Run
git push --force-with-lease - Check the Git pane - your branch should now be in sync with the remote
Only use --force-with-lease on your own feature branch. Never force-push to shared branches like main or branches where others are actively collaborating. See “The Golden Rule of Rebase” below for the full picture.
If your branch is protected on the remote, the push will be rejected. This is expected. Instead, open a pull request and merge through the normal review process.
The Golden Rule of Rebase
⚠️ Never rebase commits that have been pushed to a shared repository.
This is the most important rule to remember. Rebasing rewrites commit history by changing commit hashes. If you rebase commits that others have already pulled, you’ll create divergent histories that are difficult to reconcile.
Safe to rebase:
- Local commits that haven’t been pushed yet
- Feature branches that only you are working on
- Commits on your personal fork before creating a pull request
Never rebase:
- The
mainormasterbranch - Any branch that others have pulled from
- Commits that have been pushed to a shared repository
RStudio tip: In the Git pane, you can see which commits have been pushed (they appear differently from unpushed commits), helping you follow the golden rule safely.
Troubleshooting in RStudio
If something goes wrong during rebase:
Don’t panic! Git keeps a history of all operations
In RStudio’s Terminal, you can undo the rebase:
git reflog git reset --hard HEAD@{1}The Git pane will refresh to show your original commit history
Your R files remain safe - rebase only changes commit history, not your actual code
Before rebasing, always:
- Ensure your working directory is clean (no uncommitted changes in the Git pane)
- Make sure you’re on the correct branch
- Consider making a backup branch:
git branch backup-branch-name
Conclusion
git rebase is a powerful tool for maintaining a clean and clear commit history, which in turn leads to better collaboration and more robust data science projects. By using it to squash your smaller, work-in-progress commits, you can ensure that your project’s history is a readable and accurate record of your progress. It’s a key practice for any data scientist aiming for reproducibility and a professional workflow.