You want to know which subject line gets more opens. Easy in theory: write two, send half your list each, compare. In practice, this is where most cold email tests go off the rails. You run the test at the wrong sample size, you do not control for the day of the week, you keep both losing variants in rotation, and three weeks later your list is half-burnt and you still have no signal.
A clean A/B test on subject lines does not require a six-figure tool. It requires a small, disciplined process you can run with whatever sending stack you already have.

Photo by Min An on Pexels
Step 1: Pick two genuinely different subjects
The first mistake is testing variants that are too similar. "Quick question about your onboarding" versus "Question about your onboarding flow" is not an A/B test. It is two near-identical subject lines, and the variance you will see is noise, not signal.
A real test changes the pattern, not just the wording. Test a specific question against a visible-artifact reference. Test a numerical specific against a forward-able routing question. Test five words against seven words and a number. The bigger the structural difference between A and B, the more useful the result.
If you cannot articulate one sentence on why A might perform differently from B, you have not designed a test, you have made a typo.
Step 2: Split the test list, not the live list
Hold back a slice of your list specifically for testing. A hundred recipients is the minimum for any signal; two hundred is more reliable. Split that slice fifty-fifty into A and B groups.
Do not run the test on the whole list. If A wins, you have burnt half your prospects on the loser. If neither wins clearly, you have burnt all of them on undifferentiated outreach. A dedicated test slice means you keep the rest of the list available for whichever variant wins.
The most useful tool for this kind of list segmentation is whatever sending platform you already use. Almost all of them support segments. If yours does not, segment by exporting and uploading two lists; the manual version is fine for a hundred-recipient test.
Step 3: Send on the same day at the same hour
Day-of-week and hour-of-day have a measurable effect on open rates. A test that sends A on Tuesday morning and B on Wednesday afternoon is comparing four variables at once: subject, day, hour, and recipient pool. You will get a number, but you will not learn anything about subject lines from it.
The fix is to send both variants in the same fifteen-minute window. Schedule them in advance. Verify they actually went out at the same time before you start counting opens.
Step 4: Wait at least 48 hours before counting opens
Open rates do not stabilize for at least a day, sometimes two. People open emails on commute, at lunch, after dinner, the morning after. A snapshot at four hours is misleading; a snapshot at twenty-four hours is closer but still moving.
Wait forty-eight hours minimum. Seventy-two if the test is not urgent. Count the opens at that point and lock the numbers in. Re-checking the next week will surface a few more late opens, but the test is over.
If your sending tool gives you both open rate and reply rate, count both. Open rate tells you which subject earned the click; reply rate tells you which subject set up the body for conversion. They do not always agree.
Step 5: Define the win threshold before you read the data
This is the step most senders skip. They run the test, look at the numbers, and rationalize whichever variant won. "A had a 28 percent open rate and B had 26 percent, so A wins." With a hundred-recipient test slice, that two-point gap is noise, not signal. Two more recipients opening one variant or the other would flip the result.
A useful rule of thumb: at a sample size of one hundred per variant, you need at least a five-percentage-point gap to call it. At two hundred per variant, four points. At five hundred, two points. Below those thresholds, declare the test inconclusive and run it again with a new pair.
You can compute the actual significance with any free statistical A/B test calculator if you want the formal number. For most cold email tests, the rule of thumb is enough.
Step 6: Ship the winner, archive the loser
Once you have a winning variant, send it to the rest of the list. Archive the loser; do not keep it in rotation, do not save it for next time. Cold email testing is incremental: each round produces a new champion that you then test the next variant against.
The discipline is to never test more than one thing at a time. Subject lines first, then openers, then CTAs, then length. If you change three things at once and the numbers move, you cannot tell which change did it.
Step 7: Re-test the winner against fresh ideas
The subject line that won this month will not necessarily win next quarter. Lists fatigue. Inbox providers adjust their filters. The reader's pattern recognition for "this is a sales template" updates.
Plan to re-test your champion against new variants every couple of months. The cadence keeps the open rate from drifting downward as your subject style gets predictable to the recipient.
Tools that help the workflow
You can run this entire process with a spreadsheet, two saved-template versions, and your sending tool's existing segmentation. No special software required. But there are two specific tools worth knowing about:
For technical deliverability checks before sending the test (so a domain reputation issue does not confound the result), MXToolbox and Spamhaus are the standard free tools. Run your sending domain through both before any volume test.
For scoring each subject line variant against known cold-email patterns before you send them, the Cold Email & Outreach Scorer at EvvyTools rates a subject and body across seven dimensions and points at the lines that are likely to fail. Useful as a pre-test filter: if both variants score below 60, the test is going to be a comparison of two losers, and you might want to write a third candidate first.
For a longer read on what those seven dimensions are, the guide How to Diagnose a Failing Cold Email Before You Hit Send walks through all of them. Other writing utilities live in the EvvyTools tools directory.
What not to test
A few things that look like A/B test material but produce noisy or useless data:
- Sender name capitalization (one capital, all lowercase, etc.). The effect size is too small for any reasonable sample to detect.
- Emoji versus no emoji. The effect is so different across audiences that one test result does not generalize.
- Subject line punctuation. Same problem; effect size is too small.
Save the test budget for changes that actually move the number: subject line structure, opener pattern, CTA specificity, value-prop framing. Those are where the wins live.
Run small tests often
The senders who improve their open rate continuously are not the ones who run one big test a year. They run a small test every two weeks, on a hundred-recipient slice, with one variable changed, and they ship the winner to the next batch.
The math compounds. A small, real improvement every two weeks, sustained for a year, is the difference between a campaign that earns the open and one that goes to the trash. Most senders never set up the discipline. The ones who do separate themselves quickly.
Build the test loop once. Run it forever. The numbers move.










