Skip links

Counting Sentences: An Implementation in C++20

Counting the number of sentences in a given paragraph appears rather simple on the surface – look for the common punctuation marks: “.?!”. Only when you dig deeper, you will know that it is really not that simple. For example, consider this text: “Peter met Dr.James at 3 p.m.” How many sentences does this have? Not three, just one! The reason sentence counting is hard is because the most common delimiter, the period, has multiple roles to play. It appears in abbreviations, decimal numbers, email addresses, URLs, initials, and ellipses and so understanding the context is quite important.

Before I forget, I must also point out that a sentence might not have any terminator at all. Consider this: “The dog ran after the cat. The cat climbed on to the wall”. Here the second sentence does not have any terminator but we know that it has ended because the text has ended.

What are some approaches we can follow for sentence boundary detection? 

1) Rule-based systems: We hand-craft a set of rules that classify each period as a boundary or non-boundary and use these rules to define the correct pipeline.

2) Machine-Learning classifiers: We can train an unsupervised model that learns which tokens are abbreviations based on statistical cues. This can give good accuracy but can fail in noisy contexts.

3) Neural models: These can yield new-human accuracy, but might require heavy computational resources for training.

I decided to give this problem to Claude, specifically “Opus 4.6 Extended” model. I asked it to generate C++20 code. The good news is that it did it quite fast and even included test cases to validate!

Here are the cases it has handled:

# Rule Description
1 Sentence terminators Sentences end with  .  !  or  ?
2 Ellipsis ‘…’ and ‘…’ do NOT terminate a sentence
3 Abbreviations Mr., Dr., U.S., p.m., etc. do NOT end a sentence
4 Decimal numbers 3.14, $1,200.50 — dots inside numbers are ignored
5 Initials J. K. Rowling — single-letter dots are not boundaries
6 Quoted endings He said, “Go!” — terminators inside quotes still count
7 Multiple terminators ?!  or  !!!  collapse to a single sentence end
8 URLs and emails Dots in http://x.com or a@b.com are ignored

 

Here are the regular expressions it has generated and used for the common situations:

Regular Expressions
Regular Expressions

The generated code was compiled thus:

g++ -std=c++20 -O2 -o sentence_counter sentence_counter.cpp

Here is the program output:

Program Output
Program Output

Does the code cover all possible cases? It is reasonable to say it covers most of the edge cases. Nevertheless it is quite interesting that Claude was able to generate decent quality code, and that too without any errors. Personally, between ChatGPT and Claude, I prefer the latter. Of course, we need to carefully review the generated code (and logic) before integrating it into any project.

You can download the code here.

Have a wonderful weekend.

Leave a comment