Split a string by max characters length, word aware - but without capturing whitespaces

274 views Asked by At

The following regex (taken from here) splits a string by characters length (e.g. 20 characters), while being word-aware (live demo):

\b[\w\s]{20,}?(?=\s)|.+$

This means that if a word should be "cut" in the middle (based on the provided characters length) - then the whole word is taken instead:

const str = "this is an input example of one sentence that contains a bit of words and must be split"

const substringMaxLength = 20;

const regex = new RegExp(`\\b[\\w\\s]{${substringMaxLength},}?(?=\\s)|.+$`, 'g');

const substrings = str.match(regex);

console.log(substrings);

However, as can be seen when running the snippet above, the leading whitespace is taken with each substring. Can it be ignored, so that we'll end up with this?

[
  "this is an input example",
  "of one sentence that",
  "contains a bit of words",
  "and must be split"
]

I tried adding either [^\s], (?:\s), (?!\s) everywhere, but just couldn't achieve it.

How can it be done?

3

There are 3 answers

3
trincot On BEST ANSWER

You can require that every match starts with \w -- so for both options of your current regex:

const str = "this is an input example of one sentence that contains a bit of words and must be split"

const substringMaxLength = 20;

const regex = new RegExp(`\\b\\w(?:[\\w\\s]{${substringMaxLength-1},}?(?=\\s)|.*$)`, 'g');

const substrings = str.match(regex);

console.log(substrings);

0
Sh_gosha On

This is how you can do it:

const regex = new RegExp(`\\b((?:[^\\s]+\\s?){${substringMaxLength},}?)(?=\\s)|.+$`, 'g');

The regex uses a non-capturing group with a positive lookahead (?=\s) to prevent whitespace from being captured. The lookahead checks if there is a whitespace after the group and if there is whitespace it returns a match. The non-capturing group uses a positive look behind (?<=\s) to make sure that the group starts with whitespace. \b((?:[^\s]+\s?){20,}?)\b(?=\s) Regex Demo

0
The fourth bird On

Your pattern can start with a word character and the length minus 1.

The negative lookahead (?!\S) asserts a whitespace boundary to the right.

The alternative matches the rest of the line, and also starta with a word character.

\b\w(?:[\w\s]{19,}?(?!\S)|.*)

Regex demo

const str = "this is an input example of one sentence that contains a bit of words and must be split"

const substringMaxLength = 20;

const regex = new RegExp(`\\b\\w(?:[\\w\\s]{${substringMaxLength-1},}?(?!\\S)|.*)`, 'g');

const substrings = str.match(regex);

console.log(substrings);