Separate a string with SUBSTRING and PATINDEX - last step

98 views Asked by At

I need to finalize a query. The query returns a column which contains values like "P100+P200" or "SUMME(P400:P1200)".

In the end, the result should be:

Column A Column B Column C
P100 + P200
P400 : P1200

Solved to extract column A and column B.

I used for the first two steps this code:

MAX (SUBSTRING(t3.formel, PATINDEX('%[A-Z][0-9]%', t3.formel), PATINDEX('%[+:-]%', SUBSTRING(t3.formel, PATINDEX('%[A-Z][0-9]%', t3.formel), LEN(t3.formel))) - 1)) "Formelteil 1",
MAX (SUBSTRING(t3.formel, PATINDEX('%[+:.-]%', t3.formel), 1) ) AS Sonderzeichen

But guess I'm going to be blind about the solution for the third step.

2

There are 2 answers

2
Patrick Hurst On BEST ANSWER

As mentioned in the comments, this is not really a job for SQL Server.

When asking questions like this it's helpful to provide example DDL/DML:

DECLARE @Table TABLE (formel NVARCHAR(100));
INSERT INTO @Table (formel) VALUES 
('P100+P200'), ('G100/G200'), ('a100*z200'), ('P1005-P2005'), ('SUMME(P400:P1200)');

You're two thirds of the way there. Since we only seem to need to worry about one additional character, we can simply use the position of the operator + 1 to find the start of the last string and use an arbitrary number higher than the remaining characters, and then replace it with nothing:


SELECT t3.formel, 
SUBSTRING(t3.formel, PATINDEX('%[A-Z|a-z][0-9]%', t3.formel),PATINDEX('%[-|*|/|+|:]%', t3.formel)-PATINDEX('%[A-Z|a-z][0-9]%', t3.formel)) AS a,
SUBSTRING(t3.formel, PATINDEX('%[-*/+:]%', t3.formel), 1) AS b,
REPLACE(SUBSTRING(t3.formel, PATINDEX('%[-*/+:]%', t3.formel)+1, LEN(t3.formel)),')','') AS c
  FROM @Table t3;
formel a b c
P100+P200 P100 + P200
G100/G200 G100 / G200
a100*z200 a100 * z200
P1005-P2005 P1005 - P2005
SUMME(P400:P1200) P400 : P1200
0
Panagiotis Kanavos On

T-SQL isn't a text manipulation language and doesn't even have regular expressions. It's a lot easier to do this task in a client language, using a regular expression like ([A-Z\d]+)([+:.-])([A-Z\d]+) to capture the three parts.

In the comments you mention the data is used in Power BI. You can use a Python Transformation in the Query editor to apply a regular expression to the data using Pandas' str.exact and automatically extract the parts into columns.

The Power BI step script is essentially a one-liner

import pandas as pd
pattern=r"([A-Z\d]+)([+:.-])([A-Z\d]+)"
dataset[['a','b','c']]=dataset['formel'].str.extract(pattern)

str.extract applies the regular expression to all the values of the formel column (Series) and extracts each capture group into a separate column. dataset[['a','b','c']]= stores those columns in the original dataset using the names a, b and c.

You can easily test Python scripts in the command line or a Jupyter Notebook in VS Code.

The following script, in either Python or VS Code :

import pandas as pd
dataset=pd.DataFrame({'formel':['P100+P400','SUMME(P200:P300)']})

pattern=r"([A-Z\d]+)([+:.-])([A-Z\d]+)"
dataset[['a','b','c']]=dataset['formel'].str.extract(pattern)
dataset

Prints

             formel     a  b     c
0         P100+P400  P100  +  P400
1  SUMME(P200:P300)  P200  :  P300