What is the most efficient way of extracting these integers from a string using SQL?

111 views Asked by At

I have a table (in a SQL server 2014 database, accessed via SSMS) of engineering data, and I need to report the bolt lengths that are mentioned in a description field, marked in bold:

STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 140 LG C/W 1 NUT, 12 X M24 X 170 LG C/W 1 NUT)
STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (2 X M20 X 80 LG C/W 1 NUT, 6 X M20 X 90 LG C/W 1 NUT)
STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 130 LG C/W 1 NUT, 12 X M24 X 150 LG C/W 1 NUT)
STUD BOLT SET, ASME B1.1, ASTM A193 B7, ASME B1.1, ASTM A194 GR 2H, GALVANISED (28 x 1 3/4" UN8 x 270 LG C/W 1 NUT)

Note that there are two bolt lengths mentioned in the description (eg the first record has "140 LG" and "170 LG"), but I only need the second one, "170", which is also always the larger of the two. While the position of the number should remain a fixed length from the end of the string, the length varies between two and three digits.

So beyond using MID(string,15,3) inside a lot of nested if statements to determine the length and position of the number, I don't really know how best to go about doing this.

Any suggestions would be much appreciated.

2

There are 2 answers

3
Yitzhak Khabinsky On BEST ANSWER

Please try the following solution that is using tokenization via XML and XQuery.

The desired value is not in the same position from the end of the string due to edge cases.

That's why, first, we will find a position of the bolt length using if/else logic.

After that the XPath predicate /root/r[last() - sql:column("t2.position")] does the job.

SQL

-- DDL and sample data population, start
DECLARE @tbl TABLE (id INT IDENTITY PRIMARY KEY, tokens NVARCHAR(1024));
INSERT INTO @tbl (tokens) VALUES
('STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 140 LG C/W 1 NUT, 12 X M24 X 170 LG C/W 1 NUT)'),
('STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (2 X M20 X 80 LG C/W 1 NUT, 6 X M20 X 90 LG C/W 1 NUT)'),
('STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 130 LG C/W 1 NUT, 12 X M24 X 150 LG C/W 1 NUT)'),
('STUD BOLT SET, ASME B1.1, ASTM A193 B7, ASME B1.1, ASTM A194 GR 2H, GALVANISED (28 x 1 3/4" UN8 x 270 LG C/W 1 NUT)'),
('STUD BOLT SET, ASME B1.1, ASTM A193 B7, ASME B1.1, ASTM A194 GR 2H (12 x 7/8" x 110 LG C/W NUT)')
-- DDL and sample data population, end

DECLARE @separator CHAR(1) = SPACE(1);

SELECT *
    , bolt_length = c.value('(/root/r[last() - sql:column("t2.position")]/text())[1]', 'INT')
FROM @tbl AS t
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' + 
    REPLACE(tokens, @separator, ']]></r><r><![CDATA[') + 
    ']]></r></root>' AS XML)) AS t1(c)
CROSS APPLY (SELECT c.query('if (/root/r[last() - 1]/text() = "1") then 4
    else 3').value('.','INT')) AS t2(position);

Output

id tokens bolt_length
1 STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 140 LG C/W 1 NUT, 12 X M24 X 170 LG C/W 1 NUT) 170
2 STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (2 X M20 X 80 LG C/W 1 NUT, 6 X M20 X 90 LG C/W 1 NUT) 90
3 STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 130 LG C/W 1 NUT, 12 X M24 X 150 LG C/W 1 NUT) 150
4 STUD BOLT SET, ASME B1.1, ASTM A193 B7, ASME B1.1, ASTM A194 GR 2H, GALVANISED (28 x 1 3/4" UN8 x 270 LG C/W 1 NUT) 270
5 STUD BOLT SET, ASME B1.1, ASTM A193 B7, ASME B1.1, ASTM A194 GR 2H (12 x 7/8" x 110 LG C/W NUT) 110
1
Atmo On

Assuming your ssms tag means you are on SQL server, you have only limited support to some version of regular expression (not the POSIX regex).
The below query will do what you want. Its principle is to locate the rightmost occurrence of LG and from there, the 2nd preceding space. The result you want is between these 2 positions.
Refer to the comments for explanation, remove the CTE included to have a complete example and cast the returned value to integer yourself if needed:

WITH MyTable(description) AS (
SELECT 'STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 140 LG C/W 1 NUT, 12 X M24 X 170 LG C/W 1 NUT)'
UNION ALL
SELECT 'STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (2 X M20 X 80 LG C/W 1 NUT, 6 X M20 X 90 LG C/W 1 NUT)'
UNION ALL
SELECT 'STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 130 LG C/W 1 NUT, 12 X M24 X 150 LG C/W 1 NUT)'
UNION ALL
SELECT 'STUD BOLT SET, ASME B1.1, ASTM A193 B7, ASME B1.1, ASTM A194 GR 2H, GALVANISED (28 x 1 3/4" UN8 x 270 LG C/W 1 NUT)'
)
SELECT 
SUBSTRING(description, LEN(description) - pos1 - pos2, pos2 - 1)
FROM MyTable
/* REVERSE the string so that searching for the rightmost character/pattern will be equivalent to searching the leftmost character/reversed pattern in the reversed string */
CROSS APPLY (SELECT REVERSE(description)) AS rev(revDescription) 
/* Find the rightmost occurrence of 'LG', by searching the leftmost occurrence of 'GL' in the reserved string */
CROSS APPLY (SELECT patindex(REVERSE(N'% % LG%'), revDescription)) AS ca1(pos1)
/* Skipping the L (from LG) and its preceding space (i.e. starting on character 3), find the rightmost space */
CROSS APPLY (SELECT CHARINDEX(' ', SUBSTRING(revDescription, pos1 + 3, 1000))) AS ca2(pos2)

Other databases may have better support for regular expression, making the query much easier to write. Below is an example with Postgres, we can get it done in 1 line; a similar approach may work with other databases too.

WITH MyTable(description) AS (
SELECT 'STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 140 LG C/W 1 NUT, 12 X M24 X 170 LG C/W 1 NUT)'
UNION ALL
SELECT 'STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (2 X M20 X 80 LG C/W 1 NUT, 6 X M20 X 90 LG C/W 1 NUT)'
UNION ALL
SELECT 'STUD BOLT SET, SANS 1700, GR 8.8, SANS 1700, GR 8, GALVANISED (8 X M24 X 130 LG C/W 1 NUT, 12 X M24 X 150 LG C/W 1 NUT)'
UNION ALL
SELECT 'STUD BOLT SET, ASME B1.1, ASTM A193 B7, ASME B1.1, ASTM A194 GR 2H, GALVANISED (28 x 1 3/4" UN8 x 270 LG C/W 1 NUT)'
)
SELECT UNNEST(regexp_matches(description, '(\d+)\s*LG(?!LG)'))
FROM MyTable

In this case, we capture the number ((\d+)) before LG. (?!LG) is a negative lookahead, which means there cannot be allowed another occurrence of LG to the right of the one we want to capture.
If you were to remove (?!LG) (as well as the UNNEST), you would get all the numbers immediately preceding all the occurrences of LG in 1 array per input string.