Extract numbers from chemical formula

2.6k views Asked by At

Apologies if this has already been asked and answered but I couldn't find a satisfactory answer.

I have a list of chemical formulas including, in this order: C, H, N and O. And I would like to pull the number after each of these letters. The problem is that not all the formulas contain an N. All contain a C, H and O however. And the number can be either single, double or (in the case of H only) triple digit.

Thus the data looks like this:

  • C20H37N1O5
  • C10H12O3
  • C20H19N3O4
  • C23H40O3
  • C9H13N1O3
  • C14H26O4
  • C58H100N2O9

I'd like each element number for the list in separate columns. So in the first example it would be:

20 37 1 5

I've been trying:

=IFERROR(MID(LEFT(A2,FIND("H",A2)-1),FIND("C",A2)+1,LEN(A2)),"") 

to separate out the C#. However, after this I get stuck as the H# is flanked by either an O or N.

Is there an excel formula or VBA that can do this?

6

There are 6 answers

7
Pᴇʜ On BEST ANSWER

Use Regular Expressions

This is a good task for regular expressions (regex). Because VBA doesn't support regular expressions out of the box we need to reference a Windows library first.

  1. Add reference to regex under Tools then References enter image description here

  2. and selecting Microsoft VBScript Regular Expression 5.5 enter image description here

  3. Add this function to a module

     Option Explicit 
    
     Public Function ChemRegex(ByVal ChemFormula As String, ByVal Element As String) As Long
         Dim strPattern As String
         strPattern = "([CNHO])([0-9]*)" 
                      'this pattern is limited to the elements C, N, H and O only.
         Dim regEx As New RegExp
    
         Dim Matches As MatchCollection, m As Match
    
         If strPattern <> "" Then
             With regEx
                 .Global = True
                 .MultiLine = True
                 .IgnoreCase = False
                 .Pattern = strPattern
             End With
    
             Set Matches = regEx.Execute(ChemFormula)
             For Each m In Matches
                 If m.SubMatches(0) = Element Then
                     ChemRegex = IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1) 
                                 'this IIF ensures that in CH4O the C and O are count as 1
                     Exit For
                 End If
             Next m
         End If
     End Function
    
  4. Use the function like this in a cell formula

    E.g. in cell B2: =ChemRegex($A2,B$1) and copy it to the other cells enter image description here


Recognize also chemical formulas with multiple occurrences of elements like CH₃OH or CH₂COOH

Note that the code above cannot count something like CH3OH where elements occur more than once. Then only the first H3 is count the last is omitted.

If you need also to recognize formulas in the format like CH3OH or CH2COOH (and summarize the occurrences of the elements) then you need to change the code to recognize these too …

If m.SubMatches(0) = Element Then
    ChemRegex = ChemRegex + IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1)
    'Exit For needs to be removed.
End If

enter image description here

Recognize also chemical formulas with 2 letter elements like NaOH or CaCl₂

In addition to the change above for multiple occurrences of elements use this pattern:

strPattern = "([A-Z][a-z]?)([0-9]*)"   'https://regex101.com/r/nNv8W6/2

enter image description here

  1. Note that they need to be in the correct upper/lower letter case. CaCl2 works but not cacl2 or CACL2.

  2. Note that this doesn't proof if these letter combinations are existing elements of the periodic table. So this will also recognize eg. Xx2Zz5Q as fictive elements Xx = 2, Zz = 5 and Q = 1.

    To accept only combinations that exist in the periodic table use the following pattern:

     strPattern = "([A][cglmrstu]|[B][aehikr]?|[C][adeflmnorsu]?|[D][bsy]|[E][rsu]|[F][elmr]?|[G][ade]|[H][efgos]?|[I][nr]?|[K][r]?|[L][airuv]|[M][cdgnot]|[N][abdehiop]?|[O][gs]?|[P][abdmortu]?|[R][abefghnu]|[S][bcegimnr]?|[T][abcehilms]|[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr])([0-9]*)"
     'https://regex101.com/r/Hlzta2/3
     'This pattern includes all 118 elements up to today. 
     'If new elements are found/generated by scientist they need to be added to the pattern.
    

Recognize also chemical formulas with prenthesis like Ca(OH)₂

Therefore another RegEx is needed to handle the parenthesis and multiply them.

Public Function ChemRegex(ByVal ChemFormula As String, ByVal Element As String) As Long
    Dim regEx As New RegExp
    With regEx
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
    End With
    
    'first pattern matches every element once
    regEx.Pattern = "([A][cglmrstu]|[B][aehikr]?|[C][adeflmnorsu]?|[D][bsy]|[E][rsu]|[F][elmr]?|[G][ade]|[H][efgos]?|[I][nr]?|[K][r]?|[L][airuv]|[M][cdgnot]|[N][abdehiop]?|[O][gs]?|[P][abdmortu]?|[R][abefghnu]|[S][bcegimnr]?|[T][abcehilms]|[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr])([0-9]*)"
    
    Dim Matches As MatchCollection
    Set Matches = regEx.Execute(ChemFormula)
    
    Dim m As Match
    For Each m In Matches
        If m.SubMatches(0) = Element Then
            ChemRegex = ChemRegex + IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1)
        End If
    Next m
    
    'second patternd finds parenthesis and multiplies elements within
    regEx.Pattern = "(\((.+?)\)([0-9]+)+)+?"
    Set Matches = regEx.Execute(ChemFormula)
    For Each m In Matches
        ChemRegex = ChemRegex + ChemRegex(m.SubMatches(1), Element) * (m.SubMatches(2) - 1) '-1 because all elements were already counted once in the first pattern
    Next m
End Function

This will also recognize parenthesis. Note that it does not recognize nested parenthesis.

enter image description here


Also have a look at a similar question: Determine total number of atoms in a chemical formula

0
Vityata On

With VBA this is an easy task - you have to loop through the chars and check the values for being numeric. With Excel, the solution includes some redundancy. But it is doable. E.g.,

C20H37NO5 will return 20375 if you apply the following formula:

=IF(ISNUMBER(1*MID(A1,1,1)),MID(A1,1,1),"")&
IF(ISNUMBER(1*MID(A1,2,1)),MID(A1,2,1),"")&
IF(ISNUMBER(1*MID(A1,3,1)),MID(A1,3,1),"")&
IF(ISNUMBER(1*MID(A1,4,1)),MID(A1,4,1),"")&
IF(ISNUMBER(1*MID(A1,5,1)),MID(A1,5,1),"")&
IF(ISNUMBER(1*MID(A1,6,1)),MID(A1,6,1),"")&
IF(ISNUMBER(1*MID(A1,7,1)),MID(A1,7,1),"")&
IF(ISNUMBER(1*MID(A1,8,1)),MID(A1,8,1),"")&
IF(ISNUMBER(1*MID(A1,9,1)),MID(A1,9,1),"")

Currently, it checks the first 9 characters for being numeric. If you want to include more than 9, then simply add a few lines to the formula.

There is a small trick in the formula - the 1*. It converts a text character to a numeric, if it is possible. Thus, a 5 as a text, multiplied by 1 becomes a numeric character.

0
Dy.Lee On

Use split and like method.

Sub test()
    Dim vDB As Variant, vR() As Variant
    Dim s As String
    Dim vSplit As Variant
    Dim i As Long, n As Long, j As Integer

    vDB = Range("a2", Range("a" & Rows.Count).End(xlUp))

    n = UBound(vDB, 1)
    ReDim vR(1 To n, 1 To 4)
    For i = 1 To n
        s = vDB(i, 1)
        For j = 1 To Len(s)
            If Mid(s, j, 1) Like "[A-Z]" Then
                s = Replace(s, Mid(s, j, 1), " ")
            End If
        Next j
        vSplit = Split(s, " ")
        For j = 1 To UBound(vSplit)

            vR(i, j) = vSplit(j)
        Next j
    Next i
    Range("b2").Resize(n, 4) = vR
End Sub
0
eirikdaude On

I did this in VBA, using regular expressions. You can probably do it like Vityata suggests by looping through the string too, but I suspect that this is slightly faster and easier to read.

Option Explicit

Function find_associated_number(chemical_formula As Range, element As String) As Variant
  Dim regex As Object: Set regex = CreateObject("VBScript.RegExp")
  Dim pattern As String
  Dim matches As Object

  If Len(element) > 1 Or chemical_formula.CountLarge <> 1 Then
    find_associated_number = CVErr(xlErrName)
  Else
    pattern = element + "(\d+)\D"
    With regex
      .pattern = pattern
      .ignorecase = True
      If .test(chemical_formula) Then
        Set matches = .Execute(chemical_formula)
        find_associated_number = matches(0).submatches(0)
      Else
        find_associated_number = CVErr(xlErrNA)
      End If
    End With
  End If
End Function

Then you use the formula in your sheet like normal:

enter image description here

Column C contains the number of carbon atoms, column D the number of nitrogen atoms. Just expand on this by copying the formula and changing the element it searches for.

0
Tom On

If you want a vba solution to extract all numbers my preferred solution is to use Regular Expressions. The following code will extract all numbers from a string

Sub GetMolecularFormulaNumbers()
    Dim rng As Range
    Dim RegExp As Object
    Dim match, matches
    Dim j As Long

    Set rng = Range(Cells(1, 1), Cells(Cells(Rows.Count, 1).End(xlUp).Row, 1))
    Set RegExp = CreateObject("vbscript.regexp")
    With RegExp
        .Pattern = "\d+"
        .IgnoreCase = True
        .Global = True

        For Each c In rng
            j = 0
            Set matches = .Execute(c)
            If matches.Count > 0 Then
                For Each match In matches
                    j = j + 1
                    c.Offset(0, j) = CInt(match)
                Next match
            End If
        Next c
    End With
End Sub
0
CallumDA On

This seems to work just fine:

enter image description here

Formula in B2 is below. Drag across and down

=IFERROR(IFERROR(--(MID($A2,SEARCH(B$1,$A2)+1,3)),IFERROR(--(MID($A2,SEARCH(B$1,$A2)+1,2)),--MID($A2,SEARCH(B$1,$A2)+1,1))),0)

Or a shorter array formula, which must be entered with ctrl+shift+enter

=MAX(IFERROR(--MID($A2,SEARCH(B$1,$A2)+1,ROW($A$1:$A$3)),0))

If you wanted to keep the VBA super simple, something like this works as well:

Public Function ElementCount(str As String, element As String) As Long
    Dim i As Integer
    Dim s As String

    For i = 1 To 3
        s = Mid(str, InStr(str, element) + 1, i)
        On Error Resume Next
        ElementCount = CLng(s)
        On Error GoTo 0
    Next i
End Function

Use it like so:

=ElementCount(A1,"C")