How can I extract part of a PChar into a string?

1.9k views Asked by At

During profiling I came across a function that was taking quite a bit of time, but essentially boiled down to this very simple piece of code:

function GetSubstring(AInput: PChar; AStart, ASubstringLength: Integer): string;
begin
  Result := Copy(AInput, AStart, ASubstringLength);
end;

This function returns the expected sub-string, but it doesn't scale very well for longer inputs. I looked at the assembler code in CPU view, and from what I can tell (I'm not normally working at the assembler level), it seems that AInput is implicitly converted into a string before calling Copy.

But since at this point the length of the string/character array is unknown, the conversion code has to walk the length of the PChar until it finds the null terminator. This would explain the terrible scaling for longer inputs.

However, since the caller passes in the length of the PChar, I initially thought could just convert the method to use SetString instead.

function GetSubstring(AInput: PChar; AStart, ASubstringLength: Integer): string;
begin
  SetString(Result, AInput + AStart - 1, ASubstringLength);
end;

In addition to SetString working zero-based (not one-based as Copy), there seem to be a number of other little things Copy does in terms of validating its inputs, not all of which are documented (e.g. any start value less than 1 is changed to 1). So the naïve implementation above doesn't always work as the original one.

My goal is to replicate the Copy routine as much as possible, as this function is part of a library and has been used extensively by my colleagues.

I'm wondering whether the following implementation accomplishes that, or whether I need to be aware of any other caveats of Copy. NB: FLength is the actual length of AInput that comes from another part in the module this function is part of. I removed that other part for this example.

function GetSubstring(AInput: PChar; AStart, ASubstringLength: Integer): string;
begin
  if (AInput = nil) then begin
    Result := '';
  end else begin
    if (AStart < 1) then begin
      AStart := 0;
    end else begin
      AStart := AStart - 1;
    end;
    if (ASubstringLength + AStart > FLength) then begin
      ASubstringLength := FLength - AStart;
    end;
    SetString(Result, AInput + AStart, ASubstringLength);
  end;
end;

I'm using Delphi 2006, but I assume this isn't much different in other versions of the product (at least the non-Unicode ones).

2

There are 2 answers

1
David Heffernan On BEST ANSWER

Let's consider the corner cases. I think they are:

  1. AInput invalid.
  2. AStart < 1.
  3. AStart > FLength.
  4. ASubstringLength < 0.
  5. ASubstringLength + (AStart-1) > FLength.

We can ignore case 1 in my opinion. The onus should be on the caller to provide a valid PChar. Indeed your check that AInput <> nil is already a step too far in my view because nil is not a valid PChar.

Of the rest you have covered 2 and 5, but not 3 and 4. So if the user supplies a value of AStart that is too large, then you will read off the end of the string. Likewise, the user can readily supply a negative ASubstringLength. I don't think that you need anyone to write the code to check these cases since you are clearly very competent.

Now, if you really care about every last drop of performance, you should not be checking for any of these case. Demand that the user passes valid parameters. In debug mode, with {$IFOPF D+} or Assert you might check inputs. Of course, if these arguments come from external sources, then they should be validated.

On the other hand, the biggest performance hit the original code suffers is the needless scanning of the entire string, and the copying to an intermediate heap allocated string. Once you've removed those, as you have, then the opportunity for further performance gains is much reduced.

5
René Hoffmann On

Instead of converting PChar to string you should try to copy the memory from address AInput + (AStart * SizeOf(PChar)) with a length of ASubstringLength * SizeOf(PChar) to @Result as it's much easier to handle Result as a pointer.

Move procedure could do this.