.net - CSV-parsing regular expression performance -

i'm using regex parse csv-like file. i'm new regular expressions, and, while works, gets slow when there many fields , 1 of fields contains long value. how can optimize it?

the csv have parse of following flavor:

all fields strings enclosed in quotes separated commas
quotes inside fields escaped in form of 2 consecutive quotes
there unpredictable garbage @ start of lines needs ignored (so far didn't contain quotes, thankfully)
zero-length fields , newlines in fields possible

i working vb.net. using following regex:

(^(?!").+?|^(?="))(?<entry>"(",|(.*?)"(?<!((?!").("")+)),))*(?<lastentry>"("$|(.*?)"(?<!((?!").("")+))$))

i handle newlines feeding streamreader.readline's string variable until regex succeeds, replacing newline space (this ok purposes). extract field contents using match.groups("entry").captures , match.groups("lastentry").

i suppose performance hit coming look-behind escaped quotes. there better way?

thanks ideas!

i think regex needlessly complicated, , nested quantifiers cause catastrophic backtracking. try following:

^[^"]*(?<entry>(?>"(?>[^"]+|"")*"),)*(?<lastentry>(?>"(?>[^"]+|"")*"))$

explanation:

^                 # start of string [^"]*             # optional non-quotes (?<entry>         # match group 'entry'  (?>              # match, , don't allow backtracking (atomic group):   "               # quote   (?>             # followed atomic group:    [^"]+          # 1 or more non-quote characters   |               # or    ""             # 2 quotes in row   )*              # repeat 0 or more times.   "               # match closing quote  )                # end of atomic group  ,                # match comma )*                # end of group 'entry' (?<lastentry>     # match final group 'lastentry'  (?>              # same before   "               # quoted field...   (?>[^"]+|"")*   # containing non-quotes or double-quotes   "               # , closing quote  )                # once. )                 # end of group 'lastentry' $                 # end of string

this should work on entire file well, wouldn't have add 1 line after next until regex matches, , wouldn't have replace newlines:

dim regexobj new regex("^[^""]*(?<entry>(?>""(?:[^""]+|"""")*""),)*(?<lastentry>(?>""(?:[^""]+|"""")*""))$", regexoptions.multiline) dim matchresults match = regexobj.match(subjectstring) while matchresults.success     ' can access matchresults.groups("entry").captures ,     ' matchresults.groups("lastentry")     matchresults = matchresults.nextmatch() end while

Brazier

Search This Blog

.net - CSV-parsing regular expression performance -

Comments

Post a Comment