i'm using regex parse csv-like file. i'm new regular expressions, and, while works, gets slow when there many fields , 1 of fields contains long value. how can optimize it?
the csv have parse of following flavor:
- all fields strings enclosed in quotes separated commas
- quotes inside fields escaped in form of 2 consecutive quotes
- there unpredictable garbage @ start of lines needs ignored (so far didn't contain quotes, thankfully)
- zero-length fields , newlines in fields possible
i working vb.net. using following regex:
(^(?!").+?|^(?="))(?<entry>"(",|(.*?)"(?<!((?!").("")+)),))*(?<lastentry>"("$|(.*?)"(?<!((?!").("")+))$))
i handle newlines feeding streamreader.readline's string variable until regex succeeds, replacing newline space (this ok purposes). extract field contents using match.groups("entry").captures , match.groups("lastentry").
i suppose performance hit coming look-behind escaped quotes. there better way?
thanks ideas!
i think regex needlessly complicated, , nested quantifiers cause catastrophic backtracking. try following:
^[^"]*(?<entry>(?>"(?>[^"]+|"")*"),)*(?<lastentry>(?>"(?>[^"]+|"")*"))$
explanation:
^ # start of string [^"]* # optional non-quotes (?<entry> # match group 'entry' (?> # match, , don't allow backtracking (atomic group): " # quote (?> # followed atomic group: [^"]+ # 1 or more non-quote characters | # or "" # 2 quotes in row )* # repeat 0 or more times. " # match closing quote ) # end of atomic group , # match comma )* # end of group 'entry' (?<lastentry> # match final group 'lastentry' (?> # same before " # quoted field... (?>[^"]+|"")* # containing non-quotes or double-quotes " # , closing quote ) # once. ) # end of group 'lastentry' $ # end of string
this should work on entire file well, wouldn't have add 1 line after next until regex matches, , wouldn't have replace newlines:
dim regexobj new regex("^[^""]*(?<entry>(?>""(?:[^""]+|"""")*""),)*(?<lastentry>(?>""(?:[^""]+|"""")*""))$", regexoptions.multiline) dim matchresults match = regexobj.match(subjectstring) while matchresults.success ' can access matchresults.groups("entry").captures , ' matchresults.groups("lastentry") matchresults = matchresults.nextmatch() end while
Comments
Post a Comment