Parsing Character-Separated Data with a Regular Expression
A line from a flat-file is typically formatted using a separator
character to separate the fields. If the separator is simply a comma,
tab, or single character, the StringTokenizer class can be used to
parse the line into fields. If the separator is more complex (e.g., a
space after a comma), a regular expression is needed.
String.split() conveniently parses a line using a regular
expression to specify the separator.
String.split() returns only the nondelimiter strings. To
obtain the delimiter strings, see Parsing a String into Tokens Using a Regular Expression.
Note: The StringTokenizer does not conveniently handle
empty fields properly. For example, given the line a,,b, rather
than return three fields (the second being empty), the
StringTokenizer returns two fields, discarding the empty field.
String.split() properly handles empty fields.
// Parse a comma-separated string
String inputStr = "a,,b";
String patternStr = ",";
String[] fields = inputStr.split(patternStr);
// ["a", "", "b"]
// Parse a line whose separator is a comma followed by a space
inputStr = "a, b, c,d";
patternStr = ", ";
fields = inputStr.split(patternStr, -1);
// ["a", "b", "c,d"]
// Parse a line with and's and or's
inputStr = "a, b, and c";
patternStr = "[, ]+(and|or)*[, ]*";
fields = inputStr.split(patternStr, -1);
// ["a", "b", "c"]
Post a comment