diff --git a/doc/source/api.rst b/doc/source/api.rst index 6ab7a20d6b41f..52fd8f5838b1c 100644 --- a/doc/source/api.rst +++ b/doc/source/api.rst @@ -526,6 +526,7 @@ strings and apply several methods to it. These can be accessed like Series.str.encode Series.str.endswith Series.str.extract + Series.str.extractall Series.str.find Series.str.findall Series.str.get diff --git a/doc/source/text.rst b/doc/source/text.rst index d5ca24523695d..13421ae3dfa55 100644 --- a/doc/source/text.rst +++ b/doc/source/text.rst @@ -168,28 +168,37 @@ Extracting Substrings .. _text.extract: -The method ``extract`` (introduced in version 0.13) accepts `regular expressions -`__ with match groups. Extracting a -regular expression with one group returns a Series of strings. +Extract first match in each subject (extract) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -.. ipython:: python +.. versionadded:: 0.13.0 + +.. warning:: + + In version 0.18.0, ``extract`` gained the ``expand`` argument. When + ``expand=False`` it returns a ``Series``, ``Index``, or + ``DataFrame``, depending on the subject and regular expression + pattern (same behavior as pre-0.18.0). When ``expand=True`` it + always returns a ``DataFrame``, which is more consistent and less + confusing from the perspective of a user. - pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)') +The ``extract`` method accepts a `regular expression +`__ with at least one +capture group. -Elements that do not match return ``NaN``. Extracting a regular expression -with more than one group returns a DataFrame with one column per group. +Extracting a regular expression with more than one group returns a +DataFrame with one column per group. .. ipython:: python pd.Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)') -Elements that do not match return a row filled with ``NaN``. -Thus, a Series of messy strings can be "converted" into a -like-indexed Series or DataFrame of cleaned-up or more useful strings, -without necessitating ``get()`` to access tuples or ``re.match`` objects. - -The results dtype always is object, even if no match is found and the result -only contains ``NaN``. +Elements that do not match return a row filled with ``NaN``. Thus, a +Series of messy strings can be "converted" into a like-indexed Series +or DataFrame of cleaned-up or more useful strings, without +necessitating ``get()`` to access tuples or ``re.match`` objects. The +results dtype always is object, even if no match is found and the +result only contains ``NaN``. Named groups like @@ -201,9 +210,109 @@ and optional groups like .. ipython:: python - pd.Series(['a1', 'b2', '3']).str.extract('(?P[ab])?(?P\d)') + pd.Series(['a1', 'b2', '3']).str.extract('([ab])?(\d)') + +can also be used. Note that any capture group names in the regular +expression will be used for column names; otherwise capture group +numbers will be used. + +Extracting a regular expression with one group returns a ``DataFrame`` +with one column if ``expand=True``. + +.. ipython:: python + + pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True) + +It returns a Series if ``expand=False``. + +.. ipython:: python + + pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False) + +Calling on an ``Index`` with a regex with exactly one capture group +returns a ``DataFrame`` with one column if ``expand=True``, + +.. ipython:: python + + s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"]) + s + s.index.str.extract("(?P[a-zA-Z])", expand=True) + +It returns an ``Index`` if ``expand=False``. + +.. ipython:: python + + s.index.str.extract("(?P[a-zA-Z])", expand=False) + +Calling on an ``Index`` with a regex with more than one capture group +returns a ``DataFrame`` if ``expand=True``. + +.. ipython:: python + + s.index.str.extract("(?P[a-zA-Z])([0-9]+)", expand=True) + +It raises ``ValueError`` if ``expand=False``. + +.. code-block:: python + + >>> s.index.str.extract("(?P[a-zA-Z])([0-9]+)", expand=False) + ValueError: This pattern contains no groups to capture. + +The table below summarizes the behavior of ``extract(expand=False)`` +(input subject in first column, number of groups in regex in +first row) + ++--------+---------+------------+ +| | 1 group | >1 group | ++--------+---------+------------+ +| Index | Index | ValueError | ++--------+---------+------------+ +| Series | Series | DataFrame | ++--------+---------+------------+ + +Extract all matches in each subject (extractall) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. _text.extractall: + +Unlike ``extract`` (which returns only the first match), + +.. ipython:: python + + s = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"]) + s + s.str.extract("[ab](?P\d)") + +.. versionadded:: 0.18.0 + +the ``extractall`` method returns every match. The result of +``extractall`` is always a ``DataFrame`` with a ``MultiIndex`` on its +rows. The last level of the ``MultiIndex`` is named ``match`` and +indicates the order in the subject. + +.. ipython:: python + + s.str.extractall("[ab](?P\d)") + +When each subject string in the Series has exactly one match, + +.. ipython:: python + + s = pd.Series(['a3', 'b3', 'c2']) + s + two_groups = '(?P[a-z])(?P[0-9])' + +then ``extractall(pat).xs(0, level='match')`` gives the same result as +``extract(pat)``. + +.. ipython:: python + + extract_result = s.str.extract(two_groups) + extract_result + extractall_result = s.str.extractall(two_groups) + extractall_result + extractall_result.xs(0, level="match") -can also be used. Testing for Strings that Match or Contain a Pattern ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -288,7 +397,8 @@ Method Summary :meth:`~Series.str.endswith`,Equivalent to ``str.endswith(pat)`` for each element :meth:`~Series.str.findall`,Compute list of all occurrences of pattern/regex for each string :meth:`~Series.str.match`,"Call ``re.match`` on each element, returning matched groups as list" - :meth:`~Series.str.extract`,"Call ``re.match`` on each element, as ``match`` does, but return matched groups as strings for convenience." + :meth:`~Series.str.extract`,"Call ``re.search`` on each element, returning DataFrame with one row for each element and one column for each regex capture group" + :meth:`~Series.str.extractall`,"Call ``re.findall`` on each element, returning DataFrame with one row for each match and one column for each regex capture group" :meth:`~Series.str.len`,Compute string lengths :meth:`~Series.str.strip`,Equivalent to ``str.strip`` :meth:`~Series.str.rstrip`,Equivalent to ``str.rstrip`` diff --git a/doc/source/whatsnew/v0.18.0.txt b/doc/source/whatsnew/v0.18.0.txt index ac6267a15b513..d30c0321568bc 100644 --- a/doc/source/whatsnew/v0.18.0.txt +++ b/doc/source/whatsnew/v0.18.0.txt @@ -137,6 +137,92 @@ New Behavior: s.index s.index.nbytes +.. _whatsnew_0180.enhancements.extract: + +Changes to str.extract +^^^^^^^^^^^^^^^^^^^^^^ + +The :ref:`.str.extract ` method takes a regular +expression with capture groups, finds the first match in each subject +string, and returns the contents of the capture groups +(:issue:`11386`). In v0.18.0, the ``expand`` argument was added to +``extract``. When ``expand=False`` it returns a ``Series``, ``Index``, +or ``DataFrame``, depending on the subject and regular expression +pattern (same behavior as pre-0.18.0). When ``expand=True`` it always +returns a ``DataFrame``, which is more consistent and less confusing +from the perspective of a user. Currently the default is +``expand=None`` which gives a ``FutureWarning`` and uses +``expand=False``. To avoid this warning, please explicitly specify +``expand``. + +.. ipython:: python + + pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)') + +Extracting a regular expression with one group returns a ``DataFrame`` +with one column if ``expand=True``. + +.. ipython:: python + + pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True) + +It returns a Series if ``expand=False``. + +.. ipython:: python + + pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False) + +Calling on an ``Index`` with a regex with exactly one capture group +returns a ``DataFrame`` with one column if ``expand=True``, + +.. ipython:: python + + s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"]) + s + s.index.str.extract("(?P[a-zA-Z])", expand=True) + +It returns an ``Index`` if ``expand=False``. + +.. ipython:: python + + s.index.str.extract("(?P[a-zA-Z])", expand=False) + +Calling on an ``Index`` with a regex with more than one capture group +returns a ``DataFrame`` if ``expand=True``. + +.. ipython:: python + + s.index.str.extract("(?P[a-zA-Z])([0-9]+)", expand=True) + +It raises ``ValueError`` if ``expand=False``. + +.. code-block:: python + + >>> s.index.str.extract("(?P[a-zA-Z])([0-9]+)", expand=False) + ValueError: only one regex group is supported with Index + +In summary, ``extract(expand=True)`` always returns a ``DataFrame`` +with a row for every subject string, and a column for every capture +group. + +.. _whatsnew_0180.enhancements.extractall: + +The :ref:`.str.extractall ` method was added +(:issue:`11386`). Unlike ``extract`` (which returns only the first +match), + +.. ipython:: python + + s = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"]) + s + s.str.extract("(?P[ab])(?P\d)") + +the ``extractall`` method returns all matches. + +.. ipython:: python + + s.str.extractall("(?P[ab])(?P\d)") + .. _whatsnew_0180.enhancements.rounding: Datetimelike rounding diff --git a/pandas/core/strings.py b/pandas/core/strings.py index be78c950eff9d..727e3fcb377bd 100644 --- a/pandas/core/strings.py +++ b/pandas/core/strings.py @@ -418,38 +418,123 @@ def _get_single_group_name(rx): return None -def str_extract(arr, pat, flags=0): +def _groups_or_na_fun(regex): + """Used in both extract_noexpand and extract_frame""" + if regex.groups == 0: + raise ValueError("pattern contains no capture groups") + empty_row = [np.nan] * regex.groups + + def f(x): + if not isinstance(x, compat.string_types): + return empty_row + m = regex.search(x) + if m: + return [np.nan if item is None else item for item in m.groups()] + else: + return empty_row + return f + + +def _str_extract_noexpand(arr, pat, flags=0): """ Find groups in each string in the Series using passed regular - expression. + expression. This function is called from + str_extract(expand=False), and can return Series, DataFrame, or + Index. + + """ + from pandas import DataFrame, Index + + regex = re.compile(pat, flags=flags) + groups_or_na = _groups_or_na_fun(regex) + + if regex.groups == 1: + result = np.array([groups_or_na(val)[0] for val in arr], dtype=object) + name = _get_single_group_name(regex) + else: + if isinstance(arr, Index): + raise ValueError("only one regex group is supported with Index") + name = None + names = dict(zip(regex.groupindex.values(), regex.groupindex.keys())) + columns = [names.get(1 + i, i) for i in range(regex.groups)] + if arr.empty: + result = DataFrame(columns=columns, dtype=object) + else: + result = DataFrame( + [groups_or_na(val) for val in arr], + columns=columns, + index=arr.index, + dtype=object) + return result, name + + +def _str_extract_frame(arr, pat, flags=0): + """ + For each subject string in the Series, extract groups from the + first match of regular expression pat. This function is called from + str_extract(expand=True), and always returns a DataFrame. + + """ + from pandas import DataFrame + + regex = re.compile(pat, flags=flags) + groups_or_na = _groups_or_na_fun(regex) + names = dict(zip(regex.groupindex.values(), regex.groupindex.keys())) + columns = [names.get(1 + i, i) for i in range(regex.groups)] + + if len(arr) == 0: + return DataFrame(columns=columns, dtype=object) + try: + result_index = arr.index + except AttributeError: + result_index = None + return DataFrame( + [groups_or_na(val) for val in arr], + columns=columns, + index=result_index, + dtype=object) + + +def str_extract(arr, pat, flags=0, expand=None): + """ + For each subject string in the Series, extract groups from the + first match of regular expression pat. + + .. versionadded:: 0.13.0 Parameters ---------- pat : string - Pattern or regular expression + Regular expression pattern with capturing groups flags : int, default 0 (no flags) re module flags, e.g. re.IGNORECASE + .. versionadded:: 0.18.0 + expand : bool, default False + * If True, return DataFrame. + * If False, return Series/Index/DataFrame. + Returns ------- - extracted groups : Series (one group) or DataFrame (multiple groups) - Note that dtype of the result is always object, even when no match is - found and the result is a Series or DataFrame containing only NaN - values. + DataFrame with one row for each subject string, and one column for + each group. Any capture group names in regular expression pat will + be used for column names; otherwise capture group numbers will be + used. The dtype of each result column is always object, even when + no match is found. If expand=True and pat has only one capture group, + then return a Series (if subject is a Series) or Index (if subject + is an Index). - Examples + See Also -------- - A pattern with one group will return a Series. Non-matches will be NaN. - - >>> Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)') - 0 1 - 1 2 - 2 NaN - dtype: object + extractall : returns all matches (not just the first match) - A pattern with more than one group will return a DataFrame. + Examples + -------- + A pattern with two groups will return a DataFrame with two columns. + Non-matches will be NaN. - >>> Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)') + >>> s = Series(['a1', 'b2', 'c3']) + >>> s.str.extract('([ab])(\d)') 0 1 0 a 1 1 b 2 @@ -457,7 +542,7 @@ def str_extract(arr, pat, flags=0): A pattern may contain optional groups. - >>> Series(['a1', 'b2', 'c3']).str.extract('([ab])?(\d)') + >>> s.str.extract('([ab])?(\d)') 0 1 0 a 1 1 b 2 @@ -465,46 +550,147 @@ def str_extract(arr, pat, flags=0): Named groups will become column names in the result. - >>> Series(['a1', 'b2', 'c3']).str.extract('(?P[ab])(?P\d)') + >>> s.str.extract('(?P[ab])(?P\d)') letter digit 0 a 1 1 b 2 2 NaN NaN + A pattern with one group will return a DataFrame with one column + if expand=True. + + >>> s.str.extract('[ab](\d)', expand=True) + 0 + 0 1 + 1 2 + 2 NaN + + A pattern with one group will return a Series if expand=False. + + >>> s.str.extract('[ab](\d)', expand=False) + 0 1 + 1 2 + 2 NaN + dtype: object + """ - from pandas.core.frame import DataFrame - from pandas.core.index import Index + if expand is None: + warnings.warn( + "currently extract(expand=None) " + + "means expand=False (return Index/Series/DataFrame) " + + "but in a future version of pandas this will be changed " + + "to expand=True (return DataFrame)", + FutureWarning, + stacklevel=3) + expand = False + if not isinstance(expand, bool): + raise ValueError("expand must be True or False") + if expand: + return _str_extract_frame(arr._orig, pat, flags=flags) + else: + result, name = _str_extract_noexpand(arr._data, pat, flags=flags) + return arr._wrap_result(result, name=name) - regex = re.compile(pat, flags=flags) - # just to be safe, check this - if regex.groups == 0: - raise ValueError("This pattern contains no groups to capture.") - empty_row = [np.nan] * regex.groups - def f(x): - if not isinstance(x, compat.string_types): - return empty_row - m = regex.search(x) - if m: - return [np.nan if item is None else item for item in m.groups()] - else: - return empty_row +def str_extractall(arr, pat, flags=0): + """ + For each subject string in the Series, extract groups from all + matches of regular expression pat. When each subject string in the + Series has exactly one match, extractall(pat).xs(0, level='match') + is the same as extract(pat). - if regex.groups == 1: - result = np.array([f(val)[0] for val in arr], dtype=object) - name = _get_single_group_name(regex) + .. versionadded:: 0.18.0 + + Parameters + ---------- + pat : string + Regular expression pattern with capturing groups + flags : int, default 0 (no flags) + re module flags, e.g. re.IGNORECASE + + Returns + ------- + A DataFrame with one row for each match, and one column for each + group. Its rows have a MultiIndex with first levels that come from + the subject Series. The last level is named 'match' and indicates + the order in the subject. Any capture group names in regular + expression pat will be used for column names; otherwise capture + group numbers will be used. + + See Also + -------- + extract : returns first match only (not all matches) + + Examples + -------- + A pattern with one group will return a DataFrame with one column. + Indices with no matches will not appear in the result. + + >>> s = Series(["a1a2", "b1", "c1"], index=["A", "B", "C"]) + >>> s.str.extractall("[ab](\d)") + 0 + match + A 0 1 + 1 2 + B 0 1 + + Capture group names are used for column names of the result. + + >>> s.str.extractall("[ab](?P\d)") + digit + match + A 0 1 + 1 2 + B 0 1 + + A pattern with two groups will return a DataFrame with two columns. + + >>> s.str.extractall("(?P[ab])(?P\d)") + letter digit + match + A 0 a 1 + 1 a 2 + B 0 b 1 + + Optional groups that do not match are NaN in the result. + + >>> s.str.extractall("(?P[ab])?(?P\d)") + letter digit + match + A 0 a 1 + 1 a 2 + B 0 b 1 + C 0 NaN 1 + + """ + from pandas import DataFrame, MultiIndex + regex = re.compile(pat, flags=flags) + # the regex must contain capture groups. + if regex.groups == 0: + raise ValueError("pattern contains no capture groups") + names = dict(zip(regex.groupindex.values(), regex.groupindex.keys())) + columns = [names.get(1 + i, i) for i in range(regex.groups)] + match_list = [] + index_list = [] + for subject_key, subject in arr.iteritems(): + if isinstance(subject, compat.string_types): + try: + key_list = list(subject_key) + except TypeError: + key_list = [subject_key] + for match_i, match_tuple in enumerate(regex.findall(subject)): + na_tuple = [ + np.NaN if group == "" else group for group in match_tuple] + match_list.append(na_tuple) + result_key = tuple(key_list + [match_i]) + index_list.append(result_key) + if 0 < len(index_list): + index = MultiIndex.from_tuples( + index_list, names=arr.index.names + ["match"]) else: - if isinstance(arr, Index): - raise ValueError("only one regex group is supported with Index") - name = None - names = dict(zip(regex.groupindex.values(), regex.groupindex.keys())) - columns = [names.get(1 + i, i) for i in range(regex.groups)] - if arr.empty: - result = DataFrame(columns=columns, dtype=object) - else: - result = DataFrame([f(val) for val in arr], columns=columns, - index=arr.index, dtype=object) - return result, name + index = None + result = DataFrame(match_list, index, columns) + return result def str_get_dummies(arr, sep='|'): @@ -599,6 +785,10 @@ def str_findall(arr, pat, flags=0): Returns ------- matches : Series/Index of lists + + See Also + -------- + extractall : returns DataFrame with one column per capture group """ regex = re.compile(pat, flags=flags) return _na_map(regex.findall, arr) @@ -1405,9 +1595,12 @@ def translate(self, table, deletechars=None): findall = _pat_wrapper(str_findall, flags=True) @copy(str_extract) - def extract(self, pat, flags=0): - result, name = str_extract(self._data, pat, flags=flags) - return self._wrap_result(result, name=name) + def extract(self, pat, flags=0, expand=None): + return str_extract(self, pat, flags=flags, expand=expand) + + @copy(str_extractall) + def extractall(self, pat, flags=0): + return str_extractall(self._orig, pat, flags=flags) _shared_docs['find'] = (""" Return %(side)s indexes in each strings in the Series/Index diff --git a/pandas/tests/test_categorical.py b/pandas/tests/test_categorical.py index 733ed2fbcb971..071e280bd112a 100755 --- a/pandas/tests/test_categorical.py +++ b/pandas/tests/test_categorical.py @@ -4110,6 +4110,7 @@ def test_str_accessor_api_for_categorical(self): ('encode', ("UTF-8",), {}), ('endswith', ("a",), {}), ('extract', ("([a-z]*) ",), {}), + ('extractall', ("([a-z]*) ",), {}), ('find', ("a",), {}), ('findall', ("a",), {}), ('index', (" ",), {}), diff --git a/pandas/tests/test_strings.py b/pandas/tests/test_strings.py index bc540cc8bf92b..f0bb002a1c96d 100644 --- a/pandas/tests/test_strings.py +++ b/pandas/tests/test_strings.py @@ -509,12 +509,22 @@ def test_match(self): exp = Series([True, np.nan, np.nan]) assert_series_equal(exp, res) - def test_extract(self): + def test_extract_expand_None(self): + values = Series(['fooBAD__barBAD', NA, 'foo']) + with tm.assert_produces_warning(FutureWarning): + values.str.extract('.*(BAD[_]+).*(BAD)', expand=None) + + def test_extract_expand_unspecified(self): + values = Series(['fooBAD__barBAD', NA, 'foo']) + with tm.assert_produces_warning(FutureWarning): + values.str.extract('.*(BAD[_]+).*(BAD)') + + def test_extract_expand_False(self): # Contains tests like those in test_match and some others. values = Series(['fooBAD__barBAD', NA, 'foo']) er = [NA, NA] # empty row - result = values.str.extract('.*(BAD[_]+).*(BAD)') + result = values.str.extract('.*(BAD[_]+).*(BAD)', expand=False) exp = DataFrame([['BAD__', 'BAD'], er, er]) tm.assert_frame_equal(result, exp) @@ -522,7 +532,7 @@ def test_extract(self): mixed = Series(['aBAD_BAD', NA, 'BAD_b_BAD', True, datetime.today(), 'foo', None, 1, 2.]) - rs = Series(mixed).str.extract('.*(BAD[_]+).*(BAD)') + rs = Series(mixed).str.extract('.*(BAD[_]+).*(BAD)', expand=False) exp = DataFrame([['BAD_', 'BAD'], er, ['BAD_', 'BAD'], er, er, er, er, er, er]) tm.assert_frame_equal(rs, exp) @@ -530,7 +540,7 @@ def test_extract(self): # unicode values = Series([u('fooBAD__barBAD'), NA, u('foo')]) - result = values.str.extract('.*(BAD[_]+).*(BAD)') + result = values.str.extract('.*(BAD[_]+).*(BAD)', expand=False) exp = DataFrame([[u('BAD__'), u('BAD')], er, er]) tm.assert_frame_equal(result, exp) @@ -539,84 +549,85 @@ def test_extract(self): # multi-group would expand to a frame idx = Index(['A1', 'A2', 'A3', 'A4', 'B5']) with tm.assertRaisesRegexp(ValueError, "supported"): - idx.str.extract('([AB])([123])') + idx.str.extract('([AB])([123])', expand=False) # these should work for both Series and Index for klass in [Series, Index]: # no groups s_or_idx = klass(['A1', 'B2', 'C3']) - f = lambda: s_or_idx.str.extract('[ABC][123]') + f = lambda: s_or_idx.str.extract('[ABC][123]', expand=False) self.assertRaises(ValueError, f) # only non-capturing groups - f = lambda: s_or_idx.str.extract('(?:[AB]).*') + f = lambda: s_or_idx.str.extract('(?:[AB]).*', expand=False) self.assertRaises(ValueError, f) # single group renames series/index properly s_or_idx = klass(['A1', 'A2']) - result = s_or_idx.str.extract(r'(?PA)\d') + result = s_or_idx.str.extract(r'(?PA)\d', expand=False) tm.assert_equal(result.name, 'uno') tm.assert_numpy_array_equal(result, klass(['A', 'A'])) s = Series(['A1', 'B2', 'C3']) # one group, no matches - result = s.str.extract('(_)') + result = s.str.extract('(_)', expand=False) exp = Series([NA, NA, NA], dtype=object) tm.assert_series_equal(result, exp) # two groups, no matches - result = s.str.extract('(_)(_)') + result = s.str.extract('(_)(_)', expand=False) exp = DataFrame([[NA, NA], [NA, NA], [NA, NA]], dtype=object) tm.assert_frame_equal(result, exp) # one group, some matches - result = s.str.extract('([AB])[123]') + result = s.str.extract('([AB])[123]', expand=False) exp = Series(['A', 'B', NA]) tm.assert_series_equal(result, exp) # two groups, some matches - result = s.str.extract('([AB])([123])') + result = s.str.extract('([AB])([123])', expand=False) exp = DataFrame([['A', '1'], ['B', '2'], [NA, NA]]) tm.assert_frame_equal(result, exp) # one named group - result = s.str.extract('(?P[AB])') + result = s.str.extract('(?P[AB])', expand=False) exp = Series(['A', 'B', NA], name='letter') tm.assert_series_equal(result, exp) # two named groups - result = s.str.extract('(?P[AB])(?P[123])') + result = s.str.extract('(?P[AB])(?P[123])', + expand=False) exp = DataFrame([['A', '1'], ['B', '2'], [NA, NA]], columns=['letter', 'number']) tm.assert_frame_equal(result, exp) # mix named and unnamed groups - result = s.str.extract('([AB])(?P[123])') + result = s.str.extract('([AB])(?P[123])', expand=False) exp = DataFrame([['A', '1'], ['B', '2'], [NA, NA]], columns=[0, 'number']) tm.assert_frame_equal(result, exp) # one normal group, one non-capturing group - result = s.str.extract('([AB])(?:[123])') + result = s.str.extract('([AB])(?:[123])', expand=False) exp = Series(['A', 'B', NA]) tm.assert_series_equal(result, exp) # two normal groups, one non-capturing group result = Series(['A11', 'B22', 'C33']).str.extract( - '([AB])([123])(?:[123])') + '([AB])([123])(?:[123])', expand=False) exp = DataFrame([['A', '1'], ['B', '2'], [NA, NA]]) tm.assert_frame_equal(result, exp) # one optional group followed by one normal group result = Series(['A1', 'B2', '3']).str.extract( - '(?P[AB])?(?P[123])') + '(?P[AB])?(?P[123])', expand=False) exp = DataFrame([['A', '1'], ['B', '2'], [NA, '3']], columns=['letter', 'number']) tm.assert_frame_equal(result, exp) # one normal group followed by one optional group result = Series(['A1', 'B2', 'C']).str.extract( - '(?P[ABC])(?P[123])?') + '(?P[ABC])(?P[123])?', expand=False) exp = DataFrame([['A', '1'], ['B', '2'], ['C', NA]], columns=['letter', 'number']) tm.assert_frame_equal(result, exp) @@ -626,28 +637,431 @@ def test_extract(self): def check_index(index): data = ['A1', 'B2', 'C'] index = index[:len(data)] - result = Series(data, index=index).str.extract('(\d)') + s = Series(data, index=index) + result = s.str.extract('(\d)', expand=False) exp = Series(['1', '2', NA], index=index) tm.assert_series_equal(result, exp) - result = Series( - data, index=index).str.extract('(?P\D)(?P\d)?') - exp = DataFrame([['A', '1'], ['B', '2'], ['C', NA]], columns=[ - 'letter', 'number' - ], index=index) + result = Series(data, index=index).str.extract( + '(?P\D)(?P\d)?', expand=False) + e_list = [ + ['A', '1'], + ['B', '2'], + ['C', NA] + ] + exp = DataFrame(e_list, columns=['letter', 'number'], index=index) tm.assert_frame_equal(result, exp) - for index in [tm.makeStringIndex, tm.makeUnicodeIndex, tm.makeIntIndex, - tm.makeDateIndex, tm.makePeriodIndex]: + i_funs = [ + tm.makeStringIndex, tm.makeUnicodeIndex, tm.makeIntIndex, + tm.makeDateIndex, tm.makePeriodIndex, tm.makeRangeIndex + ] + for index in i_funs: check_index(index()) - def test_extract_single_series_name_is_preserved(self): + # single_series_name_is_preserved. s = Series(['a3', 'b3', 'c2'], name='bob') - r = s.str.extract(r'(?P[a-z])') + r = s.str.extract(r'(?P[a-z])', expand=False) e = Series(['a', 'b', 'c'], name='sue') tm.assert_series_equal(r, e) self.assertEqual(r.name, e.name) + def test_extract_expand_True(self): + # Contains tests like those in test_match and some others. + values = Series(['fooBAD__barBAD', NA, 'foo']) + er = [NA, NA] # empty row + + result = values.str.extract('.*(BAD[_]+).*(BAD)', expand=True) + exp = DataFrame([['BAD__', 'BAD'], er, er]) + tm.assert_frame_equal(result, exp) + + # mixed + mixed = Series(['aBAD_BAD', NA, 'BAD_b_BAD', True, datetime.today(), + 'foo', None, 1, 2.]) + + rs = Series(mixed).str.extract('.*(BAD[_]+).*(BAD)', expand=True) + exp = DataFrame([['BAD_', 'BAD'], er, ['BAD_', 'BAD'], er, er, + er, er, er, er]) + tm.assert_frame_equal(rs, exp) + + # unicode + values = Series([u('fooBAD__barBAD'), NA, u('foo')]) + + result = values.str.extract('.*(BAD[_]+).*(BAD)', expand=True) + exp = DataFrame([[u('BAD__'), u('BAD')], er, er]) + tm.assert_frame_equal(result, exp) + + # these should work for both Series and Index + for klass in [Series, Index]: + # no groups + s_or_idx = klass(['A1', 'B2', 'C3']) + f = lambda: s_or_idx.str.extract('[ABC][123]', expand=True) + self.assertRaises(ValueError, f) + + # only non-capturing groups + f = lambda: s_or_idx.str.extract('(?:[AB]).*', expand=True) + self.assertRaises(ValueError, f) + + # single group renames series/index properly + s_or_idx = klass(['A1', 'A2']) + result_df = s_or_idx.str.extract(r'(?PA)\d', expand=True) + result_series = result_df['uno'] + tm.assert_numpy_array_equal(result_series, klass(['A', 'A'])) + + def test_extract_series(self): + # extract should give the same result whether or not the + # series has a name. + for series_name in None, "series_name": + s = Series(['A1', 'B2', 'C3'], name=series_name) + # one group, no matches + result = s.str.extract('(_)', expand=True) + exp = DataFrame([NA, NA, NA], dtype=object) + tm.assert_frame_equal(result, exp) + + # two groups, no matches + result = s.str.extract('(_)(_)', expand=True) + exp = DataFrame([[NA, NA], [NA, NA], [NA, NA]], dtype=object) + tm.assert_frame_equal(result, exp) + + # one group, some matches + result = s.str.extract('([AB])[123]', expand=True) + exp = DataFrame(['A', 'B', NA]) + tm.assert_frame_equal(result, exp) + + # two groups, some matches + result = s.str.extract('([AB])([123])', expand=True) + exp = DataFrame([['A', '1'], ['B', '2'], [NA, NA]]) + tm.assert_frame_equal(result, exp) + + # one named group + result = s.str.extract('(?P[AB])', expand=True) + exp = DataFrame({"letter": ['A', 'B', NA]}) + tm.assert_frame_equal(result, exp) + + # two named groups + result = s.str.extract( + '(?P[AB])(?P[123])', + expand=True) + e_list = [ + ['A', '1'], + ['B', '2'], + [NA, NA] + ] + exp = DataFrame(e_list, columns=['letter', 'number']) + tm.assert_frame_equal(result, exp) + + # mix named and unnamed groups + result = s.str.extract('([AB])(?P[123])', expand=True) + exp = DataFrame(e_list, columns=[0, 'number']) + tm.assert_frame_equal(result, exp) + + # one normal group, one non-capturing group + result = s.str.extract('([AB])(?:[123])', expand=True) + exp = DataFrame(['A', 'B', NA]) + tm.assert_frame_equal(result, exp) + + def test_extract_optional_groups(self): + + # two normal groups, one non-capturing group + result = Series(['A11', 'B22', 'C33']).str.extract( + '([AB])([123])(?:[123])', expand=True) + exp = DataFrame([['A', '1'], ['B', '2'], [NA, NA]]) + tm.assert_frame_equal(result, exp) + + # one optional group followed by one normal group + result = Series(['A1', 'B2', '3']).str.extract( + '(?P[AB])?(?P[123])', expand=True) + e_list = [ + ['A', '1'], + ['B', '2'], + [NA, '3'] + ] + exp = DataFrame(e_list, columns=['letter', 'number']) + tm.assert_frame_equal(result, exp) + + # one normal group followed by one optional group + result = Series(['A1', 'B2', 'C']).str.extract( + '(?P[ABC])(?P[123])?', expand=True) + e_list = [ + ['A', '1'], + ['B', '2'], + ['C', NA] + ] + exp = DataFrame(e_list, columns=['letter', 'number']) + tm.assert_frame_equal(result, exp) + + # GH6348 + # not passing index to the extractor + def check_index(index): + data = ['A1', 'B2', 'C'] + index = index[:len(data)] + result = Series(data, index=index).str.extract('(\d)', expand=True) + exp = DataFrame(['1', '2', NA], index=index) + tm.assert_frame_equal(result, exp) + + result = Series(data, index=index).str.extract( + '(?P\D)(?P\d)?', expand=True) + e_list = [ + ['A', '1'], + ['B', '2'], + ['C', NA] + ] + exp = DataFrame(e_list, columns=['letter', 'number'], index=index) + tm.assert_frame_equal(result, exp) + + i_funs = [ + tm.makeStringIndex, tm.makeUnicodeIndex, tm.makeIntIndex, + tm.makeDateIndex, tm.makePeriodIndex, tm.makeRangeIndex + ] + for index in i_funs: + check_index(index()) + + def test_extract_single_group_returns_frame(self): + # GH11386 extract should always return DataFrame, even when + # there is only one group. Prior to v0.18.0, extract returned + # Series when there was only one group in the regex. + s = Series(['a3', 'b3', 'c2'], name='series_name') + r = s.str.extract(r'(?P[a-z])', expand=True) + e = DataFrame({"letter": ['a', 'b', 'c']}) + tm.assert_frame_equal(r, e) + + def test_extractall(self): + subject_list = [ + 'dave@google.com', + 'tdhock5@gmail.com', + 'maudelaperriere@gmail.com', + 'rob@gmail.com some text steve@gmail.com', + 'a@b.com some text c@d.com and e@f.com', + np.nan, + "", + ] + expected_tuples = [ + ("dave", "google", "com"), + ("tdhock5", "gmail", "com"), + ("maudelaperriere", "gmail", "com"), + ("rob", "gmail", "com"), ("steve", "gmail", "com"), + ("a", "b", "com"), ("c", "d", "com"), ("e", "f", "com"), + ] + named_pattern = r''' + (?P[a-z0-9]+) + @ + (?P[a-z]+) + \. + (?P[a-z]{2,4}) + ''' + expected_columns = ["user", "domain", "tld"] + S = Series(subject_list) + # extractall should return a DataFrame with one row for each + # match, indexed by the subject from which the match came. + expected_index = MultiIndex.from_tuples([ + (0, 0), + (1, 0), + (2, 0), + (3, 0), + (3, 1), + (4, 0), + (4, 1), + (4, 2), + ], names=(None, "match")) + expected_df = DataFrame( + expected_tuples, expected_index, expected_columns) + computed_df = S.str.extractall(named_pattern, re.VERBOSE) + tm.assert_frame_equal(computed_df, expected_df) + + # The index of the input Series should be used to construct + # the index of the output DataFrame: + series_index = MultiIndex.from_tuples([ + ("single", "Dave"), + ("single", "Toby"), + ("single", "Maude"), + ("multiple", "robAndSteve"), + ("multiple", "abcdef"), + ("none", "missing"), + ("none", "empty"), + ]) + Si = Series(subject_list, series_index) + expected_index = MultiIndex.from_tuples([ + ("single", "Dave", 0), + ("single", "Toby", 0), + ("single", "Maude", 0), + ("multiple", "robAndSteve", 0), + ("multiple", "robAndSteve", 1), + ("multiple", "abcdef", 0), + ("multiple", "abcdef", 1), + ("multiple", "abcdef", 2), + ], names=(None, None, "match")) + expected_df = DataFrame( + expected_tuples, expected_index, expected_columns) + computed_df = Si.str.extractall(named_pattern, re.VERBOSE) + tm.assert_frame_equal(computed_df, expected_df) + + # MultiIndexed subject with names. + Sn = Series(subject_list, series_index) + Sn.index.names = ("matches", "description") + expected_index.names = ("matches", "description", "match") + expected_df = DataFrame( + expected_tuples, expected_index, expected_columns) + computed_df = Sn.str.extractall(named_pattern, re.VERBOSE) + tm.assert_frame_equal(computed_df, expected_df) + + # optional groups. + subject_list = ['', 'A1', '32'] + named_pattern = '(?P[AB])?(?P[123])' + computed_df = Series(subject_list).str.extractall(named_pattern) + expected_index = MultiIndex.from_tuples([ + (1, 0), + (2, 0), + (2, 1), + ], names=(None, "match")) + expected_df = DataFrame([ + ('A', '1'), + (NA, '3'), + (NA, '2'), + ], expected_index, columns=['letter', 'number']) + tm.assert_frame_equal(computed_df, expected_df) + + # only one of two groups has a name. + pattern = '([AB])?(?P[123])' + computed_df = Series(subject_list).str.extractall(pattern) + expected_df = DataFrame([ + ('A', '1'), + (NA, '3'), + (NA, '2'), + ], expected_index, columns=[0, 'number']) + tm.assert_frame_equal(computed_df, expected_df) + + def test_extractall_single_group(self): + # extractall(one named group) returns DataFrame with one named + # column. + s = Series(['a3', 'b3', 'd4c2'], name='series_name') + r = s.str.extractall(r'(?P[a-z])') + i = MultiIndex.from_tuples([ + (0, 0), + (1, 0), + (2, 0), + (2, 1), + ], names=(None, "match")) + e = DataFrame({"letter": ['a', 'b', 'd', 'c']}, i) + tm.assert_frame_equal(r, e) + + # extractall(one un-named group) returns DataFrame with one + # un-named column. + r = s.str.extractall(r'([a-z])') + e = DataFrame(['a', 'b', 'd', 'c'], i) + tm.assert_frame_equal(r, e) + + def test_extractall_no_matches(self): + s = Series(['a3', 'b3', 'd4c2'], name='series_name') + # one un-named group. + r = s.str.extractall('(z)') + e = DataFrame(columns=[0]) + tm.assert_frame_equal(r, e) + # two un-named groups. + r = s.str.extractall('(z)(z)') + e = DataFrame(columns=[0, 1]) + tm.assert_frame_equal(r, e) + # one named group. + r = s.str.extractall('(?Pz)') + e = DataFrame(columns=["first"]) + tm.assert_frame_equal(r, e) + # two named groups. + r = s.str.extractall('(?Pz)(?Pz)') + e = DataFrame(columns=["first", "second"]) + tm.assert_frame_equal(r, e) + # one named, one un-named. + r = s.str.extractall('(z)(?Pz)') + e = DataFrame(columns=[0, + "second"]) + tm.assert_frame_equal(r, e) + + def test_extractall_errors(self): + # Does not make sense to use extractall with a regex that has + # no capture groups. (it returns DataFrame with one column for + # each capture group) + s = Series(['a3', 'b3', 'd4c2'], name='series_name') + with tm.assertRaisesRegexp(ValueError, "no capture groups"): + s.str.extractall(r'[a-z]') + + def test_extract_index_one_two_groups(self): + s = Series( + ['a3', 'b3', 'd4c2'], ["A3", "B3", "D4"], name='series_name') + r = s.index.str.extract(r'([A-Z])', expand=True) + e = DataFrame(['A', "B", "D"]) + tm.assert_frame_equal(r, e) + + # Prior to v0.18.0, index.str.extract(regex with one group) + # returned Index. With more than one group, extract raised an + # error (GH9980). Now extract always returns DataFrame. + r = s.index.str.extract( + r'(?P[A-Z])(?P[0-9])', expand=True) + e_list = [ + ("A", "3"), + ("B", "3"), + ("D", "4"), + ] + e = DataFrame(e_list, columns=["letter", "digit"]) + tm.assert_frame_equal(r, e) + + def test_extractall_same_as_extract(self): + s = Series(['a3', 'b3', 'c2'], name='series_name') + + pattern_two_noname = r'([a-z])([0-9])' + extract_two_noname = s.str.extract(pattern_two_noname, expand=True) + has_multi_index = s.str.extractall(pattern_two_noname) + no_multi_index = has_multi_index.xs(0, level="match") + tm.assert_frame_equal(extract_two_noname, no_multi_index) + + pattern_two_named = r'(?P[a-z])(?P[0-9])' + extract_two_named = s.str.extract(pattern_two_named, expand=True) + has_multi_index = s.str.extractall(pattern_two_named) + no_multi_index = has_multi_index.xs(0, level="match") + tm.assert_frame_equal(extract_two_named, no_multi_index) + + pattern_one_named = r'(?P[a-z])' + extract_one_named = s.str.extract(pattern_one_named, expand=True) + has_multi_index = s.str.extractall(pattern_one_named) + no_multi_index = has_multi_index.xs(0, level="match") + tm.assert_frame_equal(extract_one_named, no_multi_index) + + pattern_one_noname = r'([a-z])' + extract_one_noname = s.str.extract(pattern_one_noname, expand=True) + has_multi_index = s.str.extractall(pattern_one_noname) + no_multi_index = has_multi_index.xs(0, level="match") + tm.assert_frame_equal(extract_one_noname, no_multi_index) + + def test_extractall_same_as_extract_subject_index(self): + # same as above tests, but s has an MultiIndex. + i = MultiIndex.from_tuples([ + ("A", "first"), + ("B", "second"), + ("C", "third"), + ], names=("capital", "ordinal")) + s = Series(['a3', 'b3', 'c2'], i, name='series_name') + + pattern_two_noname = r'([a-z])([0-9])' + extract_two_noname = s.str.extract(pattern_two_noname, expand=True) + has_match_index = s.str.extractall(pattern_two_noname) + no_match_index = has_match_index.xs(0, level="match") + tm.assert_frame_equal(extract_two_noname, no_match_index) + + pattern_two_named = r'(?P[a-z])(?P[0-9])' + extract_two_named = s.str.extract(pattern_two_named, expand=True) + has_match_index = s.str.extractall(pattern_two_named) + no_match_index = has_match_index.xs(0, level="match") + tm.assert_frame_equal(extract_two_named, no_match_index) + + pattern_one_named = r'(?P[a-z])' + extract_one_named = s.str.extract(pattern_one_named, expand=True) + has_match_index = s.str.extractall(pattern_one_named) + no_match_index = has_match_index.xs(0, level="match") + tm.assert_frame_equal(extract_one_named, no_match_index) + + pattern_one_noname = r'([a-z])' + extract_one_noname = s.str.extract(pattern_one_noname, expand=True) + has_match_index = s.str.extractall(pattern_one_noname) + no_match_index = has_match_index.xs(0, level="match") + tm.assert_frame_equal(extract_one_noname, no_match_index) + def test_empty_str_methods(self): empty_str = empty = Series(dtype=str) empty_int = Series(dtype=int) @@ -670,9 +1084,18 @@ def test_empty_str_methods(self): tm.assert_series_equal(empty_str, empty.str.replace('a', 'b')) tm.assert_series_equal(empty_str, empty.str.repeat(3)) tm.assert_series_equal(empty_bool, empty.str.match('^a')) - tm.assert_series_equal(empty_str, empty.str.extract('()')) tm.assert_frame_equal( - DataFrame(columns=[0, 1], dtype=str), empty.str.extract('()()')) + DataFrame(columns=[0], dtype=str), + empty.str.extract('()', expand=True)) + tm.assert_frame_equal( + DataFrame(columns=[0, 1], dtype=str), + empty.str.extract('()()', expand=True)) + tm.assert_series_equal( + empty_str, + empty.str.extract('()', expand=False)) + tm.assert_frame_equal( + DataFrame(columns=[0, 1], dtype=str), + empty.str.extract('()()', expand=False)) tm.assert_frame_equal(DataFrame(dtype=str), empty.str.get_dummies()) tm.assert_series_equal(empty_str, empty_list.str.join('')) tm.assert_series_equal(empty_int, empty.str.len())