2 Star 0 Fork 0

mirrors_arnaudsj/sklearn-pandas

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
BSD-2-Clause

Sklearn-pandas

This module provides a bridge between Scikit-Learn's machine learning methods and pandas-style Data Frames.

In particular, it provides:

  1. a way to map DataFrame columns to transformations, which are later recombined into features
  2. a way to cross-validate a pipeline that takes a pandas DataFrame as input.

Installation

You can install sklearn-pandas with pip:

# pip install sklearn-pandas

Tests

The examples in this file double as basic sanity tests. To run them, use doctest, which is included with python:

# python -m doctest README.rst

Usage

Import

Import what you need from the sklearn_pandas package. The choices are:

  • DataFrameMapper, a class for mapping pandas data frame columns to different sklearn transformations
  • cross_val_score, similar to sklearn.cross_validation.cross_val_score but working on pandas DataFrames

For this demonstration, we will import both:

>>> from sklearn_pandas import DataFrameMapper, cross_val_score

For these examples, we'll also use pandas and sklearn:

>>> import pandas as pd
>>> import sklearn.preprocessing, sklearn.decomposition, \
...     sklearn.linear_model, sklearn.pipeline, sklearn.metrics

Load some Data

Normally you'll read the data from a file, but for demonstration purposes I'll create a data frame from a Python dict:

>>> data = pd.DataFrame({'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
...                      'children': [4., 6, 3, 3, 2, 3, 5, 4],
...                      'salary':   [90, 24, 44, 27, 32, 59, 36, 27]})

Transformation Mapping

Map the Columns to Transformations

The mapper takes a list of pairs. The first is a column name from the pandas DataFrame (or a list of multiple columns, as we will see later). The second is an object which will perform the transformation which will be applied to that column:

>>> mapper = DataFrameMapper([
...     ('pet', sklearn.preprocessing.LabelBinarizer()),
...     ('children', sklearn.preprocessing.StandardScaler())
... ])

Test the Transformation

We can use the fit_transform shortcut to both fit the model and see what transformed data looks like:

>>> mapper.fit_transform(data)
array([[ 1.        ,  0.        ,  0.        ,  0.20851441],
       [ 0.        ,  1.        ,  0.        ,  1.87662973],
       [ 0.        ,  1.        ,  0.        , -0.62554324],
       [ 0.        ,  0.        ,  1.        , -0.62554324],
       [ 1.        ,  0.        ,  0.        , -1.4596009 ],
       [ 0.        ,  1.        ,  0.        , -0.62554324],
       [ 1.        ,  0.        ,  0.        ,  1.04257207],
       [ 0.        ,  0.        ,  1.        ,  0.20851441]])

Note that the first three columns are the output of the LabelBinarizer (corresponding to _cat_, _dog_, and _fish_ respectively) and the fourth column is the standardized value for the number of children. In general, the columns are ordered according to the order given when the DataFrameMapper is constructed.

Now that the transformation is trained, we confirm that it works on new data:

>>> mapper.transform({'pet': ['cat'], 'children': [5.]})
array([[ 1.        ,  0.        ,  0.        ,  1.04257207]])

Transform Multiple Columns

Transformations may require multiple input columns. In these cases, the column names can be specified in a list:

>>> mapper2 = DataFrameMapper([
...     (['children', 'salary'], sklearn.decomposition.PCA(1))
... ])

Now running fit_transform will run PCA on the children and salary columns and return the first principal component:

>>> mapper2.fit_transform(data)
array([[ 47.62288153],
       [-18.38596516],
       [  1.62873661],
       [-15.3709553 ],
       [-10.36602451],
       [ 16.62846476],
       [ -6.38116123],
       [-15.37597671]])

Cross-Validation

Now that we can combine features from pandas DataFrames, we may want to use cross-validation to see whether our model works. Scikit-learn provides features for cross-validation, but they expect numpy data structures and won't work with DataFrameMapper.

To get around this, sklearn-pandas provides a wrapper on sklearn's cross_val_score function which passes a pandas DataFrame to the estimator rather than a numpy array:

>>> pipe = sklearn.pipeline.Pipeline([
...     ('featurize', mapper),
...     ('lm', sklearn.linear_model.LinearRegression())])
>>> cross_val_score(pipe, data, data.salary, sklearn.metrics.mean_squared_error)
array([ 2018.185     ,     6.72033058,  1899.58333333])

Sklearn-pandas' cross_val_score function provides exactly the same interface as sklearn's function of the same name.

Credit

The code for DataFrameMapper is based on code originally written by Ben Hamner.

sklearn-pandas -- bridge code for cross-validation of pandas data frames with sklearn This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution. Paul Butler <paulgb@gmail.com> The source code of DataFrameMapper is derived from code originally written by Ben Hamner and released under the following license. Copyright (c) 2013, Ben Hamner Author: Ben Hamner (ben@benhamner.com) All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

简介

暂无描述 展开 收起
Python
BSD-2-Clause
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/mirrors_arnaudsj/sklearn-pandas.git
git@gitee.com:mirrors_arnaudsj/sklearn-pandas.git
mirrors_arnaudsj
sklearn-pandas
sklearn-pandas
master

搜索帮助