-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
1360 lines (553 loc) · 82.3 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html class="theme-next pisces use-motion" lang="zh-Hans">
<head>
<meta charset="UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1"/>
<meta name="theme-color" content="#222">
<meta http-equiv="Cache-Control" content="no-transform" />
<meta http-equiv="Cache-Control" content="no-siteapp" />
<link href="/lib/fancybox/source/jquery.fancybox.css?v=2.1.5" rel="stylesheet" type="text/css" />
<link href="/lib/font-awesome/css/font-awesome.min.css?v=4.6.2" rel="stylesheet" type="text/css" />
<link href="/css/main.css?v=5.1.4" rel="stylesheet" type="text/css" />
<link rel="apple-touch-icon" sizes="180x180" href="/images/apple-touch-icon-next.png?v=5.1.4">
<link rel="icon" type="image/png" sizes="32x32" href="/images/favicon-32x32-next.png?v=5.1.4">
<link rel="icon" type="image/png" sizes="16x16" href="/images/favicon-16x16-next.png?v=5.1.4">
<link rel="mask-icon" href="/images/logo.svg?v=5.1.4" color="#222">
<meta name="keywords" content="Hexo, NexT" />
<meta property="og:type" content="website">
<meta property="og:title" content="啰里啰嗦的圈圈">
<meta property="og:url" content="http://yoursite.com/index.html">
<meta property="og:site_name" content="啰里啰嗦的圈圈">
<meta property="og:locale" content="zh-Hans">
<meta name="twitter:card" content="summary">
<meta name="twitter:title" content="啰里啰嗦的圈圈">
<script type="text/javascript" id="hexo.configurations">
var NexT = window.NexT || {};
var CONFIG = {
root: '/',
scheme: 'Pisces',
version: '5.1.4',
sidebar: {"position":"left","display":"post","offset":12,"b2t":false,"scrollpercent":false,"onmobile":false},
fancybox: true,
tabs: true,
motion: {"enable":true,"async":false,"transition":{"post_block":"fadeIn","post_header":"slideDownIn","post_body":"slideDownIn","coll_header":"slideLeftIn","sidebar":"slideUpIn"}},
duoshuo: {
userId: '0',
author: '博主'
},
algolia: {
applicationID: '',
apiKey: '',
indexName: '',
hits: {"per_page":10},
labels: {"input_placeholder":"Search for Posts","hits_empty":"We didn't find any results for the search: ${query}","hits_stats":"${hits} results found in ${time} ms"}
}
};
</script>
<link rel="canonical" href="http://yoursite.com/"/>
<title>啰里啰嗦的圈圈</title>
</head>
<body itemscope itemtype="http://schema.org/WebPage" lang="zh-Hans">
<div class="container sidebar-position-left
page-home">
<div class="headband"></div>
<a href="https://github.com/LEIYUAN9759/LEIYUAN9759.github.io"><img style="position: absolute; top: 0; right: 0; border: 0;" src="https://s3.amazonaws.com/github/ribbons/forkme_right_white_ffffff.png" alt="Fork me on GitHub"></a>
<header id="header" class="header" itemscope itemtype="http://schema.org/WPHeader">
<div class="header-inner"><div class="site-brand-wrapper">
<div class="site-meta ">
<div class="custom-logo-site-title">
<a href="/" class="brand" rel="start">
<span class="logo-line-before"><i></i></span>
<span class="site-title">啰里啰嗦的圈圈</span>
<span class="logo-line-after"><i></i></span>
</a>
</div>
<p class="site-subtitle"></p>
</div>
<div class="site-nav-toggle">
<button>
<span class="btn-bar"></span>
<span class="btn-bar"></span>
<span class="btn-bar"></span>
</button>
</div>
</div>
<nav class="site-nav">
<ul id="menu" class="menu">
<li class="menu-item menu-item-home">
<a href="/" rel="section">
<i class="menu-item-icon fa fa-fw fa-home"></i> <br />
首页
</a>
</li>
<li class="menu-item menu-item-categories">
<a href="/categories/" rel="section">
<i class="menu-item-icon fa fa-fw fa-th"></i> <br />
分类
</a>
</li>
<li class="menu-item menu-item-archives">
<a href="/archives/" rel="section">
<i class="menu-item-icon fa fa-fw fa-archive"></i> <br />
所有内容
</a>
</li>
<li class="menu-item menu-item-tags">
<a href="/tags" rel="section">
<i class="menu-item-icon fa fa-fw fa-question-circle"></i> <br />
标签
</a>
</li>
<li class="menu-item menu-item-about">
<a href="/about" rel="section">
<i class="menu-item-icon fa fa-fw fa-question-circle"></i> <br />
关于
</a>
</li>
</ul>
</nav>
</div>
</header>
<main id="main" class="main">
<div class="main-inner">
<div class="content-wrap">
<div id="content" class="content">
<section id="posts" class="posts-expand">
<article class="post post-type-normal" itemscope itemtype="http://schema.org/Article">
<div class="post-block">
<link itemprop="mainEntityOfPage" href="http://yoursite.com/2018/10/11/机器学习-2/">
<span hidden itemprop="author" itemscope itemtype="http://schema.org/Person">
<meta itemprop="name" content="雷源">
<meta itemprop="description" content="">
<meta itemprop="image" content="/images/blog_logo2.jpg">
</span>
<span hidden itemprop="publisher" itemscope itemtype="http://schema.org/Organization">
<meta itemprop="name" content="啰里啰嗦的圈圈">
</span>
<header class="post-header">
<h1 class="post-title" itemprop="name headline">
<a class="post-title-link" href="/2018/10/11/机器学习-2/" itemprop="url">机器学习之k-近邻算法</a></h1>
<div class="post-meta">
<span class="post-time">
<span class="post-meta-item-icon">
<i class="fa fa-calendar-o"></i>
</span>
<span class="post-meta-item-text">发表于</span>
<time title="创建于" itemprop="dateCreated datePublished" datetime="2018-10-11T20:12:48+08:00">
2018-10-11
</time>
</span>
</div>
</header>
<div class="post-body" itemprop="articleBody">
<p>机器学习的算法分为两种:监督学习和非监督学习,</p>
<ul>
<li>监督学习,可以由输入数据中学到或建立一个模型,并依此模式推测新的结果。输入数据是由输入特征值和目标值所组成。函数的输出可以是一个连续的值(称为回归),或是输出是有限个离散值(称作分类)。</li>
<li>非监督学习,可以由输入数据中学到或建立一个模型,并依此模式推测新的结果。输入数据是<br>由输入特征值所组成。</li>
</ul>
<p>两种算法的区别主要就在于有没有目标值.在这里也简要介绍一下我所了解到的算法</p>
<p>监督学习主要有:</p>
<ul>
<li>分类 : k-近邻算法,决策树,朴素贝叶斯,逻辑回归,支持向量机等等</li>
<li>回归 : 线性回归, 岭回归</li>
<li>标注 : 隐马尔可夫模型<br>非监督学习主要是:聚类算法k-means</li>
</ul>
<p>今天介绍的就是机器学习最简单的算法k-近邻算法,当然是调用scikit-learn的API,我自己目前可是写不出来的……</p>
<h3 id="k-近邻算法的简要介绍"><a href="#k-近邻算法的简要介绍" class="headerlink" title="k-近邻算法的简要介绍"></a>k-近邻算法的简要介绍</h3><p>k-近邻算法是通过测量不同特征值之间的距离来进行分类</p>
<blockquote>
<p>优点:精度高、对异常值不敏感、无数据输入假定</p>
</blockquote>
<blockquote>
<p>缺点:计算复杂度高、空间复杂度高</p>
</blockquote>
<blockquote>
<p>使用数据范围:数值型和标称型</p>
</blockquote>
<p>举个电影分类的例子,来分析一下k-近邻算法</p>
<p>假如两种类型的电影,动作片和爱情片。动作片中打斗镜头的次数较多,而爱情片中接吻镜头相对更多。当然动作片中也有一些接吻镜头,爱情片中也会有一些打斗镜头。所以不能单纯通过是否存在打斗镜头或者接吻镜头来判断影片的类别。那么现在我们有6部影片已经明确了类别,也有打斗镜头和接吻镜头的次数,还有一部电影类型未知。<br><img src="/2018/10/11/机器学习-2/电影数据.PNG" alt="电影数据"><br>那么我们使用K-近邻算法来分类爱情片和动作片:存在一个样本数据集合,也叫训练样本集,样本个数M个,知道每一个数据特征与类别对应关系,然后存在未知类型数据集合1个,那么我们要选择一个测试样本数据中与训练样本中M个的距离,排序过后选出最近的K个,这个取值一般不大于20个。选择K个最相近数据中次数最多的分类。那么我们根据这个原则去判断未知电影的分类.<br><img src="/2018/10/11/机器学习-2/电影距离数据.PNG" alt="距离数据"><br>我们假设K为3,那么排名前三个电影的类型都是爱情片,所以我们判定这个未知电影也是一个爱情片,这也就是k-近邻算法的基本思想.</p>
<h3 id="k-近邻算法的scikit-learn"><a href="#k-近邻算法的scikit-learn" class="headerlink" title="k-近邻算法的scikit-learn"></a>k-近邻算法的scikit-learn</h3><p>k-近邻算法所调用的API为:<br><code>sklearn.neighbors.KNeighborsClassifier(n_neighbors=5,algorithm='auto')</code></p>
<p>n_neighbors:int(默认=5), algorithm:{‘auto’,‘ball_tree’,‘kd_tree’,‘brute’},可选用于计算最近邻居的算法:‘ball_tree’将会使用 BallTree,‘kd_tree’将使用 KDTree。‘auto’将尝试根据传递给fit方法的值来决定最合适的算法。 (不同实现方式影响效率)</p>
<h3 id="案例"><a href="#案例" class="headerlink" title="案例"></a>案例</h3><p>这里使用kaggle中的一个预测入住位置的一个案例,来进行k-近邻算法的练习. 数据来源<a href="https://www.kaggle.com/c/facebook-v-predicting-check-ins" target="_blank" rel="noopener">https://www.kaggle.com/c/facebook-v-predicting-check-ins</a></p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.model_selection <span class="keyword">import</span> train_test_split</span><br><span class="line"><span class="keyword">from</span> sklearn.neighbors <span class="keyword">import</span> KNeighborsClassifier</span><br><span class="line"><span class="keyword">from</span> sklearn.preprocessing <span class="keyword">import</span> StandardScaler</span><br><span class="line"><span class="keyword">import</span> pandas <span class="keyword">as</span> pd</span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">knncls</span><span class="params">()</span>:</span></span><br><span class="line"> <span class="string">"""</span></span><br><span class="line"><span class="string"> K-近邻预测用户签到位置</span></span><br><span class="line"><span class="string"> :return:None</span></span><br><span class="line"><span class="string"> """</span></span><br><span class="line"> <span class="comment"># 读取数据</span></span><br><span class="line"> data = pd.read_csv(<span class="string">"./data/FBlocation/train.csv"</span>)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># print(data.head(10))</span></span><br><span class="line"></span><br><span class="line"> <span class="comment"># 处理数据</span></span><br><span class="line"> <span class="comment"># 1、缩小数据,查询数据晒讯</span></span><br><span class="line"> data = data.query(<span class="string">"x > 1.0 & x < 1.25 & y > 2.5 & y < 2.75"</span>)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 处理时间的数据</span></span><br><span class="line"> time_value = pd.to_datetime(data[<span class="string">'time'</span>], unit=<span class="string">'s'</span>)</span><br><span class="line"></span><br><span class="line"> print(time_value)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 把日期格式转换成 字典格式</span></span><br><span class="line"> time_value = pd.DatetimeIndex(time_value)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 构造一些特征</span></span><br><span class="line"> data[<span class="string">'day'</span>] = time_value.day</span><br><span class="line"> data[<span class="string">'hour'</span>] = time_value.hour</span><br><span class="line"> data[<span class="string">'weekday'</span>] = time_value.weekday</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 把时间戳特征删除</span></span><br><span class="line"> data = data.drop([<span class="string">'time'</span>], axis=<span class="number">1</span>)</span><br><span class="line"></span><br><span class="line"> print(data)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 把签到数量少于n个目标位置删除</span></span><br><span class="line"> place_count = data.groupby(<span class="string">'place_id'</span>).count()</span><br><span class="line"></span><br><span class="line"> tf = place_count[place_count.row_id > <span class="number">3</span>].reset_index()</span><br><span class="line"></span><br><span class="line"> data = data[data[<span class="string">'place_id'</span>].isin(tf.place_id)]</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 取出数据当中的特征值和目标值</span></span><br><span class="line"> y = data[<span class="string">'place_id'</span>]</span><br><span class="line"></span><br><span class="line"> x = data.drop([<span class="string">'place_id'</span>], axis=<span class="number">1</span>)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 进行数据的分割训练集合测试集</span></span><br><span class="line"> x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=<span class="number">0.25</span>)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 特征工程(标准化)</span></span><br><span class="line"> std = StandardScaler()</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 对测试集和训练集的特征值进行标准化</span></span><br><span class="line"> x_train = std.fit_transform(x_train)</span><br><span class="line"></span><br><span class="line"> x_test = std.transform(x_test)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 进行算法流程 # 超参数</span></span><br><span class="line"> knn = KNeighborsClassifier()</span><br><span class="line"></span><br><span class="line"> <span class="comment"># fit, predict,score</span></span><br><span class="line"> knn.fit(x_train, y_train)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 得出预测结果</span></span><br><span class="line"> y_predict = knn.predict(x_test)</span><br><span class="line"></span><br><span class="line"> print(<span class="string">"预测的目标签到位置为:"</span>, y_predict)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 得出准确率</span></span><br><span class="line"> print(<span class="string">"预测的准确率:"</span>, knn.score(x_test, y_test))</span><br><span class="line"> <span class="keyword">return</span> <span class="keyword">None</span></span><br><span class="line"></span><br><span class="line"> <span class="keyword">if</span> __name__ == <span class="string">"__main__"</span>:</span><br><span class="line"> knncls()</span><br></pre></td></tr></table></figure>
<p>好啦今天就到这里啦~</p>
</div>
<div>
</div>
<footer class="post-footer">
<div class="post-eof"></div>
</footer>
</div>
</article>
<article class="post post-type-normal" itemscope itemtype="http://schema.org/Article">
<div class="post-block">
<link itemprop="mainEntityOfPage" href="http://yoursite.com/2018/10/10/机器学习-1/">
<span hidden itemprop="author" itemscope itemtype="http://schema.org/Person">
<meta itemprop="name" content="雷源">
<meta itemprop="description" content="">
<meta itemprop="image" content="/images/blog_logo2.jpg">
</span>
<span hidden itemprop="publisher" itemscope itemtype="http://schema.org/Organization">
<meta itemprop="name" content="啰里啰嗦的圈圈">
</span>
<header class="post-header">
<h1 class="post-title" itemprop="name headline">
<a class="post-title-link" href="/2018/10/10/机器学习-1/" itemprop="url">机器学习</a></h1>
<div class="post-meta">
<span class="post-time">
<span class="post-meta-item-icon">
<i class="fa fa-calendar-o"></i>
</span>
<span class="post-meta-item-text">发表于</span>
<time title="创建于" itemprop="dateCreated datePublished" datetime="2018-10-10T18:48:27+08:00">
2018-10-10
</time>
</span>
</div>
</header>
<div class="post-body" itemprop="articleBody">
<h3 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h3><p>机器学习似乎是个很火的话题,人工智能更是浪潮之巅.而我既然学习了Python当然不能免俗,在国庆假期里偷偷看了一些机器学习的基本算法.发现以前自己感觉很专业的名词,比如朴素贝叶斯,随机森林,逻辑回归等等,在前人的基础上,调用API完成功能的代码,居然只有短短的几行.当然这也有可能是我只是刚刚看了一下scikitlearn这种封装的比较好的模块.接下来要把TensorFlow也粗浅的学一下,继续探索一下机器学习这个高深的领域.</p>
<h3 id="特征工程"><a href="#特征工程" class="headerlink" title="特征工程"></a>特征工程</h3><p>从数据中抽取出来的对预测结果有用的信息,通过专业的技巧进行数据处理,使得特征能在机器学习算法中发挥更好的作用。在数据获取的阶段 最初的原始特征数据集可能太大,或者信息冗余,因此在机器学习的应用中,一个初始步骤就是选择特征的子集,或构建一套新的特征集,减少功能来促进算法的学习,提高泛化能力和可解释性,这也就是我们所说的特征工程。<br>在特征工程阶段,我们主要的工作就是对数据的特征进行抽取,然后进行特征的预处理,最后进行特征的选择,done~~ 数据就处理成我们需要的格式啦!<br>今天就来介绍一下特征工程的各个阶段的各种操作.</p>
<h3 id="特征抽取"><a href="#特征抽取" class="headerlink" title="特征抽取"></a>特征抽取</h3><h4 id="字典特征抽取"><a href="#字典特征抽取" class="headerlink" title="字典特征抽取"></a>字典特征抽取</h4><p>字典特征抽取调用的类:<code>sklearn.feature_extraction.DictVectorizer</code>.<br>将映射列表转换为Numpy数组或scipy.sparse矩阵<br>DictVectorizer的语法为:<br><code>DictVectorizer(saprse=True,....)</code><br>在其中包含的方法有:</p>
<ul>
<li><code>DictVectorizer.fit_transform(x)</code> x:为字典或者包含字典的迭代器 返回值为一个sparse矩阵</li>
<li><code>DictVectorizer.fit_transform(x)</code><br>x: array数组或者是sparse矩阵 返回值为转换之前的数据格式</li>
<li><code>DictVectorizer.get_feature_names()</code> 返回类别名称</li>
</ul>
<p>这么说起来实在是太抽象了,我们举个栗子演示一下:<br>有这么一个列表<br><code>[{'city': '北京','temperature':100}
{'city': '上海','temperature':60}
{'city': '深圳','temperature':30}]</code><br>只需要通过一下几行代码,这个字典就可以被转换成numpy数组啦<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.feature_extraction <span class="keyword">import</span> DictVectorizer</span><br><span class="line">dict = DictVectorizer(sparse=<span class="keyword">False</span>)</span><br><span class="line"><span class="comment"># 调用fit_transform</span></span><br><span class="line">data = dict.fit_transform([{<span class="string">'city'</span>: <span class="string">'北京'</span>,<span class="string">'temperature'</span>: <span class="number">100</span>}, {<span class="string">'city'</span>: <span class="string">'上海'</span>,<span class="string">'temperature'</span>:<span class="number">60</span>}, {<span class="string">'city'</span>: <span class="string">'深圳'</span>,<span class="string">'temperature'</span>: <span class="number">30</span>}])</span><br><span class="line">print(dict.get_feature_names())</span><br><span class="line">print(dict.inverse_transform(data))</span><br><span class="line">print(data)</span><br></pre></td></tr></table></figure></p>
<p>这里数组采用的是one_hot编码方式,这里也举个小例子来表示一下,这个难以用语言来形容的编码. 首先,我们有如图的这样一个分类</p>
<p><img src="/2018/10/10/机器学习-1/one_hot1.png" alt="blockchain"> <img src="/2018/10/10/机器学习-1/one_hot2.png" alt="blockchain">,这样就明白了吧.</p>
<h4 id="文本特征抽取"><a href="#文本特征抽取" class="headerlink" title="文本特征抽取"></a>文本特征抽取</h4><p>文本特征抽取调用的类<code>sklearn.feature_extraction.text.CountVectorizer()</code><br>类似的CountVectorizer的语法为:<br><code>CountVectorizer(max_df=1.0,min_df=1,....)</code>,返回词频矩阵,这里的max_df和min_df有整数和小数两种形式,<br>在其中包含的方法有:</p>
<ul>
<li><code>CountVectorizer.fit_transform(x)</code> x:为字典或者包含字典的迭代器 返回值为一个sparse矩阵</li>
<li><code>CountVectorizer.fit_transform(x)</code><br>x: array数组或者是sparse矩阵 返回值为转换之前的数据格式</li>
<li><code>CountVectorizer.get_feature_names()</code> 返回单词列表</li>
</ul>
<p>具体代码:<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">cv = CountVectorizer()</span><br><span class="line"><span class="keyword">from</span> sklearn.feature_extraction.text <span class="keyword">import</span> CountVectorizer</span><br><span class="line">data = cv.fit_transform([<span class="string">"life is short,i like python"</span>,<span class="string">"life is too long,i dislike python"</span>])</span><br><span class="line">print(cv.get_feature_names())</span><br><span class="line">print(data.toarray())</span><br></pre></td></tr></table></figure></p>
<p>当然,这里只能处理英文,因为文本抽取默认是按照空格来进行分词的,如果我们要对中文来进行文本特征抽取,首先要使用jieba来分词,并拼成由空格分隔的字符串. 举个大栗子:<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.feature_extraction.text <span class="keyword">import</span> CountVectorizer</span><br><span class="line"><span class="keyword">import</span> jieba</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">cutword</span><span class="params">()</span>:</span></span><br><span class="line"> con1 = jieba.cut(<span class="string">"今天很残酷,明天更残酷,后天很美好,但绝对大部分是死在明天晚上,所以每个人不要放弃今天。"</span>)</span><br><span class="line"> con2 = jieba.cut(<span class="string">"我们看到的从很远星系来的光是在几百万年之前发出的,这样当我们看到宇宙时,我们是在看它的过去。"</span>)</span><br><span class="line"> con3 = jieba.cut(<span class="string">"如果只用一种方式了解某样事物,你就不会真正了解它。了解事物真正含义的秘密取决于如何将其与我们所了解的事物相联系。"</span>)</span><br><span class="line"> <span class="comment"># 转换成列表</span></span><br><span class="line"> content1 = list(con1)</span><br><span class="line"> content2 = list(con2)</span><br><span class="line"> content3 = list(con3)</span><br><span class="line"> <span class="comment"># 吧列表转换成字符串</span></span><br><span class="line"> c1 = <span class="string">' '</span>.join(content1)</span><br><span class="line"> c2 = <span class="string">' '</span>.join(content2)</span><br><span class="line"> c3 = <span class="string">' '</span>.join(content3)</span><br><span class="line"> <span class="keyword">return</span> c1, c2, c3</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">hanzivec</span><span class="params">()</span>:</span></span><br><span class="line"> c1, c2, c3 = cutword()</span><br><span class="line"> print(c1, c2, c3)</span><br><span class="line"> cv = CountVectorizer()</span><br><span class="line"> data = cv.fit_transform([c1, c2, c3])</span><br><span class="line"> print(cv.get_feature_names())</span><br><span class="line"> print(data.toarray())</span><br><span class="line"> <span class="keyword">return</span> <span class="keyword">None</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__ == <span class="string">"__main__"</span>:</span><br><span class="line"> hanzivec()</span><br></pre></td></tr></table></figure></p>
<h3 id="特征处理"><a href="#特征处理" class="headerlink" title="特征处理"></a>特征处理</h3><p>通过特定的统计方法(数学方法)将数据转换成算法要求的数据<br>特征处理的方法主要有:归一化,标准化</p>
<h5 id="归一化"><a href="#归一化" class="headerlink" title="归一化"></a>归一化</h5><p>归一化首先在特征(维度)非常多的时候,可以防止某一维或某几维对数据影响过大,也是为了把不同来源的数据统一到一个参考区间下,这样比较起来才有意义,其次可以程序可以运行更快。 例如:一个人的身高和体重两个特征,假如体重50kg,身高175cm,由于两个单位不一样,数值大小不一样。如果比较两个人的体型差距时,那么身高的影响结果会比较大.<br>归一化数学上已经学过很多次了,这里就直接给出scikit_learn的相关操作.<br>归一化API为<code>sklearn.preprocessing.MinMaxScaler</code><br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.preprocessing <span class="keyword">import</span> MinMaxScaler</span><br><span class="line">mm = MinMaxScaler(feature_range=(<span class="number">2</span>, <span class="number">3</span>))</span><br><span class="line">data = mm.fit_transform([[<span class="number">90</span>,<span class="number">2</span>,<span class="number">10</span>,<span class="number">40</span>],[<span class="number">60</span>,<span class="number">4</span>,<span class="number">15</span>,<span class="number">45</span>],[<span class="number">75</span>,<span class="number">3</span>,<span class="number">13</span>,<span class="number">46</span>]])</span><br><span class="line">print(data)</span><br></pre></td></tr></table></figure></p>
<h5 id="标准化"><a href="#标准化" class="headerlink" title="标准化"></a>标准化</h5><p>标准化是通过对原始数据进行变换把数据变换到均值为0,方差为1范围内,处理公式为<img src="/2018/10/10/机器学习-1/标准化.png" alt="blockchain">,其中μ是样本的均值,σ是样本的标准差,它们可以通过现有的样本进行估计,在已有的样本足够多的情况下比较稳定,适合嘈杂的数据场景.<br>sklearn标准化API:<code>scikit-learn.preprocessing.StandardScaler</code></p>
<p>其中包含的方法有,- <code>StandardScaler.mean_</code>返回原始数据每列的平均值</p>
<ul>
<li><code>StandardScaler.std_</code>, 原始数据每列的方差</li>
</ul>
<p>同样举个栗子:<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.preprocessing <span class="keyword">import</span> StandardScaler</span><br><span class="line">std = StandardScaler()</span><br><span class="line">data = std.fit_transform([[ <span class="number">1.</span>, <span class="number">-1.</span>, <span class="number">3.</span>],[ <span class="number">2.</span>, <span class="number">4.</span>, <span class="number">2.</span>],[ <span class="number">4.</span>, <span class="number">6.</span>, <span class="number">-1.</span>]])</span><br><span class="line">print(data)</span><br></pre></td></tr></table></figure></p>
<h5 id="缺失值处理"><a href="#缺失值处理" class="headerlink" title="缺失值处理"></a>缺失值处理</h5><p>由于各种原因,许多现实世界的数据集包含缺少的值,通常编码为空白,NaN或其他占位符。然而,这样的数据集与scikit的分类器不兼容,它们假设数组中的所有值都是数字,并且都具有和保持含义。使用不完整数据集的基本策略是丢弃包含缺失值的整个行和/或列。然而,这是以丢失可能是有价值的数据(即使不完整)的代价。更好的策略是估算缺失值,即从已知部分的数据中推断它们。</p>
<p>(1)填充缺失值<br>使用sklearn.preprocessing中的Imputer类进行数据的填充.</p>
<p>具体的用法为<code>Imputer(missing_values='NaN', strategy='mean', axis=0)</code>,同样举个栗子<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.preprocessing <span class="keyword">import</span> Imputer</span><br><span class="line">im = Imputer(missing_values=<span class="string">'NaN'</span>, strategy=<span class="string">'mean'</span>, axis=<span class="number">0</span>)</span><br><span class="line">data = im.fit_transform([[<span class="number">1</span>, <span class="number">2</span>], [np.nan, <span class="number">3</span>], [<span class="number">7</span>, <span class="number">6</span>]])</span><br><span class="line">print(data)</span><br></pre></td></tr></table></figure></p>
<p>(2)替换缺失值<br>使用pandas中的fillna来填充<br>数据降维也是我们在特征预处理中经常进行的操作,一般我们使用成分分析法PCA( Principal component analysis)。他的特点是保存数据集中对方差影响最大的那些特征,但是PCA极其容易受到数据中特征范围影响,所以在运用PCA前一定要做特征标准化,这样才能保证每维度特征的重要性等同。</p>
<h5 id="降维"><a href="#降维" class="headerlink" title="降维"></a>降维</h5><p>数据降维也是我们在特征预处理中经常进行的操作,一般我们使用成分分析法PCA( Principal component analysis)。<br>这里的降维只是特征数量的减少,和我们想象中的不一样哦.<br><img src="/2018/10/10/机器学习-1/水壶.png" alt="blockchain"><br>我们可以形象的利用这个水壶来举例子,PCA法作用就是如何在在信息损失最小的情况下,用一张二维图来表示水壶.</p>
<p> PCA的特点是保存数据集中对方差影响最大的那些特征,但是PCA极其容易受到数据中特征范围影响,所以在运用PCA前一定要做特征标准化,这样才能保证每维度特征的重要性等同。<br>PCA的用法: <code>PCA(n_components=None)</code>,其中n_components是指我们保留信息的百分比,比如0.9,表示降维信息的保留程度为90%</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.decomposition <span class="keyword">import</span> PCA</span><br><span class="line">pca = PCA(n_components=<span class="number">0.9</span>)</span><br><span class="line">data = pca.fit_transform([[<span class="number">2</span>,<span class="number">8</span>,<span class="number">4</span>,<span class="number">5</span>],[<span class="number">6</span>,<span class="number">3</span>,<span class="number">0</span>,<span class="number">8</span>],[<span class="number">5</span>,<span class="number">4</span>,<span class="number">9</span>,<span class="number">1</span>]])</span><br><span class="line">print(data)</span><br></pre></td></tr></table></figure>
<h4 id="特征选择"><a href="#特征选择" class="headerlink" title="特征选择"></a>特征选择</h4><p>特征选择是为了减少冗余,减小计算机的性能消耗(因为部分特征相关度高),同时消除部分噪点特征.在所有特征中选择部分作为训练集特征.<br>主要有以下三种方法:</p>
<ul>
<li>过滤式: VarianceThreshold</li>
<li>嵌入式:正则化(回归分析时候会用到)</li>
<li>包裹式</li>
</ul>
<p>过滤式VarianceThreshold 是特征选择中的一项基本方法。它会移除所有方差不满足阈值的特征。默认设置下,它将移除所有方差为0的特征,即那些在所有样本中数值完全相同的特征。<br>假设我们要移除那些超过80%的数据都为1或0的特征.<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.feature_selection <span class="keyword">import</span> VarianceThreshold</span><br><span class="line">X = [[<span class="number">0</span>, <span class="number">0</span>, <span class="number">1</span>], [<span class="number">0</span>, <span class="number">1</span>, <span class="number">0</span>], [<span class="number">1</span>, <span class="number">0</span>, <span class="number">0</span>], [<span class="number">0</span>, <span class="number">1</span>, <span class="number">1</span>], [<span class="number">0</span>, <span class="number">1</span>, <span class="number">0</span>], [<span class="number">0</span>, <span class="number">1</span>, <span class="number">1</span>]]</span><br><span class="line">sel = VarianceThreshold(threshold=(<span class="number">.8</span> * (<span class="number">1</span> - <span class="number">.8</span>)))</span><br><span class="line">sel.fit_transform(X)</span><br><span class="line">array([[<span class="number">0</span>, <span class="number">1</span>],</span><br><span class="line"> [<span class="number">1</span>, <span class="number">0</span>],</span><br><span class="line"> [<span class="number">0</span>, <span class="number">0</span>],</span><br><span class="line"> [<span class="number">1</span>, <span class="number">1</span>],</span><br><span class="line"> [<span class="number">1</span>, <span class="number">0</span>],</span><br><span class="line"> [<span class="number">1</span>, <span class="number">1</span>]])</span><br></pre></td></tr></table></figure></p>
</div>
<div>
</div>
<footer class="post-footer">
<div class="post-eof"></div>
</footer>
</div>
</article>
<article class="post post-type-normal" itemscope itemtype="http://schema.org/Article">
<div class="post-block">
<link itemprop="mainEntityOfPage" href="http://yoursite.com/2018/10/10/机器学习-1/机器学习/">
<span hidden itemprop="author" itemscope itemtype="http://schema.org/Person">
<meta itemprop="name" content="雷源">
<meta itemprop="description" content="">
<meta itemprop="image" content="/images/blog_logo2.jpg">
</span>
<span hidden itemprop="publisher" itemscope itemtype="http://schema.org/Organization">
<meta itemprop="name" content="啰里啰嗦的圈圈">
</span>
<header class="post-header">
<h1 class="post-title" itemprop="name headline">
<a class="post-title-link" href="/2018/10/10/机器学习-1/机器学习/" itemprop="url">机器学习</a></h1>
<div class="post-meta">
<span class="post-time">
<span class="post-meta-item-icon">
<i class="fa fa-calendar-o"></i>
</span>
<span class="post-meta-item-text">发表于</span>
<time title="创建于" itemprop="dateCreated datePublished" datetime="2018-10-10T18:48:27+08:00">
2018-10-10
</time>
</span>
</div>
</header>
<div class="post-body" itemprop="articleBody">
<h3 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h3><p>机器学习似乎是个很火的话题,人工智能更是浪潮之巅.而我既然学习了Python当然不能免俗,在国庆假期里偷偷看了一些机器学习的基本算法.发现以前自己感觉很专业的名词,比如朴素贝叶斯,随机森林,逻辑回归等等,在前人的基础上,调用API完成功能的代码,居然只有短短的几行.当然这也有可能是我只是刚刚看了一下scikitlearn这种封装的比较好的模块.接下来要把TensorFlow也粗浅的学一下,继续探索一下机器学习这个高深的领域.</p>
<h3 id="特征工程"><a href="#特征工程" class="headerlink" title="特征工程"></a>特征工程</h3><p>从数据中抽取出来的对预测结果有用的信息,通过专业的技巧进行数据处理,使得特征能在机器学习算法中发挥更好的作用。在数据获取的阶段 最初的原始特征数据集可能太大,或者信息冗余,因此在机器学习的应用中,一个初始步骤就是选择特征的子集,或构建一套新的特征集,减少功能来促进算法的学习,提高泛化能力和可解释性,这也就是我们所说的特征工程。<br>在特征工程阶段,我们主要的工作就是对数据的特征进行抽取,然后进行特征的预处理,最后进行特征的选择,done~~ 数据就处理成我们需要的格式啦!<br>今天就来介绍一下特征工程的各个阶段的各种操作.</p>
<h3 id="特征抽取"><a href="#特征抽取" class="headerlink" title="特征抽取"></a>特征抽取</h3><h4 id="字典特征抽取"><a href="#字典特征抽取" class="headerlink" title="字典特征抽取"></a>字典特征抽取</h4><p>字典特征抽取调用的类:<code>sklearn.feature_extraction.DictVectorizer</code>.<br>将映射列表转换为Numpy数组或scipy.sparse矩阵<br>DictVectorizer的语法为:<br><code>DictVectorizer(saprse=True,....)</code><br>在其中包含的方法有:</p>
<ul>
<li><code>DictVectorizer.fit_transform(x)</code> x:为字典或者包含字典的迭代器 返回值为一个sparse矩阵</li>
<li><code>DictVectorizer.fit_transform(x)</code><br>x: array数组或者是sparse矩阵 返回值为转换之前的数据格式</li>
<li><code>DictVectorizer.get_feature_names()</code> 返回类别名称</li>
</ul>
<p>这么说起来实在是太抽象了,我们举个栗子演示一下:<br>有这么一个列表<br><code>[{'city': '北京','temperature':100}
{'city': '上海','temperature':60}
{'city': '深圳','temperature':30}]</code><br>只需要通过一下几行代码,这个字典就可以被转换成numpy数组啦<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.feature_extraction <span class="keyword">import</span> DictVectorizer</span><br><span class="line">dict = DictVectorizer(sparse=<span class="keyword">False</span>)</span><br><span class="line"><span class="comment"># 调用fit_transform</span></span><br><span class="line">data = dict.fit_transform([{<span class="string">'city'</span>: <span class="string">'北京'</span>,<span class="string">'temperature'</span>: <span class="number">100</span>}, {<span class="string">'city'</span>: <span class="string">'上海'</span>,<span class="string">'temperature'</span>:<span class="number">60</span>}, {<span class="string">'city'</span>: <span class="string">'深圳'</span>,<span class="string">'temperature'</span>: <span class="number">30</span>}])</span><br><span class="line">print(dict.get_feature_names())</span><br><span class="line">print(dict.inverse_transform(data))</span><br><span class="line">print(data)</span><br></pre></td></tr></table></figure></p>
<p>这里数组采用的是one_hot编码方式,这里也举个小例子来表示一下,这个难以用语言来形容的编码. 首先,我们有如图的这样一个分类</p>
<p><img src="/2018/10/10/机器学习-1/机器学习/./one_hot1.png" alt="blockchain"> <img src="/2018/10/10/机器学习-1/机器学习/./one_hot2.png" alt="blockchain">,这样就明白了吧.</p>
<h4 id="文本特征抽取"><a href="#文本特征抽取" class="headerlink" title="文本特征抽取"></a>文本特征抽取</h4><p>文本特征抽取调用的类<code>sklearn.feature_extraction.text.CountVectorizer()</code><br>类似的CountVectorizer的语法为:<br><code>CountVectorizer(max_df=1.0,min_df=1,....)</code>,返回词频矩阵,这里的max_df和min_df有整数和小数两种形式,<br>在其中包含的方法有:</p>
<ul>
<li><code>CountVectorizer.fit_transform(x)</code> x:为字典或者包含字典的迭代器 返回值为一个sparse矩阵</li>
<li><code>CountVectorizer.fit_transform(x)</code><br>x: array数组或者是sparse矩阵 返回值为转换之前的数据格式</li>
<li><code>CountVectorizer.get_feature_names()</code> 返回单词列表</li>
</ul>
<p>具体代码:<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">cv = CountVectorizer()</span><br><span class="line"><span class="keyword">from</span> sklearn.feature_extraction.text <span class="keyword">import</span> CountVectorizer</span><br><span class="line">data = cv.fit_transform([<span class="string">"life is short,i like python"</span>,<span class="string">"life is too long,i dislike python"</span>])</span><br><span class="line">print(cv.get_feature_names())</span><br><span class="line">print(data.toarray())</span><br></pre></td></tr></table></figure></p>
<p>当然,这里只能处理英文,因为文本抽取默认是按照空格来进行分词的,如果我们要对中文来进行文本特征抽取,首先要使用jieba来分词,并拼成由空格分隔的字符串. 举个大栗子:<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.feature_extraction.text <span class="keyword">import</span> CountVectorizer</span><br><span class="line"><span class="keyword">import</span> jieba</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">cutword</span><span class="params">()</span>:</span></span><br><span class="line"> con1 = jieba.cut(<span class="string">"今天很残酷,明天更残酷,后天很美好,但绝对大部分是死在明天晚上,所以每个人不要放弃今天。"</span>)</span><br><span class="line"> con2 = jieba.cut(<span class="string">"我们看到的从很远星系来的光是在几百万年之前发出的,这样当我们看到宇宙时,我们是在看它的过去。"</span>)</span><br><span class="line"> con3 = jieba.cut(<span class="string">"如果只用一种方式了解某样事物,你就不会真正了解它。了解事物真正含义的秘密取决于如何将其与我们所了解的事物相联系。"</span>)</span><br><span class="line"> <span class="comment"># 转换成列表</span></span><br><span class="line"> content1 = list(con1)</span><br><span class="line"> content2 = list(con2)</span><br><span class="line"> content3 = list(con3)</span><br><span class="line"> <span class="comment"># 吧列表转换成字符串</span></span><br><span class="line"> c1 = <span class="string">' '</span>.join(content1)</span><br><span class="line"> c2 = <span class="string">' '</span>.join(content2)</span><br><span class="line"> c3 = <span class="string">' '</span>.join(content3)</span><br><span class="line"> <span class="keyword">return</span> c1, c2, c3</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">hanzivec</span><span class="params">()</span>:</span></span><br><span class="line"> c1, c2, c3 = cutword()</span><br><span class="line"> print(c1, c2, c3)</span><br><span class="line"> cv = CountVectorizer()</span><br><span class="line"> data = cv.fit_transform([c1, c2, c3])</span><br><span class="line"> print(cv.get_feature_names())</span><br><span class="line"> print(data.toarray())</span><br><span class="line"> <span class="keyword">return</span> <span class="keyword">None</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__ == <span class="string">"__main__"</span>:</span><br><span class="line"> hanzivec()</span><br></pre></td></tr></table></figure></p>
<h3 id="特征处理"><a href="#特征处理" class="headerlink" title="特征处理"></a>特征处理</h3><p>通过特定的统计方法(数学方法)将数据转换成算法要求的数据<br>特征处理的方法主要有:归一化,标准化</p>
<h5 id="归一化"><a href="#归一化" class="headerlink" title="归一化"></a>归一化</h5><p>归一化首先在特征(维度)非常多的时候,可以防止某一维或某几维对数据影响过大,也是为了把不同来源的数据统一到一个参考区间下,这样比较起来才有意义,其次可以程序可以运行更快。 例如:一个人的身高和体重两个特征,假如体重50kg,身高175cm,由于两个单位不一样,数值大小不一样。如果比较两个人的体型差距时,那么身高的影响结果会比较大.<br>归一化数学上已经学过很多次了,这里就直接给出scikit_learn的相关操作.<br>归一化API为<code>sklearn.preprocessing.MinMaxScaler</code><br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.preprocessing <span class="keyword">import</span> MinMaxScaler</span><br><span class="line">mm = MinMaxScaler(feature_range=(<span class="number">2</span>, <span class="number">3</span>))</span><br><span class="line">data = mm.fit_transform([[<span class="number">90</span>,<span class="number">2</span>,<span class="number">10</span>,<span class="number">40</span>],[<span class="number">60</span>,<span class="number">4</span>,<span class="number">15</span>,<span class="number">45</span>],[<span class="number">75</span>,<span class="number">3</span>,<span class="number">13</span>,<span class="number">46</span>]])</span><br><span class="line">print(data)</span><br></pre></td></tr></table></figure></p>
<h5 id="标准化"><a href="#标准化" class="headerlink" title="标准化"></a>标准化</h5><p>标准化是通过对原始数据进行变换把数据变换到均值为0,方差为1范围内,处理公式为<img src="/2018/10/10/机器学习-1/机器学习/./标准化.png" alt="blockchain">,其中μ是样本的均值,σ是样本的标准差,它们可以通过现有的样本进行估计,在已有的样本足够多的情况下比较稳定,适合嘈杂的数据场景.<br>sklearn标准化API:<code>scikit-learn.preprocessing.StandardScaler</code></p>
<p>其中包含的方法有,- <code>StandardScaler.mean_</code>返回原始数据每列的平均值</p>
<ul>
<li><code>StandardScaler.std_</code>, 原始数据每列的方差</li>
</ul>
<p>同样举个栗子:<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.preprocessing <span class="keyword">import</span> StandardScaler</span><br><span class="line">std = StandardScaler()</span><br><span class="line">data = std.fit_transform([[ <span class="number">1.</span>, <span class="number">-1.</span>, <span class="number">3.</span>],[ <span class="number">2.</span>, <span class="number">4.</span>, <span class="number">2.</span>],[ <span class="number">4.</span>, <span class="number">6.</span>, <span class="number">-1.</span>]])</span><br><span class="line">print(data)</span><br></pre></td></tr></table></figure></p>
<h5 id="缺失值处理"><a href="#缺失值处理" class="headerlink" title="缺失值处理"></a>缺失值处理</h5><p>由于各种原因,许多现实世界的数据集包含缺少的值,通常编码为空白,NaN或其他占位符。然而,这样的数据集与scikit的分类器不兼容,它们假设数组中的所有值都是数字,并且都具有和保持含义。使用不完整数据集的基本策略是丢弃包含缺失值的整个行和/或列。然而,这是以丢失可能是有价值的数据(即使不完整)的代价。更好的策略是估算缺失值,即从已知部分的数据中推断它们。</p>
<p>(1)填充缺失值<br>使用sklearn.preprocessing中的Imputer类进行数据的填充.</p>
<p>具体的用法为<code>Imputer(missing_values='NaN', strategy='mean', axis=0)</code>,同样举个栗子<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.preprocessing <span class="keyword">import</span> Imputer</span><br><span class="line">im = Imputer(missing_values=<span class="string">'NaN'</span>, strategy=<span class="string">'mean'</span>, axis=<span class="number">0</span>)</span><br><span class="line">data = im.fit_transform([[<span class="number">1</span>, <span class="number">2</span>], [np.nan, <span class="number">3</span>], [<span class="number">7</span>, <span class="number">6</span>]])</span><br><span class="line">print(data)</span><br></pre></td></tr></table></figure></p>
<p>(2)替换缺失值<br>使用pandas中的fillna来填充<br>数据降维也是我们在特征预处理中经常进行的操作,一般我们使用成分分析法PCA( Principal component analysis)。他的特点是保存数据集中对方差影响最大的那些特征,但是PCA极其容易受到数据中特征范围影响,所以在运用PCA前一定要做特征标准化,这样才能保证每维度特征的重要性等同。</p>
<h5 id="降维"><a href="#降维" class="headerlink" title="降维"></a>降维</h5><p>数据降维也是我们在特征预处理中经常进行的操作,一般我们使用成分分析法PCA( Principal component analysis)。<br>这里的降维只是特征数量的减少,和我们想象中的不一样哦.<br><img src="/2018/10/10/机器学习-1/机器学习/./水壶.png" alt="blockchain"><br>我们可以形象的利用这个水壶来举例子,PCA法作用就是如何在在信息损失最小的情况下,用一张二维图来表示水壶.</p>
<p> PCA的特点是保存数据集中对方差影响最大的那些特征,但是PCA极其容易受到数据中特征范围影响,所以在运用PCA前一定要做特征标准化,这样才能保证每维度特征的重要性等同。<br>PCA的用法: <code>PCA(n_components=None)</code>,其中n_components是指我们保留信息的百分比,比如0.9,表示降维信息的保留程度为90%</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.decomposition <span class="keyword">import</span> PCA</span><br><span class="line">pca = PCA(n_components=<span class="number">0.9</span>)</span><br><span class="line">data = pca.fit_transform([[<span class="number">2</span>,<span class="number">8</span>,<span class="number">4</span>,<span class="number">5</span>],[<span class="number">6</span>,<span class="number">3</span>,<span class="number">0</span>,<span class="number">8</span>],[<span class="number">5</span>,<span class="number">4</span>,<span class="number">9</span>,<span class="number">1</span>]])</span><br><span class="line">print(data)</span><br></pre></td></tr></table></figure>
<h4 id="特征选择"><a href="#特征选择" class="headerlink" title="特征选择"></a>特征选择</h4><p>特征选择是为了减少冗余,减小计算机的性能消耗(因为部分特征相关度高),同时消除部分噪点特征.在所有特征中选择部分作为训练集特征.<br>主要有以下三种方法:</p>
<ul>
<li>过滤式: VarianceThreshold</li>
<li>嵌入式:正则化(回归分析时候会用到)</li>
<li>包裹式</li>
</ul>
<p>过滤式VarianceThreshold 是特征选择中的一项基本方法。它会移除所有方差不满足阈值的特征。默认设置下,它将移除所有方差为0的特征,即那些在所有样本中数值完全相同的特征。<br>假设我们要移除那些超过80%的数据都为1或0的特征.<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.feature_selection <span class="keyword">import</span> VarianceThreshold</span><br><span class="line">X = [[<span class="number">0</span>, <span class="number">0</span>, <span class="number">1</span>], [<span class="number">0</span>, <span class="number">1</span>, <span class="number">0</span>], [<span class="number">1</span>, <span class="number">0</span>, <span class="number">0</span>], [<span class="number">0</span>, <span class="number">1</span>, <span class="number">1</span>], [<span class="number">0</span>, <span class="number">1</span>, <span class="number">0</span>], [<span class="number">0</span>, <span class="number">1</span>, <span class="number">1</span>]]</span><br><span class="line">sel = VarianceThreshold(threshold=(<span class="number">.8</span> * (<span class="number">1</span> - <span class="number">.8</span>)))</span><br><span class="line">sel.fit_transform(X)</span><br><span class="line">array([[<span class="number">0</span>, <span class="number">1</span>],</span><br><span class="line"> [<span class="number">1</span>, <span class="number">0</span>],</span><br><span class="line"> [<span class="number">0</span>, <span class="number">0</span>],</span><br><span class="line"> [<span class="number">1</span>, <span class="number">1</span>],</span><br><span class="line"> [<span class="number">1</span>, <span class="number">0</span>],</span><br><span class="line"> [<span class="number">1</span>, <span class="number">1</span>]])</span><br></pre></td></tr></table></figure></p>
</div>
<div>
</div>
<footer class="post-footer">
<div class="post-eof"></div>
</footer>
</div>
</article>
<article class="post post-type-normal" itemscope itemtype="http://schema.org/Article">
<div class="post-block">
<link itemprop="mainEntityOfPage" href="http://yoursite.com/2018/09/05/git学习笔记2/">
<span hidden itemprop="author" itemscope itemtype="http://schema.org/Person">
<meta itemprop="name" content="雷源">
<meta itemprop="description" content="">
<meta itemprop="image" content="/images/blog_logo2.jpg">
</span>
<span hidden itemprop="publisher" itemscope itemtype="http://schema.org/Organization">
<meta itemprop="name" content="啰里啰嗦的圈圈">
</span>
<header class="post-header">
<h1 class="post-title" itemprop="name headline">
<a class="post-title-link" href="/2018/09/05/git学习笔记2/" itemprop="url">git学习笔记2</a></h1>
<div class="post-meta">
<span class="post-time">
<span class="post-meta-item-icon">
<i class="fa fa-calendar-o"></i>
</span>
<span class="post-meta-item-text">发表于</span>
<time title="创建于" itemprop="dateCreated datePublished" datetime="2018-09-05T10:07:06+08:00">
2018-09-05
</time>
</span>
<span class="post-category" >
<span class="post-meta-divider">|</span>
<span class="post-meta-item-icon">
<i class="fa fa-folder-o"></i>
</span>
<span class="post-meta-item-text">分类于</span>
<span itemprop="about" itemscope itemtype="http://schema.org/Thing">
<a href="/categories/git/" itemprop="url" rel="index">
<span itemprop="name">git</span>
</a>
</span>
</span>
</div>
</header>
<div class="post-body" itemprop="articleBody">
<p>学习完git在本地的操作后,就要涉及到远端仓库的学习了,所以有关git的第二篇笔记就从远端仓库开始写起。</p>
<h3 id="远程仓库"><a href="#远程仓库" class="headerlink" title="远程仓库"></a>远程仓库</h3><p>关于远程仓库的概念这里就不在赘述了,需要注意的是在使用GitHub时,我们需要添加ssh key,操作很简单。<br>使用远端仓库首先我们需要在GitHub或者码云上建立一个新的仓库,然后按照提示在本地仓库使用命令<code>git remote add origin [email protected]:leiyuan9759/learngit.git</code><br>,接着就能够把本地库的所有东西推送到远程仓库中去,命令为<code>git push -u origin master</code>,由于远程库是空的,我们第一次推送master分支时,加上了-u参数,Git不但会把本地的master分支内容推送的远程新的master分支,还会把本地的master分支和远程的master分支关联起来,在以后的推送或者拉取时就可以简化命令。从现在起,只要本地作了提交,就可以通过命令:<code>git push origin master</code></p>
<ul>
<li>对于开发者,一般来讲我们都是现有远程库,然后从远程仓库进行克隆,我们可以采用<code>git clone [email protected]:leiyuan9759/gitskills.git</code><br>这样就可以把代码copy到我们的机器上来。</li>
</ul>
<hr>
<h3 id="分支管理"><a href="#分支管理" class="headerlink" title="分支管理"></a>分支管理</h3><p>git的协同操作,我感觉最复杂的就是分支管理了,关于git分支的讲解可以参考<a href="https://www.liaoxuefeng.com/wiki/0013739516305929606dd18361248578c67b8067c8c017b000/001375840038939c291467cc7c747b1810aab2fb8863508000" target="_blank" rel="noopener">廖雪峰git笔记之创建与合并分支</a>。<br>git 分支操作</p>
<ul>
<li>创建分支<br>我们创建dev分支,然后切换到dev分支:<code>git checkout -b dev</code>,相当于以下两条命令:<code>git branch dev</code>,<code>git checkout dev</code></li>
<li>查看分支 <code>git branch</code>,git branch命令会列出所有分支,当前分支前面会标一个*号</li>
<li>切换分支 <code>git checkout master</code>,切换为master分支。</li>
<li>合并分支 <code>git merge dev</code>,git merge命令用于合并指定分支到当前分支。</li>
<li>删除分支 <code>git branch -d dev</code></li>
<li>合并分支的策略<br>通常,合并分支时,如果可能,Git会用Fast forward模式,但这种模式下,删除分支后,会丢掉分支信息。如果要强制禁用Fast forward模式,Git就会在merge时生成一个新的commit,这样,从分支历史上就可以看出分支信息。命令为<code>git merge --no-ff -m "merge with no-ff" dev</code>,其中-m 是添加的备注信息</li>
</ul>
<hr>
<h3 id="多人协作"><a href="#多人协作" class="headerlink" title="多人协作"></a>多人协作</h3><p>多人协作的工作模式通常是这样:<br>首先,可以试图用<code>git push origin <branch-name></code>推送自己的修改;<br>如果推送失败,则因为远程分支比你的本地更新,需要先用git pull试图合并;<br>如果合并有冲突,则解决冲突,并在本地提交;<br>没有冲突或者解决掉冲突后,再用<code>git push origin <branch-name></code>推送就能成功!<br>如果git pull提示no tracking information,则说明本地分支和远程分支的链接关系没有创建,用命令<code>git branch --set-upstream-to <branch-name> origin/<branch-name></code>。</p>
<p>多人协助常用命令:</p>
<ul>
<li>查看远程库信息,使用git remote -v;本地新建的分支如果不推送到远程,对其他人就是不可见的;</li>
<li>从本地推送分支,使用git push origin branch-name,如果推送失败,先用git pull抓取远程的新提交;</li>
<li>在本地创建和远程分支对应的分支,使用git checkout -b branch-name origin/branch-name,本地和远程分支的名称最好一致</li>
<li>建立本地分支和远程分支的关联,使用git branch –set-upstream branch-name origin/branch-name;</li>
<li>从远程抓取分支,使用git pull,如果有冲突,要先处理冲突。</li>
</ul>
<hr>
<p> 关于git的使用还有打个标签的操作,感觉比较简单,就不写了。<br> git的学习到此就结束了</p>
</div>
<div>
</div>
<footer class="post-footer">
<div class="post-eof"></div>
</footer>
</div>
</article>
<article class="post post-type-normal" itemscope itemtype="http://schema.org/Article">
<div class="post-block">
<link itemprop="mainEntityOfPage" href="http://yoursite.com/2018/09/04/git学习笔记/">
<span hidden itemprop="author" itemscope itemtype="http://schema.org/Person">
<meta itemprop="name" content="雷源">
<meta itemprop="description" content="">
<meta itemprop="image" content="/images/blog_logo2.jpg">
</span>
<span hidden itemprop="publisher" itemscope itemtype="http://schema.org/Organization">
<meta itemprop="name" content="啰里啰嗦的圈圈">
</span>
<header class="post-header">
<h1 class="post-title" itemprop="name headline">
<a class="post-title-link" href="/2018/09/04/git学习笔记/" itemprop="url">git学习笔记</a></h1>
<div class="post-meta">
<span class="post-time">
<span class="post-meta-item-icon">
<i class="fa fa-calendar-o"></i>
</span>
<span class="post-meta-item-text">发表于</span>
<time title="创建于" itemprop="dateCreated datePublished" datetime="2018-09-04T19:27:53+08:00">
2018-09-04
</time>
</span>
</div>
</header>
<div class="post-body" itemprop="articleBody">
<p>很早就听说过版本控制工具在今后的工作当中是是必不可少的,但是平时因为是自己一个人在进行研究和学习,对git的使用就怀着一种偷懒的态度,感觉平时不使用git也可以完成平时的工作,对git就只限于了解.在暑假的实习中,经理给我说:”其实有些时候,确实是可以不使用git的,但是只有你有这个思想,我们才会感觉你是同类”.到这时才感受到git的重要性,所以我决定从新梳理总结一下git的<br>使用.本篇笔记主要是学习廖雪峰的git教程后的笔记,但是感觉已经够用了,以后有时间再去git官网参阅文档.<br>我对于git主要还是分不清楚工作区/暂存区和远端仓库之间的关系, 对于命令的掌握也就无从谈起了,所以对git的学习还是要从基础学起.</p>
<h3 id="git简介"><a href="#git简介" class="headerlink" title="git简介"></a>git简介</h3><p>关于git的介绍也不用多说,廖雪峰说git是世界上最先进的分布式版本控制系统,帮助我们(特别是我这种小白)结束了手动管理多个版本的史前时代,进入21世纪.既然说到分布式我们就要了解一下什么是分布式系统,而在此之前我们还要了解一下与之对应的集中式系统.</p>
<ul>
<li>集中式系统是指版本库在中央服务器上,所有的人都要先从服务器上获取最新版本,然后干活,接着把自己的工作提交到中央数据库.集中式系统必须要联网,这样才能提交代码.</li>
<li>分布式系统每个人的电脑上都是一个完整的版本库,工作时候不需要联网,大多数操作都像是在本地完成的,当然也需要有一个中央服务器来方便大家交换代码.</li>
</ul>
<h3 id="git的基本操作"><a href="#git的基本操作" class="headerlink" title="git的基本操作"></a>git的基本操作</h3><h4 id="创建版本库"><a href="#创建版本库" class="headerlink" title="创建版本库"></a>创建版本库</h4><p>版本库就是一个目录,类似一个仓库,之前爱这个目录里所有的文件都可以使用git来进行管理。创建一个版本库的操作也非常简单,只要创建一个空白的文件夹,然后打开git bash,使用<code>git init</code>命令就可以把这个目录变成git可以管理的 仓库。为了避免问题出现,这个目录的路径名最好是没有中文的。</p>
<h4 id="提交操作"><a href="#提交操作" class="headerlink" title="提交操作"></a>提交操作</h4><p>git的基本操作主要是<code>add commit push</code>三个,将代码提交到本地或者是远端仓库,掌握这三个命令就能够完成平时的大部分工作。<br>add 是把文件加入到git仓库中去的,通常进行的操作就是<code>git add 文件名`</code><br>使用commitm命令时要添加一些信息说明本次提交的内容,这样在版本回退时候才能够清楚的看到本次提交所进行的操作。例如,<code>git commit -m "添加readme文件"</code>。</p>
<h4 id="版本回退"><a href="#版本回退" class="headerlink" title="版本回退"></a>版本回退</h4><p>git做为版本控制工具是可以随时吃“后悔药”的,也就是我们所说的版本回退。进行版本回退主要涉及的命令是reset。<br>在进行版本回退时,我们需要查看我们提交的历史记录,查看历史提交记录的命令是:<code>git log</code>,这时所有的提交日志会按照从近到远显示,如果嫌弃输出的信息太多,可以使用命令<code>git log --pretty=online</code>.</p>
<p>好了,现在我们准备把readme.txt回退到上一个版本,怎么做呢?<br>首先,Git必须知道当前版本是哪个版本,在Git中,用HEAD表示当前版本,也就是最新的提交版本,上一个版本就是HEAD^,上上一个版本就是HEAD^^,当然往上100个版本写100个^比较容易数不过来,所以写成HEAD~100。举个例子<code>git reset --hard HEAD^</code>, 就可以返回到上个版本。<br>在返回上个版本后再使用<code>git log</code>就会发现,找不到上一次提交的版本,这时就可以采用<br>命令<code>git reflog</code>,这样可以查看到每一次命令,再3使用<code>git reset</code>命令回退。</p>
<h4 id="撤销操作"><a href="#撤销操作" class="headerlink" title="撤销操作"></a>撤销操作</h4><p><code>git checkout -- file</code>可以丢弃工作区的修改:</p>