目录
一、数据来源
本文使用的数据来源于天池大赛数据集,由蚂蚁金服提供,包含用户基本信息、申购赎回记录、收益率、银行间拆借利率等多个维度,本文通过PySpark实现对该数据集的简单分析。
二、PySpark RDD编程
数据都已上传到HDFS的/data目录下,对于分析结果,保存至/output目录下。
1、查询特定日期的资金流入和流出情况
使用user_balance_table,计算出所有用户在每一天的总资金流入和总资金流出量。
输出格式如下:
<日期> <资金流入量> <资金流出量>
代码如下:
- from pyspark.sql.functions import col, mean, round, row_number, sum
- from pyspark import SparkConf, SparkContext
- from pyspark.sql import SparkSession
- from pyspark.sql.window import Window
- from datetime import datetime
-
- # 创建对象
- conf = SparkConf().setAppName("data analysis")
- sc = SparkContext(conf=conf)
-
- # 读取CSV文件为RDD
- lines_rdd = sc.textFile("/data/user_balance_table.csv")
- # 获取表头
- header = lines_rdd.first()
- data_rdd = lines_rdd.filter(lambda row: row != header)
- # 每行转为列表
- split_rdd = data_rdd.map(lambda line: line.split(","))
-
- # 提取列索引
- extracted_rdd = split_rdd.map(lambda fields: (fields[1], (int(fields[4]), int(fields[8]))))
- # 分组聚合,分别计算流入和流出总额
- aggregated_rdd = extracted_rdd.reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
- # 结果转换
- result_rdd = aggregated_rdd.map(lambda x: "\t".join([x[0], str(x[1][0]), str(x[1][1])]))
-
- # 插入表头行
- final_rdd = sc.parallelize(["日期\t资金流入量\t资金流出量"]) + result_rdd
- final_rdd.saveAsTextFile("/output/result1.txt")
-
- print("已完成1、查询特定日期的资金流入和流出情况")
输出结果如下:
- 日期 资金流入量 资金流出量
- 20140805 394780870 221706539
- 20140808 233903717 311648757
- 20140811 331550471 418603336
- 20140814 257702660 211939431
- 20140820 308378692 202452782
- 20140823 141412027 199377531
- 20140826 306945089 285478563
- 20140829 267554713 273756380
- 20140830 199708772 196374134
- 20140827 302194801 468164147
- 20140821 251763517 219963356
- 20140818 298499146 259169016
- 20140815 244551620 236516007
- 20140812 258493673 309754858
- 20140803 173825397 127125217
- 20140728 371762756 345986909
- 20140725 181641088 262874791
- 20140722 243084133 369043423
- 20140719 210318023 155464283
- 20140716 394890140 234775948
- 20140713 179759885 199459990
- 20140710 283095921 326009240
- 20140707 272182847 317612569
- 20140409 383347565 289330278
- 20140403 363877120 266605457
- 20140331 398884905 423852634
- 20140328 225966355 405443946
- 20140831 275090213 292943033
- 20140825 309574223 312413411
- 20140822 246316056 179349206
- 20140816 215059736 219214339
- 20140813 261506619 303975517
- 20140810 259534870 189909225
- 20140804 330640884 322907524
- 20140729 228093046 303480103
- 20140723 265461894 308353077
- 20140717 253011280 298279385
- 20140714 254797524 284753279
- 20140711 208671021 240050748
- 20140708 224240103 340453063
- 20140630 334054112 456547794
- 20140321 282351818 259655286
- 20140324 313180334 437825259
- 20140402 355347118 272612066
- 20140414 309853269 415986984
- 20140417 355792647 265341592
- 20140420 191259529 161057781
- 20140501 193045106 143362755
- 20140504 303087562 413222034
- 20140513 275241493 257918375
- 20140519 259077930 293791406
- 20140522 344636549 251108485
- 20140603 270887462 385622582
- 20140612 332365185 236467885
- 20140615 166080126 116623756
- 20140323 167456369 186443311
- 20140326 272935544 450254233
- 20140329 160250985 155006056
- 20140407 196936223 176966561
- 20140410 386567460 286914864
- 20140416 387847838 255914640
- 20140422 285248757 268810141
- 20140425 220927432 227764292
- 20140506 318002728 341108696
- 20140515 313367089 238307643
- 20140524 160073254 154409868
- 20140527 299049555 321965845
- 20140530 226547701 312802179
- 20140605 380042567 355645445
- 20140614 181574530 229916191
- 20140623 232227670 373779624
- 20140415 428681231 285293076
- 20140427 146837951 191915377
- 20140502 125336258 121222064
- 20140505 370924149 309330781
- 20140508 392838756 273187499
- 20140514 305474522 316225442
- 20140517 170983868 145854372
- 20140526 344890868 274707572
- 20140529 320549863 313827800
- 20140622 147180819 361285652
- 20140625 264663201 547295931
- 20140628 153826161 283400236
- 20140706 199569025 195530758
- 20140709 278005555 269642881
- 20140712 177644343 149081488
- 20140718 208959595 208671287
- 20140724 277044480 347622431
- 20140727 151406251 166610652
- 20140730 209917272 250117716
- 20140802 189092130 172250225
- 20140217 706118060 405809267
- 20131120 166163002 86951218
- 20131111 147986439 227124951
- 20131102 62132300 26705506
- 20131024 69530830 58183921
- 20131015 85704304 35099198
- 20131012 81901136 31322983
- 20131013 71149981 31850475
- 20131029 102050144 43052468
- 20131104 300027403 130970051
- 20131110 164813471 85293644
- 20131113 174554976 79121440
- 20131116 118085705 28996272
- 20131119 131365419 70308235
- 20131122 128528687 82717921
- 20131125 147143602 70113127
- 20131203 268666856 101662045
- 20131209 174652812 133355299
- 20131214 129045596 52234651
- 20131208 124886903 76314292
- 20131205 190825287 118256819
- 20131124 102394991 85555947
- 20131121 221166938 74733962
- 20131118 195766477 136916782
- 20131112 148469647 131263524
- 20131106 181474910 131707820
- 20131028 129993375 55089464
- 20131025 69936453 52225802
- 20131022 145611496 96172537
- 20131019 53148443 26684697
- 20140216 331324628 183057224
- 20131216 196489931 156157326
- 20131213 164672690 90778214
- 20131210 179471612 147147341
- 20131207 138882605 71916047
- 20131123 89109018 48882131
- 20140223 337053711 195214052
- 20140313 346286910 316973542
- 20140310 497338076 308040624
- 20140317 339008082 440624592
- 20140314 315897431 311575572
- 20140311 430500816 496039886
- 20140302 276202230 246199417
- 20140225 563505889 299155352
- 20140309 244752519 206312503
- 20140315 287407002 242799048
- 20140318 435479117 410240726
- 20140221 353981687 178914448
- 20140218 593563145 271125324
- 20140212 763009770 354585919
- 20140209 341911234 262954049
- 20140206 227860276 111635336
- 20140205 209043990 95407704
- 20140208 358255468 183432237
- 20140211 818803205 243058162
- 20140220 478925191 230855410
- 20140207 578818221 273482213
- 20140201 64287172 61527060
- 20140117 345426853 127385210
- 20140116 374947083 153167426
- 20140119 185082022 117629351
- 20140125 440854623 185524856
- 20140127 657199484 280645861
- 20131220 133867400 153678963
- 20131217 193980797 116537843
- 20140112 228255344 141505812
- 20131221 112000779 83151594
- 20131222 154543217 83998249
- 20131228 192808006 61383220
- 20140102 434956739 190155450
- 20140105 206030707 156781996
- 20140113 447207050 178923772
- 20140110 237797636 117259153
- 20140107 589726496 137972793
- 20131224 305406334 118607911
- 20130824 44367268 14316598
- 20131011 82674757 27666473
- 20131002 11562708 7233874
- 20130930 28398050 30181353
- 20130921 32406340 26596179
- 20130906 38923186 33417903
- 20130831 47655303 22012016
- 20130828 46602319 33696861
- 20130816 41683536 33739844
- 20130813 48720919 28790328
- 20130804 45745254 8263965
- 20130729 53512076 18599364
- 20130723 58136147 24404051
- 20131010 72094613 52244531
- 20130820 53675509 30131225
- 20130817 17670519 4674983
- 20130811 55519543 15372680
- 20130805 43632203 15797507
- 20130730 47481243 13048582
- 20130724 48422518 36258592
- 20130725 57433418 38212836
- 20130731 54569637 9534040
- 20130803 21595789 20701797
- 20130806 50866184 16401387
- 20130815 42421222 52659327
- 20130818 59703092 15218300
- 20130827 45268028 60137633
- 20130830 56488844 20630752
- 20130908 38796909 14919122
- 20130911 98944459 71674082
- 20130914 30975443 36312660
- 20131004 19907689 23412350
- 20130904 33366708 27452600
- 20130907 34076183 20344424
- 20130910 94684143 37128363
- 20130919 24778048 11418512
- 20131003 10101194 6223263
- 20131009 72422299 33999376
- 20130711 44075197 3508800
- 20130714 22615303 2784107
- 20130720 20439079 4601143
- 20130718 24234505 11765356
- 20130715 48128555 13107943
- 20130712 34183904 8492573
- 20130706 36751272 1616635
- 20130710 30696506 2597169
- 20130713 15164717 3482829
- 20130719 33680124 9244769
- 20130722 40448896 19144267
- 20140817 149978271 139564084
- 20140824 130195484 191080151
- 20140809 160262764 163611708
- 20140806 288821016 282346594
- 20140731 191728916 277194379
- 20140704 211649838 264494550
- 20140412 177642053 123295320
- 20140406 129477254 139576683
- 20140325 314345006 312710515
- 20140322 191700135 138039412
- 20140828 245082751 297893861
- 20140819 266401973 254929877
- 20140807 247646474 253659514
- 20140801 374884735 252540858
- 20140726 128268053 282653341
- 20140720 176449304 174462836
- 20140705 169383796 272535138
- 20140702 384555819 328950951
- 20140327 266231082 359071642
- 20140330 205533934 264714811
- 20140405 202336542 163199682
- 20140408 354770149 250015131
- 20140411 237829882 277077434
- 20140423 313677307 278470936
- 20140426 151625415 158122962
- 20140429 330607104 307578349
- 20140507 417327518 239372999
- 20140510 287240171 147248328
- 20140516 231967423 282094916
- 20140525 166943526 231004758
- 20140528 276134813 415891684
- 20140531 146823669 142862063
- 20140606 301413900 274862067
- 20140609 366114374 287520152
- 20140618 335262709 421920230
- 20140621 177999186 231003775
- 20140624 245450766 428471509
- 20140627 264282703 399444352
- 20140320 365011495 336076380
- 20140401 453320585 277429358
- 20140404 251895894 200192637
- 20140413 208172985 178934722
- 20140419 268729366 146374940
- 20140428 324937272 327724735
- 20140503 185094488 199568043
- 20140509 281479009 247743971
- 20140512 325108597 293952908
- 20140518 164419642 153440019
- 20140521 297799722 250223726
- 20140602 158219402 170409506
- 20140608 302171269 169525332
- 20140611 327661453 246127540
- 20140617 270350693 502560223
- 20140620 251582530 286583065
- 20140626 297628039 418742109
- 20140629 158801540 261170799
- 20140701 384428753 374164541
- 20140418 239300383 225952909
- 20140421 301134667 295635256
- 20140424 318358891 224536754
- 20140430 260091330 281835975
- 20140511 182424063 152945581
- 20140520 453955303 260040720
- 20140523 249546195 229000787
- 20140601 183489775 149829253
- 20140604 274460744 303978838
- 20140607 187801995 146203577
- 20140610 354031597 298190025
- 20140613 216923770 386799040
- 20140616 387308484 492349489
- 20140619 338609087 284956260
- 20140703 297894643 289009780
- 20140715 334810012 261722182
- 20140721 378088594 434191479
- 20140213 626698794 362757986
- 20131117 111786622 42358155
- 20131114 167996058 52554150
- 20131108 231749471 92856307
- 20131105 180904814 82689864
- 20131030 109663260 63908701
- 20131027 68007543 46520570
- 20131021 106030290 45781403
- 20131018 98151450 15938355
- 20131014 130315300 40652625
- 20131017 87203217 23274764
- 20131020 47766681 50884342
- 20131023 106093112 73788462
- 20131026 55752506 49357211
- 20131101 150904945 66933119
- 20131107 167963464 83306452
- 20131128 139760425 61453591
- 20131206 152197159 82625571
- 20131212 140360007 92090275
- 20131215 205882387 54757146
- 20140215 255393179 121463693
- 20131211 239916977 110314592
- 20131202 345313258 128465086
- 20131130 116865492 119660311
- 20131127 134774128 86129498
- 20131115 124270704 58210254
- 20131109 126467028 45885771
- 20131103 68707870 30895754
- 20131031 90926701 55607833
- 20131016 100948091 51261982
- 20131204 219456372 69981329
- 20131201 137889992 94632233
- 20131129 122536344 123982166
- 20131126 258037702 65802983
- 20140226 680801145 284819682
- 20140301 362865580 211279011
- 20140304 524146340 250562978
- 20140319 359014064 429298917
- 20140316 269387391 390584547
- 20140307 380139779 291087220
- 20140308 243274169 140323202
- 20140305 454295491 209072753
- 20140227 672909288 492786036
- 20140224 656317045 473470156
- 20140204 83653646 44927333
- 20140222 282022706 134735118
- 20140228 428721754 322030204
- 20140303 505305862 513017360
- 20140306 561787770 243149884
- 20140312 377007897 416491268
- 20140203 84329848 46109194
- 20140131 87324175 48132389
- 20140202 57912761 24395720
- 20140214 490710434 263261899
- 20140219 408367966 323990451
- 20140210 647465140 431753411
- 20140126 528402520 331752519
- 20130929 34433848 40295202
- 20140129 952479658 215164666
- 20140123 481450097 209024328
- 20140120 335761027 442996084
- 20140114 356907128 159778389
- 20130721 21142394 2681331
- 20140122 498730606 292856188
- 20140128 843424742 248334316
- 20140130 227916211 105288105
- 20140124 614103894 345829001
- 20140121 412854611 302825136
- 20140118 233200688 142869842
- 20140115 391408718 177618247
- 20131229 161473347 55129989
- 20131226 220287409 241852564
- 20140103 342074805 127714255
- 20131223 198747367 138478245
- 20140106 442494042 190917629
- 20140109 280698487 140391237
- 20131218 180215804 202220872
- 20131219 233631357 164023344
- 20131225 380076148 172755480
- 20131231 265851934 134232530
- 20140108 264025160 213880074
- 20140111 243403438 71530182
- 20140104 186085910 99869074
- 20140101 330926565 77367755
- 20131230 286358778 179826624
- 20131227 211118967 163522548
- 20130925 47274168 74838162
- 20131008 109806912 77856096
- 20131005 26364686 13966453
- 20130927 42166121 27639603
- 20130924 74250859 56680792
- 20130918 42926608 47203221
- 20130915 37059683 25020715
- 20130912 68573684 45147220
- 20130909 49312473 45621186
- 20130903 80507880 39328144
- 20130825 43306288 18117808
- 20130822 89130737 25000672
- 20130819 109184209 28339260
- 20130810 20615307 26614408
- 20130807 43908081 29708706
- 20130801 53369962 18864468
- 20130726 44721817 39192369
- 20130829 47518666 35944968
- 20130901 56239802 59339949
- 20130826 74496541 20637986
- 20130823 37878575 25680323
- 20130814 30541167 15031683
- 20130808 44493490 29551691
- 20130802 38064536 40150769
- 20130727 17194451 15058893
- 20130728 36255382 7683211
- 20130809 33425186 30131015
- 20130812 94520502 28586669
- 20130821 38184446 29983342
- 20130902 140844739 17785524
- 20130905 38716505 21800904
- 20130917 76204815 58260798
- 20130920 28365762 15188254
- 20130923 54658160 58285879
- 20130926 68718693 48467529
- 20131001 19137499 12813267
- 20131007 42797733 28894400
- 20130913 71655946 60512675
- 20130916 161656210 45184589
- 20130922 67514169 47962055
- 20130928 39995798 58105720
- 20131006 38730653 13464528
- 20130717 29015682 10911513
- 20130705 11648749 2763587
- 20130702 29037390 2554548
- 20130708 57258266 8347729
- 20130709 26798941 3473059
- 20130703 27270770 5953867
- 20130701 32488348 5525022
- 20130704 18321185 6410729
- 20130707 8962232 3982735
- 20130716 50622847 11864981
2、活跃用户分析
使用 user_balance_table ,定义活跃用户为在指定月份内有至少5天记录的用户,统计2014年8月的活跃用户总数。
输出格式如下:
<活跃用户总数>
代码如下:
- # 提取user_id、report_date字段
- mapped_rdd = split_rdd.map(lambda fields: (fields[0], datetime.strptime(fields[1], "%Y%m%d")))
- # 筛选出2014年8月的记录
- filtered_rdd = mapped_rdd.filter(lambda x: x[1].year == 2014 and x[1].month == 8)
- # 聚合
- grouped_rdd = filtered_rdd.groupByKey()
- # 统计记录数量,过滤活跃用户
- active_users_rdd = grouped_rdd.filter(lambda x: len(list(x[1])) >= 5)
-
- # 统计活跃用户总数
- total_active_users = active_users_rdd.map(lambda x: x[0]).distinct().count()
-
- print("已完成2、活跃用户分析\n2014年8月期间的活跃用户总数为:", total_active_users)
三、PySpark SQL编程
1、按城市统计2014年3月1日的平均余额
计算每个城市在2014年3月1日的用户平均余额(tBalance),按平均余额降序排列。
输出格式如下:
<城市ID> <平均余额>
代码如下:
- spark = SparkSession.builder.getOrCreate()
- # 从HDFS读取CSV文件
- df_bal = spark.read.csv("/data/user_balance_table.csv", header=True, inferSchema=True)
- df_city = spark.read.csv("/data/user_profile_table.csv", header=True, inferSchema=True)
-
- df = df_bal.select("user_id", "report_date", "tBalance")\
- .filter(col("report_date") == "20140301")\
- .join(df_city.select("user_id","city"), on="user_id")
-
- result_df = df.groupBy("city").agg(mean(col("tBalance")).cast("int").alias("tBalance_mean"))\
- .sort(col("tBalance_mean").desc())
-
- # 将结果转换为RDD,并进行格式调整(添加表头,并以tab分隔每列)
- data_rdd = result_df.rdd.map(lambda row: "\t".join(map(str, row)))
- result2 = sc.parallelize(["城市ID\t平均余额"] + data_rdd.collect())
- result2.saveAsTextFile("/output/result2.txt")
- print("已完成1、按城市统计2014年3月1日的平均余额")
输出结果如下:
- 城市ID 平均余额
- 6281949 2795923
- 6301949 2650775
- 6081949 2643912
- 6481949 2087617
- 6411949 1929838
- 6412149 1896363
- 6581949 1526555
2、统计每个城市总流量前3高的用户
统计每个城市中每个用户在2014年8月的总流量(定义为total purchase_amt+total_redeem_amt),并输出每个城市总流量排名前三的用户ID及其总流量。
输出格式如下:
<城市ID> <用户ID> <总流量>
代码如下:
- windowSpec = Window.partitionBy("city").orderBy(col("total_amt").desc())
- df = df_bal.select("user_id", "report_date", "total_purchase_amt", "total_redeem_amt")\
- .filter(df_bal.report_date.like("201408%"))\
- .join(df_city.select("user_id","city"), on="user_id")
-
- result_df = df.groupBy("city", "user_id") \
- .agg((sum(col("total_purchase_amt")) + sum(col("total_redeem_amt"))).alias("total_amt"))\
- .filter(col("total_amt") > 0)\
- .withColumn("rn", row_number().over(windowSpec))\
- .filter(col("rn") <= 3)\
- .sort(col("city"), col("rn"))\
- .select("city", "user_id", "total_amt")
-
- # 输出结果
- data_rdd = result_df.rdd.map(lambda row: "\t".join(map(str, row)))
- result3 = sc.parallelize(["城市ID\t用户ID\t总流量"] + data_rdd.collect())
- result3.saveAsTextFile("/output/result3.txt")
- print("已完成2、统计每个城市总流量前3高的用户")
-
- sc.stop()
- spark.stop()
输出结果如下:
- 城市ID 用户ID 总流量
- 6081949 27235 108475680
- 6081949 27746 76065458
- 6081949 18945 55304049
- 6281949 15118 149311909
- 6281949 11397 124293438
- 6281949 25814 104428054
- 6301949 2429 109171121
- 6301949 26825 95374030
- 6301949 10932 74016744
- 6411949 662 75162566
- 6411949 21030 49933641
- 6411949 16769 49383506
- 6412149 22585 200516731
- 6412149 14472 138262790
- 6412149 25147 70594902
- 6481949 12026 51161825
- 6481949 670 49626204
- 6481949 14877 34488733
- 6581949 9494 38854436
- 6581949 26876 23449539
- 6581949 21761 21136440
四、总结
RDD编程主要练习了filter、map、reduceByKey、saveAsTextFile、groupByKey等算子的使用,Spark SQL编程主要练习了DataFrame操作、聚合函数、窗口函数等内容。
评论记录:
回复评论: