用Python处理"大"XLS文件

    xiaoxiao2021-03-26  64

    权当学习Python练手用的.

    数据来data.gov.uk,大小有58.4MB

    文件都是些什么内容?

    ’Accident_Index’, ‘Location_Easting_OSGR’,‘Location_Northing_OSGR’, ‘Longitude’, ‘Latitude’, ‘Police_Force’, ‘Accident_Severity’, ‘Number_of_Vehicles’, ‘Number_of_Casualties’, ‘Date’, ‘Day_of_Week’, ‘Time’, ‘Local_Authority_(District)’, ‘Local_Authority_(Highway)’, ‘1st_Road_Class’, ‘1st_Road_Number’, ‘Road_Type’, ‘Speed_limit’, ‘Junction_Detail’, ‘Junction_Control’, ‘2nd_Road_Class’, ‘2nd_Road_Number’, ‘Pedestrian_Crossing-Human_Control’, ‘Pedestrian_Crossing_Physical_Facilities’, ’Light_Conditions’, ‘Weather_Conditions’, ‘Road_Surface_Conditions’, ‘Special_Conditions_at_Site’, ‘Carriageway_Hazards’, ‘Urban_or_Rural_Area’, ‘Did_Police_Officer_Attend_Scene_of_Accident’, ‘LSOA_of_Accident_Location’

    LowMemory 方式读取文件

    #read the file filedir='/home/derek/Desktop/python-data-analyis/large-excel-files/Accidents_2013.csv' data = pd.read_csv(filedir,low_memory=False) print data.ix[:10]['Day_of_Week'] SQL likes 提取数据信息 print 'Accidents' print '----------' #选择星期日发生的事故 accidents_sunday = data[data.Day_of_Week==1] print 'Accidents which happended on a Sunday: ',len(accidents_sunday) #选择星期日发生的且涉事人数在十人以上的事故 accidents_sunday_twenty_cars = data[(data.Day_of_Week==1) & (data.Number_of_Vehicles>10)] print'Accidents which happened on a Sunday involving > 10 cars: ' , len(accidents_sunday_twenty_cars) #选择星期日发生的且涉事人数在十人以上且天气情况是下雨的事故(2对应的是无风下雨) accidents_sunday_twenty_cars_rain = data[(data.Day_of_Week==1) & (data.Number_of_Vehicles>10) & (data.Weather_Conditions==2)] print'Accidents which happened on a Sunday involving > 10 cars with rainning: ' , len(accidents_sunday_twenty_cars_rain) #选择在伦敦的星期日发生的事故 london_data = data[(data['Police_Force'] == 1) & (data.Day_of_Week==1)] print 'Accidents in London on a Sunday',len(london_data) #选择在2000年的伦敦的星期日发生的事故 london_data_2000 = london_data[((pd.to_datetime('2000-1-1', errors='coerce')) > (pd.to_datetime(london_data['Date'],errors='coerce'))) & (pd.to_datetime(london_data['Date'],errors='coerce') < (pd.to_datetime('2000-12-31', errors='coerce')))] print 'Accidents in London on a Sunday in 2000:',len(london_data_2000)

    给人的感觉是特别像SQL语句,DataFrame的这种切片,方式特别好用,对不对?

    pd.to_datetime(london_data['Date'],errors='coerce')

    这里是日期转换函数.

    输出:

    Accidents ---------- Accidents which happended on a Sunday: 14854 Accidents which happened on a Sunday involving > 10 cars: 1 Accidents which happened on a Sunday involving > 10 cars with rainning: 1 Accidents in London on a Sunday 2374 Accidents in London on a Sunday in 2000: 0

    将部分DataFrame数据以XLSX文件存储下来 确保你安装了XlsxWriter

    sudo pip install XlsxWriter

    writer = pd.ExcelWriter('london_data.xlsx', engine='xlsxwriter') london_data.to_excel(writer, 'sheet1') writer.save() writer.close() 块读取,分析一个星期中那一天最有出事故的概率最大 代码.2013,2014,2015三年的事故记录,在’Accidents_2013.csv’,’Accidents_2014.csv’, ‘Accidents_2015.csv’这三个文件中 import pandas as pd from pandas import Series import matplotlib.pyplot as plt #read the file dir='/home/derek/Desktop/python-data-analyis/large-excel-files/' filedir=['Accidents_2013.csv','Accidents_2014.csv', 'Accidents_2015.csv'] tot = Series([]) for i in range(3): #块读取文件, 每次读1000条记录 data = pd.read_csv(dir + filedir[i],chunksize=1000) for piece in data: tot = tot.add(piece['Day_of_Week'].value_counts(), fill_value=0) day_index = ['Sun', 'Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat'] print 'data like:' #tot = tot.sort_values(ascending=False) print tot #重新构造一个Series,是为了给索引命名 new_Series = Series(tot.values, index=day_index) new_Series.plot() plt.show() plt.close()

    控制台输出:

    data like: 1 46052 2 60956 3 65006 4 64039 5 64445 6 69378 7 55162 dtype: float64

    图: 三年记录在案的有425038条记录.

    结论: 看来,英国人在工作日出行要比在休息日造成更多的事故.星期五的出行造成的事故最多,或许,星期五急着回家,哈哈.相比起来,星期五不适合外出.

    参考文章来源

    文件没有提供,是因为:读者可以自己去下载,可能找到更想更好用Python分析的数据.

    转载请注明原文地址: https://ju.6miu.com/read-661279.html

    最新回复(0)