读书笔记数据科学入门————Python快速入门

xiaoxiao2025-11-26 8

本章摘要

快速熟悉Python的基本语法并将其用于数据分析之中其中python的设计原则：按照“明显”的方式编写代码完成工作。

python模块化：

python某些特征默认不加载，包含了语言的本身部分特征。导入方式：import re re是处理正则表达式和常量模块然后通过re前缀调用函数 my_regex = re.compile("[0-9]+",re.I) 用模块别名可以避免敲击很长字符 import mat plotlib.pyplot as plt 如果需要模块中特定的函数 from collections import defaultdict,Counter 也可以导入整个模块中函数 from re import*

python中算法与函数

导入一些内置算法 from __future__ import division 将函数赋值给其他变量 >>> def double(x): return x*2 >>> my_double = double >>> my_double(4) 8 当然也可以用lambda来表示匿名函数 >>> def apply_to_one(f): return f(1) y = apply_to_one(lambda x:x+4) lambda相当于一个求值表达式指定函数默认参数 def my_print(message="my default"): print message my_print("hello") my_print(); 通过参数名字给指定参数赋值 def substract(a=0,b=0): return a-b substract(b=5) -5

python中字符串

反斜杠表示特殊字符编码 tab_string = '\t'; len(tab_string) = 1 想\本身表示用r命令 Not_tab_string = r'\t' 用三重引号表示多行字符串： >>> multi_line_string = """This is the first line. and this is the second line and this is the third line""" >>> multi_line_string 'This is the first line.\n and this is the second line\n and this is the third line'

python异常。

异常如果不处理会引起程序崩溃，类似JAVA中异常处理，捕捉异常用关键字except try: print 0/0 except ZeroDivisionError: print "Can not divide by zero"

列表数据结构

一个列表是一个有序的集合 >>> interger_list = [1,2,3] >>> length = len(interger_list) lit_sum = sum(interger_list) 访问其中第N个元素 >>> x = range(10) >>> zero = x[0] >>> one = x[1] >>> nine = x[-1]# 9 >>> x[0] = -1 >>> x [-1, 1, 2, 3, 4, 5, 6, 7, 8, 9] 对列表也可以进行切取 >>> first_three = x[:3] >>> three_to_end = x[3:] >>> withoutfirst_and_last = x[1:-1] >>> copy_of_x = x[:] 如何确定是否有列表中元素 1 in [1,2,3] true 但是除非列表很小列表改变的操作：串联 x=[1,2,3] x.extend([4,5,6]) >>> x = [1,2,3] >>> y = x+[4,5,6] >>> y [1, 2, 3, 4, 5, 6] 不会改变原序列 x = [1,2,3] x.append(0) y = x[-1] 0

元祖数据结构

元祖跟列表非常相似，对列表的操作元祖也能做但不包括修改作用1：通过函数返回多重值的方法 >>> def sum_and_product(x,y): return (x+y),(x*y) >>> sp = sum_and_product(2,3) >>> sp (5, 6) 进行多重赋值在python中相当于解包 x,y = sum_and_product(3,4)

字典数据结构

将键和值相互联系起来通过键快速查找值 empty_dict = {} grades = {"Joel":80, "Tim":95} 通过方括号为键值赋值 grades["Tim"] = 99 而字典的常用函数如下： >>> grades.keys() ['Tim', 'Joel'] 键列表 grades.values() 值列表 grades.items() 元祖列表常用defaultdict 由于对访问键值不存在的字典的时候需要异常处理的麻烦可以用defaultdict的标准字典，当没有包含在内的键的时候，会为你提供零参数函数建立一个新的键并加1 from collections import defaultdict word_counts = defaultdict(int) for word in document: word_counts[word]+=1 用法如下： >>> from collections import defaultdict >>> dd_list = defaultdict(list)值是列表 >>> dd_list[2].append(1) >>> dd_list defaultdict(<type 'list'>, {2: [1]}) >>> dd_dict = defaultdict(dict)值是字典 >>> dd_dict["joel"]["City"] = "Seattle" >>> dd_dict defaultdict(<type 'dict'>, {'joel': {'City': 'Seattle'}}) dd_pair = defalutdict(lambad: [0,0]) dd_pair[2][1] = 1 {2:[0,1]} 常用的counter 计数器将一个序列的值转换成类似整形的标准字典的键到计数对象的映射 >>> from collections import Counter >>> c = Counter([0,1,2,0]) >>> c Counter({0: 2, 1: 1, 2: 1}) 可以用来快捷的单词计数 word_counts = Counter(document) most_common是自带常见10个单词或者 >>> for word,count in word_counts.most_common(10): print word,count hello 3 ccc 2 ccsa 1 casda 1 adsadadqwwq 1 csadada 1 asda 1 dqweq 1 sdxsa 1

数据结构之集合

表示的是一组不同的元素 >>> s = set() >>> s.add(1) >>> s.add(2) >>> x = len(s) >>> y = 2 in s 使用集合的好处1：可以快速的对成分检测，不需要检测每个元素好处2：可以在汇总中找到离散的数目： >>> item_list = [1,2,3,1,2,3] >>> num_items = len(item_list) >>> item_set = set(item_list) >>> item_set set([1, 2, 3]) >>> num_distinct_items = len(item_set) >>> num_distinct_items 3

控制流：

if 1>2: message = "if only 1 greater than two.." elif 1>3: message = ""; else: message = ""; 也可以在一行语句中使用if -then-else >>> x = 3 >>> parity = "even" if x%2==0 else "odd" >>> parity 'odd' 使用for循环 >>> for x in range(10): print x,"is less than 10" 0 is less than 10 1 is less than 10 2 is less than 10 3 is less than 10 4 is less than 10 5 is less than 10 6 is less than 10 7 is less than 10 8 is less than 10 9 is less than 10

布尔：

python中使用检测布尔的方法是 x = None; print x is None 也可以用Print x ==None 但不是惯用法 s = 1; >>> first_char = s and 1 >>> first_char 1 safe_x = x or 0 python中all函数，取值是一个列表当列表每个元素为真的返回true any 是列表中至少一个元素为真的时候返回true all([True,1,{3}]) True all([True,1,{}]) false any([True,1,{}]) True

Python进阶语法：

这些属于Python的高级特性，对于展开数据工作很有用：

排序：

>>> x = [4,1,2,3] >>> y = sorted(x) >>> y x.sort() [1, 2, 3, 4] 默认是从小到大排列，如果想从大到小可以指定参数reverse=True 除了比较元素本身也可以指定键来进行比较 >>> x = sorted([-4,1,-2,3],key=abs,reverse=True) >>> x [-4, 3, -2, 1]

列表的解析：

有时候需要将一个列表转换成另外一个列表，更改其中一些元素，或者同时变动，这些技巧就是列表解析前面用过将列表带入函数中下面是其他的应用获得偶数的列表： >>> even_numbers = [x for x in range(5) if x%2==0] 获得乘方列表： >>> squares = [x*x for x in range(5)] >>> squares [0, 1, 4, 9, 16] 也可以将列表转换成字典 >>> square_dict = {x:x*x for x in range(5)} >>> square_dict {0: 0, 1: 1, 2: 4, 3: 9, 4: 16} 使用技巧不用原来列表的值可以用下划线代替 >>> zeros = [0 for _ in range(5)] >>> zeros [0, 0, 0, 0, 0] 多个解析： >>> pairs = [(x,y) for x in range(10) for y in range(10)] >>> pairs [(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 9), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8), (1, 9), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (2, 7), (2, 8), (2, 9), (3, 0), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6), (3, 7), (3, 8), (3, 9), (4, 0), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6), (4, 7), (4, 8), (4, 9), (5, 0), (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6), (5, 7), (5, 8), (5, 9), (6, 0), (6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6), (6, 7), (6, 8), (6, 9), (7, 0), (7, 1), (7, 2), (7, 3), (7, 4), (7, 5), (7, 6), (7, 7), (7, 8), (7, 9), (8, 0), (8, 1), (8, 2), (8, 3), (8, 4), (8, 5), (8, 6), (8, 7), (8, 8), (8, 9), (9, 0), (9, 1), (9, 2), (9, 3), (9, 4), (9, 5), (9, 6), (9, 7), (9, 8), (9, 9)] 也可以加入一定条件进行解析 pairs = [(x,y) for x in range(10) for y in range(x+1,10)] >>> pairs [(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 9), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8), (1, 9), (2, 3), (2, 4), (2, 5), (2, 6), (2, 7), (2, 8), (2, 9), (3, 4), (3, 5), (3, 6), (3, 7), (3, 8), (3, 9), (4, 5), (4, 6), (4, 7), (4, 8), (4, 9), (5, 6), (5, 7), (5, 8), (5, 9), (6, 7), (6, 8), (6, 9), (7, 8), (7, 9), (8, 9)]

生成器和迭代器

列表的问题是容易变得十分庞大，如果对于100万个元素的列表需要每次只处理一个元素，那么会对资源很大的浪费。对于上述问题可以用生成器进行解决，可以对当前进行迭代会延迟产生创建方法是使用函数和yield运算符 >>> def lazy_range(n): """a lazy version of range""" i =0 while i<n: yield i i+=1 而使用生成器： >>> for i in lazy_range(10): print i 程序运行到 yield 这行时，就不会继续往下执行。而是返回一个包含当前函数所有参数的状态的iterator对象。目的就是为了第二次被调用时，能够访问到函数所有的参数值都是第一次访问时的值，而不是重新赋值第二种创建方法是包含在圆括号中的for语句解析： layz_even = (i for i in lazy_range(20) if i%2==0)

随机数的处理

经常需要处理随机数，而python中有模块生成随机数： >>> import random >>> four_uniform_randoms = [random.random() for _ in range(4)] 可以用seed生成种子 random.seed(10); print random.random() 有时候会用random.randrange生成随机数会取1到2个参数并对于选择一个元素返回 random.randrange(10) random.randrange(3,6) 【3，4，5】中选择一个数 random.shuffle随机重新排列元素： >>> up_to_ten = range(10) >>> random.shuffle(up_to_ten) >>> print up_to_ten [4, 9, 7, 8, 6, 0, 1, 3, 2, 5] 从列表中随机选择一个元素用random.choice my_best = random.choice(["Alice","Bob","Charlie"]) 选择元素样本，不重复 lottery_numbers = range(60); winning_numbers = random.sample(lottery_numbers,6) 如果允许重复的话： four_with_replacement = [random.choice(range(10)) for _ in range(4)]

正则表达式的写法

正则表达用在搜索文本以及字符匹配中用处很大，但是比较复杂至于Python中如何使用 >>> import re >>> print all([not re.match("a","cat"),//不以a开头 re.search("a","cat"),//内部含有a not re.search("c","dog"),//字符中没有c 3==len(re.split("[ab]","carbs")),//分割调ab剩下长度为3 "R-D-"==re.sub("[0-9]","-","R2D2")//虚线进行位替换 ]) True

面向对象编程

python允许使用类，类可以封装对象和函数进行操作。下面是自我实现的set类 >>> class Set: def __init__(self,values=None): #构造函数 self.dict = {} if values is not None: for value in values: self.add(value) def __repr__(self): return "Set:" + str(self.dict.keys()) def add(self,value): self.dict[value] = True def contains(self,value): return value in self.dict def remove(self,value): del self.dict[value] >>> s = Set([1,2,3]) >>> s Set:[1, 2, 3] >>> s.add(4) >>> print s Set:[1, 2, 3, 4] >>> s.remove(3) >>> s Set:[1, 2, 4] >>> print s.contains(3) False >>> s.dict {1: True, 2: True, 4: True}

函数工具：

传递函数的时候有时候希望用部分的应用函数来创建新的函数 >>> def exp(base,power): return base**power >>> exp(2,3) 8 当我们想只用计算2的多少次放的时候如何改写这个函数呢方法1; def two_to_the(power): return exp(2,power) 另外可以用 functools.partial from functools import partial >>> xx = partial(exp,2)//现在是包含一个变量的函数 >>> xx(3) 8 xx = partial(exp,power=2) xx(3) 9 使用函数map,reduce,filter为列表解析提供了函数替换方案 def double(x): return 2*x >>> xs = [1,2,3,4] >>> twice_xs = [double(x) for x in xs] >>> twice_xs = map(double,xs)//函数与列表进行映射 >>> list_doubler = partial(map,double)//填充第一个参数为double >>> twice_xs = list_doubler(xs) >>> twice_xs [2, 4, 6, 8] 如果提供了多个列表 >>> products = map(multiply,[1,2],[4,5]) >>> products [4, 10] 【1*4】，2*5 而filter做了列表解析中的if工作 >>> x_evens = [x for x in range(4) if is_even(x)] >>> x_evens [0, 2] >>> x_evens = filter(is_even,[1,2,3,4]) >>> x_evens [2, 4] 通过也可以用过partial转换成1个参数 partial(filter,is_even) reduce是结合了列表的头两个元素然后结果继续结合后面的元素 x_product = reduce(multiply,xs);

python中的枚举：

有时候如果在列表上迭代同时使用元素和元素的索引 for i,document in enumerate(documentes): do_somethings(i,document); 产生了(Index,element)的元组如果只用索引 for i,_ in enmuerate(documents):do_somthing(i);

压缩和参数拆分

如果把两个或者多个列表压缩在一起，用zip把列表转换为一个对应元素的元祖的单个列表 >>> list1 = ['a','b','c'] >>> lis2 = [1,2,3] >>> zip(list1,lis2) [('a', 1), ('b', 2), ('c', 3)] 如果长度不一样会在第一个列表结束的时候停止 >>> letters,numbers = zip(*pairs) >>> letters ('a', 'b', 'c') 星号用于参数拆分也可以在任何函数上参数拆分 def add(a,b):return a+b add(1,2) 3 add(*[1,2]) 3 args和kwargs 如果想创建更加高阶的函数，把某个函数f作为输入返回 def doubler(f): def g(x): return 2*f(x) return g def f1(x): return x+1 print g(3) ==(3+1)*2 print g(-1) ==(-1+1)*2 但是对于多个参数就不适合了所以可以用另外一种指定一个可以取任意参数函数的方法，利用参数拆分解决 >>> def magic(*args): print args >>> magic((2,3,4),(1,3,4)) ((2, 3, 4), (1, 3, 4)) >>> def magic(**kwargs): print kwargs >>> magic(key="word",ke2 = "wrods2") {'ke2': 'wrods2', 'key': 'word'} 所以 args是一个未命名的参数元祖 kwargs是已经命名的参数的dict 反过来也适用 >>> x_Y_list = [1,2] >>> z_dict = {"z":3} >>> print other_way_matic(*x_Y_list,**z_dict) 6

转载请注明原文地址: https://ju.6miu.com/read-1304425.html

最新回复(0)