...本来是在python学习:零碎的内容里不断盖楼,盖到62条的时候不知道写了啥,文章被封了一次。目前不敢再动了,新盖一楼。
pip install XXX的时候总有ReadTimeoutError: HTTPSConnectionPool(host='....', port=443): Read timed out.
参考https://github.com/pypa/warehouse/issues/3826
试试pip install --default-timeout=1000 package_name
日期空值:NaT
df.apply(lambda x:pd.to_datetime(x, errors='coerce'))
将df中string格式的日期(比方说2020/1/26)转化为datetime格式,如果有空值则为NaT13位数的unix时间格式,转化为human readable
Unix time also known as Epoch time, POSIX time
即19870年1月1日后多少秒。13位数为19700101后多少毫秒(milliseconds)
from datetime import datetime
dt_object = datetime.fromtimestamp(1581162409463/1000)
print(dt_object.strftime("%Y-%m-%d %H:%M:%S"))
print(dt_object)
# 得到:
2020-02-08 19:46:49
2020-02-08 19:46:49.463000
-
plotly: 用
make_subplots()
作图,调整两个子图之间的距离,比例,共用Y轴:
subplots
make_subplots(rows=1,cols=2, # 两个图并排放
column_widths=[0.2,0.8], # 一个占比20%,一个占比80%
shared_yaxes=True, # 共用Y轴
horizontal_spacing=0.01) # 两个图之间距离缩短
67.plotly 颜色使用集锦:
discrete颜色:
https://plot.ly/python/discrete-color/
内置颜色:
https://plot.ly/python/builtin-colorscales/#discrete-color-sequences
一个很棒的调色板网站(可能是搞设计的人用的),如下所示:
os
文件重命名os.rename('old','new')
删除文件os.system('rm XXX')
即可python里检查md5码
import hashlib
def file_as_bytes(file):
with file:
return file.read()
test = ['XXXXXXXXXXXXXXX.fastq',
'XXXXXXXXXXXXXX.fastq']
[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in test]
- python2和python3中二进制和unicode character的问题
decode: 二进制--->unicode character
encode: unicode character--->二进制
python2:
-
str
: 8-bits value(二进制),unicode
: unicode characters - 默认使用ASCII
-
with open(XXX.bin.'r') as ...
默认设置为binary encoding
python3: -
bytes
: 8-bits value(二进制),str
: unicode characters -
bytes
和str
是完全不一样的type, 连两者的空值都不能等同。 -
with open(XXX.bin,'r') as ...
默认设置为utf-8 encoding,所以用python3打开binary格式文件,需要指定mode为'rb'
71.get()
参考https://stackoverflow.com/questions/2068349/understanding-get-method-in-python
t = {'a':1,'b':2,'c':3}
t['e'] # get Keyerror
t.get('e',None) # 如果key里没有'e',则默认返回None
- eumerate()的第二个参数
a = ['a','b','c']
for ind, i in eumerate(a,2):
print(ind, i)
# 2 a
# 3 b
# 4 c
- zip() loop
for ai, bi in zip(a,b):
在Python3中,zip() return的是个generator, python2中return的是 a list of all the tuples it creates,如果对很大的list pair迭代,会耗损很大内存。如果要在python2中使用zip,最好看看izip(itertools)
- try
参考https://www.thegeekstuff.com/2019/05/python-try-except-examples/
a = 12
b = 'test'
try:
print(a+b) # raise typeError
except TypeError: # 如果try里的运行结果是TypeError,那么就:
print(str(a)+b)
# 12test
list.sort(key=)
pd.read_excel()
读入excel中所有的sheet
pd.read_excel('XXX.xlsx', sheet_name=None)
得到一个dictionary, key为sheet name, value为各sheet读入的dataframe给一个dataframe全员log10
df.applymap(math.log10)
(先import math
)function最好不要return
None
因为如果你return的东西要放到if/else
中去,None
和0
或者空List等的效果是一样的,容易造成bugraise
def divide(a,b):
try:
return a/b
except ZeroDivisionError as e:
raise ValueError('What?') from e
divide(5,0)
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-9-b8e948d46537> in divide(a, b)
2 try:
----> 3 return a/b
4 except ZeroDivisionError as e:
ZeroDivisionError: division by zero
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-10-0a52e2eb64c8> in <module>
----> 1 divide(5,0)
<ipython-input-9-b8e948d46537> in divide(a, b)
3 return a/b
4 except ZeroDivisionError as e:
----> 5 raise ValueError('what?') from e
6
7
ValueError: what?
- list.sort(key=)
key传递一个函数,在sort之前对list中每个element调用 - list of tuple的排序
先根据tuple的第一位element排序,再根据第二位...
test = [(1,2),(1,19),(1,3),(1,4),(0,3),(0,9),(0,10)]
test.sort()
test
[(0, 3), (0, 9), (0, 10), (1, 2), (1, 3), (1, 4), (1, 19)]
所以如果你有个list要排序,但是有一群特殊分子需要安排到前面去,可以先把特殊分子抽出来做成(0,x),其他的为(1,x),根据80条来设置。然后排序就可以把特殊分子排列在前面了。
- subprocess输入input
subprocess见零碎的内容(一)43条
p = subprocess.Popen('XXXX',shell=True,stdin=subprocess.PIPE,
stderr=subprocess.PIPE)
stdout,stderr = p.communicate(input='XXX\nXXXX\nXXXXX')
# 多个Input用\n分开
- index name
- tqdm
在jupyter notebook/lab 中使用tqdm, import这个比较合适
from tqdm import tqdm_notebook as tqdm
for i in tqdm([1,2,3,4]):
....
把某一个index提取出来成string,而不是Index object
df.loc[df['col']==i,:].index.tolist()[0]
#这里只有一个elementdataframe筛选出某一种dtype的columns
先看一下有几种dtypes:
df.dtypes.value_counts()
然后select
df.select_dtypes(include=['XX','XXX'])
缺失值填充 missing value imputation
from sklearn.impute import SimpleImputer,KNNImputer
# 用KNN对numeric values填充
imputer_n = KNNImputer(n_neighbors=2,weight='uniform')
imputer_n.fit_transform(df)
# 用most frequent对categorical填充
imputer_c = SimpleImputer(strategy='most_frequent')
imputer_c.fit_transform(df)
有else的list comprehension
["Even" if i%2==0 else "Odd" for i in range(10)]
multi-index的melt (long --> wide)
df
ID gp value gp2
0 1 a 0.708910 a1
1 2 a 0.273727 a1
2 3 a 0.161171 a2
3 4 a 0.920273 a2
4 5 b 0.147851 b1
5 6 b 0.957274 b1
6 7 b 0.421100 b2
7 8 b 0.807547 b2
df_mean = df.loc[:,['gp','value','gp2']].groupby(['gp','gp2']).mean()
df_mean
value
gp gp2
a a1 0.491318
a2 0.540722
b b1 0.552562
b2 0.614323
pd.melt(df_mean.reset_index(),id_vars = ['gp','gp2'])
gp gp2 variable value
0 a a1 value 0.491318
1 a a2 value 0.540722
2 b b1 value 0.552562
3 b b2 value 0.614323
90.用pandas打开excel
在python3环境里,即使安装了openpyxl也无法打开
需要 pd.read_excel(XXXX, sheet_name= 'XXX', engine='openpyxl')