CS190 Scalable Machine Learning Spark - 1
Python 基础
Part 1: NumPy
NumPy is a Python library for working with arrays.
# It is convention to import NumPy with the alias np
import numpy as np
(1a) 标量相乘 Scalar multiplication
$ a $ is the scalar (constant) and $ \mathbf{v} $ is the vector
$$ a \mathbf{v} = \begin{bmatrix} a v_1 \\ a v_2 \\ \vdots \\ a v_n \end{bmatrix} $$
# Create a numpy array with the values 1, 2, 3
simpleArray = np.array([1,2,3])
# Perform the scalar product of 5 and the numpy array
timesFive = simpleArray * 5
print simpleArray
print timesFive
-----
#result
[1 2 3]
[5 10 15
(1b) 点乘 Element-wise multiplication and dot product
The element-wise calculation is as follows:
$$ \mathbf{x} \odot \mathbf{y} = \begin{bmatrix} x_1 y_1 \\ x_2 y_2 \\ \vdots \\ x_n y_n \end{bmatrix} $$
dot product is equivalent to performing element-wise multiplication and then summing the result。
$ w \cdot x$ 也可以表示为 $ w^\top x $
$$ w \cdot x = \sum_{i=1}^n w_i x_i $$
Element-wise multiplication use the ***** operator to multiply two ndarray objects of the same length.
Dot product you can use either np.dot() or np.ndarray.dot()
# Create a ndarray based on a range and step size.
u = np.arange(0, 5, .5)
v = np.arange(5, 10, .5)
elementWise = u * v
dotProduct = np.dot(u,v)
print 'u: {0}'.format(u)
print 'v: {0}'.format(v)
print '\nelementWise\n{0}'.format(elementWise)
print '\ndotProduct\n{0}'.format(dotProduct)
----
#result
u: [ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
v: [ 5. 5.5 6. 6.5 7. 7.5 8. 8.5 9. 9.5]
elementWise
[ 0. 2.75 6. 9.75 14. 18.75 24. 29.75 36. 42.75]
dotProduct
183.75
(1c) 矩阵计算 Matrix math
np.matrix() 生成矩阵
matrix math on NumPy matrices using *
转置矩阵 transpose a matrix by calling numpy.matrix.transpose() or by using .T
on the matrix object (e.g. myMatrix.T
).
Transposing a matrix produces a matrix where the new rows are the columns from the old matrix. For example: $$ \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}^\mathbf{\top} = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix} $$
逆矩阵 Inverting a matrix can be done using numpy.linalg.inv().
Note that only square matrices can be inverted, and square matrices are not guaranteed to have an inverse. If the inverse exists, then multiplying a matrix by its inverse will produce the identity matrix. $ \scriptsize ( \mathbf{A}^{-1} \mathbf{A} = \mathbf{I_n} ) $ The identity matrix $ \scriptsize \mathbf{I_n} $ has ones along its diagonal and zero elsewhere. $$ \mathbf{I_n} = \begin{bmatrix} 1 & 0 & 0 & \dots & 0 \\ 0 & 1 & 0 & \dots & 0 \\ 0 & 0 & 1 & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \dots & 1 \end{bmatrix} $$
For this exercise, multiply $ \mathbf{A} $ times its transpose $ ( \mathbf{A}^\top ) $ and then calculate the inverse of the result $ ( [ \mathbf{A} \mathbf{A}^\top ]^{-1} ) $.
from numpy.linalg import inv
A = np.matrix([[1,2,3,4],[5,6,7,8]])
print 'A:\n{0}'.format(A)
# Print A transpose
print '\nA transpose:\n{0}'.format(A.T)
# Multiply A by A transpose
AAt = A * A.T
print '\nAAt:\n{0}'.format(AAt)
# Invert AAt with np.linalg.inv()
AAtInv = np.linalg.inv(AAt)
print '\nAAtInv:\n{0}'.format(AAtInv)
# Show inverse times matrix equals identity
# We round due to numerical precision
print '\nAAtInv * AAt:\n{0}'.format((AAtInv * AAt).round(4))
print '\nAAtInv * AAt:\n{0}'.format((AAtInv * AAt).round(4))
result
A:
[[1 2 3 4]
[5 6 7 8]]
A transpose:
[[1 5]
[2 6]
[3 7]
[4 8]]
AAt:
[[ 30 70]
[ 70 174]]
AAtInv:
[[ 0.54375 -0.21875]
[-0.21875 0.09375]]
AAtInv * AAt:
[[ 1. 0.]
[-0. 1.]]
AAtInv * AAt:
[[ 1. 0.]
[-0. 1.]]
Part 2: Additional NumPy and Spark linear algebra
(2a) Slices
features = np.array([1, 2, 3, 4])
print 'features:\n{0}'.format(features)
# The first three elements of features
firstThree = features[0:3]
# The last three elements of features
lastThree = features[-3:]
(2b) Combining ndarray
objects
np.hstack(), which allows you to combine arrays column-wise,
np.vstack(), which allows you to combine arrays row-wise.
Note that both np.hstack()
and np.vstack()
take in a tuple of arrays as their first argument.
To horizontally combine three arrays a
, b
, and c
, you would run np.hstack((a, b, c))
.
If we had two arrays: a = [1, 2, 3, 4]
and b = [5, 6, 7, 8]
, we could use np.vstack((a, b))
to produce the two-dimensional array: $$ \begin{bmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \end{bmatrix} $$
zeros = np.zeros(8)
ones = np.ones(8)
print 'zeros:\n{0}'.format(zeros)
print '\nones:\n{0}'.format(ones)
zerosThenOnes = np.hstack((zeros,ones)) # A 1 by 16 array
zerosAboveOnes = np.vstack((zeros,ones)) # A 2 by 8 array
print '\nzerosThenOnes:\n{0}'.format(zerosThenOnes)
print '\nzerosAboveOnes:\n{0}'.format(zerosAboveOnes)
result:
zeros:
[ 0. 0. 0. 0. 0. 0. 0. 0.]
ones:
[ 1. 1. 1. 1. 1. 1. 1. 1.]
zerosThenOnes:
[ 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
zerosAboveOnes:
[[ 0. 0. 0. 0. 0. 0. 0. 0.]
[ 1. 1. 1. 1. 1. 1. 1. 1.]]
(2c) PySpark's DenseVector
PySpark provides a DenseVector class within the module pyspark.mllib.linalg.
DenseVector
is used to store arrays of values for use in PySpark. DenseVector
actually stores values in a NumPy array and delegates calculations to that object. You can create a new DenseVector
using DenseVector()
and passing in an NumPy array or a Python list.
Note that DenseVector
stores all values as np.float64
DenseVector
objects exist locally and are not inherently distributed. DenseVector
objects can be used in the distributed setting by either passing functions that contain them to resilient distributed dataset (RDD) transformations or by distributing them directly as RDDs.
from pyspark.mllib.linalg import DenseVector
numpyVector = np.array([-3, -4, 5])
print '\nnumpyVector:\n{0}'.format(numpyVector)
# Create a DenseVector consisting of the values [3.0, 4.0, 5.0]
myDenseVector = DenseVector([3,4,5])
# Calculate the dot product between the two vectors.
denseDotProduct = DenseVector.dot(myDenseVector,numpyVector)
print 'myDenseVector:\n{0}'.format(myDenseVector)
print '\ndenseDotProduct:\n{0}'.format(denseDotProduct)
numpyVector:
[-3 -4 5]
myDenseVector:
[3.0,4.0,5.0]
denseDotProduct:
0.0
Part 3: Python lambda expressions
Lambda 是匿名函数
一些链接: Lambda Functions, Lambda Tutorial, and Python Functions.
# Example function
def addS(x):
return x + 's'
#lambda 形式
addSLambda = lambda x: x + 's'
# 乘法
multiplyByTen = lambda x: x * 10
print multiplyByTen(5)
#lambda fewer steps than def
# The first function should add two values, while the second function should subtract the second value from the first value.
def plus(x, y):
return x + y
def minus(x, y):
return x - y
functions = [plus, minus]
print functions[0](4, 5)
print functions[1](4, 5)
# lambda
lambdaFunctions = [lambda x,y : x+y , lambda x,y: x-y]
print lambdaFunctions[0](4, 5)
print lambdaFunctions[1](4, 5)
Lambda expressions consist of a single expression statement and cannot contain other simple statements. In short, this means that the lambda expression needs to evaluate to a value and exist on a single logical line. If more complex logic is necessary, use def
in place of lambda
.
Expression statements evaluate to a value (sometimes that value is None). Lambda expressions automatically return the value of their expression statement. In fact, a return
statement in a lambda
would raise a SyntaxError
.
The following Python keywords refer to simple statements that cannot be used in a lambda expression: assert
, pass
, del
, print
, return
, yield
, raise
, break
, continue
, import
, global
, and exec
. Also, note that assignment statements (=
) and augmented assignment statements (e.g. +=
) cannot be used either.