Sunday, March 20, 2016

Useful R tips #3

An added bonus implementing machine learning algorithms in R is that you often make use of optimized or fast vectorized libraries or functions. Here are some of the functions I've found to be very useful especially when dealing with data sets, whether they are training sets, or cross-validation or test sets.

Random permutations of a set

Suppose you have a matrix X whose dimensions are m-rows (examples) and n-columns (features)

X[sample(1:nrow(X)), ]

This will generate a random re-arrangement of permutation X, i.e. rows have been reshuffled. To get only the first K examples, we need only use matrix indexing similar to the statement below

X[sample(1:nrow(X)), ][1:K, ]

Or for clarity, break it up into two lines

Xrand = X[sample(1:nrow(X)), ]
Xperm = Xrand[1:K, ]

Duplicate (replicate) matrix/column vector m or n times

Suppose we have a single row vector and we wish to replicate it m-times so we can perform operations on X[m, n] and centroid[1, n]

# centroid is a 1xn row vector
# X is an m x n matrix
repmat(centroid, m, 1)

m here is the number of times (rows) the row vector is replicated. Notice that the third parameters of repmat is 1, which means that we don't need to create more copies of centroid's columns. If centroid were a column vector, we need only to specify the number of columns to generate similar to the statement below

# centroid is a mx1 column vector
# X is an m x n matrix
repmat(centroid, 1, n)

Apply a function to each row/column of matrix

Suppose we need to apply a function to each row or column of a matrix without implementing it in a loop? This is often an issue because functions sum(), mean(), max(), min(), and others only return a single value. One solution is to use R's apply function

To get the mean of each column (or feature) in X, we may do so using:

X_means = apply(X, 2, mean)

Instead of implementing it in a loop like

X_means = array(0, c(1, ncol(X)))

for (n in 1:ncol(X)) { 

X_means[n] = mean(X[,n]) 
}

If X is an m x n matrix, the above call to apply will yield a 1 x n row vector. The second argument in apply2, indicates that the function is applied to each column. To apply it to each row, we use 1 instead.

X_means = apply(X, 1, mean)

This will then return a 1 x m column vector. Some other useful functions to use with apply are sd (standard deviation), max/min (maximum/minimum), which.max/which.min (index of maximum/minimum)

No comments:

Post a Comment