I've found some weird behavior with apply
.
Assume I have an arbitrary matrix of ordered variables
set.seed(4)
x <- ordered(sample(1:10, size=4, replace=T))
y <- ordered(sample(1:10, size=4, replace=T))
z <- ordered(sample(1:10, size=4, replace=T))
data1 <- data.frame(x,y,z)
Now I want to get the ranks for each variable. I could do this two ways:
With a for loop:
rankmat1 <- data1
for(i in 1:dim(data1)[2]){
rankmat1[, i] <- rank(data1 [, i])
}
Or with apply
rankmat2 <- apply(data1, 2, rank)
So, here are the original levels:
data1
x y z
1 6 9 10
2 1 3 1
3 3 8 8
4 3 10 3
And here are the correct rankings:
rankmat1
x y z
1 4.0 3 4
2 1.0 1 1
3 2.5 2 3
4 2.5 4 2
But why are these rankings from apply
permuted differently?
rankmat2
x y z
[1,] 4.0 4 2
[2,] 1.0 2 1
[3,] 2.5 3 4
[4,] 2.5 1 3
This happens with order
too:
ordermat1 <- data1
for(i in 1:dim(data1 )[2]){
ordermat1[, i] <- order(data1 [, i])
}
ordermat2 <- apply(data1, 2, order)
ordermat1
x y z
1 2 2 2
2 3 3 4
3 4 1 3
4 1 4 1
ordermat2
x y z
[1,] 2 4 2
[2,] 3 2 1
[3,] 4 3 4
[4,] 1 1 3
As requested by the OP, here is a detailed explanation which may help other R users to evade the traps.
As joran has pointed out, apply
coerces the data frame into a matrix thereby replacing the ordered factors by characters. So, the original data.frame
data1
x y z
1 6 9 10
2 1 3 1
3 3 8 8
4 3 10 3
becomes
as.matrix(data1)
x y z
[1,] "6" "9" "10"
[2,] "1" "3" "1"
[3,] "3" "8" "8"
[4,] "3" "10" "3"
Characters are sorted lexically. Thus, sorting the y
column as character returns
sort(c("9", "3", "8", "10"))
[1] "10" "3" "8" "9"
instead of
sort(c(9, 3, 8, 10))
[1] 3 8 9 10
This explains why apply
returns a different result for the rank
operation here.
You can use lapply
to compute the rank of each column of the data frame.
as.data.frame(lapply(data1, rank))
x y z
1 4.0 3 4
2 1.0 1 1
3 2.5 2 3
4 2.5 4 2
lapply
returns a list and a data frame is a special kind of list.
Avoid sapply
because sapply
takes the output of lapply
and "simplifies" it to something what it thinks is appropriate. Here,
sapply(data1, rank)
x y z
[1,] 4.0 3 4
[2,] 1.0 1 1
[3,] 2.5 2 3
[4,] 2.5 4 2
returns a matrix (again!) which needs to be coerced to a data frame. (See chapter 8.3.20 of The R Inferno by Patrick Burns.The text is a good read, anyway.)
The OP has not given an indication why he needs to work with ordered factors. If factors, ordered or not, are not essential to the OPs underlying problem, then apply
would have worked as expected.
set.seed(4)
x2 <- sample(1:10, size = 4, replace = T)
y2 <- sample(1:10, size = 4, replace = T)
z2 <- sample(1:10, size = 4, replace = T)
data2 <- data.frame(x2, y2, z2)
data2
x2 y2 z2
1 6 9 10
2 1 3 1
3 3 8 8
4 3 10 3
apply(data2, 2, rank)
x2 y2 z2
[1,] 4.0 3 4
[2,] 1.0 1 1
[3,] 2.5 2 3
[4,] 2.5 4 2
(Nevertheless, better to use lapply
instead of apply
with a data frame).
When I started to learn R
, I was misled by the name of the function ordered()
. It took me a while to understand that it creates a special kind of factors. Likewise, it took me some time to figure out the difference between sort()
and order()
and when to use which function appropriately.
I am not sure why the extract reason for that happen to apply function. But you could try sapply
to solve the problem.
rankmat3 <- as.data.frame(sapply(data1, rank))The result would be like:
rankmat3 x y z 1 4.0 3 4 2 1.0 1 1 3 2.5 2 3 4 2.5 4 2