+-
到目前为止,我们所学到的是,要找到主键,就要找到列变量的组合,每个列变量的值只能匹配一个观测值。所以,主键中没有一个观测值与数据集中的多个观测值相匹配。
假如你找到了由列A、W、X、Y、Z组成的主键,因为当你按它们分组时,数据集中的每一行都是唯一的。
但后来发现你可以只对A、W、X、Z进行分组。
我的问题如下图所示,这两个列的分组都产生了不同的值,但事实证明我不需要包括Salary。
Lahman::Salaries %>% count(playerID, yearID, salary, teamID) %>% filter(n>1)
A tibble: 0 x 5
… with 5 variables: playerID <chr>, yearID <int>, salary <int>, teamID <fct>,
n <int>
Lahman::Salaries %>% count(playerID, yearID, teamID) %>% filter(n>1)
A tibble: 0 x 4
… with 4 variables: playerID <chr>, yearID <int>, teamID <fct>, n <int>
如何知道哪些列是不需要的,因为你要检查下一列,当包含时,是否给你主键?
1
投票
投票
这里有一个函数,它对处理大数据的效率并不做作。
#' @param x 'data.frame'
#' @param max.columns 'integer', the max number of columns to consider
#' for a key; with most data, this might be at most 'ncol(x)-1'
#' (assuming that there is some numeric or "data" field)
#' @param find 'integer', how many column combinations to look for; if
#' 1 (default), then the first combination is returned, no others
#' attempted; if below 1, all combinations (up through
#' 'max.columns') are considered
#' @param digits 'integer', how to deal with 'numeric' non-'integer'
#' columns; if 'NA' (default), do not consider said columns; if a
#' non-negative integer, the number of digits to round the columns
#' and factorize; note: setting this too low may "find" keys in
#' non-trivial-precision floating-point numbers
#' @return 'list' of 'character', combinations of columns that unique
#' identify the rows given
findkeys <- function(x, max.columns = ncol(x) - 1L, find = 1L, digits = NA) {
isnum <- sapply(x, function(z) is.numeric(z) & !is.integer(z))
force(max.columns)
if (is.na(digits) || digits < 0) {
x <- x[, !isnum, drop = FALSE ]
} else {
possfactors <- lapply(x[,isnum], function(z) as.character(z - trunc(z)))
signif_digits <- sapply(possfactors, function(s) max(nchar(gsub("^0\\.?", "", gsub("(0{5,}|9{5,}).*", "", s)))))
remove <- signif_digits[ signif_digits > digits ]
message("#> Removing: ", paste(sQuote(names(remove)), collapse = ","))
x <- x[, setdiff(colnames(x), names(remove)), drop = FALSE]
}
max.columns <- min(c(max.columns, ncol(x)))
keys <- list()
if (max.columns < 1) {
warning("no columns found, no keys")
return(keys)
}
message("#> Columns : ", paste(sQuote(colnames(x)), collapse = ","))
for (i in seq_len(max.columns)) {
combs <- combn(names(x), i)
for (j in seq_len(ncol(combs))) {
uniq <- unique(x[, combs[,j], drop = FALSE ])
if (nrow(x) == nrow(uniq)) {
keys <- c(keys, list(combs[,j]))
if (find > 0 && length(keys) >= find) return(keys)
}
}
}
keys
}
执行。
findkeys(mtcars)
# Warning in findkeys(mtcars) : no columns found, no keys
# list()
findkeys(mtcars, digits=1)
# #> Removing: 'drat','wt','qsec'
# #> Columns : 'mpg','cyl','disp','hp','vs','am','gear','carb'
# list()
findkeys(mtcars, digits=2)
# #> Removing: 'wt'
# #> Columns : 'mpg','cyl','disp','hp','drat','qsec','vs','am','gear','carb'
# [[1]]
# [1] "mpg" "qsec"
findkeys(tibble::rownames_to_column(mtcars))
# #> Columns : 'rowname'
# [[1]]
# [1] "rowname"
findkeys(Lahman::Salaries)
# #> Columns : 'yearID','teamID','lgID','playerID','salary'
# [[1]]
# [1] "yearID" "teamID" "playerID"
findkeys(Lahman::Salaries, find = 0)
# #> Columns : 'yearID','teamID','lgID','playerID','salary'
# [[1]]
# [1] "yearID" "teamID" "playerID"
# [[2]]
# [1] "yearID" "teamID" "lgID" "playerID"
# [[3]]
# [1] "yearID" "teamID" "playerID" "salary"
这当然是不完美的 例如,我非常怀疑我们是否想使用 "salary"
键入 Lahman::Salaries
.
它当然可以做得更聪明,但就启发式方法而言,它的速度很快,很可能有缺陷,而且目前还算不错。当使用这个方法时,你可以考虑在这一步之前去掉所有已知的浮点数,也许可以对大数据进行下采样......因为知道采样可能会出现假阳性。