regex - 向量化模式匹配返回 R 中的模式

我的问题主要是效率问题。

我有一个模式向量，我想将其与向量 x 进行匹配。

最终结果应该返回与向量的每个元素匹配的模式。第二个条件是，如果许多模式与向量 x 的特定元素匹配，则返回第一个匹配的模式。

例如，假设模式的向量是:

patterns <- c("[0-9]{2}[a-zA-Z]", "[0-9][a-zA-Z] ", " [a-zA-Z]{3} ")

向量x是:

x <- c("abc 123ab abc", "abc 123 abc ", "a", "12a ", "1a ")

最终结果是:

customeRExp(patterns, x)
[1] "[0-9]{2}[a-zA-Z]" " [a-zA-Z]{3} "
[3]  NA                "[0-9]{2}[a-zA-Z]"
[5] "[0-9][a-zA-Z] "

这是我目前所拥有的:

customeRExp <- function(pattern, x){
                        m <- matrix(NA, ncol=length(x), nrow=length(pattern))
                        for(i in 1:length(pattern)){
                            m[i, ] <- grepl(pattern[i], x)}
                        indx <- suppressWarnings(apply(m, 2, function(y) min(which(y, TRUE))))
                        pattern[indx]
}

customeRExp(patterns, x)

哪个正确返回:

[1] "[0-9]{2}[a-zA-Z]" " [a-zA-Z]{3} "    NA                
[4] "[0-9]{2}[a-zA-Z]" "[0-9][a-zA-Z] "

问题是我的数据集很大，模式列表也很大。

有没有更有效的方法来做同样的事情？

最佳答案

我默认的加速上述循环的方法通常是用 C++ 重写。这是使用 Boost Xpressive 的快速尝试:

// [[Rcpp::depends(BH)]]
#include <Rcpp.h>
#include <boost/xpressive/xpressive.hpp>

namespace xp = boost::xpressive;

// [[Rcpp::export]]
Rcpp::CharacterVector
first_match(Rcpp::CharacterVector x, Rcpp::CharacterVector re) {
    R_xlen_t nx = x.size(), nre = re.size(), i = 0, j = 0;
    Rcpp::CharacterVector result(nx, NA_STRING);
    std::vector<xp::sregex> vre(nre);

    for ( ; j < nre; j++) {
        vre[j] = xp::sregex::compile(std::string(re[j]));
    }

    for ( ; i < nx; i++) {
        for (j = 0; j < nre; j++) {
            if (xp::regex_search(std::string(x[i]), vre[j])) {
                result[i] = re[j];
                break;
            }
        }
    }

    return result;
}

这种方法的要点是，一旦找到匹配的正则表达式，就通过breaking 来节省不必要的计算。

性能提升并不惊人 (~40%)，但它是对当前功能的改进。这是使用更大版本的样本数据进行的测试:

x2 <- rep(x, 5000)
p2 <- rep(patterns, 100)

all.equal(first_match(x2, p2), customeRExp(p2, x2))
#[1] TRUE

microbenchmark::microbenchmark(
    first_match(x2, p2),
    customeRExp(p2, x2),
    times = 50
)
# Unit: seconds
#                 expr      min       lq     mean   median       uq      max neval
#  first_match(x2, p2) 1.743407 1.780649 1.900954 1.836840 1.931783 2.544041    50
#  customeRExp(p2, x2) 2.368621 2.459748 2.681101 2.566717 2.824887 3.553025    50

另一种选择是考虑使用 stringi 包，它通常比基础 R 好很多。

https://stackoverflow.com/questions/38809251/

相关文章：

python - 删除元组列表中的括号

erlang - 了解透析器结果

entity-framework - 如何更改 EntityFrameworkCore 迁移的存储位

sql - IPython SQL Magic - 以编程方式生成查询字符串

scala - map{} 和 map() 有什么区别

scala - 如何使用 Avro 序列化 Scala 案例类？

cmake - 从包含的 CMakeList.txt 中删除消息

c# - 如何在 NLog.Config 中使用来自不同类的变量？

javascript - 一个语句中的多个问号 "?"和冒号 ":"如何在 javascript 中

c# - 计算两个日期之间的天数并将其显示在标签中