powershell - 如何在 Powershell 中对 3000 万条 csv 记录进行排序

我正在使用 oledbconnection 对 csv 文件的第一列进行排序。 Oledb 连接在 6 分钟内成功执行了多达 900 万条记录。但是当我执行 1000 万条记录时,得到以下警告信息。

Exception calling "ExecuteReader" with "0" argument(s): "The query cannot be completed. Either the size of the query result is larger than the maximum size of a database (2 GB), or there is not enough temporary storage space on the disk to store the query result."

有没有其他解决方案可以使用 Powershell 对 3000 万进行排序?

这是我的脚本

$OutputFile = "D:\Performance_test_data\output1.csv"
$stream = [System.IO.StreamWriter]::new( $OutputFile )

$sb = [System.Text.StringBuilder]::new()
$sw = [Diagnostics.Stopwatch]::StartNew()

$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source='D:\Performance_test_data\';Extended Properties='Text;HDR=Yes;CharacterSet=65001;FMT=Delimited';")
$cmd=$conn.CreateCommand()
$cmd.CommandText="Select * from 1crores.csv order by col6"

$conn.open()

$data = $cmd.ExecuteReader()

echo "Query has been completed!"
$stream.WriteLine( "col1,col2,col3,col4,col5,col6")

while ($data.read()) 
{ 
  $stream.WriteLine( $data.GetValue(0) +',' + $data.GetValue(1)+',' + $data.GetValue(2)+',' + $data.GetValue(3)+',' + $data.GetValue(4)+',' + $data.GetValue(5))

}
echo "data written successfully!!!"

$stream.close()
$sw.Stop()
$sw.Elapsed

$cmd.Dispose()
$conn.Dispose()

最佳答案

你可以尝试使用这个:

$CSVPath = 'C:\test\CSVTest.csv'
$Delimiter = ';'

# list we use to hold the results
$ResultList = [System.Collections.Generic.List[Object]]::new()

# Create a stream (I use OpenText because it returns a streamreader)
$File = [System.IO.File]::OpenText($CSVPath)

# Read and parse the header
$HeaderString = $File.ReadLine()

# Get the properties from the string, replace quotes
$Properties = $HeaderString.Split($Delimiter).Replace('"',$null)
$PropertyCount = $Properties.Count

# now read the rest of the data, parse it, build an object and add it to a list
while ($File.EndOfStream -ne $true)
{
    # Read the line
    $Line = $File.ReadLine()
    # split the fields and replace the quotes
    $LineData = $Line.Split($Delimiter).Replace('"',$null)
    # Create a hashtable with the properties (we convert this to a PSCustomObject later on). I use an ordered hashtable to keep the order
    $PropHash = [System.Collections.Specialized.OrderedDictionary]@{}
    # if loop to add the properties and values
    for ($i = 0; $i -lt $PropertyCount; $i++)
    { 
        $PropHash.Add($Properties[$i],$LineData[$i])
    }
    # Now convert the data to a PSCustomObject and add it to the list
    $ResultList.Add($([PSCustomObject]$PropHash))
}

# Now you can sort this list using Linq:
Add-Type -AssemblyName System.Linq
# Sort using propertyname (my sample data had a prop called "Name")
$Sorted = [Linq.Enumerable]::OrderBy($ResultList, [Func[object,string]] { $args[0].Name })

我没有使用 import-csv,而是编写了一个快速解析器,它使用流读取器并动态解析 CSV 数据并将其放入 PSCustomObject 中。 然后将其添加到列表中。

编辑:修复了 linq 示例

https://stackoverflow.com/questions/66057891/

相关文章:

javascript - 使用路由器 "No overload matches this call"

c++ - 为什么 std::ranges::view_interface 使用 CRTP

java - 如果有 happens-before 保证,让 "this"在构造函数的最后一个语句中

list - 从列表中获取元素 Terraform

c++ - 自动类型检测是否只查看一个语句

flutter - 调用者没有执行指定操作的权限

c# - 如何使用 C# 连接到 Azure 存储表

flutter - 具有动态高度的动画容器 Flutter

swift - SwiftLint 的开关大小写格式问题

python - 将 seq2seq NLP 模型转换为 ONNX 格式会对其性能产生负面影响吗?