I'm sure someone out there has dealt with this before, so here goes. We have a process that outputs, once every half hour, a CSV file of the last X lines from a given log. The activity recorded in this log has peaks and valleys, so the last X lines might only cover the last hour's activity, at busy times, or might cover the last six hours, when things are quiet. The end result is that the output CSV files overlap unpredictably.
Is there a tool, script, process, or anything of that sort - anything in Windows, or shell scripts in Unix - that can unify a bunch of these CSVs into one CSV file? I can do it manually in Excel, but it's a pain.
I can probably work out something in the shell script line with diff and head, but it would be nice if someone has already got one knocking around.
Is there a tool, script, process, or anything of that sort - anything in Windows, or shell scripts in Unix - that can unify a bunch of these CSVs into one CSV file? I can do it manually in Excel, but it's a pain.
I can probably work out something in the shell script line with diff and head, but it would be nice if someone has already got one knocking around.
From:
no subject
From:
no subject
man page for sort will give information on sorting by a certain field.
A better solution would be have real log rotation, or have the process record the last entry from before, and start from that. It's possible that those already exist and you're looking for a quick-and-dirty way to grab an overview.
From:
no subject
{~/BOSS}
$ cat /tmp/make-sortable
#!/usr/bin/perl -w
use strict;
my $line = 0;
while (<>) {
print STDERR ">>> working on $_\n";
my $md5=`echo "$_"|md5sum`;
$md5=~s/ .*//;
chomp $md5;
print STDERR "Got md5sum $md5\n";
$line = $line + 1;
print "$md5 $line $_";
}
{~/BOSS}
$ cat /tmp/test
this,is,a,csv,line
so,is,this,one,too
this,is,a,csv,line
so,is,this,one,too
this,is,a,new,line
{~/BOSS}
$ /tmp/make-sortable /tmp/test | sort | uniq --check-chars 33 | sort -k 2 > /tmp/out
>>> working on this,is,a,csv,line
Got md5sum af169b30f884f16f0e79af4cdfd717e3
>>> working on so,is,this,one,too
Got md5sum 79051761d8d11902ce692ce2fb2dcd1a
>>> working on this,is,a,csv,line
Got md5sum af169b30f884f16f0e79af4cdfd717e3
>>> working on so,is,this,one,too
Got md5sum 79051761d8d11902ce692ce2fb2dcd1a
>>> working on this,is,a,new,line
Got md5sum 1a1b8a55f2d38af0ef6df523f07c3073
{~/BOSS}
$ cat /tmp/out
af169b30f884f16f0e79af4cdfd717e3 1 this,is,a,csv,line
79051761d8d11902ce692ce2fb2dcd1a 2 so,is,this,one,too
1a1b8a55f2d38af0ef6df523f07c3073 5 this,is,a,new,line
(Memo to self: good interview question!)
From:
no subject
To be honest, the difference in human-perceptible terms is pretty minimal in the little testing I've done. Either should be a step up from manually in excel :)
From:
no subject
This is *so much faster my eyes bled*.
$ time /tmp/dothing /tmp/test 2>/dev/null > /tmp/out
/tmp/dothing /tmp/test 2> /dev/null > /tmp/out 0.04s user 0.00s system 160000% cpu 0.000 total
*ahem* May not be accurate results :)
Memo to self; wait an hour before posting, and embarrassingly broken interim versions will remain unseen.
EDIT: Ah -- the reason I went with the original crappy version was because it should easy to implement quickly in plain-old-shell. I knew I had a reason :)
From:
no subject
From:
no subject
type *.csv >> target.csv
will merge them all (target should be in a different folder)