I'm sure someone out there has dealt with this before, so here goes. We have a process that outputs, once every half hour, a CSV file of the last X lines from a given log. The activity recorded in this log has peaks and valleys, so the last X lines might only cover the last hour's activity, at busy times, or might cover the last six hours, when things are quiet. The end result is that the output CSV files overlap unpredictably.

Is there a tool, script, process, or anything of that sort - anything in Windows, or shell scripts in Unix - that can unify a bunch of these CSVs into one CSV file? I can do it manually in Excel, but it's a pain.

I can probably work out something in the shell script line with diff and head, but it would be nice if someone has already got one knocking around.
(deleted comment)

From: [identity profile] utterlymundane.livejournal.com


If I'm understanding you right, and assuming there's a timestamp on each entry, or some other unique and ordered identifier, you could try running them through a sort -u (sort them, and remove duplicate entries).

man page for sort will give information on sorting by a certain field.

A better solution would be have real log rotation, or have the process record the last entry from before, and start from that. It's possible that those already exist and you're looking for a quick-and-dirty way to grab an overview.

From: [identity profile] utterlymundane.livejournal.com


Also, if there's no single unique field, you could give them one with md5 sum, saving the line order to restore later. This is in now way suitable for production use, but it serves as proof of concept:

{~/BOSS}
$ cat /tmp/make-sortable
#!/usr/bin/perl -w
use strict;

my $line = 0;
while (<>) {
print STDERR ">>> working on $_\n";
my $md5=`echo "$_"|md5sum`;
$md5=~s/ .*//;
chomp $md5;
print STDERR "Got md5sum $md5\n";
$line = $line + 1;
print "$md5 $line $_";
}
{~/BOSS}
$ cat /tmp/test
this,is,a,csv,line
so,is,this,one,too
this,is,a,csv,line
so,is,this,one,too
this,is,a,new,line
{~/BOSS}
$ /tmp/make-sortable /tmp/test | sort | uniq --check-chars 33 | sort -k 2 > /tmp/out
>>> working on this,is,a,csv,line

Got md5sum af169b30f884f16f0e79af4cdfd717e3
>>> working on so,is,this,one,too

Got md5sum 79051761d8d11902ce692ce2fb2dcd1a
>>> working on this,is,a,csv,line

Got md5sum af169b30f884f16f0e79af4cdfd717e3
>>> working on so,is,this,one,too

Got md5sum 79051761d8d11902ce692ce2fb2dcd1a
>>> working on this,is,a,new,line

Got md5sum 1a1b8a55f2d38af0ef6df523f07c3073
{~/BOSS}
$ cat /tmp/out
af169b30f884f16f0e79af4cdfd717e3 1 this,is,a,csv,line
79051761d8d11902ce692ce2fb2dcd1a 2 so,is,this,one,too
1a1b8a55f2d38af0ef6df523f07c3073 5 this,is,a,new,line




(Memo to self: good interview question!)

From: [identity profile] utterlymundane.livejournal.com


Depending on size, it's probably also better to walk through the file, md5sum, and print or drop the lines as appropriate. The method above is... not efficient :)

$ cat /tmp/dothing 
#!/usr/bin/perl -w
use strict;

my %seen = {};
while (<>) {
    print STDERR ">>> working on $_\n";
    my $md5=`echo "$_"|md5sum`;
    $md5=~s/ .*//;
    chomp $md5;
    print STDERR ">>> Got md5sum $md5\n";
    if (not defined $seen{$md5}) {
       $seen{$md5} = 1;
       print "$_";
    }
}



To be honest, the difference in human-perceptible terms is pretty minimal in the little testing I've done. Either should be a step up from manually in excel :)

From: [identity profile] utterlymundane.livejournal.com


Heh. Of course, we don't need the md5sum in that second one :)

This is *so much faster my eyes bled*.

$ time /tmp/dothing /tmp/test 2>/dev/null > /tmp/out
/tmp/dothing /tmp/test 2> /dev/null > /tmp/out 0.04s user 0.00s system 160000% cpu 0.000 total


*ahem* May not be accurate results :)

$ cat /tmp/dothing 
#!/usr/bin/perl -w
use strict;

my %seen = {};
while (<>) {
    print STDERR ">>> working on $_\n";
    if (not defined $seen{$_}) {
       $seen{$_} = 1;
       print "$_";
    }
}



Memo to self; wait an hour before posting, and embarrassingly broken interim versions will remain unseen.

EDIT: Ah -- the reason I went with the original crappy version was because it should easy to implement quickly in plain-old-shell. I knew I had a reason :)
Edited Date: 2008-01-15 01:09 pm (UTC)
ext_34769: (Default)

From: [identity profile] gothwalk.livejournal.com


I appreciate the work, and will preserve this for posterity.

From: [identity profile] xnamkrad.livejournal.com


has done stuff like this before - using DOS commands (yes showing my age)

type *.csv >> target.csv

will merge them all (target should be in a different folder)
.
Powered by Dreamwidth Studios

Style Credit

Expand Cut Tags

No cut tags