-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathextract.pl
executable file
·210 lines (149 loc) · 6.47 KB
/
extract.pl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
#!/usr/bin/perl
=pod
In an attempt to make the C code as focused and single-purpose as possible, and
not too closely tied to any particular workflow, I've written this script to
extract the data from an RRD file, and print the output to several separate
flat text files. Each output file is a simple list containing the data
extracted from a distinct RRA within the original RRD file - there is a
one-to-one correspondence between output files and RRAs.
Each output file contains one field per line: the first line is a positive
integer representing a count of the number of data points, the second line is
an integer representing the sign exponent (either 1 or -1), and the remaining
lines are floating-point numbers (in scientific notation) representing the data
points. The individual C code transforms are written to be able to read this
simple format, and output the same format.
It is envisioned that this is a stepping-stone to a more tightly integrated
process. The intermediate step of writing to output files can be replaced by
starting the transform as an extrenal process, and piping the input to it over
a pipe.
Further integration can be obtained by reading the output from another pipe and
assembling it, as additional RRAs, into the original data structure generated
from the input XML. That data structure can then be written out as a new XML
stream, and piped to RRDtool's "restore" factility, to create a new RRD file
with RRAs representing the transformed frequency-domain data, which can then be
used to generate graphs.
=cut
use strict;
use warnings;
use List::Util;
use XML::Simple;
use Data::Dumper;
use English qw( -no_match_vars );
use Getopt::Long qw( :config no_ignore_case );
sub cartesian_product { # footnote 1 #
List::Util::reduce {
return unless defined $a and defined $b; # just to get it to stop complaining
return [
map {
my $item = $ARG;
map [ @$ARG, $item ], @$a;
} @$b
];
} [[]], @ARG;
}
my $handIn;
my $inputfile;
my $depth = 0;
local $/;
GetOptions(
"depth+" => \$depth,
"filename=s" => \$inputfile,
) or die "getopts error";
open $handIn, "/usr/bin/rrdtool dump $inputfile |"
or die "could not dump $inputfile - no point in continuing\n"; # footnote 4 #
$Data::Dumper::Indent = 1;
$Data::Dumper::Maxdepth = $depth;
my $xml = readline $handIn;
my $ref = XMLin $xml, ForceArray => 1, KeyAttr => [];
warn "finished reading ${inputfile}\n";
$inputfile =~ s{ \.rrd $ }{}xi;
my @dsnames = map {
@{ $ARG->{ name } }
} @{ $ref->{ ds } };
s{ \s+ }{}gx
foreach @dsnames;
foreach( @{ $ref->{ rra } } ) {
foreach( @{ cartesian_product( $ARG->{ database }, $ARG->{ pdp_per_row }, $ARG->{ cf } ) } ) { # footnote 3 #
my %dslists;
my $database = shift @$ARG;
my $rows = $database->{ row };
my $label = sprintf '%s.steps=%s,cf=%s', $inputfile, @$ARG;
$label =~ s{ \s+ }{}gx;
foreach( @$rows ) {
my @tmplist = @dsnames;
foreach( @{ $ARG->{ v } } ) {
my $dsname = shift @tmplist;
s{ \s+ }{}gx;
push @{ $dslists{ $dsname } }, $ARG;
}
}
foreach( @dsnames ) {
my $hand;
my $outputfile = "${label},ds=${ARG}.rra";
my $dslistref = $dslists{ $ARG };
my $fullsize = scalar @$dslistref;
shift @$dslistref
while( $dslistref->[0] eq 'NaN' );
pop @$dslistref
while( $dslistref->[-1] eq 'NaN' );
my $trimmed = scalar @$dslistref;
unshift @$dslistref, $trimmed, 1; # footnote 2 #
push @$dslistref, $fullsize;
unless( open $hand, ">${outputfile}" ) {
warn "unable to open $outputfile for writing - skipping";
next;
}
print $hand "$ARG\n"
foreach @$dslistref;
close $hand
or warn "unable to close ${outputfile}\n";
warn "wrote ${outputfile}\n";
}
}
}
=pod
Footnote 1:
This function was mostly lifted from a contributor at StackOverflow
(http://stackoverflow.com/a/2457928/2700710), with tweaks by me.
Footnote 2:
The second field represents the sign of the exponent in the transform (and thus
the direction of the transform). This is needed for the DFT, but ignored for
the DHT.
Footnote 3:
This is overkill. Chances are this product will never result in more than one
item. The only reason I'm doing this is because each of those is an array
reference, which could possibly contain more than one element (unlikely). And
the only reason for *that* is because I enabled ForceArray. And you ask: Why
did I do that? Because I wanted to be able to treat the data structure in a
consistent manner. Normally (without ForceArray enabled), XML::Simple will
take a shortcut when there is only one element, and that has to be handled
differently. And not knowing ahead of time which way to traverse the data
structure would actually make this more complicated (and probably would still
require the cartesian product anyway). So yes, this way is ultimately simpler,
believe it or not.
But yes, the code is ugly - I make no excuses for it.
Footnote 4:
Yes, I'm aware there is a Perl binding for RRD, but it's horribly inadequate.
In-particular, RRDs::dump provides no means of capturing its output, and
pig-headedly dumps to STDOUT despite my best efforts (why bother providing a
programmatic interface in that case).
RRDs::fetch was better, but it doesn't provide the guarantee of dumping the
*exact* datapoints unmolested. It takes a time range and resolution as input,
and prints to the output the data set that is the closest fit (defaulting to
the highest resolution between one day ago and now).
Using "dump" extracts the exact data points as they are in the RRD, and frees
us from worrying about specifying a time range (which is error prone and
clumsy).
ToDo:
- bail-out if the DS is anything other than GAUGE or ABSOLUTE (ultimately, we
want to handle DERIVE and COUNTER types as well, but not yet);
- weed-out NaNs at this stage, and update the row count accordingly (with
maybe an optional field specifing what the row count would have been) - some
approaches to consider:
- only operate on valid contiguous ranges of data, if NaNs can be kept to a
run at the end or beginning (or both) of the data, and put the same number of
NaNs back into the output dataset;
- if there are isolated NaNs in the middle of the range of valid data (or
short runs), sample at a lower rate, perhaps at the heartbeat, or a larger
interval;
=cut