JKQTPlotter trunk/v5.0.0
an extensive Qt5+Qt6 Plotter framework (including a feature-richt plotter widget, a speed-optimized, but limited variant and a LaTeX equation renderer!), written fully in C/C++ and without external dependencies
Loading...
Searching...
No Matches
Tutorial (JKQTPDatastore): Advanced 1-Dimensional Statistics with JKQTPDatastore

This tutorial project (see ./examples/datastore_statistics/) explains several advanced functions of JKQTPDatastore in combination with the [JKQTPlotter Statistics Library] conatined in JKQTPlotter.

Note that there are additional tutorial explaining other aspects of data mangement in JKQTPDatastore:

The source code of the main application can be found in datastore_statistics.cpp. This tutorial cites only parts of this code to demonstrate different ways of working with data for the graphs.

Generating different sets of random numbers

The code segments below will fill four instances of JKQTPlotter with different statistical plots. All these plots are based on three sets of random numbers generated as shown here:

size_t randomdatacol1=datastore1->addColumn("random data 1");
size_t randomdatacol2=datastore1->addColumn("random data 2");
size_t randomdatacol3=datastore1->addColumn("random data 3");
std::random_device rd; // random number generators:
std::mt19937 gen{rd()};
std::uniform_int_distribution<> ddecide(0,1);
std::normal_distribution<> d1{0,1};
std::normal_distribution<> d2{6,1.2};
for (size_t i=0; i<150; i++) {
double v=0;
const int decide=ddecide(gen);
if (decide==0) v=d1(gen);
else v=d2(gen);
datastore1->appendToColumn(randomdatacol1, v);
if (decide==0) datastore1->appendToColumn(randomdatacol2, v);
else datastore1->appendToColumn(randomdatacol3, v);
}

The column randomdatacol1 will contain 150 random numbers. Each one is drawn either from a normal dirstribution N(0,1) (d1) or N(6,1.2) (d2). the decision, which of the two to use is based on the result of a third random distribution ddecide, which only returns 0 or 1. The two columns randomdatacol2 and randomdatacol3 only collect the random numbers drawn from d1 or d2 respectively. The three columns are generated empyt by calling JKQTPDatastore::addColumn() with only a name. Then the actual values are added by calling JKQTPDatastore::appendToColumn().

Basic Statistics

The three sets of random numbers from above can be visualized e.g. by a JKQTPPeakStreamGraph graph with code as follows:

plot1box->addGraph(gData1=new JKQTPPeakStreamGraph(plot1box));
gData1->setDataColumn(randomdatacol1);
gData1->setBaseline(-0.1);
gData1->setPeakHeight(-0.05);
gData1->setDrawBaseline(false);
interprets data as a stream of x- or y-positions (depending in yPeaks ) ans plots a line on the heig...
Definition jkqtppeakstream.h:48
void setDrawBaseline(bool __value)
indicates whether to draw the basleine (default: true )
void setPeakHeight(double __value)
height of each peak
void setBaseline(double __value)
position of the baseline
void setDataColumn(int __value)
the column that contains the datapoints

This (if repeated for all three columns) results in a plot like this:

datastore_statistics_dataonly

Based on the raw data we can now use JKQTPlotter's JKQTPlotter Statistics Library to calculate some basic properties, like the average (jkqtpstatAverage()) or the standard deviation (jkqtpstatStdDev()):

size_t N=0;
double mean=jkqtpstatAverage(datastore1->begin(randomdatacol1), datastore1->end(randomdatacol1), &N);
double std=jkqtpstatStdDev(datastore1->begin(randomdatacol1), datastore1->end(randomdatacol1));
double jkqtpstatStdDev(InputIt first, InputIt last, double *averageOut=nullptr, size_t *Noutput=nullptr)
calculates the standard deviation of a given data range first ... last
Definition jkqtpstatbasics.h:515
double jkqtpstatAverage(InputIt first, InputIt last, size_t *Noutput=nullptr)
calculates the average of a given data range first ... last
Definition jkqtpstatbasics.h:62

Both statistics functions (the same as all statistics functions in the library) use an iterator-based interface, comparable to the interface of the algorithms in the C++ standard template library. To this end, the class JKQTPDatastore provides an iterator interface to its columns, using the functions JKQTPDatastore::begin() and JKQTPDatastore::end(). Both functions simply receive the column ID as parameter and exist in a const and a mutable variant. the latter allows to also edit the data. In addition the function JKQTPDatastore::backInserter() returns a back-inserter iterator (like generated for STL containers with std::back_inserter(container)) that also allows to append to the column.

note that the iterator interface allows to use these functions with any container that provides such iterators (e.g. std::vector<double>, std::list<int>, std::set<float>, QVector<double>...).

The output of these functions is shown in the image above in the plot legend/key.

Of course, several other functions exist that calculate basic statistics from a column, e.g.:

All these functions use all values in the given range and convert each value to a double, using jkqtp_todouble(). The return values is always a dohble. Therefore you can use these functions to calculate statistics of ranges of any type that can be converted to double. Values that do not result in a valid doubleare not used in calculating the statistics. Therefore you can exclude values by setting them JKQTP_DOUBLE_NAN (i.e. "not a number").

Boxplots

Standard Boxplots

As mentioned above and shown in several other examples, JKQTPlotter supports Boxplots with the classes JKQTPBoxplotHorizontalElement, JKQTPBoxplotVerticalElement, as well as JKQTPBoxplotHorizontal and JKQTPBoxplotVertical. You can then use the 5-Number Summray functions from the JKQTPlotter Statistics Library to calculate the data for such a boxplot (e.g. jkqtpstat5NumberStatistics()) and set it up by hand. Code would look roughly like this:

JKQTPStat5NumberStatistics stat=jkqtpstat5NumberStatistics(data.begin(), data.end(), 0.25, .5);
res->setMin(stat.minimum);
res->setMax(stat.maximum);
res->setMedian(stat.median);
res->setMean(jkqtpstatAverage(first, last));
res->setDrawMean(true);
res->setDrawNotch(true);
res->setDrawMedian(true);
res->setDrawMinMax(true);
res->setDrawBox(true);
res->setPos(boxposX);
plotter->addGraph(res);
void setMin(double __value)
the minimum value to be used for the boxplot
void setMedianConfidenceIntervalWidth(double __value)
the width of the confidence interval around the median
void setDrawMean(bool __value)
indicates whether to draw the mean
void setPos(double __value)
the position of the boxplot on the "other" axis
void setDrawMedian(bool __value)
indicates whether to draw the median
void setMedian(double __value)
the median value to be used for the boxplot
void setMax(double __value)
the maximum value to be used for the boxplot
void setDrawMinMax(bool __value)
indicates whether to draw the percentiles
void setMean(double __value)
the mean value to be used for the boxplot
void setDrawNotch(bool __value)
indicates whether to draw a notch with width medianConfidenceIntervalWidth
void setPercentile75(double __value)
the 75% percentile value to be used for the boxplot
void setPercentile25(double __value)
the 25% percentile value to be used for the boxplot
This implements a single vertical (notched) boxplot as a "geometric element", where the data is direc...
Definition jkqtpboxplot.h:188
void setDrawBox(bool __value)
enables/disables drawing of the actual box of the boxplot (false leads to Tufte Style boxplots )
void jkqtpstat5NumberStatistics(InputIt first, InputIt last, double *minimum, double minimumQuantile=0, double *median=nullptr, double *maximum=nullptr, double maximumQuantile=1, double quantile1Spec=0.25, double *quantile1=nullptr, double quantile2Spec=0.75, double *quantile2=nullptr, double *IQR=nullptr, double *IQRSignificance=nullptr, size_t *Noutput=nullptr)
calculates the Five-Number Statistical Summary (minimum, median, maximum and two user-defined quantil...
Definition jkqtpstatbasics.h:1043
represents the Five-Number Statistical Summary (minimum, median, maximum and two user-defined quantil...
Definition jkqtpstatbasics.h:1091
double quantile1
first quantile value (specified by quantile1Spec)
Definition jkqtpstatbasics.h:1099
double quantile2
second quantile value (specified by quantile1Spec)
Definition jkqtpstatbasics.h:1105
double minimum
minimum value
Definition jkqtpstatbasics.h:1095
double IQRSignificanceEstimate() const
interquartile range, calculated as
double median
median value
Definition jkqtpstatbasics.h:1103
double maximum
maximum value
Definition jkqtpstatbasics.h:1109

In order to save you the work of writing out this code, the JKQTPlotter Statistics Library provides "adaptors", such as jkqtpstatAddVBoxplot(), which basically implements the code above. Then drawing a boxplot is reduced to:

JKQTPBoxplotHorizontalElement* gBox2=jkqtpstatAddHBoxplot(plot1box->getPlotter(), datastore1->begin(randomdatacol2), datastore1->end(randomdatacol2), -0.25);
gBox2->setColor(gData2->getKeyLabelColor());
virtual void setColor(QColor c)
set the color of the graph (colors all elements, based on the given color c )
This implements a horizontal (notched) boxplot where the data is directly given to the object and not...
Definition jkqtpboxplot.h:229
void setBoxWidthAbsolute(double __value)
width of box in pt.
JKQTPBoxplotHorizontalElement * jkqtpstatAddHBoxplot(JKQTBasePlotter *plotter, InputIt first, InputIt last, double boxposY, double quantile1Spec=0.25, double quantile2Spec=0.75, double minimumQuantile=0, double maximumQuantile=1.0, JKQTPStat5NumberStatistics *statOutput=nullptr)
add a JKQTPBoxplotHorizontalElement to the given plotter, where the boxplot values are calculated fro...
Definition jkqtpstatisticsadaptors.h:67

Here -0.25indicates the location (on the y-axis) of the boxplot. and the plot is calculated for the data in the JKQTPDatastore column randomdatacol2.

datastore_statistics_boxplots_simple

Boxplots with Outliers

Usually the boxplot draws its whiskers at the minimum and maximum value of the dataset. But if your data contains a lot of outliers, it may make sense to draw them e.g. at the 3% and 97% quantiles and the draw the outliers as additional data points. This can also be done with jkqtpstat5NumberStatistics(), as you can specify the minimum and maximum quantile (default is 0 and 1, i.e. the true minimum and maximum) and the resulting object contains a vector with the outlier values. Then you could add them to the JKQTPDatastore and add a scatter plot that displays them. Also this task is sped up by an "adaptor". Simply call

std::pair<JKQTPBoxplotHorizontalElement*,JKQTPSingleColumnSymbolsGraph*> gBox1;
gBox1=jkqtpstatAddHBoxplotAndOutliers(plot1box->getPlotter(), datastore1->begin(randomdatacol1), datastore1->end(randomdatacol1), -0.3,
0.25, 0.75, // 1. and 3. Quartile for the boxplot box
0.03, 0.97 // Quantiles for the boxplot box whiskers' ends
);
std::pair< JKQTPBoxplotHorizontalElement *, JKQTPSingleColumnSymbolsGraph * > jkqtpstatAddHBoxplotAndOutliers(JKQTBasePlotter *plotter, InputIt first, InputIt last, double boxposY, double quantile1Spec=0.25, double quantile2Spec=0.75, double minimumQuantile=0.03, double maximumQuantile=0.97, const QString &outliercolumnBaseName=QString("boxplot"), JKQTPStat5NumberStatistics *statOutput=nullptr)
add a JKQTPBoxplotHorizontalElement and a JKQTPSingleColumnSymbolsGraph for outliers to the given plo...
Definition jkqtpstatisticsadaptors.h:164

As you can see this restuns the JKQTPBoxplotHorizontalElement and in addition a JKQTPSingleColumnSymbolsGraph for the display of the outliers. The result looks like this:

datastore_statistics_boxplots_outliers

Histograms

Calculating 1D-Histograms is supported by several functions from the JKQTPlotter Statistics Library, e.g. jkqtpstatHistogram1DAutoranged(). You can use the result to fill new columns in a JKQTPDatastore, which can then be used to draw the histogram (here wit 15 bins, spanning the full data range):

size_t histcolX=plotter->getDatastore()->addColumn(histogramcolumnBaseName+", bins");
size_t histcolY=plotter->getDatastore()->addColumn(histogramcolumnBaseName+", values");
jkqtpstatHistogram1DAutoranged(first, last, plotter->getDatastore()->backInserter(histcolX), plotter->getDatastore()->backInserter(histcolY), 15);
resO->setXColumn(histcolX);
resO->setYColumn(histcolY);
resO->setTitle(histogramcolumnBaseName);
plotter->addGraph(resO);
This implements a vertical bar graph with bars between and .
Definition jkqtpbarchart.h:51
virtual void setTitle(const QString &__value)
sets the title of the plot (for display in key!).
void setYColumn(int __value)
the column that contains the y-component of the datapoints
void setXColumn(int __value)
the column that contains the x-component of the datapoints
void jkqtpstatHistogram1DAutoranged(InputIt first, InputIt last, OutputIt histogramXOut, OutputIt histogramYOut, int bins=11, bool normalized=true, bool cummulative=false, JKQTPStatHistogramBinXMode binXMode=JKQTPStatHistogramBinXMode::XIsLeft)
calculate an autoranged 1-dimensional histogram from the given data range first .....
Definition jkqtpstathistogram.h:73

Again there are "adaptors" which significanty reduce the amount of coude you have to type:

JKQTPBarVerticalGraph* hist1=jkqtpstatAddHHistogram1DAutoranged(plot1->getPlotter(), datastore1->begin(randomdatacol1), datastore1->end(randomdatacol1), 15);
JKQTPBarVerticalGraph * jkqtpstatAddHHistogram1DAutoranged(JKQTBasePlotter *plotter, InputIt first, InputIt last, int bins=11, bool normalized=true, bool cummulative=false, const QString &histogramcolumnBaseName=QString("histogram"))
calculate an autoranged histogram and add a JKQTPBarVerticalGraph to the given plotter,...
Definition jkqtpstatisticsadaptors.h:828

The resulting plot looks like this (the distributions used to generate the random data are also shown as line plots!):

datastore_statistics_hist

Kernel Density Estimates (KDE)

Especially when only few samples from a distribution are available, histograms are not good at representing the underlying data distribution. In such cases, Kernel Density Estimates (KDE) can help, which are basically a smoothed variant of a histogram. The JKQTPlotter Statistics Library supports calculating them via e.g. jkqtpstatKDE1D():

size_t kdecolX=plotter->getDatastore()->addColumn(KDEcolumnBaseName+", bins");
size_t kdecolY=plotter->getDatastore()->addColumn(KDEcolumnBaseName+", values");
jkqtpstatKDE1D(first, last, -5.0,0.01,10.0, plotter->getDatastore()->backInserter(kdecolX), plotter->getDatastore()->backInserter(kdecolY), kernel, kdeBandwidth);
resO->setXColumn(kdecolX);
resO->setYColumn(kdecolY);
resO->setTitle(KDEcolumnBaseName);
resO->setDrawLine(true);
plotter->addGraph(resO);
void setSymbolType(JKQTPGraphSymbols __value)
set the type of the graph symbol
This implements xy line plots. This also alows to draw symbols at the data points.
Definition jkqtplines.h:61
void setDrawLine(bool __value)
indicates whether to draw a line or not
@ JKQTPNoSymbol
plots no symbol at all (usefull together with error bars)
Definition jkqtpdrawingtools.h:144
void jkqtpstatKDE1D(InputIt first, InputIt last, BinsInputIt binsFirst, BinsInputIt binsLast, OutputIt KDEXOut, OutputIt KDEYOut, const std::function< double(double)> &kernel=std::function< double(double)>(&jkqtpstatKernel1DGaussian), double bandwidth=1.0, bool cummulative=false)
calculate an autoranged 1-dimensional Kernel Density Estimation (KDE) from the given data range first...
Definition jkqtpstatkde.h:368

The function accepts different kernel functions (any C++ functor double f(double x)) and provides a set of default kernels, e.g.

The three parameters -5.0, 0.01, 10.0 tell the function jkqtpstatKDE1D() to evaluate the KDE at positions between -5 and 10, in steps of 0.01.

Finally the bandwidth constrols the smoothing and the JKQTPlotter Statistics Library provides a simple function to estimate it automatically from the data:

double kdeBandwidth=jkqtpstatEstimateKDEBandwidth(datastore1->begin(randomdatacol1subset), datastore1->end(randomdatacol1subset));
double jkqtpstatEstimateKDEBandwidth(InputIt first, InputIt last)
estimates a bandwidth for a Kernel Density Estimator (KDE) of the given data first ....
Definition jkqtpstatkde.h:192

Again a shortcut "adaptor" simplifies this task:

JKQTPXYLineGraph* kde2=jkqtpstatAddHKDE1D(plot1kde->getPlotter(), datastore1->begin(randomdatacol1subset), datastore1->end(randomdatacol1subset),
// evaluate at locations between -5 and 10, in steps of 0.01 (equivalent to the line above, but without pre-calculating a vector)
-5.0,0.01,10.0,
// use a gaussian kernel
// estimate the bandwidth
kdeBandwidth);
double jkqtpstatKernel1DEpanechnikov(double t)
a 1D Epanechnikov kernel function, e.g. for Kernel Density Estimation
Definition jkqtpstatkde.h:77
JKQTPXYLineGraph * jkqtpstatAddHKDE1D(JKQTBasePlotter *plotter, InputIt first, InputIt last, BinsInputIt binsFirst, BinsInputIt binsLast, const std::function< double(double)> &kernel=std::function< double(double)>(&jkqtpstatKernel1DGaussian), double bandwidth=1.0, bool cummulative=false, const QString &KDEcolumnBaseName=QString("KDE"))
calculate an autoranged KDE and add a JKQTPXYLineGraph to the given plotter, where the KDE is calcula...
Definition jkqtpstatisticsadaptors.h:1365

Plots that result from such calls look like this:

datastore_statistics_kde

Cummulative Histograms and KDEs

Both histograms and KDEs support a parameter bool cummulative, which allows to accumulate the data after calculation and drawing cummulative histograms/KDEs:

JKQTPBarVerticalGraph* histcum2=jkqtpstatAddHHistogram1DAutoranged(plot1cum->getPlotter(), datastore1->begin(randomdatacol2), datastore1->end(randomdatacol2),
// bin width
0.1,
// normalized, cummulative
false, true);

datastore_statistics_cumhistkde

Screenshot of the full Program

The output of the full test program datastore_statistics.cpp looks like this:

datastore_statistics