{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# Reference: https://jupyterbook.org/interactive/hiding.html\n", "# Use {hide, remove}-{input, output, cell} tags to hiding content\n", "\n", "import sys\n", "import os\n", "if not any(path.endswith('textbook') for path in sys.path):\n", " sys.path.append(os.path.abspath('../../..'))\n", "from textbook_utils import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Wrangling and Transforming" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We begin by taking a peek at the contents of our datafile.\n", "To do this, we open the file and examine the first few rows\n", "({numref}`Chapter %s `):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BCS,Age,Sex,Length,Girth,Height,Weight,WeightAlt\n", "3,<2,stallion,78,90,90,77,NA\n", "2.5,<2,stallion,91,97,94,100,NA\n", "1.5,<2,stallion,74,93,95,74,NA\n", "3,<2,female,87,109,96,116,NA\n" ] } ], "source": [ "from pathlib import Path\n", "\n", "# Create a Path pointing to our data file\n", "insp_path = Path('data/donkeys.csv')\n", "\n", "with insp_path.open() as f:\n", " # Display first five lines of file\n", " for _ in range(5):\n", " print(f.readline(), end='')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the file is CSV formatted, we can easily read it into a dataframe:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BCSAgeSexLengthGirthHeightWeightWeightAlt
03.0<2stallion78909077NaN
12.5<2stallion919794100NaN
21.5<2stallion74939574NaN
...........................
5412.510-15stallion103118103174NaN
5423.02-5stallion91112100139NaN
5433.05-10stallion104124110189NaN
\n", "

544 rows × 8 columns

\n", "
" ], "text/plain": [ " BCS Age Sex Length Girth Height Weight WeightAlt\n", "0 3.0 <2 stallion 78 90 90 77 NaN\n", "1 2.5 <2 stallion 91 97 94 100 NaN\n", "2 1.5 <2 stallion 74 93 95 74 NaN\n", ".. ... ... ... ... ... ... ... ...\n", "541 2.5 10-15 stallion 103 118 103 174 NaN\n", "542 3.0 2-5 stallion 91 112 100 139 NaN\n", "543 3.0 5-10 stallion 104 124 110 189 NaN\n", "\n", "[544 rows x 8 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "donkeys = pd.read_csv(\"data/donkeys.csv\")\n", "donkeys" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Over 500 donkeys participated in the survey, and eight measurements were made on each donkey. According to the documentation, the granularity is a single donkey ({numref}`Chapter %s `).\n", "{numref}`Table %s ` provides descriptions of the eight features.\n", "\n", ":::{table} Donkey study codebook\n", ":name: tbl:donkey-codebook\n", "\n", "| Feature | Data type | Feature type | Description |\n", "|----------------|-----------|------------|--------------------------------------------------------|\n", "|BCS | float64 | Ordinal | Body condition score: from 1 (emaciated) to 3 (healthy) to 5 (obese) in increments of 0.5. |\n", "| Age | string | Ordinal | Age in years, under 2, 2–5, 5–10, 10–15, 15–20, and over 20 years |\n", "| Sex | string | Nominal | Sex categories: stallion, gelding, female |\n", "| Length | int64 | Numeric | Body length (cm) from front leg elbow to back of pelvis |\n", "| Girth | int64 | Numeric | Body circumference (cm), measured just behind front legs |\n", "| Height | int64 | Numeric | Body height (cm) up to point where neck connects to back |\n", "| Weight |int64 | Numeric | Weight (kilogram) |\n", "| WeightAlt |float64 | Numeric | Second weight measurement taken on a subset of donkeys |\n", "\n", ":::\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "{numref}`Figure %s ` is a stylized representation of a donkey as a cylinder with neck and legs appended. Height is measured from the ground to the base of the neck above the shoulders; girth is around the body, just behind the legs; and length is from the front elbow to the back of the pelvis.\n", "\n", "```{figure} donkeyDiagram.png\n", "---\n", "name: fig:donkeyDiagram\n", "width: 400px\n", "---\n", "Diagram of a donkey's girth, length, and height, characterized as measurements on a cylinder\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our next step is to perform some quality checks on the data. In the previous section, we listed a few potential quality concerns based on scope. Next, we check the quality of the measurements and their distributions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by comparing the two weight measurements made on the subset of donkeys to check on the consistency of the scale. We make a histogram of the difference between these two measurements for the 31 donkeys that were weighed twice: " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "alignmentgroup": "True", "bingroup": "x", "hovertemplate": "Differences of two weighings (kg)
on the same donkey=%{x}
count=%{y}", "legendgroup": "", "marker": { "color": "#1F77B4", "pattern": { "shape": "" } }, "name": "", "nbinsx": 15, "offsetgroup": "", "orientation": "v", "showlegend": false, "type": "histogram", "x": [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 1, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 0, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 0, null, null, null, null, null, null, 0, null, null, null, 0, null, -1, null, null, 0, null, 0, 0, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, -1, -1, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 0, 0, null, 0, null, null, null, null, null, null, null, null, null, 0, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 0, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 0, null, null, null, null, null, 0, null, null, null, null, null, null, 0, null, 0, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 0, null, null, 0, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, -1, null, null, null, null, null, null, null, null, null, null, -1, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 1, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 1, null, null, 0, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, -1, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 1, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 0, 0, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null ], "xaxis": "x", "yaxis": "y" } ], "layout": { "barmode": "relative", "height": 250, "legend": { "tracegroupgap": 0 }, "template": { "data": { "bar": [ { "error_x": { "color": "rgb(36,36,36)" }, "error_y": { "color": "rgb(36,36,36)" }, "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "baxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmap" } ], "heatmapgl": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmapgl" } ], "histogram": [ { "marker": { "line": { "color": "white", "width": 0.6 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergl" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "rgb(237,237,237)" }, "line": { "color": "white" } }, "header": { "fill": { "color": "rgb(217,217,217)" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowhead": 0, "arrowwidth": 1 }, "autosize": true, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "colorscale": { "diverging": [ [ 0, "rgb(103,0,31)" ], [ 0.1, "rgb(178,24,43)" ], [ 0.2, "rgb(214,96,77)" ], [ 0.3, "rgb(244,165,130)" ], [ 0.4, "rgb(253,219,199)" ], [ 0.5, "rgb(247,247,247)" ], [ 0.6, "rgb(209,229,240)" ], [ 0.7, "rgb(146,197,222)" ], [ 0.8, "rgb(67,147,195)" ], [ 0.9, "rgb(33,102,172)" ], [ 1, "rgb(5,48,97)" ] ], "sequential": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "sequentialminus": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ] }, "colorway": [ "#1F77B4", "#FF7F0E", "#2CA02C", "#D62728", "#9467BD", "#8C564B", "#E377C2", "#7F7F7F", "#BCBD22", "#17BECF" ], "font": { "color": "rgb(36,36,36)" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "white", "showlakes": true, "showland": true, "subunitcolor": "white" }, "height": 250, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "margin": { "b": 10, "l": 10, "r": 10, "t": 10 }, "paper_bgcolor": "white", "plot_bgcolor": "white", "polar": { "angularaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "radialaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "scene": { "xaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "zaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } }, "shapedefaults": { "fillcolor": "black", "line": { "width": 0 }, "opacity": 0.3 }, "ternary": { "aaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "baxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "caxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "title": { "x": 0.5, "xanchor": "center" }, "width": 350, "xaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } } }, "width": 350, "xaxis": { "anchor": "y", "autorange": true, "domain": [ 0, 1 ], "range": [ -1.1, 1.1 ], "title": { "text": "Differences of two weighings (kg)
on the same donkey" }, "type": "linear" }, "yaxis": { "anchor": "x", "autorange": true, "domain": [ 0, 1 ], "range": [ 0, 22.105263157894736 ], "title": { "text": "count" } } } }, "image/png": "", "image/svg+xml": [ "−1−0.500.5105101520Differences of two weighings (kg)on the same donkeycount" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "donkeys = donkeys.assign(difference=donkeys[\"WeightAlt\"] - donkeys[\"Weight\"])\n", "\n", "px.histogram(donkeys, x=\"difference\", nbins=15,\n", " labels=dict(\n", " difference=\"Differences of two weighings (kg)
on the same donkey\"\n", " ),\n", " width=350, height=250,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The measurements are all within 1 kg of each other, and the majority are exactly the same (to the nearest kilogram). This gives us confidence in the accuracy of the measurements. \n", "\n", "Next, we look for unusual values in the body condition score: " ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BCS\n", "3.0 307\n", "2.5 135\n", "3.5 55\n", " ... \n", "1.5 5\n", "4.5 1\n", "1.0 1\n", "Name: count, Length: 8, dtype: int64" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "donkeys['BCS'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this output, we see that there's only one emaciated (BCS = 1) and one obese (BCS = 4.5) donkey.\n", "Let's look at the complete records for these two donkeys:" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BCSAgeSexLengthGirthHeightWeightWeightAlt
2914.510-15female107130106227NaN
4451.0>20female97109102115NaN
\n", "
" ], "text/plain": [ " BCS Age Sex Length Girth Height Weight WeightAlt\n", "291 4.5 10-15 female 107 130 106 227 NaN\n", "445 1.0 >20 female 97 109 102 115 NaN" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "donkeys[(donkeys['BCS'] == 1.0) | (donkeys['BCS'] == 4.5)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since these BCS values are extreme, we want to be cautious about including these two donkeys in our analysis. Since we have only one donkey in each of these extreme categories, our model might well not extend to donkeys with a BCS of 1 or 4.5. So we remove these two records from the dataframe and note that our analysis may not extend to emaciated or obese donkeys. In general, we exercise caution in dropping records from a dataframe. Later, we may also decide to remove the five donkeys with a score of 1.5 if they appear anomalous in our analysis, but for now, we keep them in our dataframe. We need good reasons for their exclusion, document our action, and keep the number of excluded observations low so that we avoid overfitting a model because we dropped any record that disagreed with the model. \n", "\n", "We remove these two outliers next:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def remove_bcs_outliers(donkeys):\n", " return donkeys[(donkeys['BCS'] >= 1.5) & (donkeys['BCS'] <= 4)] \n", "\n", "donkeys = (pd.read_csv('data/donkeys.csv')\n", " .pipe(remove_bcs_outliers))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we examine the distribution of values for weight to see if there are any issues with quality:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "alignmentgroup": "True", "bingroup": "x", "hovertemplate": "Weight (kg)=%{x}
count=%{y}", "legendgroup": "", "marker": { "color": "#1F77B4", "pattern": { "shape": "" } }, "name": "", "nbinsx": 40, "offsetgroup": "", "orientation": "v", "showlegend": false, "type": "histogram", "x": [ 77, 100, 74, 116, 91, 105, 108, 86, 27, 141, 100, 95, 115, 106, 112, 117, 94, 107, 102, 127, 90, 108, 72, 102, 86, 118, 65, 90, 118, 113, 117, 124, 89, 130, 150, 87, 96, 117, 94, 105, 94, 107, 117, 106, 75, 105, 114, 115, 107, 98, 71, 142, 146, 106, 162, 164, 133, 152, 129, 110, 181, 160, 172, 164, 192, 163, 169, 144, 165, 160, 145, 110, 154, 133, 152, 145, 130, 183, 173, 161, 172, 151, 170, 159, 142, 136, 185, 143, 107, 173, 161, 176, 169, 184, 166, 152, 98, 131, 136, 109, 165, 138, 137, 177, 142, 160, 125, 158, 178, 157, 164, 162, 155, 166, 170, 141, 142, 165, 146, 163, 170, 159, 154, 165, 167, 157, 114, 181, 170, 168, 146, 163, 190, 138, 162, 171, 167, 139, 188, 152, 174, 125, 179, 167, 149, 144, 158, 144, 145, 140, 168, 152, 161, 155, 179, 143, 150, 170, 136, 167, 169, 141, 154, 131, 155, 157, 165, 92, 171, 156, 182, 157, 194, 174, 160, 178, 152, 157, 170, 142, 156, 167, 151, 158, 149, 146, 139, 133, 144, 178, 159, 114, 142, 175, 145, 150, 155, 140, 140, 158, 160, 160, 188, 172, 145, 154, 158, 158, 165, 149, 128, 167, 126, 158, 168, 134, 186, 148, 162, 134, 136, 168, 174, 123, 196, 146, 149, 148, 170, 155, 165, 166, 165, 164, 178, 170, 142, 173, 127, 204, 155, 162, 137, 171, 212, 165, 157, 148, 178, 151, 150, 190, 178, 188, 158, 195, 163, 149, 202, 116, 165, 145, 147, 144, 121, 200, 179, 138, 136, 143, 140, 149, 166, 173, 146, 143, 114, 145, 179, 174, 162, 158, 199, 210, 171, 130, 125, 156, 177, 175, 144, 227, 154, 166, 180, 130, 155, 162, 171, 145, 176, 138, 149, 135, 156, 157, 166, 160, 126, 177, 156, 145, 136, 152, 154, 126, 139, 160, 134, 120, 177, 130, 163, 159, 153, 130, 162, 168, 164, 167, 150, 134, 154, 169, 143, 167, 166, 122, 178, 170, 138, 165, 160, 155, 120, 149, 125, 126, 122, 137, 139, 178, 145, 146, 173, 172, 144, 158, 135, 130, 174, 126, 129, 127, 161, 168, 183, 122, 184, 160, 148, 146, 160, 137, 184, 141, 142, 147, 146, 132, 142, 114, 158, 148, 143, 152, 113, 133, 159, 183, 171, 136, 180, 167, 132, 184, 150, 147, 135, 180, 166, 160, 175, 114, 166, 181, 129, 154, 174, 165, 141, 156, 142, 96, 156, 178, 141, 175, 183, 151, 161, 183, 163, 161, 162, 173, 132, 143, 122, 174, 115, 144, 177, 175, 164, 204, 133, 163, 133, 160, 140, 151, 132, 143, 158, 115, 147, 146, 158, 152, 165, 152, 161, 144, 139, 174, 164, 153, 153, 142, 150, 180, 173, 126, 154, 130, 171, 172, 168, 196, 132, 192, 159, 173, 185, 181, 140, 185, 119, 153, 142, 173, 170, 177, 194, 143, 145, 142, 159, 184, 156, 179, 157, 179, 163, 177, 143, 177, 174, 143, 193, 183, 183, 181, 170, 189, 193, 191, 195, 163, 153, 183, 156, 185, 170, 170, 214, 179, 173, 147, 171, 189, 214, 230, 145, 162, 169, 178, 177, 151, 172, 180, 187, 132, 167, 152, 165, 213, 189, 145, 183, 174, 139, 189 ], "xaxis": "x", "yaxis": "y" } ], "layout": { "barmode": "relative", "height": 250, "legend": { "tracegroupgap": 0 }, "template": { "data": { "bar": [ { "error_x": { "color": "rgb(36,36,36)" }, "error_y": { "color": "rgb(36,36,36)" }, "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "baxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmap" } ], "heatmapgl": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmapgl" } ], "histogram": [ { "marker": { "line": { "color": "white", "width": 0.6 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergl" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "rgb(237,237,237)" }, "line": { "color": "white" } }, "header": { "fill": { "color": "rgb(217,217,217)" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowhead": 0, "arrowwidth": 1 }, "autosize": true, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "colorscale": { "diverging": [ [ 0, "rgb(103,0,31)" ], [ 0.1, "rgb(178,24,43)" ], [ 0.2, "rgb(214,96,77)" ], [ 0.3, "rgb(244,165,130)" ], [ 0.4, "rgb(253,219,199)" ], [ 0.5, "rgb(247,247,247)" ], [ 0.6, "rgb(209,229,240)" ], [ 0.7, "rgb(146,197,222)" ], [ 0.8, "rgb(67,147,195)" ], [ 0.9, "rgb(33,102,172)" ], [ 1, "rgb(5,48,97)" ] ], "sequential": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "sequentialminus": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ] }, "colorway": [ "#1F77B4", "#FF7F0E", "#2CA02C", "#D62728", "#9467BD", "#8C564B", "#E377C2", "#7F7F7F", "#BCBD22", "#17BECF" ], "font": { "color": "rgb(36,36,36)" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "white", "showlakes": true, "showland": true, "subunitcolor": "white" }, "height": 250, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "margin": { "b": 10, "l": 10, "r": 10, "t": 10 }, "paper_bgcolor": "white", "plot_bgcolor": "white", "polar": { "angularaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "radialaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "scene": { "xaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "zaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } }, "shapedefaults": { "fillcolor": "black", "line": { "width": 0 }, "opacity": 0.3 }, "ternary": { "aaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "baxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "caxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "title": { "x": 0.5, "xanchor": "center" }, "width": 350, "xaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } } }, "width": 350, "xaxis": { "anchor": "y", "autorange": true, "domain": [ 0, 1 ], "range": [ 19.5, 239.5 ], "title": { "text": "Weight (kg)" }, "type": "linear" }, "yaxis": { "anchor": "x", "autorange": true, "domain": [ 0, 1 ], "range": [ 0, 95.78947368421052 ], "title": { "text": "count" } } } }, "image/png": "", "image/svg+xml": [ "50100150200020406080Weight (kg)count" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "px.histogram(donkeys, x='Weight', nbins=40, width=350, height=250,\n", " labels={'Weight':'Weight (kg)'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It appears there is one very light donkey weighing less than 30 kg. Next, we check the relationship between weight and height to assess the quality of the data for analysis:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "hovertemplate": "Height (cm)=%{x}
Weight (kg)=%{y}", "legendgroup": "", "marker": { "color": "#1F77B4", "symbol": "circle" }, "mode": "markers", "name": "", "orientation": "v", "showlegend": false, "type": "scatter", "x": [ 90, 94, 95, 96, 91, 98, 96, 89, 71, 99, 92, 92, 96, 96, 97, 98, 90, 96, 94, 98, 93, 96, 86, 96, 92, 106, 91, 96, 97, 98, 98, 99, 94, 101, 106, 97, 98, 100, 93, 93, 92, 100, 97, 96, 89, 96, 98, 91, 99, 96, 89, 101, 104, 97, 100, 101, 99, 97, 98, 93, 101, 104, 101, 97, 106, 99, 102, 100, 97, 100, 98, 95, 98, 104, 102, 105, 98, 108, 107, 106, 101, 101, 102, 99, 98, 104, 107, 99, 94, 97, 107, 103, 100, 98, 101, 101, 97, 96, 97, 100, 100, 98, 99, 102, 99, 99, 102, 101, 104, 103, 103, 104, 107, 102, 103, 103, 102, 102, 98, 102, 97, 103, 103, 103, 107, 101, 97, 98, 102, 104, 100, 102, 102, 104, 106, 102, 98, 97, 105, 106, 106, 99, 105, 102, 101, 95, 104, 97, 101, 102, 105, 102, 103, 100, 97, 98, 100, 102, 105, 105, 104, 102, 100, 97, 106, 102, 103, 98, 101, 104, 104, 101, 101, 107, 99, 99, 101, 99, 98, 99, 104, 104, 102, 105, 99, 102, 102, 101, 99, 98, 101, 95, 96, 102, 98, 98, 100, 98, 98, 106, 105, 100, 103, 101, 102, 105, 99, 103, 101, 99, 97, 103, 105, 101, 104, 100, 106, 99, 103, 94, 101, 103, 108, 96, 101, 103, 100, 98, 103, 101, 100, 102, 102, 101, 105, 102, 99, 100, 94, 111, 102, 103, 105, 103, 113, 103, 98, 101, 103, 102, 99, 103, 105, 103, 100, 107, 99, 99, 105, 101, 100, 105, 99, 99, 96, 103, 102, 100, 98, 104, 97, 98, 99, 99, 106, 100, 97, 101, 104, 102, 100, 101, 110, 110, 101, 102, 89, 102, 104, 104, 99, 106, 103, 99, 108, 97, 103, 99, 103, 101, 104, 101, 100, 97, 106, 102, 106, 105, 97, 101, 101, 102, 101, 99, 103, 98, 103, 100, 100, 100, 102, 103, 101, 103, 108, 102, 105, 112, 105, 103, 99, 101, 105, 107, 105, 107, 102, 94, 102, 108, 102, 107, 106, 107, 102, 107, 103, 102, 99, 103, 100, 98, 101, 101, 105, 107, 104, 102, 100, 99, 109, 101, 96, 100, 99, 104, 107, 100, 105, 104, 99, 102, 106, 103, 110, 104, 100, 99, 104, 102, 96, 89, 103, 100, 103, 99, 102, 98, 105, 101, 106, 101, 101, 100, 101, 106, 100, 100, 104, 103, 105, 97, 106, 98, 107, 103, 99, 102, 100, 106, 99, 103, 105, 90, 101, 104, 98, 106, 110, 102, 101, 104, 102, 98, 103, 106, 102, 100, 100, 105, 99, 102, 102, 100, 102, 107, 98, 101, 94, 112, 100, 100, 98, 98, 105, 102, 98, 103, 103, 101, 103, 104, 108, 100, 101, 103, 107, 103, 104, 100, 99, 103, 105, 97, 103, 99, 104, 101, 102, 109, 101, 105, 108, 106, 105, 103, 97, 104, 97, 103, 101, 106, 105, 102, 106, 103, 99, 102, 99, 106, 103, 102, 103, 104, 104, 104, 101, 105, 104, 100, 108, 102, 105, 100, 103, 105, 108, 102, 105, 101, 108, 102, 102, 106, 103, 106, 110, 107, 103, 101, 106, 109, 108, 116, 103, 104, 107, 103, 105, 106, 109, 105, 105, 103, 107, 100, 103, 108, 102, 101, 110, 103, 100, 110 ], "xaxis": "x", "y": [ 77, 100, 74, 116, 91, 105, 108, 86, 27, 141, 100, 95, 115, 106, 112, 117, 94, 107, 102, 127, 90, 108, 72, 102, 86, 118, 65, 90, 118, 113, 117, 124, 89, 130, 150, 87, 96, 117, 94, 105, 94, 107, 117, 106, 75, 105, 114, 115, 107, 98, 71, 142, 146, 106, 162, 164, 133, 152, 129, 110, 181, 160, 172, 164, 192, 163, 169, 144, 165, 160, 145, 110, 154, 133, 152, 145, 130, 183, 173, 161, 172, 151, 170, 159, 142, 136, 185, 143, 107, 173, 161, 176, 169, 184, 166, 152, 98, 131, 136, 109, 165, 138, 137, 177, 142, 160, 125, 158, 178, 157, 164, 162, 155, 166, 170, 141, 142, 165, 146, 163, 170, 159, 154, 165, 167, 157, 114, 181, 170, 168, 146, 163, 190, 138, 162, 171, 167, 139, 188, 152, 174, 125, 179, 167, 149, 144, 158, 144, 145, 140, 168, 152, 161, 155, 179, 143, 150, 170, 136, 167, 169, 141, 154, 131, 155, 157, 165, 92, 171, 156, 182, 157, 194, 174, 160, 178, 152, 157, 170, 142, 156, 167, 151, 158, 149, 146, 139, 133, 144, 178, 159, 114, 142, 175, 145, 150, 155, 140, 140, 158, 160, 160, 188, 172, 145, 154, 158, 158, 165, 149, 128, 167, 126, 158, 168, 134, 186, 148, 162, 134, 136, 168, 174, 123, 196, 146, 149, 148, 170, 155, 165, 166, 165, 164, 178, 170, 142, 173, 127, 204, 155, 162, 137, 171, 212, 165, 157, 148, 178, 151, 150, 190, 178, 188, 158, 195, 163, 149, 202, 116, 165, 145, 147, 144, 121, 200, 179, 138, 136, 143, 140, 149, 166, 173, 146, 143, 114, 145, 179, 174, 162, 158, 199, 210, 171, 130, 125, 156, 177, 175, 144, 227, 154, 166, 180, 130, 155, 162, 171, 145, 176, 138, 149, 135, 156, 157, 166, 160, 126, 177, 156, 145, 136, 152, 154, 126, 139, 160, 134, 120, 177, 130, 163, 159, 153, 130, 162, 168, 164, 167, 150, 134, 154, 169, 143, 167, 166, 122, 178, 170, 138, 165, 160, 155, 120, 149, 125, 126, 122, 137, 139, 178, 145, 146, 173, 172, 144, 158, 135, 130, 174, 126, 129, 127, 161, 168, 183, 122, 184, 160, 148, 146, 160, 137, 184, 141, 142, 147, 146, 132, 142, 114, 158, 148, 143, 152, 113, 133, 159, 183, 171, 136, 180, 167, 132, 184, 150, 147, 135, 180, 166, 160, 175, 114, 166, 181, 129, 154, 174, 165, 141, 156, 142, 96, 156, 178, 141, 175, 183, 151, 161, 183, 163, 161, 162, 173, 132, 143, 122, 174, 115, 144, 177, 175, 164, 204, 133, 163, 133, 160, 140, 151, 132, 143, 158, 115, 147, 146, 158, 152, 165, 152, 161, 144, 139, 174, 164, 153, 153, 142, 150, 180, 173, 126, 154, 130, 171, 172, 168, 196, 132, 192, 159, 173, 185, 181, 140, 185, 119, 153, 142, 173, 170, 177, 194, 143, 145, 142, 159, 184, 156, 179, 157, 179, 163, 177, 143, 177, 174, 143, 193, 183, 183, 181, 170, 189, 193, 191, 195, 163, 153, 183, 156, 185, 170, 170, 214, 179, 173, 147, 171, 189, 214, 230, 145, 162, 169, 178, 177, 151, 172, 180, 187, 132, 167, 152, 165, 213, 189, 145, 183, 174, 139, 189 ], "yaxis": "y" } ], "layout": { "height": 250, "legend": { "tracegroupgap": 0 }, "template": { "data": { "bar": [ { "error_x": { "color": "rgb(36,36,36)" }, "error_y": { "color": "rgb(36,36,36)" }, "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "baxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmap" } ], "heatmapgl": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmapgl" } ], "histogram": [ { "marker": { "line": { "color": "white", "width": 0.6 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergl" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "rgb(237,237,237)" }, "line": { "color": "white" } }, "header": { "fill": { "color": "rgb(217,217,217)" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowhead": 0, "arrowwidth": 1 }, "autosize": true, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "colorscale": { "diverging": [ [ 0, "rgb(103,0,31)" ], [ 0.1, "rgb(178,24,43)" ], [ 0.2, "rgb(214,96,77)" ], [ 0.3, "rgb(244,165,130)" ], [ 0.4, "rgb(253,219,199)" ], [ 0.5, "rgb(247,247,247)" ], [ 0.6, "rgb(209,229,240)" ], [ 0.7, "rgb(146,197,222)" ], [ 0.8, "rgb(67,147,195)" ], [ 0.9, "rgb(33,102,172)" ], [ 1, "rgb(5,48,97)" ] ], "sequential": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "sequentialminus": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ] }, "colorway": [ "#1F77B4", "#FF7F0E", "#2CA02C", "#D62728", "#9467BD", "#8C564B", "#E377C2", "#7F7F7F", "#BCBD22", "#17BECF" ], "font": { "color": "rgb(36,36,36)" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "white", "showlakes": true, "showland": true, "subunitcolor": "white" }, "height": 250, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "margin": { "b": 10, "l": 10, "r": 10, "t": 10 }, "paper_bgcolor": "white", "plot_bgcolor": "white", "polar": { "angularaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "radialaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "scene": { "xaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "zaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } }, "shapedefaults": { "fillcolor": "black", "line": { "width": 0 }, "opacity": 0.3 }, "ternary": { "aaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "baxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "caxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "title": { "x": 0.5, "xanchor": "center" }, "width": 350, "xaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } } }, "width": 350, "xaxis": { "anchor": "y", "autorange": true, "domain": [ 0, 1 ], "range": [ 67.72166874221669, 119.27833125778331 ], "title": { "text": "Height (cm)" }, "type": "linear" }, "yaxis": { "anchor": "x", "autorange": true, "domain": [ 0, 1 ], "range": [ 10.279279279279276, 246.72072072072072 ], "title": { "text": "Weight (kg)" }, "type": "linear" } } }, "image/png": "", "image/svg+xml": [ "8010050100150200Height (cm)Weight (kg)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "px.scatter(donkeys, x='Height', y='Weight', width=350, height=250,\n", " labels={'Weight':'Weight (kg)', 'Height':'Height (cm)'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The small donkey is far from the main concentration of donkeys and would overly influence our models. For this reason, we exclude it. Again, we keep in mind that we may also want to exclude the one or two heavy donkeys if they appear to overly influence our future model fitting: " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(541, 8)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def remove_weight_outliers(donkeys):\n", " return donkeys[(donkeys['Weight'] >= 40)]\n", "\n", "donkeys = (pd.read_csv('data/donkeys.csv')\n", " .pipe(remove_bcs_outliers)\n", " .pipe(remove_weight_outliers))\n", "\n", "donkeys.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In summary, based on our cleaning and quality checks, we removed three anomalous observations from the dataframe. Now we're nearly ready to begin our exploratory analysis.\n", "Before we proceed, we set aside some of our data as a test set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We talked about why it's important to separate out a test set from the train set in {numref}`Chapter %s `.\n", "A best practice is to separate out a test set early in the analysis, before we explore the data in detail, because in EDA, we begin to make decisions about what kinds of models to fit and what variables to use in the model. It's important that our test set isn't involved in these decisions so that it imitates how our model would perform with entirely new data.\n", "\n", "We divide our data into an 80/20 split, where we use 80\\% of the data to\n", "explore and build a model. Then we evaluate the model with the 20\\% that has\n", "been set aside. We use a simple random sample to split the dataframe into the\n", "test and train sets. To begin, we randomly shuffle the indices of the dataframe:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)\n", "n = len(donkeys)\n", "indices = np.arange(n)\n", "np.random.shuffle(indices)\n", "n_train = int(np.round((0.8 * n)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we assign the first 80\\% of the dataframe to the train set and the remaining 20% to the test set: " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "train_set = donkeys.iloc[indices[:n_train]]\n", "test_set = donkeys.iloc[indices[n_train:]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we're ready to explore the training data and look for useful relationships and distributions that inform our modeling. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" } }, "nbformat": 4, "nbformat_minor": 4 }