基于plotly数据可视化
简介(我们将创建的内容): (Introduction (what we’ll create):)
Unlike the previous tutorials in this map-based visualization series, we will be dealing with a very large dataset in this tutorial (about 2GB of lat, lon coordinates). We will learn how to use the Datashader library to convert this data into a pixel-density raster, which can be superimposed on a Mapbox base-map to create cool visualizations. The image below shows what you will create by the end of this tutorial.
与本基于地图的可视化系列文章中的先前教程不同,本教程将处理非常大的数据集(约2GB的经纬度坐标)。 我们将学习如何使用Datashader库将该数据转换为像素密度栅格,该栅格可以叠加在Mapbox底图上以创建出色的可视化效果。 下图显示了本教程结束时将创建的内容。
本教程的结构: (Structure of the tutorial:)
The tutorial is structured into the following sections:
本教程分为以下几节:
-
Pre-requisites
先决条件
-
About Datashader
关于Datashader
-
Getting started with the tutorial
教程入门
-
When to use this library
何时使用此库
先决条件: (Pre-requisites:)
This tutorial assumes that you are familiar with python and that you have python downloaded and installed in your machine. If you are not familiar with python but have some experience of programming in some other languages, you may still be able to follow this tutorial, depending on your proficiency.
本教程假定您熟悉python,并且已在计算机中下载并安装了python。 如果您不熟悉python,但有一些使用其他语言进行编程的经验,那么您仍然可以根据自己的熟练程度来学习本教程。
It is very strongly recommended that you go through the Plotly tutorial before going through this tutorial. In this tutorial, the installation of plotly and the concepts covered in the Plotly tutorial will not be repeated.
强烈建议您先阅读Plotly教程,然后再进行本教程。 在本教程中,不会重复安装plotly和Plotly教程中涵盖的概念。
Also, you are strongly encouraged to go through the ‘About Mapbox’ section in the [Plotly + Mapbox] Interactive Choropleth visualization tutorial. We will not repeat that section here, but it is very much a part of this tutorial.
另外,强烈建议您阅读[Plotly + Mapbox] Interactive Choropleth可视化教程中的“关于Mapbox”部分。 我们不会在这里重复该部分,但这是本教程的大部分内容。
关于Datashader: (About Datashader:)
Quoting the official Datashader website,
引用Datashader官方网站 ,
Datashader is a graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly
Datashader是一个图形管道系统,用于快速,灵活地创建大型数据集的有意义的表示形式
In layman terms, datashader converts the millions of lat-lon coordinates into a pixel-density map. Say you have a million lat-lon coordinates bound between latitudes [x,y] and longitudes [a,b]. Now, you create a 100×100 pixels image with the corners corresponding to the extreme lat-lon pairs. So you now have a total of 10,000 pixels. Each pixel corresponds to a physical tile of say 100 sq. km. (actual area will depend on the values of x,y,a,b). Now, if tile1 has 100 lat-lon coordinates within it and tile2 has 1000 coordinates, tile2 has a coordinate density 10 times higher than tile 1. Thus, the pixel corresponding to tile2 will be 10 times brighter than the pixel corresponding to tile1. So essentially, a million lat-lon coordinates now get converted into 10,000 pixel-density mappings. Essentially, the coordinates have been converted into a raster image. This is what makes datashader so powerful.
用外行术语来说,数据着色器将数百万个纬度坐标转换为像素密度图。 假设您在纬度[x,y]和经度[a,b]之间绑定了一百万个纬度坐标。 现在,您创建一个100×100像素的图像,其角对应于极端纬度对。 因此,您现在总共有10,000个像素。 每个像素对应于例如100平方公里的物理图块。 (实际面积取决于x,y,a,b的值)。 现在,如果tile1中具有100个纬度坐标,而tile2中具有1000个坐标,则tile2的坐标密度将比tile 1高10倍。因此,与tile2对应的像素将比与tile1对应的像素亮10倍。 因此从本质上讲,现在可以将一百万个纬度坐标转换为10,000个像素密度映射。 实质上,坐标已转换为光栅图像。 这就是使datashader如此强大的原因。
安装数据着色器: (Installing datashader:)
If you are using Anaconda,
如果您正在使用Anaconda,
conda install datashader
Else, you can use the pip installer:
另外,您可以使用pip安装程序:
pip install datashader
See the Getting Started guide on the datashader website for more information.
有关更多信息,请参见datashader网站上的《 入门指南》 。
教程入门: (Getting started with the tutorial:)
GitHub repo: https://github.com/carnot-technologies/MapVisualizations
GitHub回购: https : //github.com/carnot-technologies/MapVisualizations
Relevant notebook: DatashaderDemo.ipynb
相关笔记本: DatashaderDemo.ipynb
View notebook on NBViewer: Click Here
在NBViewer上查看笔记本: 单击此处
导入相关软件包: (Import relevant packages:)
import dask.dataframe as dd
import datashader as ds
import plotly.express as px
Note the import of dask.dataframe instead of pandas. Because we are dealing with a large dataset, dask will be much faster than pandas. For perspective, the .read_csv() operation takes 19 seconds with pandas and less than a second with dask. Click here to know more about why dask is preferred for large datasets. The gist is that dask utilizes all the cores on your machine, which pandas is unable to do.
注意dask.dataframe而不是pandas的导入。 由于我们要处理的是大型数据集,因此dask的速度将比pandas快得多。 出于透视考虑,.read_csv()操作使用熊猫需要19秒,而使用dask则不到一秒。 单击此处以了解更多关于为什么dask是大型数据集首选的原因。 要点是,dask可以利用计算机上的所有内核,而pandas则无法做到。
导入和清除数据: (Import and clean data:)
Since the relevant CSV for this tutorial is about 2 GB large (74 million + coordinates), it was not possible to host this on GitHub. It can be downloaded from this Google Drive link. It is recommended that you download this file and save it into your data folder. Once that is done, you can simply import it like any other CSV.
由于本教程的相关CSV大小约为2 GB(7400万个+坐标),因此无法在GitHub上托管。 可以从此Google云端硬盘链接下载。 建议您下载此文件并将其保存到数据文件夹中。 完成后,您可以像导入其他CSV一样简单地导入它。
Note: Make sure that you don’t have any other heavy software open when you are loading this dataset, especially if your RAM is comparable to the file size.
注意:加载此数据集时,请确保没有打开任何其他繁琐的软件,尤其是在您的RAM与文件大小相当的情况下。
df = dd.read_csv('data/lat_lon_data.csv')
Now, we will perform some basic cleaning of the data. Since our region of interest is India, we will make sure that all coordinates outside the lat-lon bounds of India are excluded.
现在,我们将对数据进行一些基本清理。 由于我们的关注区域是印度,因此我们将确保排除印度经纬度范围以外的所有坐标。
#Remove any unwanted columns
df = df[['latitude','longitude']]#Clean data, remove any out of bounds points
df = df[df['latitude'] > 6]
df = df[df['latitude'] < 38]
df = df[df['longitude'] > 68]
df = df[df['longitude'] < 98]
创建数据着色器画布: (Creating the datashader canvas:)
cvs = ds.Canvas(plot_width=1000, plot_height=1000)
agg = cvs.points(df, x='longitude', y='latitude')
# agg is an xarray object, see http://xarray.pydata.org/en/stable/coords_lat, coords_lon = agg.coords['latitude'].values, agg.coords['longitude'].values# Corners of the image, which need to be passed to mapbox
coordinates = [[coords_lon[0], coords_lat[0]],
[coords_lon[-1], coords_lat[0]],
[coords_lon[-1], coords_lat[-1]],
[coords_lon[0], coords_lat[-1]]]
We have created a 1000 x 1000 canvas cvs
. Next, we projected the longitude and latitude from the dataframe onto the canvas, using cvs.points
. Then we fetch the projected coordinates and determine the corner points for the image.
我们创建了一个1000 x 1000的画布cvs
。 接下来,我们使用cvs.points
将数据cvs.points
的经度和纬度投影到画布上。 然后,我们获取投影坐标并确定图像的角点。
Now that we have the canvas ready, let us define the colormap for the visualization. We will use the hot
colormap. You can use other alternatives, like fire, or any other color map of your choice.
现在我们已经准备好画布,让我们为可视化定义颜色图。 我们将使用hot
表。 您可以使用其他替代方法,例如火或您选择的任何其他颜色图。
from matplotlib.cm import hot
import datashader.transfer_functions as tf
img=(tf.shade(agg, cmap = hot, how='log'))[::-1].to_pil()#pil stands for Python Image Library
A couple of things to note here. We are using a transfer function to shade the projected coordinates, using the hot
colormap. We have specified the mapping methodology as log
. This is to ensure that even the low-intensity points get represented adequately in the visualization. If we chose the linear
mapping, then the high intensity points completely overshadow the low-intensity points.
这里有几件事要注意。 我们正在使用传递函数,通过hot
色图来阴影投影坐标。 我们已将映射方法指定为log
。 这是为了确保即使是低强度的点也可以在可视化中得到充分的体现。 如果我们选择linear
映射,则高强度点将完全覆盖低强度点。
Another mapping option is eq_hist
, which produces a result similar to the log transformation. You can read more about it here. A comparison of the outputs of the 3 transformations in shown below.
另一个映射选项是eq_hist
,它产生的结果类似于对数转换。 您可以在此处了解更多信息。 下面显示了3个转换的输出的比较。
As you can see, almost nothing is visible with the linear transformation. This is because a couple of pixels with extremely high intensity have overshadowed all others. You will need to zoom-in to identify those hotspots.
如您所见,线性变换几乎看不到任何东西。 这是因为几个具有极高强度的像素使所有其他像素都黯淡了。 您将需要放大以识别那些热点。
Similar to the transformation, different color map options are also available. To get the list of all color maps, click here. Below, the examples with a few different color maps are shown.
与转换类似,也可以使用不同的颜色图选项。 要获取所有颜色图的列表, 请单击此处 。 下面显示了带有一些不同颜色映射的示例。
创建可视化: (Creating the visualization:)
fig = px.scatter_mapbox(df.tail(1),
lat='latitude',
lon='longitude',
zoom=4,width=1000, height=1000)# Add the datashader image as a mapbox layer image
fig.update_layout(mapbox_style="carto-darkmatter",
mapbox_layers = [
{
"sourcetype": "image",
"source": img,
"coordinates": coordinates
}]
)
fig.show()
Here, we are plotting just one point from the dataframe (the last one), so that plotly can create the scatter visualization. We are using the carto-darkmatter style from Mapbox and overlaying the image output of datashader as a layer on top of the visualization. Congratulations!! Your visualization is ready!
在这里,我们仅绘制了数据框中的一个点(最后一个),以便可以通过散点图创建散点图。 我们正在使用Mapbox中的carto-darkmatter样式,并将datashader的图像输出覆盖为可视化之上的一层。 恭喜!! 您的可视化已准备就绪!
何时使用此库: (When to use this library:)
The answer is perhaps the simplest for this library. Use this when you have a very large data set. If you find this visualization aesthetically appealing as I do, then you can use this for smaller datasets as well, but the results will depend on the density distribution of your data. You won’t get high interactivity, because datashader essentially overlays an image on the Mapbox base-map. But you can still zoom and pan the visualization.
对于这个库,答案也许是最简单的。 如果数据集非常大,请使用此选项。 如果您发现这种可视化效果像我一样美观,那么您也可以将其用于较小的数据集,但结果将取决于数据的密度分布。 您不会获得很高的交互性,因为datashader本质上会将图像叠加在Mapbox底图上。 但是您仍然可以缩放和平移可视化效果。
We are trying to fix some broken benches in the Indian agriculture ecosystem through technology, to improve farmers’ income. If you share the same passion join us in the pursuit, or simply drop us a line on report@carnot.co.in
我们正在尝试通过技术修复印度农业生态系统中一些破烂的长凳 ,以提高农民的收入。 如果您有同样的热情,请加入我们的行列,或者直接给我们写信至report@carnot.co.in
翻译自: https://medium.com/tech-carnot/plotly-datashader-visualizing-large-geospatial-datasets-bea27b9d7824
基于plotly数据可视化