15. 空间索引

回想一下,空间索引是空间数据库的三个关键特征之一。空间索引,使得大型数据集在空间数据库中的应用成为可能。如果没有索引,任何对空间要素的搜索都需要对数据库中的每条记录进行“顺序遍历”。空间索引就是把空间数据组织成树结构,树可以加速搜索过程以便找到特定的记录。

空间索引是 PostGIS 最大的本领之一。在之前的例子中,构建空间连接需要把每个空间数据表进行比较,这样的性能消耗十分巨大,为了连接两个 10000 条记录的数据表,要进行 100000000 次比较;而使用空间索引的话,最低可以低到 20000 次。

现有的数据已经有空间索引了,为了展示空间索引的表现,先删除空间索引以便比较。

现在我们对 nyc_census_blocks 数据表跑一个 没有 空间索引的查询。

第一步,先 移除 空间索引。

DROP INDEX nyc_census_blocks_geom_idx;

注解

DROP INDEX 表达式移除数据库中现有的索引。更多信息参考 PostgreSQL 文档

执行命令后,观察 pgAdmin 查询窗口右下角的 “Timing” 面板。这个查询搜索的是每个名称以 “B” 开头的地铁站所在地区。

SELECT count(blocks.blkid)
 FROM nyc_census_blocks blocks
 JOIN nyc_subway_stations subways
 ON ST_Contains(blocks.geom, subways.geom)
 WHERE subways.name LIKE 'B%';
    count
---------------
    46

nyc_census_blocks 数据表很小(大概千把个记录),所以它甚至都不需要索引,这个查询在我的电脑上只用了 300 毫秒

现在重新添加空间索引并再次运行查询。

CREATE INDEX nyc_census_blocks_geom_idx
  ON nyc_census_blocks
  USING GIST (geom);

注解

USING GIST 子句告诉 PostgreSQL 在创建索引时,使用通用的索引结构(GIST)。若你在创建索引时,报类似于 ERROR: index row requires 11340 bytes, maximum size is 8191 的错误,你可能就是少写了 USING GIST 子句。

在我的测试机上,查询时间降到了 50 毫秒。你的数据表越大,查询性能提升越明显。

15.1. 空间索引是如何起作用的

标准的数据库索引,是根据被索引的列的值去创建树结构的。空间索引略不同,因为数据库并不能索引几何字段的值 —— 也就是几何对象本身,我们改索引要素的范围边界框。

_images/bbox.png

上图中,和黄色星星相交的线的数量是 1,即红色那条线。但是与黄色框相交的范围框有红色和蓝色,共 2 个。

数据库求解 “什么线与黄色星相交” 这个问题,是先用空间索引求解 “什么范围框与黄色范围框相交” 这个问题的(速度非常快),然后才是 “什么线与黄色的星星相交”。上述过程仅对于第一次测试的空间要素而言。

对于数量庞大的数据表,这种索引先行,然后局部精确计算的 “两遍法” 可以在根本上减少查询计算量。

PostGIS 和 Oracle spatial 都用了 “R-Tree” 1 空间索引结构。R-Tree 把数据描述成一簇簇的矩形,它是一种自调整的索引结构,可以自动处理数据的数量、密度和大小等。

_images/index-01.png

15.2. 空间索引函数

不是所有函数都会使用空间索引的,如果存在空间索引,那么支持使用空间索引的函数会自动使用它。

前四个是查询中最常用的,ST_DWithin 对于 “一定距离内”、“一定半径内” 的查询是非常重要的,能获得指数级别的查询性能。

为了把空间索引加速功能添加到不在上述列表中的函数中(譬如,ST_Relate),你可以想下文的例子一样添加一个 index-only 子句。

15.3. 仅索引型查询

Most of the commonly used functions in PostGIS (ST_Contains, ST_Intersects, ST_DWithin, etc) include an index filter automatically. But some functions (e.g., ST_Relate) do not include an index filter.

To do a bounding-box search using the index (and no filtering), make use of the && operator. For geometries, the && operator means "bounding boxes overlap or touch" in the same way that for numbers the = operator means "values are the same".

Let's compare an index-only query for the population of the 'West Village' to a more exact query. Using && our index-only query looks like the following:

SELECT Sum(popn_total)
FROM nyc_neighborhoods neighborhoods
JOIN nyc_census_blocks blocks
ON neighborhoods.geom && blocks.geom
WHERE neighborhoods.name = 'West Village';
49821

Now let's do the same query using the more exact ST_Intersects function.

SELECT Sum(popn_total)
FROM nyc_neighborhoods neighborhoods
JOIN nyc_census_blocks blocks
ON ST_Intersects(neighborhoods.geom, blocks.geom)
WHERE neighborhoods.name = 'West Village';
26718

A much lower answer! The first query summed up every block whose bounding box intersects the neighborhood's bounding box; the second query only summed up those blocks that intersect the neighborhood itself.

15.4. 分析

The PostgreSQL query planner intelligently chooses when to use or not to use indexes to evaluate a query. Counter-intuitively, it is not always faster to do an index search: if the search is going to return every record in the table, traversing the index tree to get each record will actually be slower than just sequentially reading the whole table from the start.

Knowing the size of the query rectangle is not enough to pin down whether a query will return a large number or small number of records. Below, the red square is small, but will return many more records than the blue square.

_images/index-02.png

In order to figure out what situation it is dealing with (reading a small part of the table versus reading a large portion of the table), PostgreSQL keeps statistics about the distribution of data in each indexed table column. By default, PostgreSQL gathers statistics on a regular basis. However, if you dramatically change the contents of your table within a short period of time, the statistics will not be up-to-date.

To ensure the statistics match your table contents, it is wise the to run the ANALYZE command after bulk data loads and deletes in your tables. This force the statistics system to gather data for all your indexed columns.

The ANALYZE command asks PostgreSQL to traverse the table and update its internal statistics used for query plan estimation (query plan analysis will be discussed later).

ANALYZE nyc_census_blocks;

15.5. 垃圾回收

It's worth stressing that just creating an index is not enough to allow PostgreSQL to use it effectively. VACUUMing must be performed whenever a large number of UPDATEs, INSERTs or DELETEs are issued against a table. The VACUUM command asks PostgreSQL to reclaim any unused space in the table pages left by updates or deletes to records.

Vacuuming is so critical for the efficient running of the database that PostgreSQL provides an "autovacuum" facility by default.

Autovacuum both vacuums (recovers space) and analyzes (updates statistics) on your tables at sensible intervals determined by the level of activity. While this is essential for highly transactional databases, it is not advisable to wait for an autovacuum run after adding indices or bulk-loading data. Whenever a large batch update is performed, you should manually run VACUUM.

Vacuuming and analyzing the database can be performed separately as needed. Issuing VACUUM command will not update the database statistics; likewise issuing an ANALYZE command will not recover unused table rows. Both commands can be run against the entire database, a single table, or a single column.

VACUUM ANALYZE nyc_census_blocks;

15.6. 函数列表

geometry_a && geometry_b:若 A 的范围框覆盖了 B 的,则返回 TRUE.

geometry_a = geometry_b:若 A 的空间范围框和 B 的一致,返回 TRUE.

ST_Intersects(geometry_a, geometry_b):若 Geometries 对象或 Geography 对象存在 “空间相交”,即有任意部分重叠,返回 TRUE,否则返回 FALSE.

脚注

1

http://postgis.net/docs/support/rtree.pdf