apache · jiayuasu · May 8, 2026 · May 8, 2026
@@ -0,0 +1,86 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+# 进阶教程：调优您的 Sedona RDD 应用
+
+在进入这篇进阶教程之前，请确保您已经在本机尝试过若干 Sedona 函数。
+
+## 选择合适的 Sedona 版本
+
+Sedona 的版本号包含三级（例如 0.8.1）。
+
+第一级表示该版本进行了较大的结构重设计，可能带来显著的 API 变化与性能差异。
+
+第二级（如 0.8）表明该版本包含显著的性能提升、重要的新功能以及 API 变化。如果您是 Sedona 老用户并希望升级到这种版本，需要谨慎对待 API 变更。升级前请阅读 [Sedona 版本发布说明](../setup/release-notes.md)，确认能接受相应的 API 变化。
+
+第三级（如 0.8.1）则只包含 bug 修复、少量小的新特性以及轻微的性能提升，不会包含任何 API 变化。升级到此类版本是安全的。我们强烈建议同一二级版本下的所有 Sedona 用户都升级到该级别的最新版本。
+
+## 选择合适的 Spatial RDD 构造方式
+
+Sedona 为每种 SpatialRDD（PointRDD、PolygonRDD、LineStringRDD）提供了多种构造方式。一般来说，您可以从两种入口开始：
+
+1. 从 HDFS、S3 等数据源初始化 SpatialRDD。一个典型示例如下：
+
+```java
+public PointRDD(JavaSparkContext sparkContext, String InputLocation, Integer Offset, FileDataSplitter splitter, boolean carryInputData, Integer partitions, StorageLevel newLevel)
+```
+
+2. 从已有 RDD 初始化 SpatialRDD。一个典型示例如下：
+
+```java
+public PointRDD(JavaRDD<Point> rawSpatialRDD, StorageLevel newLevel)
+```
+
+可以注意到这些构造函数都接受一个 `StorageLevel` 参数。这是为了让 Spark 缓存 SpatialRDD 的一个属性 `rawSpatialRDD`。Sedona 这样做是因为它需要通过若干 Spark “Action” 计算数据集边界与近似总数；这些信息在执行 Spatial Join Query 与 Distance Join Query 时非常有用。
+
+但有时您对自己的数据集十分了解，那么可以手动提供这些信息，调用如下形式的 Spatial RDD 构造函数：
+
+```java
+public PointRDD(JavaSparkContext sparkContext, String InputLocation, Integer Offset, FileDataSplitter splitter, boolean carryInputData, Integer partitions, Envelope datasetBoundary, Integer approximateTotalCount) {
+```
+
+手动提供数据集边界与近似总数能让 Sedona 在初始化时跳过若干较慢的 “Action”。
+
+## 缓存被反复使用的 Spatial RDD
+
+每个 SpatialRDD（PointRDD、PolygonRDD、LineStringRDD）都包含 4 个 RDD 属性：
+
+1. rawSpatialRDD：由 SpatialRDD 构造方法生成的 RDD。
+2. spatialPartitionedRDD：基于 rawSpatialRDD 进行空间分区后的 RDD。注意：该 RDD 中存在被复制的空间对象。
+3. indexedRawRDD：基于 rawSpatialRDD 建索引后的 RDD。
+4. indexedRDD：基于 spatialPartitionedRDD 建索引后的 RDD。注意：该 RDD 中存在被复制的空间对象。
+
+这 4 个 RDD 不会同时存在，所以无需担心内存问题。
+它们在不同查询中分别被调用：
+
+1. Spatial Range Query / KNN Query，未启用索引：使用 rawSpatialRDD。
+2. Spatial Range Query / KNN Query，启用索引：使用 indexedRawRDD。
+3. Spatial Join Query / Distance Join Query，未启用索引：使用 spatialPartitionedRDD。
+4. Spatial Join Query / Distance Join Query，启用索引：使用 indexedRDD。
+
+因此，如果您会多次执行上述某种查询，最好将对应的 RDD 缓存到内存中。常见的使用场景包括：
+
+1. 在 Spatial Autocorrelation、Spatial Co-location Pattern Mining 等空间数据挖掘任务中，可能需要迭代地执行 Spatial Join / Spatial Self-join 来计算邻接矩阵。这种情况下请缓存被反复查询的 spatialPartitionedRDD/indexedRDD。
+2. 在 [Livy](https://github.com/cloudera/livy)、[Spark Job Server](https://github.com/spark-jobserver/spark-jobserver) 等 Spark RDD 共享应用中，多名用户可能在同一份 Spatial RDD 上以不同谓词执行 Spatial Range Query / KNN Query，此时建议缓存 rawSpatialRDD/indexedRawRDD。
+
+## 留意 Spatial RDD 的分区数
+
+有时用户反映某些场景下执行时间较慢。第一步请始终考虑增加 SpatialRDD 的分区数（建议为原值的 2 - 8 倍），可以在初始化 SpatialRDD 时进行设置，这往往能显著提升性能。
+
+之后您可以再考虑调整 Spark 的其他参数，例如使用 Kryo 序列化器，或调整缓存到内存的 RDD 比例。
@@ -0,0 +1,25 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+## 基准测试
+
+我们欢迎大家将 Sedona 用于基准测试。为了获得最佳性能或体验 Sedona 的全部特性，请：
+
+* 始终使用最新版本，或者在基准测试中明确说明所用版本，便于我们追踪相关问题。
+* 启用 Sedona 的 Kryo 序列化器以减少内存占用。
@@ -0,0 +1,136 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+# 在 Apache Spark 上使用 Apache Sedona 进行聚类
+
+聚类算法将相似的数据点划分到 “簇（cluster）” 中。Apache Sedona 可以在大规模几何数据集上运行聚类算法。
+
+注意 “cluster” 一词在此处有两种含义：
+
+* 计算集群（computation cluster）是一组协同执行算法的计算机网络
+* 聚类算法把数据点划分到不同的 “簇（cluster）” 中
+
+本页中的 “簇” 指聚类算法的输出结果。
+
+## 在 Spark 上使用 DBSCAN 进行聚类
+
+本页介绍如何使用 Apache Sedona 执行基于密度的带噪声空间聚类（DBSCAN，density-based spatial clustering of applications with noise）。
+
+DBSCAN 将密度较高区域中的几何对象聚为簇，同时把密度较低区域中的点标记为噪声/离群点。
+
+下面通过散点图来观察一份可被聚类的数据：
+
+![点的散点图](../../image/tutorial/concepts/dbscan-scatterplot-points.png)
+
+DBSCAN 的聚类结果如下：
+
+![带簇分组的散点图](../../image/tutorial/concepts/dbscan-clustering.png)
+
+* 簇 0 包含 5 个点
+* 簇 1 包含 4 个点
+* 4 个点为离群点
+
+下面使用这份数据创建 Spark DataFrame，并使用 Sedona 运行聚类。构造 DataFrame 的代码如下：
+
+```python
+df = (
+    sedona.createDataFrame(
+        [
+            (1, 8.0, 2.0),
+            (2, 2.6, 4.0),
+            (3, 2.5, 4.0),
+            (4, 8.5, 2.5),
+            (5, 2.8, 4.3),
+            (6, 12.8, 4.5),
+            (7, 2.5, 4.2),
+            (8, 8.2, 2.5),
+            (9, 8.0, 3.0),
+            (10, 1.0, 5.0),
+            (11, 8.0, 2.5),
+            (12, 5.0, 6.0),
+            (13, 4.0, 3.0),
+        ],
+        ["id", "x", "y"],
+    )
+).withColumn("point", ST_Point(col("x"), col("y")))
+```
+
+DataFrame 内容如下：
+
+```
++---+----+---+----------------+
+| id|   x|  y|           point|
++---+----+---+----------------+
+|  1| 8.0|2.0|     POINT (8 2)|
+|  2| 2.6|4.0|   POINT (2.6 4)|
+|  3| 2.5|4.0|   POINT (2.5 4)|
+|  4| 8.5|2.5| POINT (8.5 2.5)|
+|  5| 2.8|4.3| POINT (2.8 4.3)|
+|  6|12.8|4.5|POINT (12.8 4.5)|
+|  7| 2.5|4.2| POINT (2.5 4.2)|
+|  8| 8.2|2.5| POINT (8.2 2.5)|
+|  9| 8.0|3.0|     POINT (8 3)|
+| 10| 1.0|5.0|     POINT (1 5)|
+| 11| 8.0|2.5|   POINT (8 2.5)|
+| 12| 5.0|6.0|     POINT (5 6)|
+| 13| 4.0|3.0|     POINT (4 3)|
++---+----+---+----------------+
+```
+
+运行 DBSCAN 算法的方法如下：
+
+```python
+from sedona.spark.stats import dbscan
+
+dbscan(df, 1.0, 3).orderBy("id").show()
+```
+
+计算结果如下：
+
+```
++---+----+---+----------------+------+-------+
+| id|   x|  y|           point|isCore|cluster|
++---+----+---+----------------+------+-------+
+|  1| 8.0|2.0|     POINT (8 2)|  true|      0|
+|  2| 2.6|4.0|   POINT (2.6 4)|  true|      1|
+|  3| 2.5|4.0|   POINT (2.5 4)|  true|      1|
+|  4| 8.5|2.5| POINT (8.5 2.5)|  true|      0|
+|  5| 2.8|4.3| POINT (2.8 4.3)|  true|      1|
+|  6|12.8|4.5|POINT (12.8 4.5)| false|     -1|
+|  7| 2.5|4.2| POINT (2.5 4.2)|  true|      1|
+|  8| 8.2|2.5| POINT (8.2 2.5)|  true|      0|
+|  9| 8.0|3.0|     POINT (8 3)|  true|      0|
+| 10| 1.0|5.0|     POINT (1 5)| false|     -1|
+| 11| 8.0|2.5|   POINT (8 2.5)|  true|      0|
+| 12| 5.0|6.0|     POINT (5 6)| false|     -1|
+| 13| 4.0|3.0|     POINT (4 3)| false|     -1|
++---+----+---+----------------+------+-------+
+```
+
+可以看到 `cluster` 列表示每个几何对象所属的簇。
+
+要执行该操作，必须先设置 Spark 的检查点目录。检查点目录是查询中间结果写入的持久化临时缓存位置。
+
+可按如下方式设置检查点目录：
+
+```python
+sedona.sparkContext.setCheckpointDir(myPath)
+```
+
+`myPath` 必须能被所有 executor 访问。本机运行时使用本地路径即可；如有 HDFS，通常是更好的选择。某些运行时环境可能允许或要求使用块存储路径（如 Amazon S3、Google Cloud Storage）。部分环境可能已经预先设置了 Spark 检查点目录，这一步即可省略。