apache · jiayuasu · May 8, 2026 · May 8, 2026
@@ -0,0 +1,248 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+本教程指引您在启用了数据外泄保护（Data Exfiltration Protection，DEP）或由于其他网络限制导致 Spark 池无法访问公网的情况下，在 Azure Synapse Analytics 中安装 Sedona。
+
+## 开始之前
+
+本教程主要演示如何在 Spark 3.4、Python 3.10 上跑通 Sedona 1.6.1。
+
+如果您希望运行更新的版本，请参考本文后半部分介绍的详细构建与诊断流程。
+
+## 强烈建议
+
+1. 从一个未安装其他包的干净 Spark 池开始，以避免依赖冲突。
+2. Apache Spark 池 -> Apache Spark 配置：使用默认配置。
+
+## Sedona 1.6.1 在 Spark 3.4、Python 3.10 上的安装
+
+### 步骤 1：下载 9 个包
+
+注意：版本必须严格匹配，最新并不总是最优。
+
+来自 Maven：
+
+- [sedona-spark-shaded-3.4_2.12-1.6.1.jar](https://mvnrepository.com/artifact/org.apache.sedona/sedona-spark-shaded-3.4_2.12/1.6.1)
+
+- [geotools-wrapper-1.6.1-28.2.jar](https://mvnrepository.com/artifact/org.datasyslab/geotools-wrapper/1.6.1-28.2)
+
+来自 PyPI：
+
+- [rasterio-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl](https://files.pythonhosted.org/packages/cd/ad/2d3a14e5a97ca827a38d4963b86071267a6cd09d45065cd753d7325699b6/rasterio-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl)
+
+- [shapely-2.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl](https://files.pythonhosted.org/packages/2b/a6/302e0d9c210ccf4d1ffadf7ab941797d3255dcd5f93daa73aaf116a4db39/shapely-2.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl)
+
+- [apache_sedona-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl](https://files.pythonhosted.org/packages/b6/71/09f7ca2b6697b2699c04d1649bb379182076d263a9849de81295d253220d/apache_sedona-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl)
+
+- [click_plugins-1.1.1-py2.py3-none-any.whl](https://files.pythonhosted.org/packages/e9/da/824b92d9942f4e472702488857914bdd50f73021efea15b4cad9aca8ecef/click_plugins-1.1.1-py2.py3-none-any.whl)
+
+- [cligj-0.7.2-py3-none-any.whl](https://files.pythonhosted.org/packages/73/86/43fa9f15c5b9fb6e82620428827cd3c284aa933431405d1bcf5231ae3d3e/cligj-0.7.2-py3-none-any.whl)
+
+- [affine-2.4.0-py3-none-any.whl](https://files.pythonhosted.org/packages/0b/f7/85273299ab57117850cc0a936c64151171fac4da49bc6fba0dad984a7c5f/affine-2.4.0-py3-none-any.whl)
+
+- [numpy-2.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl](https://files.pythonhosted.org/packages/fb/25/ba023652a39a2c127200e85aed975fc6119b421e2c348e5d0171e2046edb/numpy-2.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl)
+
+### 步骤 2：将包上传到 Synapse 工作区
+
+https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-manage-workspace-packages
+
+### 步骤 3：将包添加到 Spark 池
+
+本教程采用该页面中的第二种方式：**If you are updating from the Synapse Studio**
+
+https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-manage-pool-packages#manage-packages-from-synapse-studio-or-azure-portal
+
+### 步骤 4：notebook
+
+在 notebook 起始处加入以下代码：
+
+```python
+from sedona.spark import SedonaContext
+
+config = (
+    SedonaContext.builder()
+    .config(
+        "spark.jars.packages",
+        "org.apache.sedona:sedona-spark-shaded-3.4_2.12-1.6.1,"
+        "org.datasyslab:geotools-wrapper-1.6.1-28.2",
+    )
+    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
+    .config(
+        "spark.kryo.registrator", "org.apache.sedona.core.serde.SedonaKryoRegistrator"
+    )
+    .config(
+        "spark.sql.extensions",
+        "org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions",
+    )
+    .getOrCreate()
+)
+
+sedona = SedonaContext.create(config)
+```
+
+跑一个测试：
+
+```python
+sedona.sql("SELECT ST_GeomFromEWKT('SRID=4269;POINT(40.7128 -74.0060)')").show()
+```
+
+如果看到点的输出，则说明安装成功，配置工作至此完成。
+
+## Sedona 1.6.0 在 Spark 3.4、Python 3.10 上需要的包
+
+```
+spark-xml_2.12-0.17.0.jar
+sedona-spark-shaded-3.4_2.12-1.6.0.jar
+
+click_plugins-1.1.1-py2.py3-none-any.whl
+affine-2.4.0-py3-none-any.whl
+apache_sedona-1.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
+cligj-0.7.2-py3-none-any.whl
+rasterio-1.3.10-cp310-cp310-manylinux2014_x86_64.whl
+shapely-2.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
+snuggs-1.4.7-py3-none-any.whl
+geotools-wrapper-1.6.0-28.2.jar
+```
+
+## 背景：如何为其他/未来版本的 Spark 与/或 Sedona 确定所需包
+
+警告：该流程需要相当的技术耐心与排错能力。
+
+整体步骤：用与已部署 Spark 池相同的镜像构建一台 Linux 虚拟机，按 Synapse 进行配置，安装 Sedona 包，再识别出在 Synapse 基线之上额外需要的包。
+
+下面是 Sedona 1.6.1 在 Spark 3.4、Python 3.10 上的实操过程（同一流程也用于 Sedona 1.6.0）。
+
+### 步骤 1：根据版本识别 Spark 池所用的 Linux 镜像
+
+https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-34-runtime
+
+### 步骤 2：下载 ISO
+
+https://github.com/microsoft/azurelinux/tree/2.0
+
+### 步骤 3：构建虚拟机
+
+https://github.com/microsoft/azurelinux/blob/2.0/toolkit/docs/quick_start/quickstart.md#iso-image
+
+如使用 Hyper-V，需注意以下设置：
+
+- 启用 Secure Boot：Microsoft UEFI Certificate authority
+- CPU 核心数：2
+- 关闭动态内存（固定为 8GB）。忘记此设置会带来很多麻烦。
+
+### 步骤 4：更新虚拟机
+
+连接到虚拟机，注意首次启动会比预期更慢。
+
+```sh
+sudo dnf upgrade
+```
+
+### 步骤 5：可选但强烈推荐 —— 安装 ssh-server（以便复制粘贴）
+
+```sh
+sudo tdnf install -y openssh-server
+```
+
+启用 root 与密码认证：
+
+```sh
+sudo vi /etc/ssh/sshd_config
+-	PasswordAuthentication yes
+-	PermitRootLogin yes
+```
+
+启动 ssh-server：
+
+```bash
+sudo systemctl enable --now sshd.service
+```
+
+识别虚拟机 IP（这里在 Windows 10 桌面上使用 Hyper-V）：
+
+```ps
+Get-VMNetworkAdapter -VMName "Synapse Spark 3.4 Python 3.10 Sedona 1.6.1" | Select-Object -ExpandProperty IPAddresses
+```
+
+### 步骤 6：安装 Miniconda
+
+```bash
+cd /tmp
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+chmod +x Miniconda3-latest-Linux-x86_64.sh
+./Miniconda3-latest-Linux-x86_64.sh
+```
+
+### 步骤 7：安装编译器
+
+```sh
+sudo tdnf -y install gcc g++
+```
+
+### 步骤 8：创建 Synapse 基线虚拟环境
+
+下载虚拟环境定义文件：
+
+```bash
+wget -O Synapse-Python310-CPU.yml https://raw.githubusercontent.com/microsoft/synapse-spark-runtime/refs/heads/main/Synapse/spark3.4/Synapse-Python310-CPU.yml source
+```
+
+```bash
+conda env create -f Synapse-Python310-CPU.yml -n synapse
+```
+
+如果在 `fsspec_wrapper` 处报错，请从 yml 中移除 `fsspec_wrapper==0.1.13=py_3` 后重试。
+
+如果在上述修改之后仍出现来自 `pip` 的其他错误，可以忽略，仍可继续。
+
+### 步骤 9：安装 Sedona Python 包
+
+```bash
+conda activate synapse
+echo "apache-sedona==1.6.1" > requirements.txt
+pip install -r requirements.txt > pip-output.txt
+```
+
+### 步骤 10：识别需要下载的 Python 包
+
+```bash
+grep Downloading pip-output.txt
+```
+
+**该输出即为您需要从 PyPI 定位并下载的包列表。**
+
+输出示例：
+
+```
+Downloading apache_sedona-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (177 kB)
+Downloading shapely-2.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB)
+Downloading rasterio-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (22.2 MB)
+Downloading affine-2.4.0-py3-none-any.whl (15 kB)
+Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB)
+Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
+```
+
+### 步骤 11：在已部署的 Azure Synapse Spark 池（真实环境，不是虚拟机）中识别包冲突
+
+- 将包上传到工作区
+- 将包加入您的（干净的！）Spark 池
+
+请仔细查看 Synapse 返回的错误信息，并逐一排查冲突。
+
+注意：在 Spark 3.4 上安装 Sedona 1.6.0 时未遇到问题，但 Sedona 1.6.1 及其支持包出现了与 `numpy` 相关的冲突，需要下载特定版本并加入包列表。`numpy` 并未出现在前面 grep 的输出中。
@@ -0,0 +1,51 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+# 搭建您的 Apache Spark 集群
+
+请从 [Spark 下载页面](http://spark.apache.org/downloads.html) 下载 Spark 发行版。
+
+## 前置准备
+
+1. 在集群上配置免密 SSH。每对 master-worker 之间需要双向免密 SSH。
+2. 确保已安装 JRE 1.8 或更高版本。
+3. 将所有 worker 的 IP 地址列入 `./conf/slaves`。
+4. 除必要的 Spark 配置外，您可能还需要在 Spark 配置文件中加入以下设置以避免 Sedona 内存相关错误：
+
+在 `./conf/spark-defaults.conf` 中：
+
+```
+spark.driver.memory 10g
+spark.network.timeout 1000s
+spark.driver.maxResultSize 5g
+```
+
+* `spark.driver.memory` 用于为 driver 程序分配足够的内存。Sedona 需要在 driver 上构建全局网格文件（全局索引）。如果数据量较大（通常超过 100 GB），将该参数设为 2~5 GB 较为合适，否则可能出现 “out of memory” 错误。
+* `spark.network.timeout` 是所有网络交互的默认超时时间。空间连接查询有时需要更长的 shuffle 时间，调大此参数可让 Spark 有足够的耐心等待结果。
+* `spark.driver.maxResultSize` 是每个 Spark action 中所有分区的序列化结果总大小限制。空间查询的结果有时较大，`Collect` 操作可能会抛出错误。
+
+更多 Spark 参数细节请参阅 [Spark 官方文档](https://spark.apache.org/docs/latest/configuration.html)。
+
+## 启动集群
+
+进入解压后的 Apache Spark 目录的根目录，通过终端启动 Spark 集群：
+
+```
+./sbin/start-all.sh
+```