Language Detection

A Docker image that detects the language of PDF documents or text using LangDetect. For PDF output, a PDFix SDK license is required.

Getting started

You need Docker installed. The first run downloads the image and may take longer than later runs.

Usage

Mount a folder into the container and run a subcommand:

docker run --rm -v "$(pwd)":/data -w /data pdfix/detect-language:latest <command> [options]

Commands

set-document-language: Detect language from a PDF and set it in document metadata (PDF → PDF)
set-tag-language: Detect language for filtered tags and save it on each tag (PDF → PDF)
set-content-language: Detect language for filtered page content and save it as marked content (PDF → PDF)
detect_language: Detect language from a TXT file or raw text string and write the language code to a TXT file (TXT → TXT; text → TXT)

Arguments

Common (PDF commands)

Option	Required	Type / expected value	Description
`--input`, `-i`	yes	Path to an existing `.pdf` file	Input PDF
`--output`, `-o`	yes	Path for output `.pdf` file	Output PDF
`--name`	no	String (PDFix account license name)	PDFix license name
`--key`	no	String (PDFix account license key)	PDFix license key
`--maxwords`	no	Integer (default: 100)	How many words are considered for language detection

`set-document-language`

Uses the Common (PDF commands) arguments.

`set-tag-language`

Uses the Common (PDF commands) arguments, plus:

Option	Required	Type / expected value	Description
`--overwrite`	no	Boolean string (default: `false`)	Overwrite already existing language on a tag

`set-content-language`

Uses the Common (PDF commands) arguments, plus:

Option	Required	Type / expected value	Description
`--overwrite`	no	Boolean string (default: `false`)	Overwrite already existing language on content

`detect_language`

Option	Required	Type / expected value	Description
`--input`, `-i`	yes	Path to an existing `.txt` file, or a raw text string	Source text or file
`--output`, `-o`	yes	Path for output `.txt` file	Output file containing the detected language code
`--maxwords`	no	Integer (default: 100)	How many words are considered for language detection

Examples

Set detected language in PDF document metadata:

docker run --rm -v "$(pwd)":/data -w /data pdfix/detect-language:latest \
  set-document-language --name "${LICENSE_NAME}" --key "${LICENSE_KEY}" \
  --input /data/input.pdf --output /data/output.pdf --maxwords 100

Set detected language on PDF tags:

docker run --rm -v "$(pwd)":/data -w /data pdfix/detect-language:latest \
  set-tag-language --name "${LICENSE_NAME}" --key "${LICENSE_KEY}" \
  --input /data/input.pdf --output /data/output.pdf --maxwords 100 --overwrite true

Set detected language on PDF page content:

docker run --rm -v "$(pwd)":/data -w /data pdfix/detect-language:latest \
  set-content-language --name "${LICENSE_NAME}" --key "${LICENSE_KEY}" \
  --input /data/input.pdf --output /data/output.pdf --maxwords 100 --overwrite true

Detect language from a text file and write the language code to output.txt:

docker run --rm -v "$(pwd)":/data -w /data pdfix/detect-language:latest \
  detect_language --input /data/input.txt --output /data/output.txt --maxwords 100

Help & support

For PDFix SDK licensing or issues, contact support@pdfix.net.

Licenses

PDFix Terms

Trial versions of the PDFix SDK may apply watermarks and redact random content in the output PDF.

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
.github/workflows		.github/workflows
example		example
src		src
.gitignore		.gitignore
.mypy.ini		.mypy.ini
.ruff.toml		.ruff.toml
Dockerfile		Dockerfile
README.md		README.md
config.json		config.json
requirements.txt		requirements.txt
test.sh		test.sh
update_version.sh		update_version.sh
update_versions_repository.sh		update_versions_repository.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Detection

Table of Contents

Getting started

Usage

Commands

Arguments

Common (PDF commands)

`set-document-language`

`set-tag-language`

`set-content-language`

`detect_language`

Examples

Help & support

Licenses

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Language Detection

Table of Contents

Getting started

Usage

Commands

Arguments

Common (PDF commands)

set-document-language

set-tag-language

set-content-language

detect_language

Examples

Help & support

Licenses

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`set-document-language`

`set-tag-language`

`set-content-language`

`detect_language`

Packages