diff --git a/docs-website/docs/concepts/data-classes.mdx b/docs-website/docs/concepts/data-classes.mdx index 91efbc3674..576c9f6221 100644 --- a/docs-website/docs/concepts/data-classes.mdx +++ b/docs-website/docs/concepts/data-classes.mdx @@ -9,7 +9,7 @@ description: "In Haystack, there are a handful of core classes that are regularl In Haystack, there are a handful of core classes that are regularly used in many different places. These are classes that carry data through the system and you are likely to interact with these as either the input or output of your pipeline. -Haystack uses data classes to help components communicate with each other in a simple and modular way. By doing this, data flows seamlessly through the Haystack pipelines. This page goes over the available data classes in Haystack: ByteStream, Answer (along with its variants ExtractedAnswer and GeneratedAnswer), ChatMessage, Document, and StreamingChunk, explaining how they contribute to the Haystack ecosystem. +Haystack uses data classes to help components communicate with each other in a simple and modular way. By doing this, data flows seamlessly through the Haystack pipelines. This page goes over the available data classes in Haystack: ByteStream, Answer (along with its variants ExtractedAnswer and GeneratedAnswer), ChatMessage, FileContent, Document, and StreamingChunk, explaining how they contribute to the Haystack ecosystem. You can check out the detailed parameters in our [Data Classes](/reference/data-classes-api) API reference. @@ -120,6 +120,12 @@ image = ByteStream.from_file_path("dog.jpg") Read the detailed documentation for the `ChatMessage` data class on a dedicated [ChatMessage](data-classes/chatmessage.mdx) page. +### FileContent + +`FileContent` represents a file payload that can be attached to a `ChatMessage`, including base64 data, MIME type, filename, and provider-specific metadata. + +Read the detailed documentation for the `FileContent` data class on a dedicated [FileContent](data-classes/filecontent.mdx) page. + ### Document #### Overview diff --git a/docs-website/docs/concepts/data-classes/filecontent.mdx b/docs-website/docs/concepts/data-classes/filecontent.mdx new file mode 100644 index 0000000000..5f2c5f40a1 --- /dev/null +++ b/docs-website/docs/concepts/data-classes/filecontent.mdx @@ -0,0 +1,119 @@ +--- +title: "FileContent" +id: filecontent +slug: "/filecontent" +description: "`FileContent` represents file payloads in chat messages, including base64 data, MIME type, filename, and provider-specific metadata." +--- + +# FileContent + +`FileContent` represents a file payload that can be attached to a [`ChatMessage`](chatmessage.mdx). Use it when a chat model accepts file inputs, such as PDFs or other documents, together with the user's text prompt. + +If you need the full list of parameters and methods, see the [`FileContent` API reference](/reference/data-classes-api#filecontent). + +## Attributes + +```python +@dataclass +class FileContent: + base64_data: str + mime_type: str | None = None + filename: str | None = None + extra: dict[str, Any] = field(default_factory=dict) + validation: bool = True +``` + +- `base64_data` stores the file content as a base64-encoded string. +- `mime_type` identifies the file type, for example `application/pdf`. Providing it explicitly is recommended because many model providers require it. +- `filename` is optional, but some providers use it when processing uploaded files. +- `extra` can store provider-specific metadata. Values should be JSON serializable. +- `validation` checks that `base64_data` is valid and tries to infer the MIME type when one is not provided. + +## Create from a file path + +Use `from_file_path` to read a local file, base64-encode it, infer the MIME type from the path, and populate the filename. + +```python +from haystack.dataclasses import ChatMessage, FileContent + +file_content = FileContent.from_file_path("data/attention-is-all-you-need.pdf") + +message = ChatMessage.from_user( + content_parts=[ + file_content, + "Summarize the key ideas in this paper.", + ] +) +``` + +Pass `filename` or `extra` when a provider expects a specific filename or provider-specific options: + +```python +file_content = FileContent.from_file_path( + "data/report.pdf", + filename="quarterly-report.pdf", + extra={"source": "finance"}, +) +``` + +## Create from a URL + +Use `from_url` to download a file and convert it into a `FileContent` instance. + +```python +from haystack.dataclasses import FileContent + +file_content = FileContent.from_url( + "https://example.com/reports/quarterly-report.pdf", + timeout=30, +) +``` + +If no filename is provided, Haystack uses the final path segment of the URL. + +## Create from base64 data + +If you already have file bytes, encode them and pass the MIME type explicitly. + +```python +import base64 +from pathlib import Path + +from haystack.dataclasses import FileContent + +data = Path("data/manual.pdf").read_bytes() +file_content = FileContent( + base64_data=base64.b64encode(data).decode("utf-8"), + mime_type="application/pdf", + filename="manual.pdf", +) +``` + +Set `validation=False` only when the base64 data and MIME type are already trusted and you want to skip validation. + +## Inspect files in a ChatMessage + +After adding `FileContent` to a `ChatMessage`, use the `file` and `files` properties to access file payloads. + +```python +from haystack.dataclasses import ChatMessage, FileContent + +file_content = FileContent.from_file_path("data/invoice.pdf") +message = ChatMessage.from_user(content_parts=[file_content, "Extract the invoice total."]) + +print(message.file) +print(message.files) +``` + +`message.file` returns the first file payload, or `None` if there are no files. `message.files` returns all file payloads. + +## Serialization + +Use `to_dict` and `from_dict` to serialize and restore file content. + +```python +payload = file_content.to_dict() +restored = FileContent.from_dict(payload) +``` + +For tracing, Haystack replaces the full base64 payload with a placeholder so large files are not sent to the tracing backend. diff --git a/docs-website/sidebars.js b/docs-website/sidebars.js index c9c94e0ad7..fc2ed887d5 100644 --- a/docs-website/sidebars.js +++ b/docs-website/sidebars.js @@ -78,6 +78,7 @@ export default { }, items: [ 'concepts/data-classes/chatmessage', + 'concepts/data-classes/filecontent', ], }, { diff --git a/docs-website/versioned_docs/version-2.29/concepts/data-classes.mdx b/docs-website/versioned_docs/version-2.29/concepts/data-classes.mdx index 91efbc3674..576c9f6221 100644 --- a/docs-website/versioned_docs/version-2.29/concepts/data-classes.mdx +++ b/docs-website/versioned_docs/version-2.29/concepts/data-classes.mdx @@ -9,7 +9,7 @@ description: "In Haystack, there are a handful of core classes that are regularl In Haystack, there are a handful of core classes that are regularly used in many different places. These are classes that carry data through the system and you are likely to interact with these as either the input or output of your pipeline. -Haystack uses data classes to help components communicate with each other in a simple and modular way. By doing this, data flows seamlessly through the Haystack pipelines. This page goes over the available data classes in Haystack: ByteStream, Answer (along with its variants ExtractedAnswer and GeneratedAnswer), ChatMessage, Document, and StreamingChunk, explaining how they contribute to the Haystack ecosystem. +Haystack uses data classes to help components communicate with each other in a simple and modular way. By doing this, data flows seamlessly through the Haystack pipelines. This page goes over the available data classes in Haystack: ByteStream, Answer (along with its variants ExtractedAnswer and GeneratedAnswer), ChatMessage, FileContent, Document, and StreamingChunk, explaining how they contribute to the Haystack ecosystem. You can check out the detailed parameters in our [Data Classes](/reference/data-classes-api) API reference. @@ -120,6 +120,12 @@ image = ByteStream.from_file_path("dog.jpg") Read the detailed documentation for the `ChatMessage` data class on a dedicated [ChatMessage](data-classes/chatmessage.mdx) page. +### FileContent + +`FileContent` represents a file payload that can be attached to a `ChatMessage`, including base64 data, MIME type, filename, and provider-specific metadata. + +Read the detailed documentation for the `FileContent` data class on a dedicated [FileContent](data-classes/filecontent.mdx) page. + ### Document #### Overview diff --git a/docs-website/versioned_docs/version-2.29/concepts/data-classes/filecontent.mdx b/docs-website/versioned_docs/version-2.29/concepts/data-classes/filecontent.mdx new file mode 100644 index 0000000000..5f2c5f40a1 --- /dev/null +++ b/docs-website/versioned_docs/version-2.29/concepts/data-classes/filecontent.mdx @@ -0,0 +1,119 @@ +--- +title: "FileContent" +id: filecontent +slug: "/filecontent" +description: "`FileContent` represents file payloads in chat messages, including base64 data, MIME type, filename, and provider-specific metadata." +--- + +# FileContent + +`FileContent` represents a file payload that can be attached to a [`ChatMessage`](chatmessage.mdx). Use it when a chat model accepts file inputs, such as PDFs or other documents, together with the user's text prompt. + +If you need the full list of parameters and methods, see the [`FileContent` API reference](/reference/data-classes-api#filecontent). + +## Attributes + +```python +@dataclass +class FileContent: + base64_data: str + mime_type: str | None = None + filename: str | None = None + extra: dict[str, Any] = field(default_factory=dict) + validation: bool = True +``` + +- `base64_data` stores the file content as a base64-encoded string. +- `mime_type` identifies the file type, for example `application/pdf`. Providing it explicitly is recommended because many model providers require it. +- `filename` is optional, but some providers use it when processing uploaded files. +- `extra` can store provider-specific metadata. Values should be JSON serializable. +- `validation` checks that `base64_data` is valid and tries to infer the MIME type when one is not provided. + +## Create from a file path + +Use `from_file_path` to read a local file, base64-encode it, infer the MIME type from the path, and populate the filename. + +```python +from haystack.dataclasses import ChatMessage, FileContent + +file_content = FileContent.from_file_path("data/attention-is-all-you-need.pdf") + +message = ChatMessage.from_user( + content_parts=[ + file_content, + "Summarize the key ideas in this paper.", + ] +) +``` + +Pass `filename` or `extra` when a provider expects a specific filename or provider-specific options: + +```python +file_content = FileContent.from_file_path( + "data/report.pdf", + filename="quarterly-report.pdf", + extra={"source": "finance"}, +) +``` + +## Create from a URL + +Use `from_url` to download a file and convert it into a `FileContent` instance. + +```python +from haystack.dataclasses import FileContent + +file_content = FileContent.from_url( + "https://example.com/reports/quarterly-report.pdf", + timeout=30, +) +``` + +If no filename is provided, Haystack uses the final path segment of the URL. + +## Create from base64 data + +If you already have file bytes, encode them and pass the MIME type explicitly. + +```python +import base64 +from pathlib import Path + +from haystack.dataclasses import FileContent + +data = Path("data/manual.pdf").read_bytes() +file_content = FileContent( + base64_data=base64.b64encode(data).decode("utf-8"), + mime_type="application/pdf", + filename="manual.pdf", +) +``` + +Set `validation=False` only when the base64 data and MIME type are already trusted and you want to skip validation. + +## Inspect files in a ChatMessage + +After adding `FileContent` to a `ChatMessage`, use the `file` and `files` properties to access file payloads. + +```python +from haystack.dataclasses import ChatMessage, FileContent + +file_content = FileContent.from_file_path("data/invoice.pdf") +message = ChatMessage.from_user(content_parts=[file_content, "Extract the invoice total."]) + +print(message.file) +print(message.files) +``` + +`message.file` returns the first file payload, or `None` if there are no files. `message.files` returns all file payloads. + +## Serialization + +Use `to_dict` and `from_dict` to serialize and restore file content. + +```python +payload = file_content.to_dict() +restored = FileContent.from_dict(payload) +``` + +For tracing, Haystack replaces the full base64 payload with a placeholder so large files are not sent to the tracing backend. diff --git a/docs-website/versioned_docs/version-2.30/concepts/data-classes.mdx b/docs-website/versioned_docs/version-2.30/concepts/data-classes.mdx index 91efbc3674..576c9f6221 100644 --- a/docs-website/versioned_docs/version-2.30/concepts/data-classes.mdx +++ b/docs-website/versioned_docs/version-2.30/concepts/data-classes.mdx @@ -9,7 +9,7 @@ description: "In Haystack, there are a handful of core classes that are regularl In Haystack, there are a handful of core classes that are regularly used in many different places. These are classes that carry data through the system and you are likely to interact with these as either the input or output of your pipeline. -Haystack uses data classes to help components communicate with each other in a simple and modular way. By doing this, data flows seamlessly through the Haystack pipelines. This page goes over the available data classes in Haystack: ByteStream, Answer (along with its variants ExtractedAnswer and GeneratedAnswer), ChatMessage, Document, and StreamingChunk, explaining how they contribute to the Haystack ecosystem. +Haystack uses data classes to help components communicate with each other in a simple and modular way. By doing this, data flows seamlessly through the Haystack pipelines. This page goes over the available data classes in Haystack: ByteStream, Answer (along with its variants ExtractedAnswer and GeneratedAnswer), ChatMessage, FileContent, Document, and StreamingChunk, explaining how they contribute to the Haystack ecosystem. You can check out the detailed parameters in our [Data Classes](/reference/data-classes-api) API reference. @@ -120,6 +120,12 @@ image = ByteStream.from_file_path("dog.jpg") Read the detailed documentation for the `ChatMessage` data class on a dedicated [ChatMessage](data-classes/chatmessage.mdx) page. +### FileContent + +`FileContent` represents a file payload that can be attached to a `ChatMessage`, including base64 data, MIME type, filename, and provider-specific metadata. + +Read the detailed documentation for the `FileContent` data class on a dedicated [FileContent](data-classes/filecontent.mdx) page. + ### Document #### Overview diff --git a/docs-website/versioned_docs/version-2.30/concepts/data-classes/filecontent.mdx b/docs-website/versioned_docs/version-2.30/concepts/data-classes/filecontent.mdx new file mode 100644 index 0000000000..5f2c5f40a1 --- /dev/null +++ b/docs-website/versioned_docs/version-2.30/concepts/data-classes/filecontent.mdx @@ -0,0 +1,119 @@ +--- +title: "FileContent" +id: filecontent +slug: "/filecontent" +description: "`FileContent` represents file payloads in chat messages, including base64 data, MIME type, filename, and provider-specific metadata." +--- + +# FileContent + +`FileContent` represents a file payload that can be attached to a [`ChatMessage`](chatmessage.mdx). Use it when a chat model accepts file inputs, such as PDFs or other documents, together with the user's text prompt. + +If you need the full list of parameters and methods, see the [`FileContent` API reference](/reference/data-classes-api#filecontent). + +## Attributes + +```python +@dataclass +class FileContent: + base64_data: str + mime_type: str | None = None + filename: str | None = None + extra: dict[str, Any] = field(default_factory=dict) + validation: bool = True +``` + +- `base64_data` stores the file content as a base64-encoded string. +- `mime_type` identifies the file type, for example `application/pdf`. Providing it explicitly is recommended because many model providers require it. +- `filename` is optional, but some providers use it when processing uploaded files. +- `extra` can store provider-specific metadata. Values should be JSON serializable. +- `validation` checks that `base64_data` is valid and tries to infer the MIME type when one is not provided. + +## Create from a file path + +Use `from_file_path` to read a local file, base64-encode it, infer the MIME type from the path, and populate the filename. + +```python +from haystack.dataclasses import ChatMessage, FileContent + +file_content = FileContent.from_file_path("data/attention-is-all-you-need.pdf") + +message = ChatMessage.from_user( + content_parts=[ + file_content, + "Summarize the key ideas in this paper.", + ] +) +``` + +Pass `filename` or `extra` when a provider expects a specific filename or provider-specific options: + +```python +file_content = FileContent.from_file_path( + "data/report.pdf", + filename="quarterly-report.pdf", + extra={"source": "finance"}, +) +``` + +## Create from a URL + +Use `from_url` to download a file and convert it into a `FileContent` instance. + +```python +from haystack.dataclasses import FileContent + +file_content = FileContent.from_url( + "https://example.com/reports/quarterly-report.pdf", + timeout=30, +) +``` + +If no filename is provided, Haystack uses the final path segment of the URL. + +## Create from base64 data + +If you already have file bytes, encode them and pass the MIME type explicitly. + +```python +import base64 +from pathlib import Path + +from haystack.dataclasses import FileContent + +data = Path("data/manual.pdf").read_bytes() +file_content = FileContent( + base64_data=base64.b64encode(data).decode("utf-8"), + mime_type="application/pdf", + filename="manual.pdf", +) +``` + +Set `validation=False` only when the base64 data and MIME type are already trusted and you want to skip validation. + +## Inspect files in a ChatMessage + +After adding `FileContent` to a `ChatMessage`, use the `file` and `files` properties to access file payloads. + +```python +from haystack.dataclasses import ChatMessage, FileContent + +file_content = FileContent.from_file_path("data/invoice.pdf") +message = ChatMessage.from_user(content_parts=[file_content, "Extract the invoice total."]) + +print(message.file) +print(message.files) +``` + +`message.file` returns the first file payload, or `None` if there are no files. `message.files` returns all file payloads. + +## Serialization + +Use `to_dict` and `from_dict` to serialize and restore file content. + +```python +payload = file_content.to_dict() +restored = FileContent.from_dict(payload) +``` + +For tracing, Haystack replaces the full base64 payload with a placeholder so large files are not sent to the tracing backend. diff --git a/docs-website/versioned_sidebars/version-2.29-sidebars.json b/docs-website/versioned_sidebars/version-2.29-sidebars.json index 73a566b2b9..88273e63ba 100644 --- a/docs-website/versioned_sidebars/version-2.29-sidebars.json +++ b/docs-website/versioned_sidebars/version-2.29-sidebars.json @@ -73,7 +73,8 @@ "id": "concepts/data-classes" }, "items": [ - "concepts/data-classes/chatmessage" + "concepts/data-classes/chatmessage", + "concepts/data-classes/filecontent" ] }, { diff --git a/docs-website/versioned_sidebars/version-2.30-sidebars.json b/docs-website/versioned_sidebars/version-2.30-sidebars.json index fcefb1373d..5e514ad425 100644 --- a/docs-website/versioned_sidebars/version-2.30-sidebars.json +++ b/docs-website/versioned_sidebars/version-2.30-sidebars.json @@ -73,7 +73,8 @@ "id": "concepts/data-classes" }, "items": [ - "concepts/data-classes/chatmessage" + "concepts/data-classes/chatmessage", + "concepts/data-classes/filecontent" ] }, {