[WIP] Auto infer schema (including fields shape) from the first row#512
[WIP] Auto infer schema (including fields shape) from the first row#512WeichenXu123 wants to merge 6 commits into
Conversation
Codecov Report
@@ Coverage Diff @@
## master #512 +/- ##
==========================================
- Coverage 86.02% 85.86% -0.17%
==========================================
Files 81 81
Lines 4402 4442 +40
Branches 704 713 +9
==========================================
+ Hits 3787 3814 +27
- Misses 504 511 +7
- Partials 111 117 +6
Continue to review full report at Codecov.
|
liangz1
left a comment
There was a problem hiding this comment.
A Convenient feature that would simplify the schema issue! I left a couple of questions.
| if all_cols: | ||
| self.publish_func(all_cols) | ||
|
|
||
| def infer_schema_from_first_row(self): |
There was a problem hiding this comment.
nit: I'm not sure whether the partition[0] necessarily contains the "first" row? Could the partitions be out of order? If so, we may call it infer_schema_from_a_row.
There was a problem hiding this comment.
Here I read the first row in the index-0 row-groups. But index-0 row-groups may be non-deterministic ? Not sure. infer_schema_from_a_row sounds good.
|
|
||
| if 'transform_spec' in petastorm_reader_kwargs or \ | ||
| 'infer_schema_from_first_row' in petastorm_reader_kwargs: | ||
| raise ValueError('User cannot set transform_spec and infer_schema_from_first_row ' |
There was a problem hiding this comment.
Shall we also allow users to use transform_spec&infer_schema_from_first_row? Keeping transform_spec would make it consistent with the rest of the petastorm library.
There was a problem hiding this comment.
I think the param preprocess_fn should cover the functionality of transform_spec, and it is easier to use (can auto inferring result schema), so I forbid the two params.
|
I create a simple PR to address issue 1, #517 |
What issues does the PR addresses ?
There're 2 issues in
make_batch_reader, one is critical and another is less critical but a pain point.(Critical) Inferring schema in
make_batch_readercannot infer fields' shape informationBecause there's no shape information, when make tensorflow dataset from the reader, if we make some tensorflow dataset operations, such as unroll, batch, and reshape field, error may occur. Tensorflow graph operator depends on field shape information heavily.
(Pain point) The
TransformSpecneed to specify edit/removed fields manuallyWe hope user can only provide a transform function, and petastorm can automatically infer the result schema from the output pandas dataframe of the transform function.
The approach in the PR
Add a method
ArrowReaderWorker. infer_schema_from_first_rowwhich can read a row first and infer the schema from the row. So that we can infer the accurate shape information.Add a param
infer_schema_from_first_rowintomake_batch_reader(default off, so won't break API behavior)Limitations:
Test
Unit test to be added. But it is ready for first review.
Example code