[WIP] Auto infer schema (including fields shape) from the first row by WeichenXu123 · Pull Request #512 · uber/petastorm

WeichenXu123 · 2020-03-23T05:37:59Z

What issues does the PR addresses ?

There're 2 issues in make_batch_reader, one is critical and another is less critical but a pain point.

(Critical) Inferring schema in `make_batch_reader` cannot infer fields' shape information

Because there's no shape information, when make tensorflow dataset from the reader, if we make some tensorflow dataset operations, such as unroll, batch, and reshape field, error may occur. Tensorflow graph operator depends on field shape information heavily.

(Pain point) The `TransformSpec` need to specify edit/removed fields manually

We hope user can only provide a transform function, and petastorm can automatically infer the result schema from the output pandas dataframe of the transform function.

The approach in the PR

Add a method ArrowReaderWorker. infer_schema_from_first_row which can read a row first and infer the schema from the row. So that we can infer the accurate shape information.
Add a param infer_schema_from_first_row into make_batch_reader (default off, so won't break API behavior)

Limitations:

for all rows (before applying predicates), require all values in each field non-nullable and having the same shape.

Test

Unit test to be added. But it is ready for first review.

Example code

import os
import pandas as pd
import sys
import numpy as np
from pyspark.sql.functions import pandas_udf
import tensorflow as tf

from petastorm import make_batch_reader
from petastorm.transform import TransformSpec
from petastorm.spark import make_spark_converter
spark.conf.set('petastorm.spark.converter.parentCacheDirUrl', 'file:/tmp/converter')

data_url = 'file:/tmp/0001'
data_path = '/tmp/t0001'

@pandas_udf('array<float>')
def gen_array(v):
  return v.map(lambda x: np.random.rand(10))

df1 = spark.range(10).withColumn('v', gen_array('id')).repartition(2)
cv1 = make_spark_converter(df1)

# we can auto infer one-dim array shape
with cv1.make_tf_dataset(batch_size=4, num_epochs=1) as dataset:
	iter = dataset.make_one_shot_iterator()
	next_op = iter.get_next()
	with tf.Session() as sess:
		for i in range(3):
			batch = sess.run(next_op)
			print(batch)


def preproc_fn(x):
  # reshape column 'v' to (2, 5) shape.
  x2 = pd.DataFrame({'v': x['v'].map(lambda x: x.reshape((2, 5))), 'id': x['id'] + 10000})
  return x2

# now we can auto infer multi-dim array shape.
with cv1.make_tf_dataset(batch_size=4, preprocess_fn=preproc_fn, num_epochs=1) as dataset:
	iter = dataset.make_one_shot_iterator()
	next_op = iter.get_next()
	with tf.Session() as sess:
		for i in range(3):
			batch = sess.run(next_op)
			print(batch)

codecov · 2020-03-23T09:09:10Z

Codecov Report

Merging #512 into master will decrease coverage by 0.16%.
The diff coverage is 72.91%.

@@            Coverage Diff             @@
##           master     #512      +/-   ##
==========================================
- Coverage   86.02%   85.86%   -0.17%     
==========================================
  Files          81       81              
  Lines        4402     4442      +40     
  Branches      704      713       +9     
==========================================
+ Hits         3787     3814      +27     
- Misses        504      511       +7     
- Partials      111      117       +6

Impacted Files	Coverage Δ
petastorm/tf_utils.py	`80.91% <ø> (ø)`	⬆️
petastorm/spark/spark_dataset_converter.py	`87.5% <25%> (-3.13%)`	⬇️
petastorm/reader.py	`90.32% <77.77%> (-0.68%)`	⬇️
petastorm/arrow_reader_worker.py	`90.34% <83.87%> (-1.66%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b70510...529cb83. Read the comment docs.

liangz1

A Convenient feature that would simplify the schema issue! I left a couple of questions.

liangz1 · 2020-03-23T17:00:23Z

        if all_cols:
            self.publish_func(all_cols)

+    def infer_schema_from_first_row(self):


nit: I'm not sure whether the partition[0] necessarily contains the "first" row? Could the partitions be out of order? If so, we may call it infer_schema_from_a_row.

Here I read the first row in the index-0 row-groups. But index-0 row-groups may be non-deterministic ? Not sure. infer_schema_from_a_row sounds good.

liangz1 · 2020-03-23T17:18:31Z

+
+        if 'transform_spec' in petastorm_reader_kwargs or \
+                'infer_schema_from_first_row' in petastorm_reader_kwargs:
+            raise ValueError('User cannot set transform_spec and infer_schema_from_first_row '


Shall we also allow users to use transform_spec&infer_schema_from_first_row? Keeping transform_spec would make it consistent with the rest of the petastorm library.

I think the param preprocess_fn should cover the functionality of transform_spec, and it is easier to use (can auto inferring result schema), so I forbid the two params.

WeichenXu123 · 2020-03-25T04:11:28Z

I create a simple PR to address issue 1, #517
We can merge that one first.
This PR could be a long-term work.

WeichenXu123 added 2 commits March 23, 2020 11:37

init

2295490

update

6ded627

WeichenXu123 changed the title ~~[WIP] Auto infer schema from first row~~ [WIP] Auto infer schema (including fields shape) from the first row Mar 23, 2020

WeichenXu123 added 2 commits March 23, 2020 14:19

update

105f2ad

fix doc

5d01502

WeichenXu123 mentioned this pull request Mar 23, 2020

[WIP][ML-10118] Keep petastorm dataset/dataloader schema fields order the same with spark dataframe #511

Closed

liangz1 reviewed Mar 23, 2020

View reviewed changes

liangz1 pushed a commit to liangz1/petastorm that referenced this pull request Mar 24, 2020

merge uber#512 auto infer, test fails

6a1c889

WeichenXu123 added 2 commits March 24, 2020 19:54

update

8d41e70

update

529cb83

WeichenXu123 closed this Mar 25, 2020

WeichenXu123 reopened this Mar 25, 2020

selitvin mentioned this pull request Apr 2, 2020

global context not imported in transform_spec function with reader_pool_type="process" #524

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Auto infer schema (including fields shape) from the first row#512

[WIP] Auto infer schema (including fields shape) from the first row#512
WeichenXu123 wants to merge 6 commits into
uber:masterfrom
WeichenXu123:auto_infer

WeichenXu123 commented Mar 23, 2020 •

edited

Loading

Uh oh!

codecov Bot commented Mar 23, 2020 •

edited

Loading

Uh oh!

liangz1 left a comment

Uh oh!

liangz1 Mar 23, 2020

Uh oh!

WeichenXu123 Mar 24, 2020

Uh oh!

liangz1 Mar 23, 2020

Uh oh!

WeichenXu123 Mar 24, 2020

Uh oh!

WeichenXu123 commented Mar 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

WeichenXu123 commented Mar 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What issues does the PR addresses ?

(Critical) Inferring schema in make_batch_reader cannot infer fields' shape information

(Pain point) The TransformSpec need to specify edit/removed fields manually

The approach in the PR

Test

Example code

Uh oh!

codecov Bot commented Mar 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

liangz1 left a comment

Choose a reason for hiding this comment

Uh oh!

liangz1 Mar 23, 2020

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

liangz1 Mar 23, 2020

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Mar 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WeichenXu123 commented Mar 23, 2020 •

edited

Loading

(Critical) Inferring schema in `make_batch_reader` cannot infer fields' shape information

(Pain point) The `TransformSpec` need to specify edit/removed fields manually

codecov Bot commented Mar 23, 2020 •

edited

Loading