python - Google BigQuery Incomplete Query Replies on Odd Attempts -


when querying bigquery through python api using:

service.jobs().getqueryresults 

we're finding first attempt works fine - expected results included in response. however, if query run second time shortly after first (roughly within 5 minutes) small subset of results returned (in powers of 2) instantly, no errors.

see our complete code at: https://github.com/sean-schaefer/pandas/blob/master/pandas/io/gbq.py

any thoughts on cause this?

it looks issue return different default numbers of rows query() , getqueryresults(). depending on whether query finished (and didn't have use getqueryresults()) you'd either more or less rows.

i've filed bug , should have fix soon.

the workaround (and idea overall) set maxresults both query , getqueryresults calls. , if you're going want lot of rows, might want page through results using returned page token.

below example reads 1 page of data completed query job. included in next release of bq.py:

class _jobtablereader(_tablereader):   """a tablereader reads completed job."""    def __init__(self, local_apiclient, project_id, job_id):     self.job_id = job_id     self.project_id = project_id     self._apiclient = local_apiclient    def readschemaandrows(self, max_rows=none):     """read @ max_rows rows table , schema.      args:       max_rows: maximum number of rows return.      raises:       bigqueryinterfaceerror: when bigquery returns unexpected.      returns:       tuple first item list of fields ,       second item list of rows.     """     page_token = none     rows = []     schema = {}     max_rows = max_rows if max_rows not none else sys.maxint     while len(rows) < max_rows:       (more_rows, page_token, total_rows, current_schema) = self._readonepage(           max_rows=max_rows - len(rows),           page_token=page_token)       if not schema , current_schema:         schema = current_schema.get('fields', {})        max_rows = min(max_rows, total_rows)       row in more_rows:         rows.append([entry.get('v', '') entry in row.get('f', [])])       if not page_token , len(rows) != max_rows:           raise bigqueryinterfaceerror(             'pagetoken missing %r' % (self,))       if not more_rows , len(rows) != max_rows:         raise bigqueryinterfaceerror(             'not enough rows returned server %r' % (self,))     return (schema, rows)    def _readonepage(self, max_rows, page_token=none):     data = self._apiclient.jobs().getqueryresults(         maxresults=max_rows,         pagetoken=page_token,         # sets timeout 0 because assume table ready.         timeoutms=0,         projectid=self.project_id,         jobid=self.job_id).execute()     if not data['jobcomplete']:       raise bigqueryerror('job %s not done' % (self,))     page_token = data.get('pagetoken', none)     total_rows = int(data['totalrows'])     schema = data.get('schema', none)     rows = data.get('rows', [])     return (rows, page_token, total_rows, schema) 

Comments