when querying bigquery through python api using:
service.jobs().getqueryresults we're finding first attempt works fine - expected results included in response. however, if query run second time shortly after first (roughly within 5 minutes) small subset of results returned (in powers of 2) instantly, no errors.
see our complete code at: https://github.com/sean-schaefer/pandas/blob/master/pandas/io/gbq.py
any thoughts on cause this?
it looks issue return different default numbers of rows query() , getqueryresults(). depending on whether query finished (and didn't have use getqueryresults()) you'd either more or less rows.
i've filed bug , should have fix soon.
the workaround (and idea overall) set maxresults both query , getqueryresults calls. , if you're going want lot of rows, might want page through results using returned page token.
below example reads 1 page of data completed query job. included in next release of bq.py:
class _jobtablereader(_tablereader): """a tablereader reads completed job.""" def __init__(self, local_apiclient, project_id, job_id): self.job_id = job_id self.project_id = project_id self._apiclient = local_apiclient def readschemaandrows(self, max_rows=none): """read @ max_rows rows table , schema. args: max_rows: maximum number of rows return. raises: bigqueryinterfaceerror: when bigquery returns unexpected. returns: tuple first item list of fields , second item list of rows. """ page_token = none rows = [] schema = {} max_rows = max_rows if max_rows not none else sys.maxint while len(rows) < max_rows: (more_rows, page_token, total_rows, current_schema) = self._readonepage( max_rows=max_rows - len(rows), page_token=page_token) if not schema , current_schema: schema = current_schema.get('fields', {}) max_rows = min(max_rows, total_rows) row in more_rows: rows.append([entry.get('v', '') entry in row.get('f', [])]) if not page_token , len(rows) != max_rows: raise bigqueryinterfaceerror( 'pagetoken missing %r' % (self,)) if not more_rows , len(rows) != max_rows: raise bigqueryinterfaceerror( 'not enough rows returned server %r' % (self,)) return (schema, rows) def _readonepage(self, max_rows, page_token=none): data = self._apiclient.jobs().getqueryresults( maxresults=max_rows, pagetoken=page_token, # sets timeout 0 because assume table ready. timeoutms=0, projectid=self.project_id, jobid=self.job_id).execute() if not data['jobcomplete']: raise bigqueryerror('job %s not done' % (self,)) page_token = data.get('pagetoken', none) total_rows = int(data['totalrows']) schema = data.get('schema', none) rows = data.get('rows', []) return (rows, page_token, total_rows, schema)
Comments
Post a Comment