Thin Reports, Report Level Measures vs Data Model Measures

Thin Reports, Report Level Measures vs Data Model Measures

The previous post explained what Thin reports are, why we should care and how we can create them. This post focuses on a more specific topic, Report Level Measures. We discuss what report-level measures are, when and why we need them and how we create them.

If you are not sure what Thin Report means, I suggest you check out my previous blog post before reading this one.

What are report level measures?

Report level measures are the measures created by the report writers within a Thin Report. Hence, the report level measures are available within the hosting Thin Report only which means the report level measures are not written back to the underlying dataset and hence they are not available to any other reports.

Why and when do we need report level measures?

It is a common situation in real-world scenarios when the business requires a report urgently, but the nuts and bolts of the report are not being created on the underlying dataset yet. For instance, the business requires to present a report to the board showing year-to-date sales analysis but the year-to-date sales measure hasn’t been created in the dataset yet. The business analyst approaches the Power BI developers to add the measure, but they are under the pump to deliver some other functionalities which adding a new measure is not even in their project delivery plan. It is perhaps too late if we wait for the developers to plan for creating the required measure, go through the release process, and make it available for us in the dataset. Here is when the report level measures come to the rescue. We can simply create the missing measure in the Thin Report itself, where we can later share it with the developers to implement it as a dataset measure.

Continue reading “Thin Reports, Report Level Measures vs Data Model Measures”

Thin Reports, What Are They, Why Should I Care and How Can I Create Them?

Thin Reports in Power BI

Shared Datasets have been around for quite a while now. In June 2019, Microsoft announced a new feature called Shared and Certified Datasets with the mindset of supporting enterprise-grade BI within the Power BI ecosystem. In essence, the shared dataset feature allows organisations to have a single source of truth across the organisation serving many reports.

A Thin Report is a report that connects to an existing dataset on Power BI Service using the Connect Live connectivity mode. So, we basically have multiple reports connected to a single dataset. Now that we know what a thin report is, let’s see why it is best practice to follow this approach.

Prior to the Shared and Certified Datasets announcement, we used to create separate reports in Power BI Desktop and publish those reports into Power BI Service. This approach had many disadvantages, such as:

  • Having many disparate islands of data instead of a single source of truth.
  • Consuming more storage on Power BI Service by having repetitive table across many datasets
  • Reducing collaboration between data modellers and report creators (contributors) as Power BI Desktop is not a multi-user application.
  • The reports were strictly connected to the underlying dataset so it is so hard, if not totally impossible, to decouple a report from a dataset and connect it to a different dataset. This was pretty restrictive for the developers to follow the Dev/Test/Prod approach.
  • If we had a fairly large report with many pages, say more than 20 pages, then again, it was almost impossible to break the report down into some smaller and more business-centric reports.
  • Putting too much load on the data sources connected to many disparate datasets. The situation gets even worst when we schedule multiple refreshes a day. In some cases the data refresh process put exclusive locks on the the source system that can potentially cause many issues down the road.
  • Having many datasets and reports made it harder and more expensive to maintain the solution.

In my previous blog, I explained the different components of a Business Intelligence solution and how they map to the Power BI ecosystem. In that post, I mentioned that the Power BI Service Datasets map to a Semantic Layer in a Business Intelligence solution. So, when we create a Power BI report with Power BI Desktop and publish the report to the Power BI Service, we create a semantic layer with a report connected to it altogether. By creating many disparate reports in Power BI Desktop and publishing them to the Power BI Service, we are indeed creating many semantic layers with many repeated tables on top of our data which does not make much sense.

On the other hand, having some shared datasets with many connected thin reports makes a lot of sense. This approach covers all the disadvantages of the previous development method; in addition, it decreases the confusion for report writers around the datasets they are connecting to, it helps with storage management in Power BI Service, and it is easier to comply with security and privacy concerns.

Continue reading “Thin Reports, What Are They, Why Should I Care and How Can I Create Them?”

Business Intelligence Components and How They Relate to Power BI

Business Intelligence Components and How They Relate to Power BI

When I decided to write this blog post, I thought it would be a good idea to learn a bit about the history of Business Intelligence. I searched on the internet, and I found this page on Wikipedia. The term Business Intelligence as we know it today was coined by an IBM computer science researcher, Hans Peter Luhn, in 1958, who wrote a paper in the IBM Systems journal titled A Business Intelligence System as a specific process in data science. In the Objectives and principles section of his paper, Luhn defines the business as “a collection of activities carried on for whatever purpose, be it science, technology, commerce, industry, law, government, defense, et cetera.” and an intelligence system as “the communication facility serving the conduct of a business (in the broad sense)”. Then he refers to Webster’s dictionary’s definition of the word Intelligence as the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal”.

It is fascinating to see how a fantastic idea in the past sets a concrete future that can help us have a better life. Isn’t it precisely what we do in our daily BI processes as Luhn described of a Business Intelligence System for the first time? How cool is that?

When we talk about the term BI today, we refer to a specific and scientific set of processes of transforming the raw data into valuable and understandable information for various business sectors (such as sales, inventory, law, etc…). These processes will help businesses to make data-driven decisions based on the existing hidden facts in the data.

Like everything else, the BI processes improved a lot during its life. I will try to make some sensible links between today’s BI Components and Power BI in this post.

Generic Components of Business Intelligence Solutions

Generally speaking, a BI solution contains various components and tools that may vary in different solutions depending on the business requirements, data culture and the organisation’s maturity in analytics. But the processes are very similar to the following:

  • We usually have multiple source systems with different technologies containing the raw data, such as SQL Server, Excel, JSON, Parquet files etc…
  • We integrate the raw data into a central repository to reduce the risk of making any interruptions to the source systems by constantly connecting to them. We usually load the data from the data sources into the central repository.
  • We transform the data to optimise it for reporting and analytical purposes, and we load it into another storage. We aim to keep the historical data in this storage.
  • We pre-aggregate the data into certain levels based on the business requirements and load the data into another storage. We usually do not keep the whole historical data in this storage; instead, we only keep the data required to be analysed or reported.
  • We create reports and dashboards to turn the data into useful information

With the above processes in mind, a BI solution consists of the following components:

  • Data Sources
  • Staging
  • Data Warehouse/Data Mart(s)
  • Extract, Transform and Load (ETL)
  • Semantic Layer
  • Data Visualisation

Data Sources

One of the main goals of running a BI project is to enable organisations to make data-driven decisions. An organisation might have multiple departments using various tools to collect the relevant data every day, such as sales, inventory, marketing, finance, health and safety etc.

The data generated by the business tools are stored somewhere using different technologies. A sales system might store the data in an Oracle database, while the finance system stores the data in a SQL Server database in the cloud. The finance team also generate some data stored in Excel files.

The data generated by different systems are the source for a BI solution.

Staging

We usually have multiple data sources contributing to the data analysis in real-world scenarios. To be able to analyse all the data sources, we require a mechanism to load the data into a central repository. The main reason for that is the business tools required to constantly store data in the underlying storage. Therefore, frequent connections to the source systems can put our production systems at risk of being unresponsive or performing poorly. The central repository where we store the data from various data sources is called Staging. We usually store the data in the staging with no or minor changes compared to the data in the data sources. Therefore, the quality of the data stored in the staging is usually low and requires cleansing in the subsequent phases of the data journey. In many BI solutions, we use Staging as a temporary environment, so we delete the Staging data regularly after it is successfully transferred to the next stage, the data warehouse or data marts.

If we want to indicate the data quality with colours, it is fair to say the data quality in staging is Bronze.

Data Warehouse/Data Mart(s)

As mentioned before, the data in the staging is not in its best shape and format. Multiple data sources disparately generate the data. So, analysing the data and creating reports on top of the data in staging would be challenging, time-consuming and expensive. So we require to find out the links between the data sources, cleanse, reshape and transform the data and make it more optimised for data analysis and reporting activities. We store the current and historical data in a data warehouse. So it is pretty normal to have hundreds of millions or even billions of rows of data over a long period. Depending on the overall architecture, the data warehouse might contain encapsulated business-specific data in a data mart or a collection of data marts. In data warehousing, we use different modelling approaches such as Star Schema. As mentioned earlier, one of the primary purposes of having a data warehouse is to keep the history of the data. This is a massive benefit of having a data warehouse, but this strength comes with a cost. As the volume of the data in the data warehouse grows, it makes it more expensive to analyse the data. The data quality in the data warehouse or data marts is Silver.

Extract, Transfrom and Load (ETL)

In the previous sections, we mentioned that we integrate the data from the data sources in the staging area, then we cleanse, reshape and transform the data and load it into a data warehouse. To do so, we follow a process called Extract, Transform and Load or, in short, ETL. As you can imagine, the ETL processes are usually pretty complex and expensive, but they are an essential part of every BI solution.

Continue reading “Business Intelligence Components and How They Relate to Power BI”

Incremental Refresh in Power BI, Part 1: Implementation in Power BI Desktop

Incremental-Refresh-in-Power-BI-Part-1-Implementation-in-Power-BI-Desktop

Incremental refresh, or in short, IR, refers to loading the data incrementally, which has been around in the world of ETL for data warehousing for a long time. Let us discuss incremental refresh (or incremental data loading) in a simple language to better understand how it works.

From a data movement standpoint, there are always two options when we transfer data from location A to location B:

  1. Truncation and load: We transfer the data as a whole from location A to location B. If location B has some data already, we entirely truncate the location B and reload the whole data from the location A to B
  2. Incremental load: We transfer the data as a whole from location A to location B just once for the first time. The next time, we only load the data changes from A to B. In this approach, we never truncate B. Instead, we only transfer the data that exists in A but not in B

When we refresh the data in Power BI, if we have not configured an incremental refresh, we use the first approach, which is truncation and load. Needless to say that in Power BI, the first approach only applies to tables with Import or Dual storage modes. Previously, the Incremental load was available only in the tables with either Import or Dual storage modes. But the new announcement from Microsoft about Hybrid Tables makes a big difference in how Incremental load works. With the Hybrid Tables, the Incremental load is available on a portion of the table when a specific partition is in Direct Query mode, while the rest of the partitions are in Import storage mode.

Incremental refresh used to be available only on Premium capacities, but from Feb 2020 onwards, it is also available in Power BI Pro with some limitations. However, the Hybrid Tables are currently available on Power BI Premium Capacity and Premium Per User (PPU) and not Pro. Let’s hope that Microsft will change its licensing plan for the Hybrid Tables in the future and make it available in Pro.

I will write about Hybrid Tables in a future blog post.

When we successfully configure the incremental refresh policies in Power BI, we always have two ranges of data; the historical range and the incremental range. The historical range includes all data processed in the past, and the incremental range is the current range of data to process. Incremental refresh in Power BI always looks for data changes in the incremental range, not the historical range. Therefore, the incremental refresh will not notice any changes in the historical data. When we talk about the data changes, we are referring to new rows inserted, updated or deleted, however, the incremental refresh detects updated rows as deleting the rows and inserting new rows of data.

Benefits of Incremental Refresh

Configuring incremental refresh is beneficial for large tables with hundreds of millions of rows. The following are some benefits of configuring incremental refresh in Power BI:

  • The data refreshes much faster than when we truncate and load the data as the incremental refresh only refreshes the incremental range
  • The data refresh process is less resource-intensive than refreshing the entire data all the time
  • The data refresh is less expensive and more maintainable than the non-incremental refreshes over large tables
  • The incremental refresh is inevitable when dealing with massive datasets with billions of rows that do not fit into our data model in Power BI Desktop. Remember, Power BI uses in-memory data processing engine; therefore, it is improbable that our local machine can handle importing billions of rows of data into the memory

Now that we understand what incremental refresh is, let us see how it works in Power BI.

Implementing Incremental Refresh Policies with Power BI Desktop

We currently can configure incremental refresh in the Power BI Desktop and in Dataflows contained in a Premium Workspace. In this blog post, we look at the incremental refresh implementation within the Power BI Desktop.

After we successfully implement the incremental refresh policies with the desktop, we publish the model to Power BI Service. The first data refresh takes longer as we transfer all data from the data source(s) to Power BI Service for the first time. After the first load, all future data refreshes will be incremental.

How to Implement Incremental Refresh

Implementing incremental refresh in Power BI is simple. There are two generic parts of the implementation:

  1. Preparing some prerequisites in Power Query and defining incremental policies in the data model
  2. Publishing the model to Power BI Service and refreshing the dataset

Let’s briefly get to some more details to quickly understand how the implementation works.

  • Preparing Prerequisites in Power Query
    • We require to define two parameters with DateTime data type in Power Query Editor. The names for the two parameters are RangeStart and RangeEnd, which are reserved for defining incremental refresh policies. As you know, Power Query is case sensitive, so the names of the parameters must be RangeStart and RangeEnd.
    • The next step is to filter the table by a DateTime column using the RangeStart and RangeEnd parameters when the value of the DateTime column is between RangeStart and RangeEnd.

Notes

  • The data type of the parameters must be DateTime
  • The datat tpe of the column we use for incremental refresh must be Int64 (integer) Date or DateTime.Therefore, for scenarios that our table has a smart date key instead of Date or DateTime, we have to convert the RangeStart and RangeEnd parameters to Int64
  • When we filter a table using the RangeStart and RangeEnd parameters, Power BI uses the filter on the DateTime column for creating partitions on the table. So it is important to pay attention to the DateTime ranges when filtering the values so that only one filter condition must have an “equal to” on RangeStart or RangeEnd, not both
Continue reading “Incremental Refresh in Power BI, Part 1: Implementation in Power BI Desktop”